Thu, 22 Jun 2017 18:20:04 +0200
Updated chardet to version 3.0.4 and corrected the changelog file.
3537
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
1 | ######################## BEGIN LICENSE BLOCK ######################## |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
2 | # The Original Code is Mozilla Universal charset detector code. |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
3 | # |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
4 | # The Initial Developer of the Original Code is |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
5 | # Netscape Communications Corporation. |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
6 | # Portions created by the Initial Developer are Copyright (C) 2001 |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
7 | # the Initial Developer. All Rights Reserved. |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
8 | # |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
9 | # Contributor(s): |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
10 | # Mark Pilgrim - port to Python |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
11 | # Shy Shalom - original C code |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
12 | # |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
13 | # This library is free software; you can redistribute it and/or |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
14 | # modify it under the terms of the GNU Lesser General Public |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
15 | # License as published by the Free Software Foundation; either |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
16 | # version 2.1 of the License, or (at your option) any later version. |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
17 | # |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
18 | # This library is distributed in the hope that it will be useful, |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
19 | # but WITHOUT ANY WARRANTY; without even the implied warranty of |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
20 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
21 | # Lesser General Public License for more details. |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
22 | # |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
23 | # You should have received a copy of the GNU Lesser General Public |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
24 | # License along with this library; if not, write to the Free Software |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
25 | # Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
26 | # 02110-1301 USA |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
27 | ######################### END LICENSE BLOCK ######################### |
5714
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
28 | """ |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
29 | Module containing the UniversalDetector detector class, which is the primary |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
30 | class a user of ``chardet`` should use. |
3537
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
31 | |
5714
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
32 | :author: Mark Pilgrim (initial port to Python) |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
33 | :author: Shy Shalom (original C code) |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
34 | :author: Dan Blanchard (major refactoring for 3.0) |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
35 | :author: Ian Cordasco |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
36 | """ |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
37 | |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
38 | |
3537
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
39 | import codecs |
5714
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
40 | import logging |
3537
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
41 | import re |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
42 | |
5763
e2d839b69ff3
Updated chardet to version 3.0.4 and corrected the changelog file.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5714
diff
changeset
|
43 | from .charsetgroupprober import CharSetGroupProber |
5714
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
44 | from .enums import InputState, LanguageFilter, ProbingState |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
45 | from .escprober import EscCharSetProber |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
46 | from .latin1prober import Latin1Prober |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
47 | from .mbcsgroupprober import MBCSGroupProber |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
48 | from .sbcsgroupprober import SBCSGroupProber |
3537
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
49 | |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
50 | |
5714
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
51 | class UniversalDetector(object): |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
52 | """ |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
53 | The ``UniversalDetector`` class underlies the ``chardet.detect`` function |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
54 | and coordinates all of the different charset probers. |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
55 | |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
56 | To get a ``dict`` containing an encoding and its confidence, you can simply |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
57 | run: |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
58 | |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
59 | .. code:: |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
60 | |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
61 | u = UniversalDetector() |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
62 | u.feed(some_bytes) |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
63 | u.close() |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
64 | detected = u.result |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
65 | |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
66 | """ |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
67 | |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
68 | MINIMUM_THRESHOLD = 0.20 |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
69 | HIGH_BYTE_DETECTOR = re.compile(b'[\x80-\xFF]') |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
70 | ESC_DETECTOR = re.compile(b'(\033|~{)') |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
71 | WIN_BYTE_DETECTOR = re.compile(b'[\x80-\x9F]') |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
72 | ISO_WIN_MAP = {'iso-8859-1': 'Windows-1252', |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
73 | 'iso-8859-2': 'Windows-1250', |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
74 | 'iso-8859-5': 'Windows-1251', |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
75 | 'iso-8859-6': 'Windows-1256', |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
76 | 'iso-8859-7': 'Windows-1253', |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
77 | 'iso-8859-8': 'Windows-1255', |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
78 | 'iso-8859-9': 'Windows-1254', |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
79 | 'iso-8859-13': 'Windows-1257'} |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
80 | |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
81 | def __init__(self, lang_filter=LanguageFilter.ALL): |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
82 | self._esc_charset_prober = None |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
83 | self._charset_probers = [] |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
84 | self.result = None |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
85 | self.done = None |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
86 | self._got_data = None |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
87 | self._input_state = None |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
88 | self._last_char = None |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
89 | self.lang_filter = lang_filter |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
90 | self.logger = logging.getLogger(__name__) |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
91 | self._has_win_bytes = None |
3537
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
92 | self.reset() |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
93 | |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
94 | def reset(self): |
5714
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
95 | """ |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
96 | Reset the UniversalDetector and all of its probers back to their |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
97 | initial states. This is called by ``__init__``, so you only need to |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
98 | call this directly in between analyses of different documents. |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
99 | """ |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
100 | self.result = {'encoding': None, 'confidence': 0.0, 'language': None} |
3537
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
101 | self.done = False |
5714
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
102 | self._got_data = False |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
103 | self._has_win_bytes = False |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
104 | self._input_state = InputState.PURE_ASCII |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
105 | self._last_char = b'' |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
106 | if self._esc_charset_prober: |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
107 | self._esc_charset_prober.reset() |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
108 | for prober in self._charset_probers: |
3537
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
109 | prober.reset() |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
110 | |
5714
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
111 | def feed(self, byte_str): |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
112 | """ |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
113 | Takes a chunk of a document and feeds it through all of the relevant |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
114 | charset probers. |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
115 | |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
116 | After calling ``feed``, you can check the value of the ``done`` |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
117 | attribute to see if you need to continue feeding the |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
118 | ``UniversalDetector`` more data, or if it has made a prediction |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
119 | (in the ``result`` attribute). |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
120 | |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
121 | .. note:: |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
122 | You should always call ``close`` when you're done feeding in your |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
123 | document if ``done`` is not already ``True``. |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
124 | """ |
3537
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
125 | if self.done: |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
126 | return |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
127 | |
5714
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
128 | if not len(byte_str): |
3537
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
129 | return |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
130 | |
5714
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
131 | if not isinstance(byte_str, bytearray): |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
132 | byte_str = bytearray(byte_str) |
3537
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
133 | |
5714
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
134 | # First check for known BOMs, since these are guaranteed to be correct |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
135 | if not self._got_data: |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
136 | # If the data starts with BOM, we know it is UTF |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
137 | if byte_str.startswith(codecs.BOM_UTF8): |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
138 | # EF BB BF UTF-8 with BOM |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
139 | self.result = {'encoding': "UTF-8-SIG", |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
140 | 'confidence': 1.0, |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
141 | 'language': ''} |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
142 | elif byte_str.startswith((codecs.BOM_UTF32_LE, |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
143 | codecs.BOM_UTF32_BE)): |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
144 | # FF FE 00 00 UTF-32, little-endian BOM |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
145 | # 00 00 FE FF UTF-32, big-endian BOM |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
146 | self.result = {'encoding': "UTF-32", |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
147 | 'confidence': 1.0, |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
148 | 'language': ''} |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
149 | elif byte_str.startswith(b'\xFE\xFF\x00\x00'): |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
150 | # FE FF 00 00 UCS-4, unusual octet order BOM (3412) |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
151 | self.result = {'encoding': "X-ISO-10646-UCS-4-3412", |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
152 | 'confidence': 1.0, |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
153 | 'language': ''} |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
154 | elif byte_str.startswith(b'\x00\x00\xFF\xFE'): |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
155 | # 00 00 FF FE UCS-4, unusual octet order BOM (2143) |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
156 | self.result = {'encoding': "X-ISO-10646-UCS-4-2143", |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
157 | 'confidence': 1.0, |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
158 | 'language': ''} |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
159 | elif byte_str.startswith((codecs.BOM_LE, codecs.BOM_BE)): |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
160 | # FF FE UTF-16, little endian BOM |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
161 | # FE FF UTF-16, big endian BOM |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
162 | self.result = {'encoding': "UTF-16", |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
163 | 'confidence': 1.0, |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
164 | 'language': ''} |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
165 | |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
166 | self._got_data = True |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
167 | if self.result['encoding'] is not None: |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
168 | self.done = True |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
169 | return |
3537
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
170 | |
5714
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
171 | # If none of those matched and we've only see ASCII so far, check |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
172 | # for high bytes and escape sequences |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
173 | if self._input_state == InputState.PURE_ASCII: |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
174 | if self.HIGH_BYTE_DETECTOR.search(byte_str): |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
175 | self._input_state = InputState.HIGH_BYTE |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
176 | elif self._input_state == InputState.PURE_ASCII and \ |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
177 | self.ESC_DETECTOR.search(self._last_char + byte_str): |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
178 | self._input_state = InputState.ESC_ASCII |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
179 | |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
180 | self._last_char = byte_str[-1:] |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
181 | |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
182 | # If we've seen escape sequences, use the EscCharSetProber, which |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
183 | # uses a simple state machine to check for known escape sequences in |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
184 | # HZ and ISO-2022 encodings, since those are the only encodings that |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
185 | # use such sequences. |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
186 | if self._input_state == InputState.ESC_ASCII: |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
187 | if not self._esc_charset_prober: |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
188 | self._esc_charset_prober = EscCharSetProber(self.lang_filter) |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
189 | if self._esc_charset_prober.feed(byte_str) == ProbingState.FOUND_IT: |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
190 | self.result = {'encoding': |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
191 | self._esc_charset_prober.charset_name, |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
192 | 'confidence': |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
193 | self._esc_charset_prober.get_confidence(), |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
194 | 'language': |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
195 | self._esc_charset_prober.language} |
3537
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
196 | self.done = True |
5714
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
197 | # If we've seen high bytes (i.e., those with values greater than 127), |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
198 | # we need to do more complicated checks using all our multi-byte and |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
199 | # single-byte probers that are left. The single-byte probers |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
200 | # use character bigram distributions to determine the encoding, whereas |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
201 | # the multi-byte probers use a combination of character unigram and |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
202 | # bigram distributions. |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
203 | elif self._input_state == InputState.HIGH_BYTE: |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
204 | if not self._charset_probers: |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
205 | self._charset_probers = [MBCSGroupProber(self.lang_filter)] |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
206 | # If we're checking non-CJK encodings, use single-byte prober |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
207 | if self.lang_filter & LanguageFilter.NON_CJK: |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
208 | self._charset_probers.append(SBCSGroupProber()) |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
209 | self._charset_probers.append(Latin1Prober()) |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
210 | for prober in self._charset_probers: |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
211 | if prober.feed(byte_str) == ProbingState.FOUND_IT: |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
212 | self.result = {'encoding': prober.charset_name, |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
213 | 'confidence': prober.get_confidence(), |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
214 | 'language': prober.language} |
3537
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
215 | self.done = True |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
216 | break |
5714
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
217 | if self.WIN_BYTE_DETECTOR.search(byte_str): |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
218 | self._has_win_bytes = True |
3537
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
219 | |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
220 | def close(self): |
5714
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
221 | """ |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
222 | Stop analyzing the current document and come up with a final |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
223 | prediction. |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
224 | |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
225 | :returns: The ``result`` attribute, a ``dict`` with the keys |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
226 | `encoding`, `confidence`, and `language`. |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
227 | """ |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
228 | # Don't bother with checks if we're already done |
3537
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
229 | if self.done: |
5714
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
230 | return self.result |
3537
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
231 | self.done = True |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
232 | |
5714
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
233 | if not self._got_data: |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
234 | self.logger.debug('no data received!') |
3537
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
235 | |
5714
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
236 | # Default to ASCII if it is all we've seen so far |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
237 | elif self._input_state == InputState.PURE_ASCII: |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
238 | self.result = {'encoding': 'ascii', |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
239 | 'confidence': 1.0, |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
240 | 'language': ''} |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
241 | |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
242 | # If we have seen non-ASCII, return the best that met MINIMUM_THRESHOLD |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
243 | elif self._input_state == InputState.HIGH_BYTE: |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
244 | prober_confidence = None |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
245 | max_prober_confidence = 0.0 |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
246 | max_prober = None |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
247 | for prober in self._charset_probers: |
3537
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
248 | if not prober: |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
249 | continue |
5714
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
250 | prober_confidence = prober.get_confidence() |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
251 | if prober_confidence > max_prober_confidence: |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
252 | max_prober_confidence = prober_confidence |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
253 | max_prober = prober |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
254 | if max_prober and (max_prober_confidence > self.MINIMUM_THRESHOLD): |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
255 | charset_name = max_prober.charset_name |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
256 | lower_charset_name = max_prober.charset_name.lower() |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
257 | confidence = max_prober.get_confidence() |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
258 | # Use Windows encoding name instead of ISO-8859 if we saw any |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
259 | # extra Windows-specific bytes |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
260 | if lower_charset_name.startswith('iso-8859'): |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
261 | if self._has_win_bytes: |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
262 | charset_name = self.ISO_WIN_MAP.get(lower_charset_name, |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
263 | charset_name) |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
264 | self.result = {'encoding': charset_name, |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
265 | 'confidence': confidence, |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
266 | 'language': max_prober.language} |
3537
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
267 | |
5714
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
268 | # Log all prober confidences if none met MINIMUM_THRESHOLD |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
269 | if self.logger.getEffectiveLevel() == logging.DEBUG: |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
270 | if self.result['encoding'] is None: |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
271 | self.logger.debug('no probers hit minimum threshold') |
5763
e2d839b69ff3
Updated chardet to version 3.0.4 and corrected the changelog file.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5714
diff
changeset
|
272 | for group_prober in self._charset_probers: |
e2d839b69ff3
Updated chardet to version 3.0.4 and corrected the changelog file.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5714
diff
changeset
|
273 | if not group_prober: |
5714
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
274 | continue |
5763
e2d839b69ff3
Updated chardet to version 3.0.4 and corrected the changelog file.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5714
diff
changeset
|
275 | if isinstance(group_prober, CharSetGroupProber): |
e2d839b69ff3
Updated chardet to version 3.0.4 and corrected the changelog file.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5714
diff
changeset
|
276 | for prober in group_prober.probers: |
e2d839b69ff3
Updated chardet to version 3.0.4 and corrected the changelog file.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5714
diff
changeset
|
277 | self.logger.debug('%s %s confidence = %s', |
e2d839b69ff3
Updated chardet to version 3.0.4 and corrected the changelog file.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5714
diff
changeset
|
278 | prober.charset_name, |
e2d839b69ff3
Updated chardet to version 3.0.4 and corrected the changelog file.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5714
diff
changeset
|
279 | prober.language, |
e2d839b69ff3
Updated chardet to version 3.0.4 and corrected the changelog file.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5714
diff
changeset
|
280 | prober.get_confidence()) |
e2d839b69ff3
Updated chardet to version 3.0.4 and corrected the changelog file.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5714
diff
changeset
|
281 | else: |
e2d839b69ff3
Updated chardet to version 3.0.4 and corrected the changelog file.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5714
diff
changeset
|
282 | self.logger.debug('%s %s confidence = %s', |
e2d839b69ff3
Updated chardet to version 3.0.4 and corrected the changelog file.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5714
diff
changeset
|
283 | prober.charset_name, |
e2d839b69ff3
Updated chardet to version 3.0.4 and corrected the changelog file.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5714
diff
changeset
|
284 | prober.language, |
e2d839b69ff3
Updated chardet to version 3.0.4 and corrected the changelog file.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5714
diff
changeset
|
285 | prober.get_confidence()) |
5714
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
286 | return self.result |