Wed, 13 Jul 2022 15:34:50 +0200
Revisions <no_multi_processing, Variables Viewer, with_python2> closed.
3537
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
1 | ######################## BEGIN LICENSE BLOCK ######################## |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
2 | # The Original Code is Mozilla Universal charset detector code. |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
3 | # |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
4 | # The Initial Developer of the Original Code is |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
5 | # Netscape Communications Corporation. |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
6 | # Portions created by the Initial Developer are Copyright (C) 2001 |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
7 | # the Initial Developer. All Rights Reserved. |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
8 | # |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
9 | # Contributor(s): |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
10 | # Mark Pilgrim - port to Python |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
11 | # Shy Shalom - original C code |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
12 | # |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
13 | # This library is free software; you can redistribute it and/or |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
14 | # modify it under the terms of the GNU Lesser General Public |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
15 | # License as published by the Free Software Foundation; either |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
16 | # version 2.1 of the License, or (at your option) any later version. |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
17 | # |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
18 | # This library is distributed in the hope that it will be useful, |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
19 | # but WITHOUT ANY WARRANTY; without even the implied warranty of |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
20 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
21 | # Lesser General Public License for more details. |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
22 | # |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
23 | # You should have received a copy of the GNU Lesser General Public |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
24 | # License along with this library; if not, write to the Free Software |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
25 | # Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
26 | # 02110-1301 USA |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
27 | ######################### END LICENSE BLOCK ######################### |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
28 | |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
29 | from .charsetprober import CharSetProber |
5714
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
30 | from .enums import ProbingState |
3537
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
31 | |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
32 | FREQ_CAT_NUM = 4 |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
33 | |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
34 | UDF = 0 # undefined |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
35 | OTH = 1 # other |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
36 | ASC = 2 # ascii capital letter |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
37 | ASS = 3 # ascii small letter |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
38 | ACV = 4 # accent capital vowel |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
39 | ACO = 5 # accent capital other |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
40 | ASV = 6 # accent small vowel |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
41 | ASO = 7 # accent small other |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
42 | CLASS_NUM = 8 # total classes |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
43 | |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
44 | Latin1_CharToClass = ( |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
45 | OTH, OTH, OTH, OTH, OTH, OTH, OTH, OTH, # 00 - 07 |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
46 | OTH, OTH, OTH, OTH, OTH, OTH, OTH, OTH, # 08 - 0F |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
47 | OTH, OTH, OTH, OTH, OTH, OTH, OTH, OTH, # 10 - 17 |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
48 | OTH, OTH, OTH, OTH, OTH, OTH, OTH, OTH, # 18 - 1F |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
49 | OTH, OTH, OTH, OTH, OTH, OTH, OTH, OTH, # 20 - 27 |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
50 | OTH, OTH, OTH, OTH, OTH, OTH, OTH, OTH, # 28 - 2F |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
51 | OTH, OTH, OTH, OTH, OTH, OTH, OTH, OTH, # 30 - 37 |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
52 | OTH, OTH, OTH, OTH, OTH, OTH, OTH, OTH, # 38 - 3F |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
53 | OTH, ASC, ASC, ASC, ASC, ASC, ASC, ASC, # 40 - 47 |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
54 | ASC, ASC, ASC, ASC, ASC, ASC, ASC, ASC, # 48 - 4F |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
55 | ASC, ASC, ASC, ASC, ASC, ASC, ASC, ASC, # 50 - 57 |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
56 | ASC, ASC, ASC, OTH, OTH, OTH, OTH, OTH, # 58 - 5F |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
57 | OTH, ASS, ASS, ASS, ASS, ASS, ASS, ASS, # 60 - 67 |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
58 | ASS, ASS, ASS, ASS, ASS, ASS, ASS, ASS, # 68 - 6F |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
59 | ASS, ASS, ASS, ASS, ASS, ASS, ASS, ASS, # 70 - 77 |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
60 | ASS, ASS, ASS, OTH, OTH, OTH, OTH, OTH, # 78 - 7F |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
61 | OTH, UDF, OTH, ASO, OTH, OTH, OTH, OTH, # 80 - 87 |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
62 | OTH, OTH, ACO, OTH, ACO, UDF, ACO, UDF, # 88 - 8F |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
63 | UDF, OTH, OTH, OTH, OTH, OTH, OTH, OTH, # 90 - 97 |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
64 | OTH, OTH, ASO, OTH, ASO, UDF, ASO, ACO, # 98 - 9F |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
65 | OTH, OTH, OTH, OTH, OTH, OTH, OTH, OTH, # A0 - A7 |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
66 | OTH, OTH, OTH, OTH, OTH, OTH, OTH, OTH, # A8 - AF |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
67 | OTH, OTH, OTH, OTH, OTH, OTH, OTH, OTH, # B0 - B7 |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
68 | OTH, OTH, OTH, OTH, OTH, OTH, OTH, OTH, # B8 - BF |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
69 | ACV, ACV, ACV, ACV, ACV, ACV, ACO, ACO, # C0 - C7 |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
70 | ACV, ACV, ACV, ACV, ACV, ACV, ACV, ACV, # C8 - CF |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
71 | ACO, ACO, ACV, ACV, ACV, ACV, ACV, OTH, # D0 - D7 |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
72 | ACV, ACV, ACV, ACV, ACV, ACO, ACO, ACO, # D8 - DF |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
73 | ASV, ASV, ASV, ASV, ASV, ASV, ASO, ASO, # E0 - E7 |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
74 | ASV, ASV, ASV, ASV, ASV, ASV, ASV, ASV, # E8 - EF |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
75 | ASO, ASO, ASV, ASV, ASV, ASV, ASV, OTH, # F0 - F7 |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
76 | ASV, ASV, ASV, ASV, ASV, ASO, ASO, ASO, # F8 - FF |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
77 | ) |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
78 | |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
79 | # 0 : illegal |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
80 | # 1 : very unlikely |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
81 | # 2 : normal |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
82 | # 3 : very likely |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
83 | Latin1ClassModel = ( |
5714
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
84 | # UDF OTH ASC ASS ACV ACO ASV ASO |
3537
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
85 | 0, 0, 0, 0, 0, 0, 0, 0, # UDF |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
86 | 0, 3, 3, 3, 3, 3, 3, 3, # OTH |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
87 | 0, 3, 3, 3, 3, 3, 3, 3, # ASC |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
88 | 0, 3, 3, 3, 1, 1, 3, 3, # ASS |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
89 | 0, 3, 3, 3, 1, 2, 1, 2, # ACV |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
90 | 0, 3, 3, 3, 3, 3, 3, 3, # ACO |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
91 | 0, 3, 1, 3, 1, 1, 1, 3, # ASV |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
92 | 0, 3, 1, 3, 1, 1, 3, 3, # ASO |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
93 | ) |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
94 | |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
95 | |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
96 | class Latin1Prober(CharSetProber): |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
97 | def __init__(self): |
5714
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
98 | super(Latin1Prober, self).__init__() |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
99 | self._last_char_class = None |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
100 | self._freq_counter = None |
3537
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
101 | self.reset() |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
102 | |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
103 | def reset(self): |
5714
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
104 | self._last_char_class = OTH |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
105 | self._freq_counter = [0] * FREQ_CAT_NUM |
3537
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
106 | CharSetProber.reset(self) |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
107 | |
5714
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
108 | @property |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
109 | def charset_name(self): |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
110 | return "ISO-8859-1" |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
111 | |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
112 | @property |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
113 | def language(self): |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
114 | return "" |
3537
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
115 | |
5714
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
116 | def feed(self, byte_str): |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
117 | byte_str = self.filter_with_english_letters(byte_str) |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
118 | for c in byte_str: |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
119 | char_class = Latin1_CharToClass[c] |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
120 | freq = Latin1ClassModel[(self._last_char_class * CLASS_NUM) |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
121 | + char_class] |
3537
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
122 | if freq == 0: |
5714
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
123 | self._state = ProbingState.NOT_ME |
3537
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
124 | break |
5714
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
125 | self._freq_counter[freq] += 1 |
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
126 | self._last_char_class = char_class |
3537
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
127 | |
5714
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
128 | return self.state |
3537
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
129 | |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
130 | def get_confidence(self): |
5714
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
131 | if self.state == ProbingState.NOT_ME: |
3537
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
132 | return 0.01 |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
133 | |
5714
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
134 | total = sum(self._freq_counter) |
3537
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
135 | if total < 0.01: |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
136 | confidence = 0.0 |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
137 | else: |
5714
90c57b50600f
Updated chardet to 3.0.2.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
5310
diff
changeset
|
138 | confidence = ((self._freq_counter[3] - self._freq_counter[1] * 20.0) |
5310
f2b774d78b4a
Updated chardet to version 2.3.0.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
3537
diff
changeset
|
139 | / total) |
3537
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
140 | if confidence < 0.0: |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
141 | confidence = 0.0 |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
142 | # lower the confidence of latin1 so that other more accurate |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
143 | # detector can take priority. |
5310
f2b774d78b4a
Updated chardet to version 2.3.0.
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
3537
diff
changeset
|
144 | confidence = confidence * 0.73 |
3537
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
12
diff
changeset
|
145 | return confidence |