Mon, 28 Dec 2009 16:03:33 +0000
Started porting eric4 to Python3
0
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
1 | ######################## BEGIN LICENSE BLOCK ######################## |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
2 | # The Original Code is Mozilla Universal charset detector code. |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
3 | # |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
4 | # The Initial Developer of the Original Code is |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
5 | # Netscape Communications Corporation. |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
6 | # Portions created by the Initial Developer are Copyright (C) 2001 |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
7 | # the Initial Developer. All Rights Reserved. |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
8 | # |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
9 | # Contributor(s): |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
10 | # Mark Pilgrim - port to Python |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
11 | # Shy Shalom - original C code |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
12 | # |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
13 | # This library is free software; you can redistribute it and/or |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
14 | # modify it under the terms of the GNU Lesser General Public |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
15 | # License as published by the Free Software Foundation; either |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
16 | # version 2.1 of the License, or (at your option) any later version. |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
17 | # |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
18 | # This library is distributed in the hope that it will be useful, |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
19 | # but WITHOUT ANY WARRANTY; without even the implied warranty of |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
20 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
21 | # Lesser General Public License for more details. |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
22 | # |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
23 | # You should have received a copy of the GNU Lesser General Public |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
24 | # License along with this library; if not, write to the Free Software |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
25 | # Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
26 | # 02110-1301 USA |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
27 | ######################### END LICENSE BLOCK ######################### |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
28 | |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
29 | import constants, sys |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
30 | from charsetprober import CharSetProber |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
31 | |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
32 | SAMPLE_SIZE = 64 |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
33 | SB_ENOUGH_REL_THRESHOLD = 1024 |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
34 | POSITIVE_SHORTCUT_THRESHOLD = 0.95 |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
35 | NEGATIVE_SHORTCUT_THRESHOLD = 0.05 |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
36 | SYMBOL_CAT_ORDER = 250 |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
37 | NUMBER_OF_SEQ_CAT = 4 |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
38 | POSITIVE_CAT = NUMBER_OF_SEQ_CAT - 1 |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
39 | #NEGATIVE_CAT = 0 |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
40 | |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
41 | class SingleByteCharSetProber(CharSetProber): |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
42 | def __init__(self, model, reversed=constants.False, nameProber=None): |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
43 | CharSetProber.__init__(self) |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
44 | self._mModel = model |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
45 | self._mReversed = reversed # TRUE if we need to reverse every pair in the model lookup |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
46 | self._mNameProber = nameProber # Optional auxiliary prober for name decision |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
47 | self.reset() |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
48 | |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
49 | def reset(self): |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
50 | CharSetProber.reset(self) |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
51 | self._mLastOrder = 255 # char order of last character |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
52 | self._mSeqCounters = [0] * NUMBER_OF_SEQ_CAT |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
53 | self._mTotalSeqs = 0 |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
54 | self._mTotalChar = 0 |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
55 | self._mFreqChar = 0 # characters that fall in our sampling range |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
56 | |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
57 | def get_charset_name(self): |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
58 | if self._mNameProber: |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
59 | return self._mNameProber.get_charset_name() |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
60 | else: |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
61 | return self._mModel['charsetName'] |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
62 | |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
63 | def feed(self, aBuf): |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
64 | if not self._mModel['keepEnglishLetter']: |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
65 | aBuf = self.filter_without_english_letters(aBuf) |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
66 | aLen = len(aBuf) |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
67 | if not aLen: |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
68 | return self.get_state() |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
69 | for c in aBuf: |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
70 | order = self._mModel['charToOrderMap'][ord(c)] |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
71 | if order < SYMBOL_CAT_ORDER: |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
72 | self._mTotalChar += 1 |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
73 | if order < SAMPLE_SIZE: |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
74 | self._mFreqChar += 1 |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
75 | if self._mLastOrder < SAMPLE_SIZE: |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
76 | self._mTotalSeqs += 1 |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
77 | if not self._mReversed: |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
78 | self._mSeqCounters[self._mModel['precedenceMatrix'][(self._mLastOrder * SAMPLE_SIZE) + order]] += 1 |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
79 | else: # reverse the order of the letters in the lookup |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
80 | self._mSeqCounters[self._mModel['precedenceMatrix'][(order * SAMPLE_SIZE) + self._mLastOrder]] += 1 |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
81 | self._mLastOrder = order |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
82 | |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
83 | if self.get_state() == constants.eDetecting: |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
84 | if self._mTotalSeqs > SB_ENOUGH_REL_THRESHOLD: |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
85 | cf = self.get_confidence() |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
86 | if cf > POSITIVE_SHORTCUT_THRESHOLD: |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
87 | if constants._debug: |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
88 | sys.stderr.write('%s confidence = %s, we have a winner\n' % (self._mModel['charsetName'], cf)) |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
89 | self._mState = constants.eFoundIt |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
90 | elif cf < NEGATIVE_SHORTCUT_THRESHOLD: |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
91 | if constants._debug: |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
92 | sys.stderr.write('%s confidence = %s, below negative shortcut threshhold %s\n' % (self._mModel['charsetName'], cf, NEGATIVE_SHORTCUT_THRESHOLD)) |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
93 | self._mState = constants.eNotMe |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
94 | |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
95 | return self.get_state() |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
96 | |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
97 | def get_confidence(self): |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
98 | r = 0.01 |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
99 | if self._mTotalSeqs > 0: |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
100 | # print self._mSeqCounters[POSITIVE_CAT], self._mTotalSeqs, self._mModel['mTypicalPositiveRatio'] |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
101 | r = (1.0 * self._mSeqCounters[POSITIVE_CAT]) / self._mTotalSeqs / self._mModel['mTypicalPositiveRatio'] |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
102 | # print r, self._mFreqChar, self._mTotalChar |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
103 | r = r * self._mFreqChar / self._mTotalChar |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
104 | if r >= 1.0: |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
105 | r = 0.99 |
de9c2efb9d02
Started porting eric4 to Python3
Detlev Offenbach <detlev@die-offenbachs.de>
parents:
diff
changeset
|
106 | return r |