Fri, 25 Apr 2014 22:07:19 +0200
updated CharDet to 2.2.1, updated changelog
3537
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
1 | <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
2 | <html lang="en"> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
3 | <head> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
4 | <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
5 | <title>How it works [Universal Encoding Detector]</title> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
6 | <link rel="stylesheet" href="css/chardet.css" type="text/css"> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
7 | <link rev="made" href="mailto:mark@diveintomark.org"> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
8 | <meta name="generator" content="DocBook XSL Stylesheets V1.65.1"> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
9 | <meta name="keywords" content="character, set, encoding, detection, Python, XML, feed"> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
10 | <link rel="start" href="index.html" title="Documentation"> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
11 | <link rel="up" href="index.html" title="Documentation"> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
12 | <link rel="prev" href="usage.html" title="Usage"> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
13 | <link rel="next" href="history.html" title="Revision history"> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
14 | </head> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
15 | <body id="chardet-feedparser-org" class="docs"> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
16 | <div class="z" id="intro"><div class="sectionInner"><div class="sectionInner2"> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
17 | <div class="s" id="pageHeader"> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
18 | <h1><a href="/">Universal Encoding Detector</a></h1> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
19 | <p>Character encoding auto-detection in Python. As smart as your browser. Open source.</p> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
20 | </div> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
21 | <div class="s" id="quickSummary"><ul> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
22 | <li class="li1"> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
23 | <a href="http://chardet.feedparser.org/download/">Download</a> ·</li> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
24 | <li class="li2"> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
25 | <a href="index.html">Documentation</a> ·</li> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
26 | <li class="li3"><a href="faq.html" title="Frequently Asked Questions">FAQ</a></li> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
27 | </ul></div> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
28 | </div></div></div> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
29 | <div id="main"><div id="mainInner"> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
30 | <p id="breadcrumb">You are here: <a href="index.html">Documentation</a> → <span class="thispage">How it works</span></p> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
31 | <div class="section" lang="en"> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
32 | <div class="titlepage"> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
33 | <div> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
34 | <div><h2 class="title"> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
35 | <a name="howitworks" class="skip" href="#howitworks" title="link to this section"><img src="images/permalink.gif" alt="[link]" title="link to this section" width="8" height="9"></a> How it works</h2></div> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
36 | <div><div class="abstract"> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
37 | <h3 class="title"></h3> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
38 | <p>This is a brief guide to navigating the code itself.</p> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
39 | </div></div> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
40 | </div> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
41 | <div></div> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
42 | </div> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
43 | <p>First, you should read <a href="http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html">A composite approach to language/encoding detection</a>, which explains the detection algorithm and how it was derived. This will help you later when you stumble across the huge character frequency distribution tables like <tt class="filename">big5freq.py</tt> and language models like <tt class="filename">langcyrillicmodel.py</tt>.</p> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
44 | <p>The main entry point for the detection algorithm is <tt class="filename">universaldetector.py</tt>, which has one class, <tt class="classname">UniversalDetector</tt>. (You might think the main entry point is the <tt class="function">detect</tt> function in <tt class="filename">chardet/__init__.py</tt>, but that’s really just a convenience function that creates a <tt class="classname">UniversalDetector</tt> object, calls it, and returns its result.)</p> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
45 | <p>There are 5 categories of encodings that <tt class="classname">UniversalDetector</tt> handles:</p> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
46 | <div class="orderedlist"><ol type="1"> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
47 | <li> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
48 | <tt class="literal">UTF-n</tt> with a <acronym title="Byte Order Mark">BOM</acronym>. This includes <tt class="literal">UTF-8</tt>, both <acronym title="Big Endian">BE</acronym> and <acronym title="Little Endian">LE</acronym> variants of <tt class="literal">UTF-16</tt>, and all 4 byte-order variants of <tt class="literal">UTF-32</tt>.</li> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
49 | <li>Escaped encodings, which are entirely 7-bit <acronym>ASCII</acronym> compatible, where non-<acronym>ASCII</acronym> characters start with an escape sequence. Examples: <tt class="literal">ISO-2022-JP</tt> (Japanese) and <tt class="literal">HZ-GB-2312</tt> (Chinese).</li> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
50 | <li>Multi-byte encodings, where each character is represented by a variable number of bytes. Examples: <tt class="literal">Big5</tt> (Chinese), <tt class="literal">SHIFT_JIS</tt> (Japanese), <tt class="literal">EUC-KR</tt> (Korean), and <tt class="literal">UTF-8</tt> without a <acronym title="Byte Order Mark">BOM</acronym>.</li> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
51 | <li>Single-byte encodings, where each character is represented by one byte. Examples: <tt class="literal">KOI8-R</tt> (Russian), <tt class="literal">windows-1255</tt> (Hebrew), and <tt class="literal">TIS-620</tt> (Thai).</li> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
52 | <li> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
53 | <tt class="literal">windows-1252</tt>, which is used primarily on Microsoft Windows by middle managers who don’t know a character encoding from a hole in the ground.</li> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
54 | </ol></div> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
55 | <div class="section" lang="en"> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
56 | <div class="titlepage"> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
57 | <div><div><h3 class="title"> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
58 | <a name="how.bom" class="skip" href="#how.bom" title="link to this section"><img src="images/permalink.gif" alt="[link]" title="link to this section" width="8" height="9"></a> <tt class="literal">UTF-n</tt> with a <acronym title="Byte Order Mark">BOM</acronym> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
59 | </h3></div></div> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
60 | <div></div> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
61 | </div> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
62 | <p>If the text starts with a <acronym title="Byte Order Mark">BOM</acronym>, we can reasonably assume that the text is encoded in <tt class="literal">UTF-8</tt>, <tt class="literal">UTF-16</tt>, or <tt class="literal">UTF-32</tt>. (The <acronym title="Byte Order Mark">BOM</acronym> will tell us exactly which one; that’s what it’s for.) This is handled inline in <tt class="classname">UniversalDetector</tt>, which returns the result immediately without any further processing.</p> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
63 | </div> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
64 | <div class="section" lang="en"> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
65 | <div class="titlepage"> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
66 | <div><div><h3 class="title"> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
67 | <a name="how.esc" class="skip" href="#how.esc" title="link to this section"><img src="images/permalink.gif" alt="[link]" title="link to this section" width="8" height="9"></a> Escaped encodings</h3></div></div> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
68 | <div></div> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
69 | </div> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
70 | <p>If the text contains a recognizable escape sequence that might indicate an escaped encoding, <tt class="classname">UniversalDetector</tt> creates an <tt class="classname">EscCharSetProber</tt> (defined in <tt class="filename">escprober.py</tt>) and feeds it the text.</p> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
71 | <p><tt class="classname">EscCharSetProber</tt> creates a series of state machines, based on models of <tt class="literal">HZ-GB-2312</tt>, <tt class="literal">ISO-2022-CN</tt>, <tt class="literal">ISO-2022-JP</tt>, and <tt class="literal">ISO-2022-KR</tt> (defined in <tt class="filename">escsm.py</tt>). <tt class="classname">EscCharSetProber</tt> feeds the text to each of these state machines, one byte at a time. If any state machine ends up uniquely identifying the encoding, <tt class="classname">EscCharSetProber</tt> immediately returns the positive result to <tt class="classname">UniversalDetector</tt>, which returns it to the caller. If any state machine hits an illegal sequence, it is dropped and processing continues with the other state machines.</p> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
72 | </div> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
73 | <div class="section" lang="en"> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
74 | <div class="titlepage"> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
75 | <div><div><h3 class="title"> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
76 | <a name="how.mb" class="skip" href="#how.mb" title="link to this section"><img src="images/permalink.gif" alt="[link]" title="link to this section" width="8" height="9"></a> Multi-byte encodings</h3></div></div> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
77 | <div></div> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
78 | </div> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
79 | <p>Assuming no <acronym title="Byte Order Mark">BOM</acronym>, <tt class="classname">UniversalDetector</tt> checks whether the text contains any high-bit characters. If so, it creates a series of “<span class="quote">probers</span>” for detecting multi-byte encodings, single-byte encodings, and as a last resort, <tt class="literal">windows-1252</tt>.</p> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
80 | <p>The multi-byte encoding prober, <tt class="classname">MBCSGroupProber</tt> (defined in <tt class="filename">mbcsgroupprober.py</tt>), is really just a shell that manages a group of other probers, one for each multi-byte encoding: <tt class="literal">Big5</tt>, <tt class="literal">GB2312</tt>, <tt class="literal">EUC-TW</tt>, <tt class="literal">EUC-KR</tt>, <tt class="literal">EUC-JP</tt>, <tt class="literal">SHIFT_JIS</tt>, and <tt class="literal">UTF-8</tt>. <tt class="classname">MBCSGroupProber</tt> feeds the text to each of these encoding-specific probers and checks the results. If a prober reports that it has found an illegal byte sequence, it is dropped from further processing (so that, for instance, any subsequent calls to <tt class="classname">UniversalDetector</tt>.<tt class="methodname">feed</tt> will skip that prober). If a prober reports that it is reasonably confident that it has detected the encoding, <tt class="classname">MBCSGroupProber</tt> reports this positive result to <tt class="classname">UniversalDetector</tt>, which reports the result to the caller.</p> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
81 | <p>Most of the multi-byte encoding probers are inherited from <tt class="classname">MultiByteCharSetProber</tt> (defined in <tt class="filename">mbcharsetprober.py</tt>), and simply hook up the appropriate state machine and distribution analyzer and let <tt class="classname">MultiByteCharSetProber</tt> do the rest of the work. <tt class="classname">MultiByteCharSetProber</tt> runs the text through the encoding-specific state machine, one byte at a time, to look for byte sequences that would indicate a conclusive positive or negative result. At the same time, <tt class="classname">MultiByteCharSetProber</tt> feeds the text to an encoding-specific distribution analyzer.</p> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
82 | <p>The distribution analyzers (each defined in <tt class="filename">chardistribution.py</tt>) use language-specific models of which characters are used most frequently. Once <tt class="classname">MultiByteCharSetProber</tt> has fed enough text to the distribution analyzer, it calculates a confidence rating based on the number of frequently-used characters, the total number of characters, and a language-specific distribution ratio. If the confidence is high enough, <tt class="classname">MultiByteCharSetProber</tt> returns the result to <tt class="classname">MBCSGroupProber</tt>, which returns it to <tt class="classname">UniversalDetector</tt>, which returns it to the caller.</p> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
83 | <p>The case of Japanese is more difficult. Single-character distribution analysis is not always sufficient to distinguish between <tt class="literal">EUC-JP</tt> and <tt class="literal">SHIFT_JIS</tt>, so the <tt class="classname">SJISProber</tt> (defined in <tt class="filename">sjisprober.py</tt>) also uses 2-character distribution analysis. <tt class="classname">SJISContextAnalysis</tt> and <tt class="classname">EUCJPContextAnalysis</tt> (both defined in <tt class="filename">jpcntx.py</tt> and both inheriting from a common <tt class="classname">JapaneseContextAnalysis</tt> class) check the frequency of Hiragana syllabary characters within the text. Once enough text has been processed, they return a confidence level to <tt class="classname">SJISProber</tt>, which checks both analyzers and returns the higher confidence level to <tt class="classname">MBCSGroupProber</tt>.</p> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
84 | </div> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
85 | <div class="section" lang="en"> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
86 | <div class="titlepage"> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
87 | <div><div><h3 class="title"> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
88 | <a name="how.sb" class="skip" href="#how.sb" title="link to this section"><img src="images/permalink.gif" alt="[link]" title="link to this section" width="8" height="9"></a> Single-byte encodings</h3></div></div> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
89 | <div></div> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
90 | </div> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
91 | <p>The single-byte encoding prober, <tt class="classname">SBCSGroupProber</tt> (defined in <tt class="filename">sbcsgroupprober.py</tt>), is also just a shell that manages a group of other probers, one for each combination of single-byte encoding and language: <tt class="literal">windows-1251</tt>, <tt class="literal">KOI8-R</tt>, <tt class="literal">ISO-8859-5</tt>, <tt class="literal">MacCyrillic</tt>, <tt class="literal">IBM855</tt>, and <tt class="literal">IBM866</tt> (Russian); <tt class="literal">ISO-8859-7</tt> and <tt class="literal">windows-1253</tt> (Greek); <tt class="literal">ISO-8859-5</tt> and <tt class="literal">windows-1251</tt> (Bulgarian); <tt class="literal">ISO-8859-2</tt> and <tt class="literal">windows-1250</tt> (Hungarian); <tt class="literal">TIS-620</tt> (Thai); <tt class="literal">windows-1255</tt> and <tt class="literal">ISO-8859-8</tt> (Hebrew).</p> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
92 | <p><tt class="classname">SBCSGroupProber</tt> feeds the text to each of these encoding+language-specific probers and checks the results. These probers are all implemented as a single class, <tt class="classname">SingleByteCharSetProber</tt> (defined in <tt class="filename">sbcharsetprober.py</tt>), which takes a language model as an argument. The language model defines how frequently different 2-character sequences appear in typical text. <tt class="classname">SingleByteCharSetProber</tt> processes the text and tallies the most frequently used 2-character sequences. Once enough text has been processed, it calculates a confidence level based on the number of frequently-used sequences, the total number of characters, and a language-specific distribution ratio.</p> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
93 | <p>Hebrew is handled as a special case. If the text appears to be Hebrew based on 2-character distribution analysis, <tt class="classname">HebrewProber</tt> (defined in <tt class="filename">hebrewprober.py</tt>) tries to distinguish between Visual Hebrew (where the source text actually stored “<span class="quote">backwards</span>” line-by-line, and then displayed verbatim so it can be read from right to left) and Logical Hebrew (where the source text is stored in reading order and then rendered right-to-left by the client). Because certain characters are encoded differently based on whether they appear in the middle of or at the end of a word, we can make a reasonable guess about direction of the source text, and return the appropriate encoding (<tt class="literal">windows-1255</tt> for Logical Hebrew, or <tt class="literal">ISO-8859-8</tt> for Visual Hebrew).</p> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
94 | </div> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
95 | <div class="section" lang="en"> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
96 | <div class="titlepage"> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
97 | <div><div><h3 class="title"> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
98 | <a name="how.windows1252" class="skip" href="#how.windows1252" title="link to this section"><img src="images/permalink.gif" alt="[link]" title="link to this section" width="8" height="9"></a> windows-1252</h3></div></div> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
99 | <div></div> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
100 | </div> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
101 | <p>If <tt class="classname">UniversalDetector</tt> detects a high-bit character in the text, but none of the other multi-byte or single-byte encoding probers return a confident result, it creates a <tt class="classname">Latin1Prober</tt> (defined in <tt class="filename">latin1prober.py</tt>) to try to detect English text in a <tt class="literal">windows-1252</tt> encoding. This detection is inherently unreliable, because English letters are encoded in the same way in many different encodings. The only way to distinguish <tt class="literal">windows-1252</tt> is through commonly used symbols like smart quotes, curly apostrophes, copyright symbols, and the like. <tt class="classname">Latin1Prober</tt> automatically reduces its confidence rating to allow more accurate probers to win if at all possible.</p> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
102 | </div> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
103 | </div> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
104 | <div class="footernavigation"> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
105 | <div style="float: left">← <a class="NavigationArrow" href="usage.html">Usage</a> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
106 | </div> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
107 | <div style="text-align: right"> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
108 | <a class="NavigationArrow" href="history.html">Revision history</a> →</div> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
109 | </div> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
110 | <hr> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
111 | <div id="footer"><p class="copyright">Copyright © 2006, 2007, 2008 Mark Pilgrim · <a href="mailto:mark@diveintomark.org">mark@diveintomark.org</a> · <a href="license.html">Terms of use</a></p></div> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
112 | </div></div> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
113 | </body> |
7662053c3906
updated CharDet to 2.2.1, updated changelog
T.Rzepka <Tobias.Rzepka@gmail.com>
parents:
0
diff
changeset
|
114 | </html> |