{"id":1074,"date":"2023-10-15T20:05:19","date_gmt":"2023-10-15T20:05:19","guid":{"rendered":"https:\/\/www.skynext.tech\/?p=1074"},"modified":"2023-11-05T17:43:40","modified_gmt":"2023-11-05T17:43:40","slug":"lightweight-english-text-stream-compression-in-python","status":"publish","type":"post","link":"https:\/\/www.skynext.tech\/index.php\/2023\/10\/15\/lightweight-english-text-stream-compression-in-python\/","title":{"rendered":"Lightweight ASCII English Text Stream Compression in Python."},"content":{"rendered":"\n<p><strong>NOTE : updated documentation and source code is available at :<\/strong><\/p>\n\n\n\n<p><a href=\"https:\/\/github.com\/rodv92\/PLETSC\">https:\/\/github.com\/rodv92\/PLETSC<\/a><\/p>\n\n\n\n<p><\/p>\n\n\n\n<h1 class=\"wp-block-heading\" id=\"user-content-pletsc\"><a href=\"https:\/\/github.com\/rodv92\/PLETSC#pletsc\">PLETSC<\/a><\/h1>\n\n\n\n<p>Lightweight english text stream compression, with word tokens, ngrams, session dictionaries and Huffmann for unknown words.<\/p>\n\n\n\n<h1 class=\"wp-block-heading\" id=\"user-content-how-to-use-\"><a href=\"https:\/\/github.com\/rodv92\/PLETSC#how-to-use-\">How to use :<\/a><\/h1>\n\n\n\n<p>git clone and decompress dics.zip in the current folder.<\/p>\n\n\n\n<p>Syntax for compression :<\/p>\n\n\n\n<p>python3 dicstrv.py -c txt_inputfile compressed_outputfile Reads txt_inputfile and writes compressed text stream to compressed_outputfile.<\/p>\n\n\n\n<p>python3 dicstrv.py -c txt_inputfile Reads txt_input file and writes compressed output to stdout<\/p>\n\n\n\n<p>Syntax for decompression :<\/p>\n\n\n\n<p>python3 dicstrv.py -x compressed_inputfile txt_outputfile Reads compressed_inputfile and writes cleartext to txt_outputfile.<\/p>\n\n\n\n<p>python3 dicstrv.py -x compressed_inputfile<\/p>\n\n\n\n<p>Reads compressed_input file and writes cleartext output to stdout<\/p>\n\n\n\n<p>Syntax to generate a compiled dictionary of ngrams :<\/p>\n\n\n\n<p>python3 dicstrv.py -d cleartext_ngrams_inputfile compressed_ngrams<\/p>\n\n\n\n<p>This is rarely used in normal operation.<\/p>\n\n\n\n<p>NOTE: dictionary file count1_w.txt must be in the same directory as the script. outngrams.bin must be in the same directory as the script, if ngrams are used (secondpass=True)<\/p>\n\n\n\n<h1 class=\"wp-block-heading\" id=\"user-content-description-\"><a href=\"https:\/\/github.com\/rodv92\/PLETSC#description-\">Description :<\/a><\/h1>\n\n\n\n<p>This script is useful for ASCII English text stream compression. It&#8217;s pedantic (P in PLETSC stands for &#8220;pedantic&#8221;) because its final goal is to enforce a minima some English syntactic rules, such as whitespace after &#8220;,&#8221; but not before, Capitalization after a &#8220;.&#8221; etc&#8230; (but not grammar). Spell check will probably be recommended but should probably be done upstream (by another applicative layer), as it will ensure a better compression ratio &#8211; since it is based on words of the english dictionary.<\/p>\n\n\n\n<p>Its compression method is primarily based on a token (words and punctuation) dictionary. It leverages frequency of modern english words:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Words of the primary dictionary are sorted from most used to least used.<\/li>\n\n\n\n<li>The line number is used as an index. (+1) index 0 is reserved for whitespace.<\/li>\n<\/ul>\n\n\n\n<p>It also uses adaptive length encoding (1-2-3 bytes) First 128 most used tokens are encoded on 1 byte, Next 16384 + 128 on 2 bytes. Next 2097152 + 16384 + 128 on 3 bytes.<\/p>\n\n\n\n<p>The 3 byte address space is split in two :<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>First part (when byte0 msb is 1 and byte1 msb is 1 and byte2 msb is 0) is further divided into two subspaces.\n<ul class=\"wp-block-list\">\n<li>The first subspace is for the remainder of the primary dictionary (it has 333 333 tokens).<\/li>\n\n\n\n<li>And the second subspace holds an Ngram dictionary (more on that later).<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>Second part (when byte0 msb is 1 and byte1 msb is 1 and byte2 msb is 1) is further divided into two subspaces.\n<ul class=\"wp-block-list\">\n<li>First part is for a session dictionary. A session dictionary is used to hold repeating unknown tokens. there are 2097152 &#8211; 5 codes available for this use. Initially empty. Kept in ram, it is a SESSION dictionary. This session dictionary should not be required to be sent between two parties, as it can be reconstructed entirely from the compressed stream.<\/li>\n\n\n\n<li>Second part is only 5 codes, (TODO, for now just 1 code, and switch between Huffmann and no compression is done in a bool parameter) It is an escape sequence meaning that following bytes will be encoded wit the following methods :\n<ul class=\"wp-block-list\">\n<li>first code : As a stream of chars (no compression), plus a C style termination (chr(0)).<\/li>\n\n\n\n<li>second code : Huffmann encoding, lowercase only.<\/li>\n\n\n\n<li>third code : Huffmann, lowercase + uppercase or uppercase only.<\/li>\n\n\n\n<li>fourth code : Huffmann, lowercase + uppercase + numbers, or numbers only.<\/li>\n\n\n\n<li>fifth code : All printable ASCII space, mainly for passwords. Each of these codes tells what Huffmann tree to use.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n\n\n\n<h1 class=\"wp-block-heading\" id=\"user-content-performance-\"><a href=\"https:\/\/github.com\/rodv92\/PLETSC#performance-\">Performance :<\/a><\/h1>\n\n\n\n<p>It offers a good compression ratio (between 2.6 and 3.0+), That is, Sizes in % of ORIGINAL size of around 33% to 38%, mainly depending on the lexical complexity or lexical archaism of the source text, and presence of unkwnown or misspelled words.<\/p>\n\n\n\n<p>A higher lexical complexity, or archaic texts, that is, if the input text uses less common words \u2013 based on current usage \u2013 (2023), will yield lower compression ratios.<\/p>\n\n\n\n<p>The compresion ratio is more or less stable : it is quite independent of text length.<\/p>\n\n\n\n<p>This is contrary to block algorithms that suffer from low compression for small files because of a significant overhead. For larger corpuses, block algorithms usually perform better, and modern methods may use ML methods to provide context and adaptive encoding based on that, they&#8217;re usually slower.<\/p>\n\n\n\n<p>This is why this algorithm is intended for stream compression (on the fly). However, its current implementation is based on reading files. and outputting to a file or stdout.<\/p>\n\n\n\n<h1 class=\"wp-block-heading\" id=\"user-content-compression-speed-all-options-enabled\"><a href=\"https:\/\/github.com\/rodv92\/PLETSC#compression-speed-all-options-enabled\">Compression speed (all options enabled)<\/a><\/h1>\n\n\n\n<p>For this test :<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>File is call_of_cthulhu.txt, size uncompressed is 69 kB<\/li>\n\n\n\n<li>Compression speed around 23,3 kB\/s on a Intel(R) Core(TM) i5 CPU M 520 @ 2.40GHz (computer from 2011), + SSD storage<\/li>\n<\/ul>\n\n\n\n<h1 class=\"wp-block-heading\" id=\"user-content-footprint-filesystem\"><a href=\"https:\/\/github.com\/rodv92\/PLETSC#footprint-filesystem\">Footprint (filesystem)<\/a><\/h1>\n\n\n\n<p>zipped size of count1_w.txt + outngrams.bin is 11 566 806 bytes unzipped size is : 31 327 633 bytes + 3 157 445 bytes = 34 485 078 bytes.<\/p>\n\n\n\n<h1 class=\"wp-block-heading\" id=\"user-content-footprint-memory\"><a href=\"https:\/\/github.com\/rodv92\/PLETSC#footprint-memory\">Footprint (memory)<\/a><\/h1>\n\n\n\n<p>To be determined<\/p>\n\n\n\n<h1 class=\"wp-block-heading\" id=\"user-content-dependecies\"><a href=\"https:\/\/github.com\/rodv92\/PLETSC#dependecies\">Dependecies<\/a><\/h1>\n\n\n\n<p>These Python modules are required :<\/p>\n\n\n\n<p>codecs, nltk, re, bitstring, bitarray, struct, time, dahuffman<\/p>\n\n\n\n<h1 class=\"wp-block-heading\" id=\"user-content-requirements\"><a href=\"https:\/\/github.com\/rodv92\/PLETSC#requirements\">Requirements<\/a><\/h1>\n\n\n\n<p>Input text file must be ASCII (for now) or UTF-8 decodable to ASCII (English). It ignores conversion errors. Decoded file will be encoded in ASCII. It should be in English to get adequate conversion.<\/p>\n\n\n\n<p>Both ends (sender and receiver) MUST have the SAME dictionaries and the SAME Huffmann tables, as these are not sent with the data.<\/p>\n\n\n\n<h1 class=\"wp-block-heading\" id=\"user-content-information-about-the-dictionaries\"><a href=\"https:\/\/github.com\/rodv92\/PLETSC#information-about-the-dictionaries\">Information about the dictionaries<\/a><\/h1>\n\n\n\n<p>The primary dictionary is based on the &#8220;count_1w.txt&#8221; english dictionary of 333 333 words, (words ordered by lexical prevalence) tweaked with added special characters also listed by order of prevalence and added english contractions, and with word count number stripped off.<\/p>\n\n\n\n<p>The original primary dictionary file is available on :&nbsp;<a href=\"https:\/\/norvig.com\/ngrams\/\">https:\/\/norvig.com\/ngrams\/<\/a><\/p>\n\n\n\n<p>It also features a secondary (optional) compression pass based on a compiled dictionary named outngrams.bin.<\/p>\n\n\n\n<p>It features compression for 4 and 5 word ngrams found in the first compression step stream. Ngrams of less than 4 words are deemed not interesting as the first pass will usually encode them on 3 bytes, the same sized as a compressed ngram.<\/p>\n\n\n\n<p>Compression and decompression require the primary dictionary to be available, and the secondary if the boolean SecondPass is set to true, (by default).<\/p>\n\n\n\n<p><strong>The zip &#8220;dics.zip&#8221; already have a compiled version of these dictionaries.<\/strong><\/p>\n\n\n\n<h1 class=\"wp-block-heading\" id=\"user-content-more-information\"><a href=\"https:\/\/github.com\/rodv92\/PLETSC#more-information\">More information<\/a><\/h1>\n\n\n\n<p>The algorithm is heavily commented in the code.<\/p>\n\n\n\n<h1 class=\"wp-block-heading\" id=\"user-content-field-of-application\"><a href=\"https:\/\/github.com\/rodv92\/PLETSC#field-of-application\">Field of application<\/a><\/h1>\n\n\n\n<p>Main applications could be messaging over low bandwidth links like POCSAG radio text, or JS8Call for HAM radio, and IoT.<\/p>\n\n\n\n<p>However, note that the underlying digital mode should allow binary transmission (not only transmission of printable ASCII characters) for seamless integration.<\/p>\n\n\n\n<h1 class=\"wp-block-heading\" id=\"user-content-todo-and-issues-\"><a href=\"https:\/\/github.com\/rodv92\/PLETSC#todo-and-issues-\">TODO and ISSUES :<\/a><\/h1>\n\n\n\n<p>See comments in the code.<\/p>\n\n\n\n<p>Main issues for now are syntactic rules and spurious whitespaces, absence of whitespaces were they should have been, problems with hyphenated tokens, spurious newlines, problems with some possessive forms, and special constructs besides emails and well formed URLs.<\/p>\n\n\n\n<h1 class=\"wp-block-heading\" id=\"user-content-ngrams-processing-from-scratch-\"><a href=\"https:\/\/github.com\/rodv92\/PLETSC#ngrams-processing-from-scratch-\">Ngrams Processing from scratch :<\/a><\/h1>\n\n\n\n<p>Useful if you want to tweak or create your own dictionaries, we&#8217;ll discuss mainly the outngrams.bin dictionary, as count_1w.txt tweaking is straightforward. Note that count1_w.txt should not be modified once outngrams.bin is generated, or you&#8217;ll have to rebuild outngrams.bin<\/p>\n\n\n\n<p>A preparatory step is required to generate a compressed version of the ngrams files, if you want to do it from scratch :<\/p>\n\n\n\n<p>First create the ngrams CSV using this code repo :&nbsp;<a href=\"https:\/\/github.com\/orgtre\/google-books-ngram-frequency\/tree\/main\/python\">https:\/\/github.com\/orgtre\/google-books-ngram-frequency\/tree\/main\/python<\/a><\/p>\n\n\n\n<p>The repo contains scripts that perform the download and concatenation of ngrams according to criterions you specify. Note that LETSC has limited space in the first subspace of the 3 byte. more or less 2097152 &#8211; 333333 I have created an ngram list of 1571125 ngrams. The distribution between the 4grams and 5grams is roughly 50%\/50%<\/p>\n\n\n\n<p>The resulting CSV files need to be further processed by our algorithm<\/p>\n\n\n\n<p>The script that create outngrams.bin (the secondary compiled dictionary based on the primary dictionary and the ngrams csv files from google-books-ngram) is called ngrams_format_dic.py This script is commented for what each line does.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span style=\"display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"54\" height=\"14\" viewBox=\"0 0 54 14\"><g fill=\"none\" fill-rule=\"evenodd\" transform=\"translate(1 1)\"><circle cx=\"6\" cy=\"6\" r=\"6\" fill=\"#FF5F56\" stroke=\"#E0443E\" stroke-width=\".5\"><\/circle><circle cx=\"26\" cy=\"6\" r=\"6\" fill=\"#FFBD2E\" stroke=\"#DEA123\" stroke-width=\".5\"><\/circle><circle cx=\"46\" cy=\"6\" r=\"6\" fill=\"#27C93F\" stroke=\"#1AAB29\" stroke-width=\".5\"><\/circle><\/g><\/svg><\/span><span role=\"button\" tabindex=\"0\" data-code=\"# LIGHTWEIGHT ENGLISH TEXT STREAM COMPRESSION (LETSC)\n# (adaptive encoding length 1byte\/2byte\/3byte based on word dictionary with statistical prevalence ordering - count1_w.txt)\n# Huffmann encoding for uknown tokens\n# Enforces English syntax rules for punctuation\n# Takes into account possessives and contractions\n# Has URLs and e-mails processing rules, more to follow\n# Second pass compression using a dictionary of the most frequent 4 N-Grams of English fiction.\n\n#GPL 3 License\n# www.skynext.tech\n# Rodrigo Verissimo\n# v0.92\n# October 21th, 2023\n\n\n# Python + packages Requirements\n\n# Python 3.9\n# nltk, bitarray, bitstring, re, dahuffmann\n\n# Performance : ratios between x2.6 for Middle to Modern and elaborate English (ex: Shakespeare)\n# Up to x3 and more for simple english.\n# adapted for text messaging \/ streaming\n# Requires the same dictionary on both channel endpoints.\n\n# ALGORITHM. Very straightforward. (adaptive encoding length based on dictionary with statistical ordering)\n\n#################################################################################\n# First byte :\n\n#if MSB is 0, a whole word is encoded on the first 7 bits of one byte only.\n#This makes 127 possible words. These are reserved for the first 127 most used \n#english words. punctuation also appears as a possible word\n\n# Second byte :\n\n#if MSB of first byte is 1, and MSB of second byte is 0, a whole word is encoded\n# with the help of the 7 LSB of byte 1 plus the 7 LSB of byte 2. \n# This makes room for the next 16384 most used english words.\n\n# Third byte :\n# if MSB of first byte is 1 and MSB of second byte is 1, and the MSB of third byte is 0\n# a whole word is encoded\n# with the help of the 7 + 7 + 7 = 21 bits (2 097 152 possible words)\n\n# For now, the 3 byte address space is split into two 2 097 152 address spaces\n# That is, the case of all 3 bytes MSB being 1 is treated separately.\n# In this adress space, only a handful of codes are used as an escape sequence for particular \n# Huffmann trees, see below.\n\n#-&gt;\n#load dictionary of english words from most used to least used.\n#punctuation and special characters have been added with order of prevalence.\n#punctuation frequency is from wikipedia corpus. (around 1.3 billion words) \n#it has been normalized to the frequency of the 1\/3 million word list based \n#on the google web trillon word corpus. that is, frequencies for special chars have been multiplied by 788.39\n#wikipedia punctuation is not optimized for chat, as it lower prevalence of chars like question marks\n#that may appear more frequently in chat situations.\n\n# the first tokenizer used does not separate any special character attached (without whitespace) to a word\n# this will mostly result in an unknown word in the dictionary\n# this key absence in the reverse dict will be catched and treated by another tokenizer (mainly for possessive\n# forms and contractions)\n\n#for possessives ex: &quot;dog's bone&quot; or plural &quot;dogs' bones&quot; a separate tokenizer is used to split into\n# &quot;dog&quot; , &quot;'s&quot;\n# &quot;'s&quot; and &quot;'&quot; also appear in the dictionary.\n\n# ROADMAP\n# remove whitespaces left of punctuation DONE\n# manage new lines DONE\n# manage websites and emails DONE\n# TODO\n# add spell check ! \n# TODO\n# Remove spurious new lines that appear after encoding special sequences such mails or URLS\n# DONE (basic Huffmann, some chars missing in tree)\n# add Huffmann encoding for absent words in dictionary (neologisms,colloqualisms,dialects, or misspellings) DONE\n# DONE\n\n# TODO : test with more texts such as wikipedia XML and various authors works, to catch as much\n# use cases and formatting issues that arise to improve the algorithm\n\n# add adaptive Huffmann. use 4 Huffmann trees. (see below)\n# Assuming there are 4 codes for hufmmann : hufmann lower case, hufmann lower + capitals, huffmann\n# lower + capitals + numeric, all printable ASCII excluding whitespace : same as preceding category plus \n# special chars.\n# Chosing the tree to use would be done by string regex.\n\n#DONE\n# Detect UTF-8 and transcode to ASCII (potentially lossy)\n#DONE\n\n\n# TODO\n# Dictionary Learn over time (re-shuffle the order of tokens)\n# Without transmission of any info between parties\n# Dangerous if sync is lost between the two parties\n# TODO\n\n# TODO\n# optimize Huffmann part to remove the need for the chr(0) termination = scan for EOF sequence in Huffmann to get\n# the Huffmann byte sequence length. TODO\n\n\n# DONE\n# Add second pass compression using word N-grams lookup table. (4 and 5 N-grams seem to be a good compromize)\n# The idea is to encode 4 and 5 token substrings in a line by a single 3 byte code.\n# There is plenty of room left in the 3 byte address space. For now, there is 333 333 - 16384 - 128 tokens used = 316821 tokens used\n# from 4194304 - 3 total address space.\n# DONE using 1 571 125 codes for a 50\/50 mix of 4grams and 5grams.\n# There is still at least 2million codes left.\n#  for now we plan 4 escape sequences for the selection of one of the 4 Huffmann trees.\n\n\n# ngrams processing is first done with the create_ngrams_dic.sh script.\n&quot;&quot;&quot;\npython3 ngrams_format_dic.py 4grams_english-fiction.csv outngrams4.txt #remove counts and process contractions\npython3 ngrams_format_dic.py 5grams_english-fiction.csv outngrams5.txt #remove counts and process contractions\n\npython3 dicstrv4.py -d outngrams4.txt outngrams4.bin.dup #convert ngrams txt to compressed form\npython3 dicstrv4.py -d outngrams5.txt outngrams5.bin.dup #convert ngrams txt to compressed form\nawk '!seen[$0]++' outngrams4.bin.dup &gt; outngrams4.bin #Remove spurious duplicates that may arise\nawk '!seen[$0]++' outngrams5.bin.dup &gt; outngrams5.bin #Remove spurious duplicates that may arise\nsed -i '786001,$ d' outngrams4.bin # truncate to fit target address space\nsed -i '786001,$ d' outngrams5.bin # truncate to fit target address space\n\ncat outngrams4.bin outngrams5.bin &gt; outngrams.bin # concatenate. this is our final form\ncat outngrams.bin | awk '{ print length, bash $0 }' | sort -n -s | cut -d&quot; &quot; -f2- &gt; sorted.txt # sort by size to have an idea of distribution\n\n# ngrams that encode as less than 4 bytes have been pruned since the ratio is 1\n\n&quot;&quot;&quot;\n\n# DONE \n# It is probable that the most used 4 tokens N-grams are based on already frequent words. that individually\n# encode as 1 byte or two bytes.\n# Worst case : all the 4 tokens are encoded in the 1 to 128 addres space, so they take a total 4 bytes.\n# The resulting code will be 3 bytes, a deflate percent of 25%\n# If one of the tokens is 2 byte (128 to 16384 -1 address space), then it uses 5 bytes.\n# deflate percent is 40%\n# The unknown is the statistical prevalence of two million 4 token N-grams.\n# (ex: coming from english fiction corpus) in a standard chat text.\n\n# First encode the google most frequent 4 and 5 N-grams csv file to replace the tokens in each N-gram by the corrsponding \n# byte sequences from our codes in the count_1w.txt dictionary. This will be another pre-process script.\n# The resulting new csv format will be :\n# some 3 byte index = x04x09x23.\n# The 3 byte index is simply the line number of the compressed ngram. \n\n# read that in ram. Conservative Estimate 4 bytes + 3 bytes per entry 7 bytes * 2 000 000 = 14 Meg memory footprint.\n# We already have a 4 MB * 3  12 Meg footprint from count_1w (estimate)\n\n# Generate the inverse map dictionary (mapping sequences to 3 byte indexes)\n# x04x09x23' = some 3 byte index\n# Should not be a problem since there is a 1 to 1 relationship between the two\n\n# Then perform a first pass compression.\n# Then scan the first pass compression file using a 4 token sliding window.\n# Contractions is a case that will have to be managed.\n\n# If there are overlapping matches, chose the match that result in the best deflation, if any.\n# If the unknown word escape codes appears, stop processing and resume after the escaped word\n\n# Overall, replace the byte sequence by the corrsponding 3 byte sequence.\n# DONE\n\n\n\nimport sys\nimport traceback\n\n#print(len(sys.argv))\n#op = (sys.argv[1]).encode(&quot;ascii&quot;).decode(&quot;ascii&quot;)\n#print(op)\n#quit()\n\nif ((len(sys.argv) &lt; 3) or (len(sys.argv) &gt; 4)):\n    print(&quot;Syntax for compression :\\n&quot;)\n    print(&quot;python3 dicstrv.py -c &lt;txt_inputfile&gt; &lt;compressed_outputfile&gt;&quot;)\n    print(&quot;Reads txt_inputfile and writes compressed text stream to compressed_outputfile.\\n&quot;) \n    \n    print(&quot;python3 dicstrv.py -c &lt;txt_inputfile&gt;&quot;)\n    print(&quot;Reads txt_input file and writes compressed output to stdout\\n&quot;)\n\n    print(&quot;Syntax for decompression :\\n&quot;)\n    print(&quot;python3 dicstrv.py -x &lt;compressed_inputfile&gt; &lt;txt_outputfile&gt;&quot;)\n    print(&quot;Reads compressed_inputfile and writes cleartext to txt_outputfile.\\n&quot;) \n    \n    print(&quot;python3 dicstrv.py -x &lt;compressed_inputfile&gt;\\n&quot;)\n    print(&quot;Reads compressed_input file and writes cleartext output to stdout\\n&quot;)\n\n    print(&quot;NOTE: dictionary file count1_w.txt must be in the same directory as the script.&quot;)    \n    quit()\n\nif (sys.argv[1] == &quot;-c&quot;):\n    compress = True\n    gendic = False\nelif (sys.argv[1] == &quot;-d&quot;):\n    compress = True\n    gendic = True\nelif (sys.argv[1] == &quot;-x&quot;):\n    compress = False\n    gendic = False\nelse:\n    print(&quot;unknown operation: &quot; + str(sys.argv[0]) + &quot; type 'python3 dicstrv3.py' for help&quot;)\n\nif (len(sys.argv) == 3):\n    infile = sys.argv[2]\n    outfile = ''\nif (len(sys.argv) == 4):\n    infile = sys.argv[2]\n    outfile = sys.argv[3]\n\nimport codecs\nimport nltk\nfrom nltk.tokenize import TweetTokenizer\ntknzr = TweetTokenizer()\n\nimport re\nimport bitstring\nfrom bitarray import bitarray\nimport struct\nimport time\nfrom dahuffman import HuffmanCodec\n\n\ndebug_on = False\ndebug_ngrams_dic = False\nsecondpass = True\nuse_huffmann = False\nunknown_token_idx = 16384 + 128 + 2097152\n\n\ndef debugw(strdebug):\n    if (debug_on):\n        print(strdebug)\n\n# Huffmann is only used for absent words in count1_w.txt dictionary\n# General lower and upper case frequency combined as lowercase\n\n\n\ncodec_lower = HuffmanCodec.from_frequencies(\n{'e' :   56.88,\t'm' :\t15.36,\n'a'\t:\t43.31,\t'h'\t:\t15.31,\n'r'\t:\t38.64,\t'g'\t:\t12.59,\n'i'\t:\t38.45,\t'b'\t:\t10.56,\n'o'\t:\t36.51,\t'f'\t:\t9.24,\n't'\t:\t35.43,\t'y'\t:\t9.06,\n'n'\t:\t33.92,\t'w'\t:\t6.57,\n's'\t:\t29.23,\t'k'\t:\t5.61,\n'l'\t:\t27.98,\t'v'\t:\t5.13,\n'c'\t:\t23.13,\t'x'\t:\t1.48,\n'u'\t:\t18.51,\t'z'\t:\t1.39,\n'd'\t:\t17.25,\t'j'\t:\t1,\n'p'\t:\t16.14,\t'q'\t:\t1\n}\n)\n\ndebugw(codec_lower.get_code_table())\n\n# following is ASCII mixed upper and lower case frequency from an English writer from Palm OS PDA memos in 2002\n# Credit : http:\/\/fitaly.com\/board\/domper3\/posts\/136.html\n\ncodec_upperlower = HuffmanCodec.from_frequencies(\n\n{'A' : 0.3132,\n'B' : 0.2163,\n'C' : 0.3906,\n'D' : 0.3151,\n'E' : 0.2673,\n'F' : 0.1416,\n'G' : 0.1876,\n'H' : 0.2321,\n'I' : 0.3211,\n'J' : 0.1726,\n'K' : 0.0687,\n'L' : 0.1884,\n'M' : 0.3529,\n'N' : 0.2085,\n'O' : 0.1842,\n'P' : 0.2614,\n'Q' : 0.0316,\n'R' : 0.2519,\n'S' : 0.4003,\n'T' : 0.3322,\n'U' : 0.0814,\n'V' : 0.0892,\n'W' : 0.2527,\n'X' : 0.0343,\n'Y' : 0.0304,\n'Z' : 0.0076,\n'a' : 5.1880,\n'b' : 1.0195,\n'c' : 2.1129,\n'd' : 2.5071,\n'e' : 8.5771,\n'f' : 1.3725,\n'g' : 1.5597,\n'h' : 2.7444,\n'i' : 4.9019,\n'j' : 0.0867,\n'k' : 0.6753,\n'l' : 3.1750,\n'm' : 1.6437,\n'n' : 4.9701,\n'o' : 5.7701,\n'p' : 1.5482,\n'q' : 0.0747,\n'r' : 4.2586,\n's' : 4.3686,\n't' : 6.3700,\n'u' : 2.0999,\n'v' : 0.8462,\n'w' : 1.3034,\n'x' : 0.1950,\n'y' : 1.1330,\n'z' : 0.0596\n})\n\ndebugw(codec_upperlower.get_code_table())\n\n# following is ASCII alpha numeric frequency from an English writer from Palm OS PDA memos in 2002\n# Credit : http:\/\/fitaly.com\/board\/domper3\/posts\/136.html\n\ncodec_alphanumeric = HuffmanCodec.from_frequencies(\n\n{'0' : 0.5516,\n'1' : 0.4594,\n'2' : 0.3322,\n'3' : 0.1847,\n'4' : 0.1348,\n'5' : 0.1663,\n'6' : 0.1153,\n'7' : 0.1030,\n'8' : 0.1054,\n'9' : 0.1024,\n'A' : 0.3132,\n'B' : 0.2163,\n'C' : 0.3906,\n'D' : 0.3151,\n'E' : 0.2673,\n'F' : 0.1416,\n'G' : 0.1876,\n'H' : 0.2321,\n'I' : 0.3211,\n'J' : 0.1726,\n'K' : 0.0687,\n'L' : 0.1884,\n'M' : 0.3529,\n'N' : 0.2085,\n'O' : 0.1842,\n'P' : 0.2614,\n'Q' : 0.0316,\n'R' : 0.2519,\n'S' : 0.4003,\n'T' : 0.3322,\n'U' : 0.0814,\n'V' : 0.0892,\n'W' : 0.2527,\n'X' : 0.0343,\n'Y' : 0.0304,\n'Z' : 0.0076,\n'a' : 5.1880,\n'b' : 1.0195,\n'c' : 2.1129,\n'd' : 2.5071,\n'e' : 8.5771,\n'f' : 1.3725,\n'g' : 1.5597,\n'h' : 2.7444,\n'i' : 4.9019,\n'j' : 0.0867,\n'k' : 0.6753,\n'l' : 3.1750,\n'm' : 1.6437,\n'n' : 4.9701,\n'o' : 5.7701,\n'p' : 1.5482,\n'q' : 0.0747,\n'r' : 4.2586,\n's' : 4.3686,\n't' : 6.3700,\n'u' : 2.0999,\n'v' : 0.8462,\n'w' : 1.3034,\n'x' : 0.1950,\n'y' : 1.1330,\n'z' : 0.0596\n})\n\ndebugw(codec_alphanumeric.get_code_table())\n\n# following is Whole ASCII printable chars frequency except whitespace from an English writer from Palm OS PDA memos in 2002\n# Credit : http:\/\/fitaly.com\/board\/domper3\/posts\/136.html\n\ncodec_all = HuffmanCodec.from_frequencies(\n\n{'!' : 0.0072,\n'\\&quot;' : 0.2442,\n'#' : 0.0179,\n'$' : 0.0561,\n'%' : 0.0160,\n'&amp;' : 0.0226,\n'\\'' : 0.2447,\n'(' : 0.2178,\n')' : 0.2233,\n'*' : 0.0628,\n'+' : 0.0215,\n',' : 0.7384,\n'-' : 1.3734,\n'.' : 1.5124,\n'\/' : 0.1549,\n'0' : 0.5516,\n'1' : 0.4594,\n'2' : 0.3322,\n'3' : 0.1847,\n'4' : 0.1348,\n'5' : 0.1663,\n'6' : 0.1153,\n'7' : 0.1030,\n'8' : 0.1054,\n'9' : 0.1024,\n':' : 0.4354,\n';' : 0.1214,\n'&lt;' : 0.1225,\n'=' : 0.0227,\n'&gt;' : 0.1242,\n'?' : 0.1474,\n'@' : 0.0073,\n'A' : 0.3132,\n'B' : 0.2163,\n'C' : 0.3906,\n'D' : 0.3151,\n'E' : 0.2673,\n'F' : 0.1416,\n'G' : 0.1876,\n'H' : 0.2321,\n'I' : 0.3211,\n'J' : 0.1726,\n'K' : 0.0687,\n'L' : 0.1884,\n'M' : 0.3529,\n'N' : 0.2085,\n'O' : 0.1842,\n'P' : 0.2614,\n'Q' : 0.0316,\n'R' : 0.2519,\n'S' : 0.4003,\n'T' : 0.3322,\n'U' : 0.0814,\n'V' : 0.0892,\n'W' : 0.2527,\n'X' : 0.0343,\n'Y' : 0.0304,\n'Z' : 0.0076,\n'[' : 0.0086,\n'\\\\' : 0.0016,\n']' : 0.0088,\n'^' : 0.0003,\n'_' : 0.1159,\n'`' : 0.0009,\n'a' : 5.1880,\n'b' : 1.0195,\n'c' : 2.1129,\n'd' : 2.5071,\n'e' : 8.5771,\n'f' : 1.3725,\n'g' : 1.5597,\n'h' : 2.7444,\n'i' : 4.9019,\n'j' : 0.0867,\n'k' : 0.6753,\n'l' : 3.1750,\n'm' : 1.6437,\n'n' : 4.9701,\n'o' : 5.7701,\n'p' : 1.5482,\n'q' : 0.0747,\n'r' : 4.2586,\n's' : 4.3686,\n't' : 6.3700,\n'u' : 2.0999,\n'v' : 0.8462,\n'w' : 1.3034,\n'x' : 0.1950,\n'y' : 1.1330,\n'z' : 0.0596,\n'{' : 0.0026,\n'|' : 0.0007,\n'}' : 0.0026,\n'~' : 0.0003,\n})\n\ndebugw(codec_all.get_code_table())\n#quit()        \n\ndef check_file_is_utf8(filename):\n    debugw(&quot;checking encoding of:&quot;)\n    debugw(filename)\n    try:\n        f = codecs.open(filename, encoding='utf-8', errors='strict')\n        for line in f:\n            pass\n        debugw(&quot;Valid utf-8&quot;)\n        return True\n    except UnicodeDecodeError:\n        debugw(&quot;invalid utf-8&quot;)\n        return False\n\ndef find_huffmann_to_use(token):\n\n    if(not use_huffmann):\n        debugw(&quot;do not use Huffmann, encode char by char&quot;)\n        return 0\n    \n    not_alllower = re.search(&quot;[^a-z]&quot;)\n    \n    if(not not_alllower):\n        debugw(&quot;all lower case&quot;)\n        return 1\n    \n    not_alllowerorupper = re.search(&quot;[^A-Za-z]&quot;)\n    \n    if(not not_alllowerorupper):\n        debugw(&quot;all lower or upper&quot;)\n        return 2\n    \n    not_alllalphanumeric = re.search(&quot;[^A-Za-z0-9]&quot;)\n    \n    if(not not_alllalphanumeric):\n        debugw(&quot;all alpha numeric&quot;)\n        return 3\n    else:\n        debugw(&quot;all printable, except whitespace&quot;)\n        return 4\n    \ndef encode_unknown(token,treecode):\n\n    if (treecode == 0):\n        bytes_unknown = bytearray()\n        for charidx in range(0, len(token)):\n            debugw(&quot;appending chars..&quot;)\n            debugw(token[charidx])\n\n            # only append if it is not an unexpected termination in the unknown token\n            if (not ord(token[charidx]) == 0):\n                bytes_unknown.append(ord(token[charidx]))\n            else:\n                debugw(&quot;unexpected termination chr(0) in unknown token, discarding character&quot;)\n\n\n        return bytes_unknown\n    if (treecode == 1):\n        return codec_lower.encode(token)\n    if (treecode == 2):\n        return codec_upperlower.encode(token)           \n    if (treecode == 3):\n        return codec_alphanumeric.encode(token)                      \n    if (treecode == 4):\n        return codec_all.encode(token)                      \n\ndef decode_unknown(bytetoken,treecode):\n\n    if (treecode == 1):\n        return codec_lower.decode(bytetoken)\n    if (treecode == 2):\n        return codec_upperlower.decode(bytetoken)           \n    if (treecode == 3):\n        return codec_alphanumeric.decode(bytetoken)                      \n    if (treecode == 4):\n        return codec_all.decode(bytetoken)  \n\ndef compress_token_or_subtoken(compressed,line_token,token_of_line_count,lentoken,gendic):\n  \n    \n    global unknown_token_idx\n\n    try:\n\n        # is the token in english dictionary ?\n        debugw(&quot;line_token:&quot; + line_token)\n        tokenid = engdictrev[line_token]\n        subtokensid = [tokenid]\n\n        \n    except:\n        debugw(&quot;unknown word, special chars adjunct, or possessive form&quot;)\n        # let's try to split the unknown word from possible adjunct special chars\n        # for this we use another tokenizer\n        subtokens = nltk.word_tokenize(line_token)\n        if (len(subtokens) == 1):\n            # no luck...\n            # TODO : do not drop the word silently, encode it !\n            # If we encode a ngram dic, skip ngrams with unknown tokens in the primary dic.\n            # and return empty bytearray to signify ngram compression failure \n            if(gendic):\n                compressed = bytearray()\n                debugw(&quot;gendic : unknown word&quot;)\n                return (compressed, token_of_line_count)\n        \n            debugw(&quot;unknown word&quot;)\n\n            #AMEND dictionary \n            # add this unknown subtoken to a session dic so it can be recalled.\n            debugw(&quot;unknown word: &quot; + subtokens[0] + &quot; adding to session dic at id: &quot; + str(unknown_token_idx))\n            debugw(&quot;unknown word, adding to session dic at id: &quot; + str(unknown_token_idx))\n            \n            engdictrev[subtokens[0]] = unknown_token_idx\n            engdict[unknown_token_idx] = subtokens[0]\n            unknown_token_idx += 1\n                       \n\n            #subtokensid = [4194304 - 1] # subtoken code for unknown word escape sequence.                       \n            subtokensid = [4194303 - find_huffmann_to_use(subtokens[0])]                   \n            #print(subtokensid)\n            #continue\n        else:\n            debugw(&quot;possible special char found&quot;)\n            subtokensid = []\n            for subtoken in subtokens:\n                debugw(&quot;subtoken=&quot;)\n                debugw(subtoken)\n                try:\n                    subtokensid.append(engdictrev[subtoken])\n                except:\n                    # no luck...\n                    # TODO : do not drop the word silently, encode it !\n        \n                    # If we encode a ngram dic, skip ngrams with unknown tokens in the primary dic.\n                    # and return empty bytearray to signify ngram compression failure \n                    if(gendic):\n                        compressed = bytearray()\n                        debugw(&quot;gendic : unknown word&quot;)\n                        return (compressed, token_of_line_count)\n        \n                    debugw(&quot;unknown subtoken&quot;)\n                    subtokensid.append(4194303 - find_huffmann_to_use(subtoken))\n                    #subtokensid.append(4194304 - 1)\n                    \n                    # add this unknown subtoken to a session dic so it can be recalled.\n                    #AMEND dictionary \n                    # add this unknown subtoken to a session dic so it can be recalled.\n                    debugw(&quot;unknown subtoken: &quot; + subtoken + &quot; adding to session dic at id: &quot; + str(unknown_token_idx))\n                    debugw(&quot;unknown subtoken, adding to session dic at id: &quot; + str(unknown_token_idx))\n                    engdictrev[subtoken] = unknown_token_idx\n                    engdict[unknown_token_idx] = subtoken\n                    unknown_token_idx += 1\n                    #continue\n    subtokenidx = 0\n    for subtokenid in subtokensid:        \n        \n        debugw(&quot;subtokenid=&quot;)\n        debugw(subtokenid)\n        # maximum level of token unpacking is done\n        if(subtokenid &lt; 128):\n\n            debugw(&quot;super common word&quot;)\n            debugw(engdict[subtokenid])\n\n            #convert to bytes\n            byte0 = subtokenid.to_bytes(1, byteorder='little')\n            debugw(&quot;hex:&quot;)\n            debugw(byte0.hex())\n\n            #append to bytearray\n            compressed.append(byte0[0])\n\n        if(128 &lt;= subtokenid &lt; 16384 + 128):\n\n            debugw(&quot;common word&quot;)\n\n            #remove offset\n            debugw(engdict[subtokenid])\n            subtokenid -= 128\n            \n            #convert to bytes1 (array of 2 bytes)\n            bytes1 = subtokenid.to_bytes(2,byteorder='little')\n            debugw(&quot;&quot;.join([f&quot;\\\\x{byte:02x}&quot; for byte in bytes1]))\n        \n            #convert to bitarray\n            c = bitarray(endian='little')\n            c.frombytes(bytes1)\n            debugw(c)\n            \n            # set msb of first byte to 1 and shift the more significant bits up.\n            c.insert(7,1)\n            debugw(c)\n            \n            # remove excess bit\n            del c[16:17:1]\n            debugw(c)\n            \n            # append our two tweaked bytes to the compressed bytearray\n            compressed.append((c.tobytes())[0])\n            compressed.append((c.tobytes())[1])\n\n        #if(16384 +128 &lt;= subtokenid &lt; 4194304 - 1):\n        if(16384 +128 &lt;= subtokenid &lt; 2097152 + 16384 + 128):\n\n\n            debugw(&quot;rare word&quot;)\n            \n            # remove offset\n            debugw(engdict[subtokenid])\n            subtokenid -= (16384 + 128)\n\n            #convert to bytes1 (array of 3 bytes)\n            bytes2 = subtokenid.to_bytes(3,byteorder='little')\n            debugw(&quot;&quot;.join([f&quot;\\\\x{byte:02x}&quot; for byte in bytes2]))\n\n            #convert to bitarray\n            c = bitarray(endian='little')\n            c.frombytes(bytes2)\n            debugw(c)\n            \n            # set msb of first byte to 1 and shift the bits above up.\n            c.insert(7,1)\n            debugw(c)\n\n            # set msb of second byte to 1 and shift the bits above up.\n            c.insert(15,1)\n            debugw(c)\n\n            # remove two excess bits that arose from our shifts\n            del c[24:26:1]\n            debugw(c)\n            \n            # append our three tweaked bytes to the compressed bytearray\n            compressed.append((c.tobytes())[0])\n            compressed.append((c.tobytes())[1])\n            compressed.append((c.tobytes())[2])\n\n\n                #if(16384 +128 &lt;= subtokenid &lt; 4194304 - 1):\n        if(16384 +128 + 2097152 &lt;= subtokenid &lt; 4194304 - 5):\n\n\n            debugw(&quot;unknown word from session DIC&quot;)\n            \n            # remove offset\n            debugw(engdict[subtokenid])\n            subtokenid -= (2097152 + 16384 + 128)\n\n            #convert to bytes1 (array of 3 bytes)\n            bytes2 = subtokenid.to_bytes(3,byteorder='little')\n            debugw(&quot;&quot;.join([f&quot;\\\\x{byte:02x}&quot; for byte in bytes2]))\n\n            #convert to bitarray\n            c = bitarray(endian='little')\n            c.frombytes(bytes2)\n            debugw(c)\n            \n            # set msb of first byte to 1 and shift the bits above up.\n            c.insert(7,1)\n            debugw(c)\n\n            # set msb of second byte to 1 and shift the bits above up.\n            c.insert(15,1)\n            debugw(c)\n\n            # set msb of third byte to 1 and shift the bits above up.\n            c.insert(23,1)\n            debugw(c)\n\n\n            # remove three excess bits that arose from our shifts\n            del c[24:27:1]\n            debugw(c)\n            \n            # append our three tweaked bytes to the compressed bytearray\n            compressed.append((c.tobytes())[0])\n            compressed.append((c.tobytes())[1])\n            compressed.append((c.tobytes())[2])\n\n\n        #if(subtokenid == (4194304 - 1)):\n        if(subtokenid in range(4194299,4194304)):\n\n            #compressed.append(255)\n            #compressed.append(255)\n            #compressed.append(255)\n            debugw(&quot;huffmann tree code :&quot; + str(subtokenid))\n\n            # TODO : Use Huffmann tree instead of byte-&gt;byte encoding.\n            \n            #convert to bytes1 (array of 3 bytes)\n            bytes2 = subtokenid.to_bytes(3,byteorder='little')\n            debugw(&quot;&quot;.join([f&quot;\\\\x{byte:02x}&quot; for byte in bytes2]))\n\n            #convert to bitarray\n            c = bitarray(endian='little')\n            c.frombytes(bytes2)\n            debugw(c)\n            \n            # set msb of first byte to 1 and shift the bits above up.\n            c.insert(7,1)\n            debugw(c)\n\n            # set msb of second byte to 1 and shift the bits above up.\n            c.insert(15,1)\n            debugw(c)\n\n            # no need to set  msb of third byte to 1 since the range will take care of it.\n            #c.insert(23,1)\n            #debugw(c)\n\n            # remove two excess bits that arose from our shifts\n            del c[24:26:1]\n            debugw(c)\n            \n            # append our three tweaked bytes that signify the huffmann tree to use to the compressed bytearray\n            compressed.append((c.tobytes())[0])\n            compressed.append((c.tobytes())[1])\n            compressed.append((c.tobytes())[2])\n\n            if (len(subtokens) == 1):\n                if(not use_huffmann):\n                    debugw(&quot;encoding unkown word&quot;)\n                    #for charidx in range(0, len(line_token)):\n                    #    debugw(&quot;appending chars..&quot;)\n                    #    debugw(line_token[charidx])\n                    #    compressed.append(ord(line_token[charidx]))\n                    compressed.extend(encode_unknown(line_token,0))\n                else:\n                    debugw(&quot;encoding unkown line token with Huffmann&quot;)\n                    huffmann_tree_code = -(subtokenid - 4194303)\n                    compressed.extend(encode_unknown(line_token,huffmann_tree_code))\n            else:\n                if(not use_huffmann):\n                    debugw(&quot;encoding unkown subtoken&quot;)\n                    #for charidx in range(0, len(subtokens[subtokenidx])):\n                    #    debugw(&quot;appending chars..&quot;)\n                    #    debugw((subtokens[subtokenidx])[charidx])\n                    #    compressed.append(ord((subtokens[subtokenidx])[charidx]))\n                    compressed.extend(encode_unknown(subtokens[subtokenidx],0))\n                else:\n                    debugw(&quot;encoding unkown subtoken with Huffmann&quot;)\n                    debugw(subtokens[subtokenidx])\n                    #huffmann_tree_code = find_huffmann_to_use(subtokens[subtokenidx])\n                    huffmann_tree_code = -(subtokenid - 4194303)\n                    compressed.extend(encode_unknown(subtokens[subtokenidx],huffmann_tree_code))\n            compressed.append(0) # terminate c string style\n        subtokenidx += 1        \n    token_of_line_count += 1\n\n    debugw(&quot;token of line count&quot;)\n    debugw(token_of_line_count)\n    debugw(&quot;lentoken&quot;)\n    debugw(lentoken)\n\n    if((token_of_line_count == lentoken) and (not gendic)):\n        # newline\n        debugw(&quot;append new line&quot;)\n        compressed.append(0)\n        #quit()  \n\n    return (compressed,token_of_line_count)\n\n\ndef compress_tokens(tokens,gendic):\n\n    #time.sleep(0.001)    \n    # Init byte array\n    compressed = bytearray()\n    \n    debugw(&quot;tokens are:&quot;)\n    debugw(tokens)\n\n    for token in tokens:\n\n        debugw(&quot;token is:&quot;)\n        debugw(token)\n\n        token_of_line_count = 0\n        # start compression run\n        if(not len(token) and (not gendic)):\n            debugw(&quot;paragraph&quot;)\n            compressed.append(0)\n            #compressed.append(0)\n            #quit()\n        lentoken = len(token)\n        if (not gendic):\n            for line_token in token:           \n                (compressed, token_of_line_count) = compress_token_or_subtoken(compressed,line_token,token_of_line_count,lentoken,gendic)\n        else:\n                (compressed, token_of_line_count) = compress_token_or_subtoken(compressed,token,token_of_line_count,lentoken,gendic)           \n                if(not len(compressed)):\n                    debugw(&quot;unknown word in gendic sequence, aborting&quot;)\n                    compressed = bytearray()\n                    return compressed\n    # dump whole compressed stream\n    debugw(&quot;compressed ngram is=&quot;)\n    debugw(compressed.hex())\n    debugw(&quot;compressed ngram byte length is=&quot;)\n    debugw(len(compressed))\n\n    return compressed\n\ndef compress_second_pass(compressed):\n\n    ngram_compressed = bytearray()\n    ngram_length = 0\n    ngram_byte_length = 0\n    index_jumps = []\n    candidates = []\n    idx = 0\n    # second pass main loop\n    #debugw(&quot;compressed=&quot;)\n    #debugw(compressed)\n    while (idx &lt; len(compressed)):\n\n        debugw(&quot;second pass idx=&quot;)\n        debugw(idx)\n        idxchar = 0\n        reset_ngram = False\n        debugw(&quot;indexjumps=&quot;)\n        debugw(index_jumps)\n\n\n        if(not (compressed[idx] &amp; 128)):\n            ngram_compressed.append(compressed[idx])\n            debugw(&quot;&quot;.join([f&quot;\\\\x{byte:02x}&quot; for byte in ngram_compressed]))\n            debugw(&quot;super common ext&quot;)\n            idx += 1\n            index_jumps.append(1)\n            ngram_byte_length += 1\n        elif((compressed[idx] &amp; 128) and (not (compressed[idx+1] &amp; 128))):\n            ngram_compressed.extend(compressed[idx:idx+2])\n            debugw(&quot;&quot;.join([f&quot;\\\\x{byte:02x}&quot; for byte in ngram_compressed]))\n            debugw(&quot;common ext&quot;)\n            idx += 2\n            index_jumps.append(2)\n            ngram_byte_length += 2\n        elif((compressed[idx] &amp; 128) and (compressed[idx+1] &amp; 128) and (not compressed[idx+2] &amp; 128)):\n            ngram_compressed.extend(compressed[idx:idx+3]) \n            debugw(&quot;&quot;.join([f&quot;\\\\x{byte:02x}&quot; for byte in ngram_compressed]))\n            debugw(&quot;rare ext&quot;)\n            idx += 3  \n            index_jumps.append(3)\n            ngram_byte_length += 3     \n        elif((compressed[idx] == 255) and (compressed[idx+1] == 255) and (compressed[idx+2] == 255)):\n            # TODO : take into account 4 escape sequences instead of only one.\n            #reset ngram_compressed\n            char = compressed[idx+3]\n            debugw(&quot;unknown token sequence detected&quot;)\n            #print(char)\n            #str = &quot;&quot;\n            idxchar = 0\n            while(char != 0):\n                   idxchar += 1\n                   char = compressed[idx+3+idxchar]\n                   debugw(&quot;char=&quot;)\n                   debugw(char)\n            debugw(&quot;end of unknown token sequence detected at idx:&quot;)\n            idx += (3 + idxchar)\n            debugw(idx)\n            index_jumps.append(3 + idxchar)\n            ngram_length -= 1\n            reset_ngram = True\n         \n        elif((compressed[idx] &amp; 128) and (compressed[idx+1] &amp; 128) and (compressed[idx+2] &amp; 128)):\n            # Session DIC space, breaks ngram construction.\n            debugw(&quot;session DIC space, we break ngram construction&quot;)\n            idx += 3\n            index_jumps.append(3)\n            ngram_length -= 1\n            reset_ngram = True\n    \n\n        ngram_length += 1\n        debugw(&quot;indexjumps=&quot;)\n        debugw(index_jumps)\n        debugw(&quot;ngram_length&quot;)\n        debugw(ngram_length)\n\n        if (((ngram_length == 3) and (ngram_byte_length &gt; 3)) or (ngram_length == 4)):\n            # if there are contractions, apparent ngram length will be one token less and potentially present in N4 ngrams\n            # try to replace the ngram if it exists, and only if ngram_byte_length is &gt; 3, otherwise there will be no compression gain.\n            # save index jumps for rewind operations.\n            # TO BE CONTINUED .....\n            try: \n                \n                ngram_compressed_no_ascii = &quot;&quot;.join([f&quot;\\\\x{byte:02x}&quot; for byte in ngram_compressed])\n                ngram_compressed_no_ascii = ngram_compressed_no_ascii.replace(&quot;\\\\&quot;,&quot;&quot;)\n                debugw(ngram_compressed_no_ascii)\n                code = ngram_dict[ngram_compressed_no_ascii]\n                debugw(&quot;****FOUND*****&quot;)\n                ratio = ngram_byte_length\/3 # all ngrams are encoded in a 3 byte address space, hence div by 3\n                removebytes = ngram_byte_length\n                if(idxchar):\n                    insertpos = idx - ngram_byte_length - (3 + idxchar)\n                else:\n                    insertpos = idx - ngram_byte_length                \n                candidates.append((code,insertpos,removebytes,ratio))\n            except:\n                #traceback.print_exc()\n                debugw(&quot;no luck 3N\/4N&quot;)\n\n            # reset all ngram data\n            ngram_length = 0\n            ngram_byte_length = 0\n            ratio = 0\n            removebytes = 0\n            ngram_compressed = bytearray()\n\n            #rewind...and retry a new ngram window from initial token index + one token shift\n            #BUG HERE !!\n            debugw(&quot;indexjumps=&quot;)\n            debugw(index_jumps)\n            #time.sleep(0.1)\n            debugw(&quot;lastindexjumps_except_first=&quot;)\n            debugw(index_jumps[-len(index_jumps)+1:])\n            debugw(&quot;index_before_rewind=&quot;)\n            debugw(idx)\n\n            idx -= sum(index_jumps[-len(index_jumps)+1:])\n            index_jumps = []\n            debugw(&quot;idx after rewind=&quot;)\n            debugw(idx)\n\n        elif (reset_ngram):\n            debugw(&quot;ngram reset : unknown token starts before ngram_length 3 or 4&quot;)\n            ngram_length = 0\n            ngram_byte_length = 0\n            ratio = 0\n            removebytes = 0\n            #do not rewind : reset pos after unknown sequence\n            index_jumps = []\n\n    return candidates        \n\n\ndef process_candidates_v2(candidates):\n\n    #here we scan all candidates.\n    #if there are overlaps, we select the candidate with the best ratio, if any.\n    #The result is a reduced list of candidates data.\n\n    #Next we recreate the compressed stream and replace the bytes at insertpos by the candidate code\n    debugw(candidates)\n    candidates_reduced = []\n    idx_reduced = 0\n    idx = 0\n    deleted_candidates_number = 0\n\n    mutual_overlaps = []\n    overlap_idx = 0\n\n    while(idx &lt; len(candidates)):\n        \n        code = candidates[idx][0]\n        insertpos = candidates[idx][1]\n        removebytes = candidates[idx][2]\n        ratio = candidates[idx][3]\n\n        first_overlap = True\n        \n        for idx_lookahead in range(idx+1,len(candidates)):\n            \n            code_lookahead = candidates[idx_lookahead][0]\n            insertpos_lookahead = candidates[idx_lookahead][1]\n            removebytes_lookahead = candidates[idx_lookahead][2]\n            ratio_lookahead = candidates[idx_lookahead][3]\n\n            if((insertpos + removebytes - 1) &gt;= insertpos_lookahead):\n                \n                debugw(&quot;overlap!&quot;)\n                debugw(code)\n                debugw(code_lookahead)\n                \n                #add mutually overlapping indexes to an array\n                if(first_overlap):\n                    mutual_overlaps.append([idx])\n                    mutual_overlaps[overlap_idx].append(idx_lookahead)\n                    first_overlap = False\n\n                else:\n                    # case for a mutual overlap of at least 3 ngrams\n                    debugw(&quot;len mutual overlap:&quot;)\n                    debugw(len(mutual_overlaps))\n                    debugw(&quot;overlap_idx&quot;)\n                    debugw(overlap_idx)\n                    mutual_overlaps[overlap_idx].append(idx_lookahead)\n                 \n                    overlap_idx += 1\n                \n            else:\n                #end of mutual overlap (current lookahead is not overlapping with original idx)\n                break\n        idx += 1        \n    #keep best ratio from all overlap lists\n    keep_idxs = []\n    remove_idx_shift = 0\n        \n    for overlap in mutual_overlaps:\n\n        prev_candidate_ratio = 0\n        \n        for candidate_idx in overlap:\n\n            debugw(&quot;candidate_idx:&quot;)\n            debugw(candidate_idx)\n            candidate_ratio = candidates[candidate_idx - remove_idx_shift][3]\n            if (candidate_ratio &gt;= prev_candidate_ratio):\n                keep_idx = candidate_idx\n                prev_candidate_ratio = candidate_ratio\n\n        keep_idxs.append(keep_idx)\n\n        \n\n        for candidate_idx in overlap:\n            if(candidate_idx != keep_idx):\n                debugw(&quot;candidate len:&quot;)\n                debugw(len(candidates))\n                \n                debugw(&quot;will delete idx:&quot;)\n                debugw(str(candidate_idx - remove_idx_shift))\n                \n                del candidates[candidate_idx - remove_idx_shift]\n                deleted_candidates_number += 1\n                debugw(&quot;deleted idx:&quot;)\n                debugw(str(candidate_idx - remove_idx_shift))\n                remove_idx_shift += 1\n                #keep the best ratio only from the list of mutual overlaps\n\n    if (deleted_candidates_number &gt; 0):\n        debugw(&quot;recursive&quot;)\n        deleted_candidates_number = 0\n        process_candidates_v2(candidates)\n\n    #need to exit recursion when len candidates stops decreasing\n\n    return candidates\n\ndef ngram_insert_reserved_bits(ngram_compressed):\n            \n    debugw(&quot;&quot;.join([f&quot;\\\\x{byte:02x}&quot; for byte in ngram_compressed]))\n\n    #convert to bitarray\n    c = bitarray(endian='little')\n    c.frombytes(ngram_compressed)\n    debugw(c)\n    \n    # set msb of first byte to 1 and shift the bits above up.\n    c.insert(7,1)\n    debugw(c)\n\n    # set msb of second byte to 1 and shift the bits above up.\n    c.insert(15,1)\n    debugw(c)\n\n    # remove two excess bits that arose from our shifts\n    del c[24:26:1]\n    debugw(c)\n    \n    # replace the original ngram_compressed bytearray with our tweaked bytes\n    ngram_compressed = bytearray()\n    ngram_compressed.append((c.tobytes())[0])\n    ngram_compressed.append((c.tobytes())[1])\n    ngram_compressed.append((c.tobytes())[2])\n\n    return ngram_compressed\n                \n\ndef replace_candidates_in_processed(candidates,processed):\n\n    byteshift = 0\n    shiftcode = 0\n    for candidate in candidates:\n            insertpos = candidate[1] - byteshift\n            removebytes = candidate[2]\n            del processed[insertpos:insertpos + removebytes]\n            byteshift += removebytes\n            ## first we need to convert candidate code to proper 3 byte format\n            # we add our 4 ngram code space at a 2^20 shift in the 3 bytes address space. \n            shifted_code = 524416 + candidate[0]\n            # now we convert our shifted ngram code to a byte sequence in the compressed format\n            bytes_shiftedcode = shifted_code.to_bytes(3, byteorder='little')\n            # print it\n            debugw(bytes_shiftedcode)\n            # tweak the bytes to insert reserved bits for 1\/2\/3 bytes variable length encoding\n            # compliance.\n            bytes_shiftedcode = ngram_insert_reserved_bits(bytes_shiftedcode)\n            # print it\n            debugw(bytes_shiftedcode)\n            # now we insert it at the position of the non-compressed ngram\n            processed[insertpos:insertpos] = bytes_shiftedcode\n            # we added 3 bytes, we have to compensate to keep future insertpos valid.\n            byteshift -= 3\n\n    return processed\n\n\ndef ngram_process_rules(subtokens):\n\n    ### VARIOUS DETOKENIZER CLEANUP\/FORMATTING OPERATIONS\n    processed_ngram_string = &quot;&quot;\n    capitalize = False\n    token_idx = 0\n    for token in subtokens:\n\n        if(capitalize):\n            token = token.capitalize()\n            capitalize = False\n\n        # English syntactic rules : remove whitespace left of &quot;!?.&quot; \n        # and enforce capitalization on first non whitespace character following.\n        if (re.match(&quot;[!\\?\\.]&quot;,token)):\n            processed_ngram_string += token\n            capitalize = True\n\n        # English syntactic rules : remove whitespace left of &quot;,;:&quot; \n        elif (re.match(&quot;[,;:]&quot;,token)):         \n            processed_ngram_string += token\n            capitalize = False\n\n        # append whitespace left of added token\n        else:\n            processed_ngram_string = processed_ngram_string + &quot; &quot; + token\n\n        token_idx += 1\n        \n        if(len(subtokens) == token_idx):\n            debugw(&quot;last token of ngram&quot;)\n            processed_ngram_string += &quot; &quot;\n\n    return processed_ngram_string\n\ndef decompress_ngram_bytes(compressed):\n\n    idx = 0\n    detokenizer_ngram = []\n    \n    while(idx &lt; len(compressed)):\n    \n        if(not (compressed[idx] &amp; 128)):\n            \n            # current index byte msb is at 0, \n            # it is one of the 128 first tokens in the dictionary.\n            debugw(&quot;super common word&quot;)\n            #decode in place\n            \n            inta = compressed[idx]        \n            detokenizer_ngram.append(engdict[inta])\n            idx += 1\n\n        elif((compressed[idx] &amp; 128) and (not (compressed[idx+1] &amp; 128))):\n\n            # current index byte msb is at 1, and next byte msb is at 0. \n            # it is one of the 16384 next tokens in the dictionary.\n            debugw(&quot;common word&quot;)\n\n            # populate bitarray from the two bytes\n            c = bitarray(endian='little')\n            c.frombytes(compressed[idx:idx+2])\n            debugw(c)\n\n            # remove first byte msb (shift down the bits above)\n            del c[7]\n            debugw(c)\n\n            # convert bytes array to 16 bit unsigned integer\n            inta = (struct.unpack(&quot;&lt;H&quot;, c.tobytes()))[0]\n            # add offset back so we get a valid dictionary key\n            inta += 128\n\n            # print word\n            detokenizer_ngram.append(engdict[inta])\n            # increment byte counter with step 2, we processed 2 bytes.\n            idx += 2\n\n        #elif((compressed[idx] &amp; 128) and (compressed[idx+1] &amp; 128)):\n        elif((compressed[idx] &amp; 128) and (compressed[idx+1] &amp; 128) and (not compressed[idx+2] &amp; 128)):\n            \n            # current index byte msb is at 1, and next byte mbs is at 1. \n            # it is one of the 4194304 next tokens in the dictionary.\n            debugw(&quot;rare word&quot;)\n            \n            chunk = compressed[idx:idx+3]\n\n            # populate bitarray from the three bytes\n            c = bitarray(endian='little')\n            #c.frombytes(compressed[idx:idx+3])\n            c.frombytes(chunk)\n            \n            debugw(c)\n\n            # remove second byte msb (shift down the bits above)\n            del c[15]\n            debugw(c)\n\n            # remove first byte msb (shift down the bits above)\n            del c[7]\n            debugw(c)\n\n            c.extend(&quot;0000000000&quot;) \n            # pad to 4 bytes (32 bit integer format) : 3 bytes + 10 bits \n            # because we previously removed two bits with del c[15] and del c[7]\n            debugw(c)\n\n            # convert bytes array to 32 bit unsigned integer\n            inta = (struct.unpack(&quot;&lt;L&quot;, c.tobytes()))[0]\n\n            inta += (16384 + 128)\n\n            detokenizer_ngram.append(engdict[inta])\n\n            # increment byte counter with step 3, we processed 3 bytes.\n            idx += 3\n\n    return detokenizer_ngram\n\n\n###INLINE START###\n\n#downloading tokenizer model if missing\nnltk.download('punkt')\n\n#opening the english dict of most used 1\/3 million words from google corpus of 1 trillion words.\n#special characters have been added with their respective prevalence (from wikipedia corpus)\n#contractions also have been added in their form with a quote just after (next line) the form \n# without quote. ex : next line after &quot;dont&quot; appears &quot;don't&quot;\n\nfile1 = open('count_1w.txt', 'r')\nLines = file1.readlines()\n\n#initializing Python dicts\ncount = 1\nengdict = {}\nengdictrev = {}\n\n\n# special case : byte val 0 is equal to new line.\n# TODO : make sure that windows CRLF is taken care of.\nengdict[0] = &quot;\\n&quot;\nengdictrev[&quot;\\n&quot;] = 0\n\n# populating dicts\nfor line in Lines:\n    # Strips the newline character\n    engdict[count] = line.strip()\n    engdictrev[line.strip()] = count\n    count += 1\n\n### populating ngram dict\n\nfilengrams = open('outngrams.bin', 'rt')\nngramlines = filengrams.readlines()\n\nngram_dict = {}\nngram_dict_rev = {}\n\n\ncount = 0\n# populating dicts\nfor ngramline in ngramlines:\n# Strips the newline character\n    #keystr = &quot;&quot;.join([f&quot;\\\\x{byte:02x}&quot; for byte in ngramline.strip()])\n    #keystr = keystr.replace(&quot;\\\\&quot;,&quot;&quot;)\n    #if(count == 71374):\n    keystr = ngramline.strip()\n    #print(ngramline.strip())\n    #print(keystr)\n    #quit()\n    ngram_dict_rev[count] = keystr\n    ngram_dict[keystr] = count\n    count += 1\n\nidx = 0\ndebugw(&quot;first ngram in dict:&quot;)\ntest = ngram_dict_rev[0]\ndebugw(test)\ndebugw(ngram_dict[test])\ncount = 0\n\n\nif (compress):\n\n    tokens = []\n    # check if file is utf-8\n    if(check_file_is_utf8(infile)):\n        with codecs.open(infile, 'r', encoding='utf-8') as utf8_file:\n            # Read the content of the UTF-8 file and transcode it to ASCII\n            # encode('ascii','ignore') MAY replace unknown char with chr(0)\n            # We don't want that, as it is a termination char for unknown strings.\n            # on the other hand backslashreplace replaces too much chars that could be transcribed\n            # the best option for now it check for chr(0) presence before writing the unknown token representation.\n            ascii_content = utf8_file.read().encode('ascii', 'ignore').decode('ascii')\n            #debugw(ascii_content)\n            Linesin = ascii_content.splitlines()\n            if(debug_on):\n                outfile_ascii = infile + &quot;.asc&quot;\n                with codecs.open(outfile_ascii, &quot;w&quot;, encoding='ascii') as ascii_file:\n                    ascii_file.write(ascii_content)\n    else:\n        # Reading file to be compressed\n        file2 = open(infile,'r')\n        #text = file2.read()\n        Linesin = file2.readlines()\n\n    if(gendic):\n         if(len(outfile)):\n                fh = open(outfile, 'wt')\n\n    lineidx = 0\n    for line in Linesin:\n        line = line.lower()\n\n        # First pass tokenizer (does not split adjunct special chars)\n        line_tokens = tknzr.tokenize(line)\n        #debugw(line_tokens)\n\n        if( not gendic):\n            tokens.append(line_tokens)\n        else:\n            compressed = compress_tokens(line_tokens,gendic)\n            if(len(outfile) and len(compressed)):\n                # write compressed binary stream to file if supplied in args or to stdout otherwise.\n                hexstr = &quot;&quot;.join([f&quot;\\\\x{byte:02x}&quot; for byte in compressed])\n                hexstr = hexstr.replace(&quot;\\\\&quot;,&quot;&quot;)\n                fh.write(hexstr)\n                if(debug_ngrams_dic):\n                    fh.write(&quot;\\t&quot;)\n                    strline = str(lineidx)\n                    fh.write(strline)\n                fh.write(&quot;\\n&quot;)\n            else:\n                sys.stdout.buffer.write(compressed)\n                sys.stdout.buffer.write(b&quot;\\n&quot;)\n        lineidx += 1\n    #line_tokens.append(&quot;\\n&quot;)\n    #tokens = tokens + line_tokens\n    debugw(tokens)\n    \n    if (not gendic):\n\n        compressed = compress_tokens(tokens,gendic)\n\n        if(secondpass):\n            candidates = compress_second_pass(compressed)\n            debugw(&quot;candidates:&quot;)\n            debugw(candidates)\n            processed_candidates = process_candidates_v2(candidates)\n            debugw(&quot;processed candidates:&quot;)\n            debugw(processed_candidates)\n            compressed = replace_candidates_in_processed(processed_candidates,compressed)\n\n\n        # write compressed binary stream to file if supplied in args or to stdout otherwise.\n        if(len(outfile)):\n            with open(outfile, 'wb') as fh:\n                fh.write(compressed)\n        else:\n            sys.stdout.buffer.write(compressed)\n\n        for sessidx in range(2113664,unknown_token_idx):\n            debugw(&quot;session_index:&quot; + str(sessidx))\n            debugw(engdict[sessidx])\n            debugw(engdictrev[engdict[sessidx]])\n            debugw(&quot;session_index:&quot; + str(sessidx))\n\n    fh.close()\n\n# decompress mode\nelse:\n\n    # decoding part\n    debugw(&quot;decoding...&quot;)\n    detokenizer = []\n    detokenizer_idx = 0\n\n    if(len(infile)):\n        with open(infile, 'rb') as fh:\n            compressed = bytearray(fh.read())\n\n    idx = 0\n    #FirstCharOfLine = 1\n    CharIsUpperCase = 1\n    #CharIsUpperCase2 = 0\n    \n    # main decoding loop\n    while (idx &lt; len(compressed)):\n            \n            # write each byte\n            debugw(hex(compressed[idx]))\n\n            #if( (idx &gt; 0) and compressed[idx] == 0 and compressed[idx - 1] == 0):\n            #find len of consecutive 0 chars\n\n            if(idx &lt; len(compressed) -1):\n                if((compressed[idx] == 0) and (compressed[idx+1] != 0)):\n                    #FirstCharOfLine = 1\n                    CharIsUpperCase = 1\n                elif(CharIsUpperCase == 1):\n                    #FirstCharOfLine = 2\n                    CharIsUpperCase = 2\n                        \n            if(len(detokenizer) &gt; 0):\n\n\n                ### VARIOUS DETOKENIZER CLEANUP\/FORMATTING OPERATIONS\n\n                #ensure this is not the end of an ngram. ngrams necessarily contain whitespaces\n                if (not re.search(&quot; &quot;,detokenizer[detokenizer_idx-2])):\n                    # English syntactic rules : remove whitespace left of &quot;!?.&quot; \n                    # and enforce capitalization on first non whitespace character following.\n                    if (re.match(&quot;[!\\?\\.]&quot;,detokenizer[detokenizer_idx-2]) and detokenizer_idx &gt; 2):\n                        del detokenizer[detokenizer_idx-3]\n                        detokenizer_idx -= 1\n                        if(CharIsUpperCase != 1):\n                            CharIsUpperCase = 2\n\n                    # English syntactic rules : remove whitespace left of &quot;,;:&quot; \n                    if (re.match(&quot;[,;:]&quot;,detokenizer[detokenizer_idx-2]) and detokenizer_idx &gt; 2):         \n                        del detokenizer[detokenizer_idx-3]\n                        detokenizer_idx -= 1\n\n                    # URL\/URI detected, remove any spurious whitespace before &quot;\/\/&quot; \n                    if (re.match(&quot;^\\\/\\\/&quot;,detokenizer[detokenizer_idx-2]) and detokenizer_idx &gt; 2):         \n                        del detokenizer[detokenizer_idx-3]\n                        detokenizer_idx -= 1\n                    \n                    # E-mail detected, remove whitespaces left and right of &quot;@&quot;\n                    if (re.match(&quot;@&quot;,detokenizer[detokenizer_idx-2]) and detokenizer_idx &gt; 2):         \n                        del detokenizer[detokenizer_idx-3]\n                        detokenizer_idx -= 1\n                        del detokenizer[detokenizer_idx-1]\n                        detokenizer_idx -= 1\n\n            if(not (compressed[idx] &amp; 128)):\n                \n                # current index byte msb is at 0, \n                # it is one of the 128 first tokens in the dictionary.\n                debugw(&quot;super common word&quot;)\n                #decode in place\n                \n                inta = compressed[idx]\n                       \n                if(CharIsUpperCase == 2):\n                    detokenizer.append(engdict[inta].capitalize())\n                    detokenizer_idx += 1\n                    CharIsUpperCase = 0\n                else:    \n                    detokenizer.append(engdict[inta])\n                    detokenizer_idx += 1\n                  \n                # print to stdout\n                if(CharIsUpperCase != 1):\n                    detokenizer.append(&quot; &quot;)\n                    detokenizer_idx += 1\n\n                debugw(engdict[inta])\n                idx += 1\n\n            elif((compressed[idx] &amp; 128) and (not (compressed[idx+1] &amp; 128))):\n    \n                # current index byte msb is at 1, and next byte msb is at 0. \n                # it is one of the 16384 next tokens in the dictionary.\n                debugw(&quot;common word&quot;)\n    \n                # populate bitarray from the two bytes\n                c = bitarray(endian='little')\n                c.frombytes(compressed[idx:idx+2])\n                debugw(c)\n    \n                # remove first byte msb (shift down the bits above)\n                del c[7]\n                debugw(c)\n\n                # convert bytes array to 16 bit unsigned integer\n                inta = (struct.unpack(&quot;&lt;H&quot;, c.tobytes()))[0]\n                # add offset back so we get a valid dictionary key\n                inta += 128\n    \n                # print word\n                if(CharIsUpperCase == 2):\n                    detokenizer.append(engdict[inta].capitalize())\n                    detokenizer_idx += 1\n                    CharIsUpperCase = 0\n                else:\n                    detokenizer.append(engdict[inta])\n                    detokenizer_idx += 1   \n\n                if(CharIsUpperCase != 1):\n                    detokenizer.append(&quot; &quot;)\n                    detokenizer_idx += 1 \n                \n                debugw(engdict[inta])\n                # increment byte counter with step 2, we processed 2 bytes.\n                idx += 2\n    \n            #elif((compressed[idx] &amp; 128) and (compressed[idx+1] &amp; 128)):\n            elif((compressed[idx] &amp; 128) and (compressed[idx+1] &amp; 128) and (not compressed[idx+2] &amp; 128)):\n                \n                # current index byte msb is at 1, and next byte mbs is at 1. \n                # it is one of the 4194304 next tokens in the dictionary.\n                debugw(&quot;rare word&quot;)\n                \n                chunk = compressed[idx:idx+3]\n\n                # populate bitarray from the three bytes\n                c = bitarray(endian='little')\n                #c.frombytes(compressed[idx:idx+3])\n                c.frombytes(chunk)\n                \n                debugw(c)\n\n                # remove second byte msb (shift down the bits above)\n                del c[15]\n                debugw(c)\n\n                # remove first byte msb (shift down the bits above)\n                del c[7]\n                debugw(c)\n\n                c.extend(&quot;0000000000&quot;) \n                # pad to 4 bytes (32 bit integer format) : 3 bytes + 10 bits \n                # because we previously removed two bits with del c[15] and del c[7]\n                debugw(c)\n\n                # convert bytes array to 32 bit unsigned integer\n                inta = (struct.unpack(&quot;&lt;L&quot;, c.tobytes()))[0]\n\n                if (inta &gt;= 524416):\n                    # this is a ngram.\n                    # remove offset to get into ngram dic code range.\n                    inta -= 524416\n                    debugw(&quot;this is an ngram. code:&quot;)\n                    debugw(inta)\n                    # process ngram through ngram dictionary\n                    # replace ngram code with corresponding ngram string and add them to the tokenizer\n                    ngram_string = ngram_dict_rev[inta]\n                    debugw(&quot;ngram string:&quot;)\n                    debugw(ngram_string)\n                    subs = 0\n                    #(ngram_string,subs) = re.subn(r'x',r'\\\\x',ngram_string)\n                    (ngram_string,subs) = re.subn(r'x',r'',ngram_string)   \n                    debugw(&quot;ngram string:&quot;)\n                    debugw(ngram_string)\n                    ngram_bytes = bytes.fromhex(ngram_string)\n                    subtokens = decompress_ngram_bytes(ngram_bytes)\n                    #bytes = bytearray(ngram_string,encoding=&quot;ascii&quot;)\n                    #subtokens.insert(0,&quot;PREFIX&quot;)\n                    #subtokens.append(&quot;SUFFIX&quot;)\n                    \n                    \n                    #subtokens = nltk.word_tokenize(ngram_string)\n                    # We know there shouldn't be any new lines in the subtokens.\n                    # possessives, contractions or punctuation may occur.\n                    # we need to add capitalization rules and spaces after punctuation rules.\n                    # These should be catched by the detokenizer backward processor (detokenizer_idx -2)\n                    # The problem is we append more than one token.\n                    # So we should process rules for first subtoken insertion only.\n                    # The rest should have inline processing (here)\n\n                    if(CharIsUpperCase == 2):\n                        detokenizer.append(subtokens[0].capitalize())\n                        detokenizer_idx += 1\n                        CharIsUpperCase = 0\n                    else:\n                        detokenizer.append(subtokens[0])\n                        detokenizer_idx += 1 \n                    #if(CharIsUpperCase != 1):\n                    #    detokenizer.append(&quot; &quot;) \n                    #    detokenizer_idx += 1\n\n                    ngram_processed_string = ngram_process_rules(subtokens[1:])\n                    # We shoud take care that the backward detokenizer processor does not mingle\n                    # with the the rest of the ngram string.\n                    # Such a special token will be the only one to have whitespaces in it\n                    # So we can detect it this way\n                    detokenizer.append(ngram_processed_string)\n                    detokenizer_idx += 1\n                                        \n\n                else:\n                    inta += (16384 + 128)\n\n                    if(CharIsUpperCase == 2):\n                        detokenizer.append(engdict[inta].capitalize())\n                        detokenizer_idx += 1\n                        CharIsUpperCase = 0\n                    else:\n                        detokenizer.append(engdict[inta])\n                        detokenizer_idx += 1 \n                    if(CharIsUpperCase != 1):\n                        detokenizer.append(&quot; &quot;) \n                        detokenizer_idx += 1\n                    \n                    debugw(engdict[inta])\n                    # increment byte counter with step 3, we processed 3 bytes.\n                idx += 3\n\n            #elif((compressed[idx] == 255) and (compressed[idx+1] == 255) and (compressed[idx+2] == 255)):   \n            elif((compressed[idx] &amp; 128) and (compressed[idx+1] &amp; 128) and (compressed[idx+2] &amp; 128)):\n            \n                #check if Huffmann first\n\n                chunk = compressed[idx:idx+3]\n\n                # populate bitarray from the three bytes\n                c = bitarray(endian='little')\n                #c.frombytes(compressed[idx:idx+3])\n                c.frombytes(chunk)\n                \n                debugw(c)\n\n                # remove third byte msb (shift down the bits above)\n                del c[23]\n                debugw(c)\n\n                # remove second byte msb (shift down the bits above)\n                del c[15]\n                debugw(c)\n\n                # remove first byte msb (shift down the bits above)\n                del c[7]\n                debugw(c)\n\n                c.extend(&quot;00000000000&quot;) \n                # pad to 4 bytes (32 bit integer format) : 3 bytes + 8 bits + 3 bits \n                # because we previously removed three bits with del c[23], del c[15] and del c[7]\n                debugw(c)\n\n                # convert bytes array to 32 bit unsigned integer\n                inta = (struct.unpack(&quot;&lt;L&quot;, c.tobytes()))[0]\n                inta -= 2097151\n                # if it is a Huffmann select tree code it will be 0 to 4 included\n                # if it is a session DIC it will be shifted in the negatives.\n\n                if (inta in range(0,5)):        \n\n                    # unknown word\n                    # end check if Huffmann first\n                    debugw(&quot;unknown word escape sequence detected, code: &quot; + str(inta))\n                    #unknown word escape sequence detected.\n                    if(inta == 0):\n                        char = compressed[idx+3]\n                        stra = &quot;&quot;\n                        idxchar = 0\n                        while(char != 0):\n                            debugw(&quot;char=&quot;)\n                            debugw(char)\n                            stra += chr(char)\n                            debugw(&quot;printing string state=&quot;)\n                            debugw(stra)\n                            idxchar += 1\n                            char = compressed[idx+3 + idxchar]\n                        debugw(&quot;termination char detected=&quot;)\n                        debugw(char)\n                    else:\n                        bstr = bytearray()\n                        idxchar = 0\n                        while(char != 0):\n                            bstr.append(char)\n                            idxchar += 1\n                            char = compressed[idx+3 + idxchar]\n                        debugw(&quot;huffmann : termination char detected=&quot;)\n                        debugw(char)\n                        stra = decode_unknown(bstr,inta)\n                        #stra = codec.decode(bstr)    \n                    \n                    debugw(&quot;we append that unknown word in our session dic at idx: &quot; + str(unknown_token_idx) + &quot; since it may be recalled&quot;)\n                    engdictrev[stra] = unknown_token_idx\n                    engdict[unknown_token_idx] = stra\n                    unknown_token_idx += 1\n                    \n                        \n                    if(CharIsUpperCase == 2):\n                        detokenizer.append(stra.capitalize())\n                        detokenizer_idx += 1\n                        CharIsUpperCase = 0\n                    else:\n                        detokenizer.append(stra)\n                        detokenizer_idx += 1 \n                    if(CharIsUpperCase != 1):\n                        detokenizer.append(&quot; &quot;) \n                        detokenizer_idx += 1\n    \n                else:\n\n                    inta += 2097151\n                    # it is a session DIC, shifting back to 0.\n                    inta += (2097152 + 16384 + 128)\n                    # it is a session DIC, shifting back session dic address space.\n\n                    debugw(&quot;recalled word:&quot;)\n                    \n                    try:\n                        debugw(engdict[inta])\n                        # print word\n                    \n                        if(CharIsUpperCase == 2):\n                            detokenizer.append(engdict[inta].capitalize())\n                            detokenizer_idx += 1\n                            CharIsUpperCase = 0\n                        else:\n                            detokenizer.append(engdict[inta])\n                            detokenizer_idx += 1   \n\n                        if(CharIsUpperCase != 1):\n                            detokenizer.append(&quot; &quot;)\n                            detokenizer_idx += 1 \n                    \n                    except:\n                        debugw(&quot;something went wrong, could not find word in session DIC&quot;)\n\n                        for sessidx in range(2113664,unknown_token_idx):\n                            debugw(&quot;session_index:&quot; + str(sessidx))\n                            debugw(engdict[sessidx])\n                            debugw(engdictrev[engdict[sessidx]])\n                            debugw(&quot;session_index:&quot; + str(sessidx))\n\n\n                idx += 3 + idxchar\n\n    debugw(detokenizer)\n    if not(len(outfile)):\n        print(''.join(detokenizer))\n    else:\n        # write clear text to file if supplied in args\n        with open(outfile, 'w') as fh:\n            fh.write(''.join(detokenizer))\n    \n\" style=\"color:#d8dee9ff;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki nord\" style=\"background-color: #2e3440ff\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #616E88\"># LIGHTWEIGHT ENGLISH TEXT STREAM COMPRESSION (LETSC)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># (adaptive encoding length 1byte\/2byte\/3byte based on word dictionary with statistical prevalence ordering - count1_w.txt)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># Huffmann encoding for uknown tokens<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># Enforces English syntax rules for punctuation<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># Takes into account possessives and contractions<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># Has URLs and e-mails processing rules, more to follow<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># Second pass compression using a dictionary of the most frequent 4 N-Grams of English fiction.<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\">#GPL 3 License<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># www.skynext.tech<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># Rodrigo Verissimo<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># v0.92<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># October 21th, 2023<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># Python + packages Requirements<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># Python 3.9<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># nltk, bitarray, bitstring, re, dahuffmann<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># Performance : ratios between x2.6 for Middle to Modern and elaborate English (ex: Shakespeare)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># Up to x3 and more for simple english.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># adapted for text messaging \/ streaming<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># Requires the same dictionary on both channel endpoints.<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># ALGORITHM. Very straightforward. (adaptive encoding length based on dictionary with statistical ordering)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\">#################################################################################<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># First byte :<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\">#if MSB is 0, a whole word is encoded on the first 7 bits of one byte only.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\">#This makes 127 possible words. These are reserved for the first 127 most used <\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\">#english words. punctuation also appears as a possible word<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># Second byte :<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\">#if MSB of first byte is 1, and MSB of second byte is 0, a whole word is encoded<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># with the help of the 7 LSB of byte 1 plus the 7 LSB of byte 2. <\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># This makes room for the next 16384 most used english words.<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># Third byte :<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># if MSB of first byte is 1 and MSB of second byte is 1, and the MSB of third byte is 0<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># a whole word is encoded<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># with the help of the 7 + 7 + 7 = 21 bits (2 097 152 possible words)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># For now, the 3 byte address space is split into two 2 097 152 address spaces<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># That is, the case of all 3 bytes MSB being 1 is treated separately.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># In this adress space, only a handful of codes are used as an escape sequence for particular <\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># Huffmann trees, see below.<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\">#-&gt;<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\">#load dictionary of english words from most used to least used.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\">#punctuation and special characters have been added with order of prevalence.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\">#punctuation frequency is from wikipedia corpus. (around 1.3 billion words) <\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\">#it has been normalized to the frequency of the 1\/3 million word list based <\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\">#on the google web trillon word corpus. that is, frequencies for special chars have been multiplied by 788.39<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\">#wikipedia punctuation is not optimized for chat, as it lower prevalence of chars like question marks<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\">#that may appear more frequently in chat situations.<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># the first tokenizer used does not separate any special character attached (without whitespace) to a word<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># this will mostly result in an unknown word in the dictionary<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># this key absence in the reverse dict will be catched and treated by another tokenizer (mainly for possessive<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># forms and contractions)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\">#for possessives ex: &quot;dog&#39;s bone&quot; or plural &quot;dogs&#39; bones&quot; a separate tokenizer is used to split into<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># &quot;dog&quot; , &quot;&#39;s&quot;<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># &quot;&#39;s&quot; and &quot;&#39;&quot; also appear in the dictionary.<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># ROADMAP<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># remove whitespaces left of punctuation DONE<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># manage new lines DONE<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># manage websites and emails DONE<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># <\/span><span style=\"color: #81A1C1\">TODO<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># add spell check ! <\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># <\/span><span style=\"color: #81A1C1\">TODO<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># Remove spurious new lines that appear after encoding special sequences such mails or URLS<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># DONE (basic Huffmann, some chars missing in tree)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># add Huffmann encoding for absent words in dictionary (neologisms,colloqualisms,dialects, or misspellings) DONE<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># DONE<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># <\/span><span style=\"color: #81A1C1\">TODO<\/span><span style=\"color: #616E88\"> : test with more texts such as wikipedia XML and various authors works, to catch as much<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># use cases and formatting issues that arise to improve the algorithm<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># add adaptive Huffmann. use 4 Huffmann trees. (see below)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># Assuming there are 4 codes for hufmmann : hufmann lower case, hufmann lower + capitals, huffmann<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># lower + capitals + numeric, all printable ASCII excluding whitespace : same as preceding category plus <\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># special chars.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># Chosing the tree to use would be done by string regex.<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\">#DONE<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># Detect UTF-8 and transcode to ASCII (potentially lossy)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\">#DONE<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># <\/span><span style=\"color: #81A1C1\">TODO<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># Dictionary Learn over time (re-shuffle the order of tokens)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># Without transmission of any info between parties<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># Dangerous if sync is lost between the two parties<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># <\/span><span style=\"color: #81A1C1\">TODO<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># <\/span><span style=\"color: #81A1C1\">TODO<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># optimize Huffmann part to remove the need for the chr(0) termination = scan for EOF sequence in Huffmann to get<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># the Huffmann byte sequence length. <\/span><span style=\"color: #81A1C1\">TODO<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># DONE<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># Add second pass compression using word N-grams lookup table. (4 and 5 N-grams seem to be a good compromize)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># The idea is to encode 4 and 5 token substrings in a line by a single 3 byte code.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># There is plenty of room left in the 3 byte address space. For now, there is 333 333 - 16384 - 128 tokens used = 316821 tokens used<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># from 4194304 - 3 total address space.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># DONE using 1 571 125 codes for a 50\/50 mix of 4grams and 5grams.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># There is still at least 2million codes left.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\">#  for now we plan 4 escape sequences for the selection of one of the 4 Huffmann trees.<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># ngrams processing is first done with the create_ngrams_dic.sh script.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&quot;&quot;&quot;<\/span><\/span>\n<span class=\"line\"><span style=\"color: #A3BE8C\">python3 ngrams_format_dic.py 4grams_english-fiction.csv outngrams4.txt #remove counts and process contractions<\/span><\/span>\n<span class=\"line\"><span style=\"color: #A3BE8C\">python3 ngrams_format_dic.py 5grams_english-fiction.csv outngrams5.txt #remove counts and process contractions<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #A3BE8C\">python3 dicstrv4.py -d outngrams4.txt outngrams4.bin.dup #convert ngrams txt to compressed form<\/span><\/span>\n<span class=\"line\"><span style=\"color: #A3BE8C\">python3 dicstrv4.py -d outngrams5.txt outngrams5.bin.dup #convert ngrams txt to compressed form<\/span><\/span>\n<span class=\"line\"><span style=\"color: #A3BE8C\">awk &#39;!seen[$0]++&#39; outngrams4.bin.dup &gt; outngrams4.bin #Remove spurious duplicates that may arise<\/span><\/span>\n<span class=\"line\"><span style=\"color: #A3BE8C\">awk &#39;!seen[$0]++&#39; outngrams5.bin.dup &gt; outngrams5.bin #Remove spurious duplicates that may arise<\/span><\/span>\n<span class=\"line\"><span style=\"color: #A3BE8C\">sed -i &#39;786001,$ d&#39; outngrams4.bin # truncate to fit target address space<\/span><\/span>\n<span class=\"line\"><span style=\"color: #A3BE8C\">sed -i &#39;786001,$ d&#39; outngrams5.bin # truncate to fit target address space<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #A3BE8C\">cat outngrams4.bin outngrams5.bin &gt; outngrams.bin # concatenate. this is our final form<\/span><\/span>\n<span class=\"line\"><span style=\"color: #A3BE8C\">cat outngrams.bin | awk &#39;{ print length, bash $0 }&#39; | sort -n -s | cut -d&quot; &quot; -f2- &gt; sorted.txt # sort by size to have an idea of distribution<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #A3BE8C\"># ngrams that encode as less than 4 bytes have been pruned since the ratio is 1<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&quot;&quot;&quot;<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># DONE <\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># It is probable that the most used 4 tokens N-grams are based on already frequent words. that individually<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># encode as 1 byte or two bytes.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># Worst case : all the 4 tokens are encoded in the 1 to 128 addres space, so they take a total 4 bytes.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># The resulting code will be 3 bytes, a deflate percent of 25%<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># If one of the tokens is 2 byte (128 to 16384 -1 address space), then it uses 5 bytes.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># deflate percent is 40%<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># The unknown is the statistical prevalence of two million 4 token N-grams.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># (ex: coming from english fiction corpus) in a standard chat text.<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># First encode the google most frequent 4 and 5 N-grams csv file to replace the tokens in each N-gram by the corrsponding <\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># byte sequences from our codes in the count_1w.txt dictionary. This will be another pre-process script.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># The resulting new csv format will be :<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># some 3 byte index = x04x09x23.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># The 3 byte index is simply the line number of the compressed ngram. <\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># read that in ram. Conservative Estimate 4 bytes + 3 bytes per entry 7 bytes * 2 000 000 = 14 Meg memory footprint.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># We already have a 4 MB * 3  12 Meg footprint from count_1w (estimate)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># Generate the inverse map dictionary (mapping sequences to 3 byte indexes)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># x04x09x23&#39; = some 3 byte index<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># Should not be a problem since there is a 1 to 1 relationship between the two<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># Then perform a first pass compression.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># Then scan the first pass compression file using a 4 token sliding window.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># Contractions is a case that will have to be managed.<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># If there are overlapping matches, chose the match that result in the best deflation, if any.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># If the unknown word escape codes appears, stop processing and resume after the escaped word<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># Overall, replace the byte sequence by the corrsponding 3 byte sequence.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># DONE<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #81A1C1\">import<\/span><span style=\"color: #D8DEE9FF\"> sys<\/span><\/span>\n<span class=\"line\"><span style=\"color: #81A1C1\">import<\/span><span style=\"color: #D8DEE9FF\"> traceback<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\">#print(len(sys.argv))<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\">#op = (sys.argv[1]).encode(&quot;ascii&quot;).decode(&quot;ascii&quot;)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\">#print(op)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\">#quit()<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">((<\/span><span style=\"color: #88C0D0\">len<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">sys<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #D8DEE9FF\">argv<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">&lt;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">3<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">or<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #88C0D0\">len<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">sys<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #D8DEE9FF\">argv<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">&gt;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">4<\/span><span style=\"color: #ECEFF4\">)):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #88C0D0\">print<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">Syntax for compression :<\/span><span style=\"color: #EBCB8B\">\\n<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #88C0D0\">print<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">python3 dicstrv.py -c &lt;txt_inputfile&gt; &lt;compressed_outputfile&gt;<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #88C0D0\">print<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">Reads txt_inputfile and writes compressed text stream to compressed_outputfile.<\/span><span style=\"color: #EBCB8B\">\\n<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #D8DEE9FF\"> <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #88C0D0\">print<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">python3 dicstrv.py -c &lt;txt_inputfile&gt;<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #88C0D0\">print<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">Reads txt_input file and writes compressed output to stdout<\/span><span style=\"color: #EBCB8B\">\\n<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #88C0D0\">print<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">Syntax for decompression :<\/span><span style=\"color: #EBCB8B\">\\n<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #88C0D0\">print<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">python3 dicstrv.py -x &lt;compressed_inputfile&gt; &lt;txt_outputfile&gt;<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #88C0D0\">print<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">Reads compressed_inputfile and writes cleartext to txt_outputfile.<\/span><span style=\"color: #EBCB8B\">\\n<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #D8DEE9FF\"> <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #88C0D0\">print<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">python3 dicstrv.py -x &lt;compressed_inputfile&gt;<\/span><span style=\"color: #EBCB8B\">\\n<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #88C0D0\">print<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">Reads compressed_input file and writes cleartext output to stdout<\/span><span style=\"color: #EBCB8B\">\\n<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #88C0D0\">print<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">NOTE: dictionary file count1_w.txt must be in the same directory as the script.<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #D8DEE9FF\">    <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #88C0D0\">quit<\/span><span style=\"color: #ECEFF4\">()<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">sys<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #D8DEE9FF\">argv<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #ECEFF4\">]<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">==<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">-c<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    compress <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">True<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    gendic <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">False<\/span><\/span>\n<span class=\"line\"><span style=\"color: #81A1C1\">elif<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">sys<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #D8DEE9FF\">argv<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #ECEFF4\">]<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">==<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">-d<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    compress <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">True<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    gendic <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">True<\/span><\/span>\n<span class=\"line\"><span style=\"color: #81A1C1\">elif<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">sys<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #D8DEE9FF\">argv<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #ECEFF4\">]<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">==<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">-x<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    compress <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">False<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    gendic <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">False<\/span><\/span>\n<span class=\"line\"><span style=\"color: #81A1C1\">else<\/span><span style=\"color: #ECEFF4\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #88C0D0\">print<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">unknown operation: <\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">str<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">sys<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #D8DEE9FF\">argv<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #B48EAD\">0<\/span><span style=\"color: #ECEFF4\">])<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\"> type &#39;python3 dicstrv3.py&#39; for help<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #88C0D0\">len<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">sys<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #D8DEE9FF\">argv<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">==<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">3<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    infile <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> sys<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #D8DEE9FF\">argv<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #B48EAD\">2<\/span><span style=\"color: #ECEFF4\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    outfile <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">&#39;&#39;<\/span><\/span>\n<span class=\"line\"><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #88C0D0\">len<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">sys<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #D8DEE9FF\">argv<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">==<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">4<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    infile <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> sys<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #D8DEE9FF\">argv<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #B48EAD\">2<\/span><span style=\"color: #ECEFF4\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    outfile <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> sys<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #D8DEE9FF\">argv<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #B48EAD\">3<\/span><span style=\"color: #ECEFF4\">]<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #81A1C1\">import<\/span><span style=\"color: #D8DEE9FF\"> codecs<\/span><\/span>\n<span class=\"line\"><span style=\"color: #81A1C1\">import<\/span><span style=\"color: #D8DEE9FF\"> nltk<\/span><\/span>\n<span class=\"line\"><span style=\"color: #81A1C1\">from<\/span><span style=\"color: #D8DEE9FF\"> nltk<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #D8DEE9FF\">tokenize <\/span><span style=\"color: #81A1C1\">import<\/span><span style=\"color: #D8DEE9FF\"> TweetTokenizer<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">tknzr <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">TweetTokenizer<\/span><span style=\"color: #ECEFF4\">()<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #81A1C1\">import<\/span><span style=\"color: #D8DEE9FF\"> re<\/span><\/span>\n<span class=\"line\"><span style=\"color: #81A1C1\">import<\/span><span style=\"color: #D8DEE9FF\"> bitstring<\/span><\/span>\n<span class=\"line\"><span style=\"color: #81A1C1\">from<\/span><span style=\"color: #D8DEE9FF\"> bitarray <\/span><span style=\"color: #81A1C1\">import<\/span><span style=\"color: #D8DEE9FF\"> bitarray<\/span><\/span>\n<span class=\"line\"><span style=\"color: #81A1C1\">import<\/span><span style=\"color: #D8DEE9FF\"> struct<\/span><\/span>\n<span class=\"line\"><span style=\"color: #81A1C1\">import<\/span><span style=\"color: #D8DEE9FF\"> time<\/span><\/span>\n<span class=\"line\"><span style=\"color: #81A1C1\">from<\/span><span style=\"color: #D8DEE9FF\"> dahuffman <\/span><span style=\"color: #81A1C1\">import<\/span><span style=\"color: #D8DEE9FF\"> HuffmanCodec<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">debug_on <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">False<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">debug_ngrams_dic <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">False<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">secondpass <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">True<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">use_huffmann <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">False<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">unknown_token_idx <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">16384<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">128<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">2097152<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #81A1C1\">def<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9\">strdebug<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">debug_on<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #88C0D0\">print<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">strdebug<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># Huffmann is only used for absent words in count1_w.txt dictionary<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># General lower and upper case frequency combined as lowercase<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">codec_lower <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> HuffmanCodec<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">from_frequencies<\/span><span style=\"color: #ECEFF4\">(<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">{<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">e<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\">   <\/span><span style=\"color: #B48EAD\">56.88<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">m<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #B48EAD\">15.36<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">a<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #B48EAD\">43.31<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">h<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #B48EAD\">15.31<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">r<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #B48EAD\">38.64<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">g<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #B48EAD\">12.59<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">i<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #B48EAD\">38.45<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">b<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #B48EAD\">10.56<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">o<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #B48EAD\">36.51<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">f<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #B48EAD\">9.24<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">t<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #B48EAD\">35.43<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">y<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #B48EAD\">9.06<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">n<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #B48EAD\">33.92<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">w<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #B48EAD\">6.57<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">s<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #B48EAD\">29.23<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">k<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #B48EAD\">5.61<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">l<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #B48EAD\">27.98<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">v<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #B48EAD\">5.13<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">c<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #B48EAD\">23.13<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">x<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #B48EAD\">1.48<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">u<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #B48EAD\">18.51<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">z<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #B48EAD\">1.39<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">d<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #B48EAD\">17.25<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">j<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">p<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #B48EAD\">16.14<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">q<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\">\t<\/span><span style=\"color: #B48EAD\">1<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">}<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">codec_lower<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">get_code_table<\/span><span style=\"color: #ECEFF4\">())<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># following is ASCII mixed upper and lower case frequency from an English writer from Palm OS PDA memos in 2002<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># Credit : http:\/\/fitaly.com\/board\/domper3\/posts\/136.html<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">codec_upperlower <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> HuffmanCodec<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">from_frequencies<\/span><span style=\"color: #ECEFF4\">(<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">{<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">A<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.3132<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">B<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.2163<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">C<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.3906<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">D<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.3151<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">E<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.2673<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">F<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.1416<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">G<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.1876<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">H<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.2321<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">I<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.3211<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">J<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.1726<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">K<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.0687<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">L<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.1884<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">M<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.3529<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">N<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.2085<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">O<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.1842<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">P<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.2614<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">Q<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.0316<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">R<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.2519<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">S<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.4003<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">T<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.3322<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">U<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.0814<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">V<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.0892<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">W<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.2527<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">X<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.0343<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">Y<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.0304<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">Z<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.0076<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">a<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">5.1880<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">b<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1.0195<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">c<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">2.1129<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">d<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">2.5071<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">e<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">8.5771<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">f<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1.3725<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">g<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1.5597<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">h<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">2.7444<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">i<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">4.9019<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">j<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.0867<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">k<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.6753<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">l<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">3.1750<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">m<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1.6437<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">n<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">4.9701<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">o<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">5.7701<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">p<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1.5482<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">q<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.0747<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">r<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">4.2586<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">s<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">4.3686<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">t<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">6.3700<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">u<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">2.0999<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">v<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.8462<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">w<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1.3034<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">x<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.1950<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">y<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1.1330<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">z<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.0596<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">})<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">codec_upperlower<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">get_code_table<\/span><span style=\"color: #ECEFF4\">())<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># following is ASCII alpha numeric frequency from an English writer from Palm OS PDA memos in 2002<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># Credit : http:\/\/fitaly.com\/board\/domper3\/posts\/136.html<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">codec_alphanumeric <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> HuffmanCodec<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">from_frequencies<\/span><span style=\"color: #ECEFF4\">(<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">{<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">0<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.5516<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">1<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.4594<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">2<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.3322<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">3<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.1847<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">4<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.1348<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">5<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.1663<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">6<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.1153<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">7<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.1030<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">8<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.1054<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">9<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.1024<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">A<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.3132<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">B<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.2163<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">C<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.3906<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">D<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.3151<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">E<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.2673<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">F<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.1416<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">G<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.1876<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">H<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.2321<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">I<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.3211<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">J<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.1726<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">K<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.0687<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">L<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.1884<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">M<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.3529<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">N<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.2085<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">O<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.1842<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">P<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.2614<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">Q<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.0316<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">R<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.2519<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">S<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.4003<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">T<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.3322<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">U<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.0814<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">V<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.0892<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">W<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.2527<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">X<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.0343<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">Y<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.0304<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">Z<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.0076<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">a<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">5.1880<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">b<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1.0195<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">c<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">2.1129<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">d<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">2.5071<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">e<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">8.5771<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">f<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1.3725<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">g<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1.5597<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">h<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">2.7444<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">i<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">4.9019<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">j<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.0867<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">k<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.6753<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">l<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">3.1750<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">m<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1.6437<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">n<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">4.9701<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">o<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">5.7701<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">p<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1.5482<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">q<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.0747<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">r<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">4.2586<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">s<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">4.3686<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">t<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">6.3700<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">u<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">2.0999<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">v<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.8462<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">w<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1.3034<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">x<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.1950<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">y<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1.1330<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">z<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.0596<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">})<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">codec_alphanumeric<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">get_code_table<\/span><span style=\"color: #ECEFF4\">())<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># following is Whole ASCII printable chars frequency except whitespace from an English writer from Palm OS PDA memos in 2002<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># Credit : http:\/\/fitaly.com\/board\/domper3\/posts\/136.html<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">codec_all <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> HuffmanCodec<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">from_frequencies<\/span><span style=\"color: #ECEFF4\">(<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">{<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">!<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.0072<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #EBCB8B\">\\&quot;<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.2442<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">#<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.0179<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">$<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.0561<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">%<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.0160<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">&amp;<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.0226<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #EBCB8B\">\\&#39;<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.2447<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">(<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.2178<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">)<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.2233<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">*<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.0628<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">+<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.0215<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">,<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.7384<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">-<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1.3734<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">.<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1.5124<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">\/<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.1549<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">0<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.5516<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">1<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.4594<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">2<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.3322<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">3<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.1847<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">4<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.1348<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">5<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.1663<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">6<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.1153<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">7<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.1030<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">8<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.1054<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">9<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.1024<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">:<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.4354<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">;<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.1214<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">&lt;<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.1225<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">=<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.0227<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">&gt;<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.1242<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">?<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.1474<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">@<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.0073<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">A<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.3132<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">B<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.2163<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">C<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.3906<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">D<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.3151<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">E<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.2673<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">F<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.1416<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">G<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.1876<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">H<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.2321<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">I<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.3211<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">J<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.1726<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">K<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.0687<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">L<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.1884<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">M<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.3529<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">N<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.2085<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">O<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.1842<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">P<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.2614<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">Q<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.0316<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">R<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.2519<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">S<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.4003<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">T<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.3322<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">U<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.0814<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">V<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.0892<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">W<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.2527<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">X<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.0343<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">Y<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.0304<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">Z<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.0076<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">[<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.0086<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #EBCB8B\">\\\\<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.0016<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">]<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.0088<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">^<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.0003<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">_<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.1159<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">`<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.0009<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">a<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">5.1880<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">b<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1.0195<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">c<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">2.1129<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">d<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">2.5071<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">e<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">8.5771<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">f<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1.3725<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">g<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1.5597<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">h<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">2.7444<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">i<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">4.9019<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">j<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.0867<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">k<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.6753<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">l<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">3.1750<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">m<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1.6437<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">n<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">4.9701<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">o<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">5.7701<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">p<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1.5482<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">q<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.0747<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">r<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">4.2586<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">s<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">4.3686<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">t<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">6.3700<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">u<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">2.0999<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">v<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.8462<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">w<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1.3034<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">x<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.1950<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">y<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1.1330<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">z<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.0596<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">{<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.0026<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">|<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.0007<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">}<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.0026<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">~<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0.0003<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #ECEFF4\">})<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">codec_all<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">get_code_table<\/span><span style=\"color: #ECEFF4\">())<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\">#quit()        <\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #81A1C1\">def<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">check_file_is_utf8<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9\">filename<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">checking encoding of:<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">filename<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #81A1C1\">try<\/span><span style=\"color: #ECEFF4\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        f <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> codecs<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">open<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">filename<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #D8DEE9\">encoding<\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">utf-8<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #D8DEE9\">errors<\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">strict<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #81A1C1\">for<\/span><span style=\"color: #D8DEE9FF\"> line <\/span><span style=\"color: #81A1C1\">in<\/span><span style=\"color: #D8DEE9FF\"> f<\/span><span style=\"color: #ECEFF4\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #81A1C1\">pass<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">Valid utf-8<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #81A1C1\">return<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">True<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #81A1C1\">except<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #8FBCBB\">UnicodeDecodeError<\/span><span style=\"color: #ECEFF4\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">invalid utf-8<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #81A1C1\">return<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">False<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #81A1C1\">def<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">find_huffmann_to_use<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9\">token<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #81A1C1\">not<\/span><span style=\"color: #D8DEE9FF\"> use_huffmann<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">do not use Huffmann, encode char by char<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #81A1C1\">return<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    not_alllower <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> re<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">search<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">[^a-z]<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #81A1C1\">not<\/span><span style=\"color: #D8DEE9FF\"> not_alllower<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">all lower case<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #81A1C1\">return<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    not_alllowerorupper <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> re<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">search<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">[^A-Za-z]<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #81A1C1\">not<\/span><span style=\"color: #D8DEE9FF\"> not_alllowerorupper<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">all lower or upper<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #81A1C1\">return<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">2<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    not_alllalphanumeric <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> re<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">search<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">[^A-Za-z0-9]<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #81A1C1\">not<\/span><span style=\"color: #D8DEE9FF\"> not_alllalphanumeric<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">all alpha numeric<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #81A1C1\">return<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">3<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #81A1C1\">else<\/span><span style=\"color: #ECEFF4\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">all printable, except whitespace<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #81A1C1\">return<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">4<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><\/span>\n<span class=\"line\"><span style=\"color: #81A1C1\">def<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">encode_unknown<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9\">token<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9\">treecode<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">treecode <\/span><span style=\"color: #81A1C1\">==<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        bytes_unknown <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">bytearray<\/span><span style=\"color: #ECEFF4\">()<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #81A1C1\">for<\/span><span style=\"color: #D8DEE9FF\"> charidx <\/span><span style=\"color: #81A1C1\">in<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">range<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #B48EAD\">0<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">len<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">token<\/span><span style=\"color: #ECEFF4\">)):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">appending chars..<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">token<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">charidx<\/span><span style=\"color: #ECEFF4\">])<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># only append if it is not an unexpected termination in the unknown token<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #81A1C1\">not<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">ord<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">token<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">charidx<\/span><span style=\"color: #ECEFF4\">])<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">==<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                bytes_unknown<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">append<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #88C0D0\">ord<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">token<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">charidx<\/span><span style=\"color: #ECEFF4\">]))<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #81A1C1\">else<\/span><span style=\"color: #ECEFF4\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">unexpected termination chr(0) in unknown token, discarding character<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #81A1C1\">return<\/span><span style=\"color: #D8DEE9FF\"> bytes_unknown<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">treecode <\/span><span style=\"color: #81A1C1\">==<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #81A1C1\">return<\/span><span style=\"color: #D8DEE9FF\"> codec_lower<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">encode<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">token<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">treecode <\/span><span style=\"color: #81A1C1\">==<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">2<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #81A1C1\">return<\/span><span style=\"color: #D8DEE9FF\"> codec_upperlower<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">encode<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">token<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #D8DEE9FF\">           <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">treecode <\/span><span style=\"color: #81A1C1\">==<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">3<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #81A1C1\">return<\/span><span style=\"color: #D8DEE9FF\"> codec_alphanumeric<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">encode<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">token<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #D8DEE9FF\">                      <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">treecode <\/span><span style=\"color: #81A1C1\">==<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">4<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #81A1C1\">return<\/span><span style=\"color: #D8DEE9FF\"> codec_all<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">encode<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">token<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #D8DEE9FF\">                      <\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #81A1C1\">def<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">decode_unknown<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9\">bytetoken<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9\">treecode<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">treecode <\/span><span style=\"color: #81A1C1\">==<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #81A1C1\">return<\/span><span style=\"color: #D8DEE9FF\"> codec_lower<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">decode<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">bytetoken<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">treecode <\/span><span style=\"color: #81A1C1\">==<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">2<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #81A1C1\">return<\/span><span style=\"color: #D8DEE9FF\"> codec_upperlower<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">decode<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">bytetoken<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #D8DEE9FF\">           <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">treecode <\/span><span style=\"color: #81A1C1\">==<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">3<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #81A1C1\">return<\/span><span style=\"color: #D8DEE9FF\"> codec_alphanumeric<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">decode<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">bytetoken<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #D8DEE9FF\">                      <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">treecode <\/span><span style=\"color: #81A1C1\">==<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">4<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #81A1C1\">return<\/span><span style=\"color: #D8DEE9FF\"> codec_all<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">decode<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">bytetoken<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #D8DEE9FF\">  <\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #81A1C1\">def<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">compress_token_or_subtoken<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9\">compressed<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9\">line_token<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9\">token_of_line_count<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9\">lentoken<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9\">gendic<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">  <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #81A1C1\">global<\/span><span style=\"color: #D8DEE9FF\"> unknown_token_idx<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #81A1C1\">try<\/span><span style=\"color: #ECEFF4\">:<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #616E88\"># is the token in english dictionary ?<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">line_token:<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #D8DEE9FF\"> line_token<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        tokenid <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> engdictrev<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">line_token<\/span><span style=\"color: #ECEFF4\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        subtokensid <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">tokenid<\/span><span style=\"color: #ECEFF4\">]<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #81A1C1\">except<\/span><span style=\"color: #ECEFF4\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">unknown word, special chars adjunct, or possessive form<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #616E88\"># let&#39;s try to split the unknown word from possible adjunct special chars<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #616E88\"># for this we use another tokenizer<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        subtokens <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> nltk<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">word_tokenize<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">line_token<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #88C0D0\">len<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">subtokens<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">==<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># no luck...<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># <\/span><span style=\"color: #81A1C1\">TODO<\/span><span style=\"color: #616E88\"> : do not drop the word silently, encode it !<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># If we encode a ngram dic, skip ngrams with unknown tokens in the primary dic.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># and return empty bytearray to signify ngram compression failure <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">gendic<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                compressed <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">bytearray<\/span><span style=\"color: #ECEFF4\">()<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">gendic : unknown word<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #81A1C1\">return<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">compressed<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\"> token_of_line_count<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">unknown word<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\">#AMEND dictionary <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># add this unknown subtoken to a session dic so it can be recalled.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">unknown word: <\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #D8DEE9FF\"> subtokens<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #B48EAD\">0<\/span><span style=\"color: #ECEFF4\">]<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\"> adding to session dic at id: <\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">str<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">unknown_token_idx<\/span><span style=\"color: #ECEFF4\">))<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">unknown word, adding to session dic at id: <\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">str<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">unknown_token_idx<\/span><span style=\"color: #ECEFF4\">))<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            engdictrev<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">subtokens<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #B48EAD\">0<\/span><span style=\"color: #ECEFF4\">]]<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> unknown_token_idx<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            engdict<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">unknown_token_idx<\/span><span style=\"color: #ECEFF4\">]<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> subtokens<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #B48EAD\">0<\/span><span style=\"color: #ECEFF4\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            unknown_token_idx <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                       <\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\">#subtokensid = [4194304 - 1] # subtoken code for unknown word escape sequence.                       <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            subtokensid <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #B48EAD\">4194303<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">-<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">find_huffmann_to_use<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">subtokens<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #B48EAD\">0<\/span><span style=\"color: #ECEFF4\">])]<\/span><span style=\"color: #D8DEE9FF\">                   <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\">#print(subtokensid)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\">#continue<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #81A1C1\">else<\/span><span style=\"color: #ECEFF4\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">possible special char found<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            subtokensid <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">[]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #81A1C1\">for<\/span><span style=\"color: #D8DEE9FF\"> subtoken <\/span><span style=\"color: #81A1C1\">in<\/span><span style=\"color: #D8DEE9FF\"> subtokens<\/span><span style=\"color: #ECEFF4\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">subtoken=<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">subtoken<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #81A1C1\">try<\/span><span style=\"color: #ECEFF4\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    subtokensid<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">append<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">engdictrev<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">subtoken<\/span><span style=\"color: #ECEFF4\">])<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #81A1C1\">except<\/span><span style=\"color: #ECEFF4\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #616E88\"># no luck...<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #616E88\"># <\/span><span style=\"color: #81A1C1\">TODO<\/span><span style=\"color: #616E88\"> : do not drop the word silently, encode it !<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #616E88\"># If we encode a ngram dic, skip ngrams with unknown tokens in the primary dic.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #616E88\"># and return empty bytearray to signify ngram compression failure <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">gendic<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                        compressed <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">bytearray<\/span><span style=\"color: #ECEFF4\">()<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                        <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">gendic : unknown word<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                        <\/span><span style=\"color: #81A1C1\">return<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">compressed<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\"> token_of_line_count<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">unknown subtoken<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    subtokensid<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">append<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #B48EAD\">4194303<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">-<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">find_huffmann_to_use<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">subtoken<\/span><span style=\"color: #ECEFF4\">))<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #616E88\">#subtokensid.append(4194304 - 1)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #616E88\"># add this unknown subtoken to a session dic so it can be recalled.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #616E88\">#AMEND dictionary <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #616E88\"># add this unknown subtoken to a session dic so it can be recalled.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">unknown subtoken: <\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #D8DEE9FF\"> subtoken <\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\"> adding to session dic at id: <\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">str<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">unknown_token_idx<\/span><span style=\"color: #ECEFF4\">))<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">unknown subtoken, adding to session dic at id: <\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">str<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">unknown_token_idx<\/span><span style=\"color: #ECEFF4\">))<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    engdictrev<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">subtoken<\/span><span style=\"color: #ECEFF4\">]<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> unknown_token_idx<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    engdict<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">unknown_token_idx<\/span><span style=\"color: #ECEFF4\">]<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> subtoken<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    unknown_token_idx <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #616E88\">#continue<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    subtokenidx <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #81A1C1\">for<\/span><span style=\"color: #D8DEE9FF\"> subtokenid <\/span><span style=\"color: #81A1C1\">in<\/span><span style=\"color: #D8DEE9FF\"> subtokensid<\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\">        <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">subtokenid=<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">subtokenid<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #616E88\"># maximum level of token unpacking is done<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">subtokenid <\/span><span style=\"color: #81A1C1\">&lt;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">128<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">super common word<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">engdict<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">subtokenid<\/span><span style=\"color: #ECEFF4\">])<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\">#convert to bytes<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            byte0 <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> subtokenid<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">to_bytes<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #D8DEE9\">byteorder<\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">little<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">hex:<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">byte0<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">hex<\/span><span style=\"color: #ECEFF4\">())<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\">#append to bytearray<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            compressed<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">append<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">byte0<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #B48EAD\">0<\/span><span style=\"color: #ECEFF4\">])<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #B48EAD\">128<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">&lt;=<\/span><span style=\"color: #D8DEE9FF\"> subtokenid <\/span><span style=\"color: #81A1C1\">&lt;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">16384<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">128<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">common word<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\">#remove offset<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">engdict<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">subtokenid<\/span><span style=\"color: #ECEFF4\">])<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            subtokenid <\/span><span style=\"color: #81A1C1\">-=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">128<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\">#convert to bytes1 (array of 2 bytes)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            bytes1 <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> subtokenid<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">to_bytes<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #B48EAD\">2<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9\">byteorder<\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">little<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;&quot;<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">join<\/span><span style=\"color: #ECEFF4\">([<\/span><span style=\"color: #81A1C1\">f<\/span><span style=\"color: #A3BE8C\">&quot;<\/span><span style=\"color: #EBCB8B\">\\\\<\/span><span style=\"color: #A3BE8C\">x<\/span><span style=\"color: #EBCB8B\">{<\/span><span style=\"color: #D8DEE9FF\">byte<\/span><span style=\"color: #81A1C1\">:02x<\/span><span style=\"color: #EBCB8B\">}<\/span><span style=\"color: #A3BE8C\">&quot;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">for<\/span><span style=\"color: #D8DEE9FF\"> byte <\/span><span style=\"color: #81A1C1\">in<\/span><span style=\"color: #D8DEE9FF\"> bytes1<\/span><span style=\"color: #ECEFF4\">]))<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\">#convert to bitarray<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            c <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">bitarray<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9\">endian<\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">little<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            c<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">frombytes<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">bytes1<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">c<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># set msb of first byte to 1 and shift the more significant bits up.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            c<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">insert<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #B48EAD\">7<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">c<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># remove excess bit<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #81A1C1\">del<\/span><span style=\"color: #D8DEE9FF\"> c<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #B48EAD\">16<\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #B48EAD\">17<\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #ECEFF4\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">c<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># append our two tweaked bytes to the compressed bytearray<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            compressed<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">append<\/span><span style=\"color: #ECEFF4\">((<\/span><span style=\"color: #D8DEE9FF\">c<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">tobytes<\/span><span style=\"color: #ECEFF4\">())[<\/span><span style=\"color: #B48EAD\">0<\/span><span style=\"color: #ECEFF4\">])<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            compressed<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">append<\/span><span style=\"color: #ECEFF4\">((<\/span><span style=\"color: #D8DEE9FF\">c<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">tobytes<\/span><span style=\"color: #ECEFF4\">())[<\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #ECEFF4\">])<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #616E88\">#if(16384 +128 &lt;= subtokenid &lt; 4194304 - 1):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #B48EAD\">16384<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #B48EAD\">128<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">&lt;=<\/span><span style=\"color: #D8DEE9FF\"> subtokenid <\/span><span style=\"color: #81A1C1\">&lt;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">2097152<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">16384<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">128<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">rare word<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># remove offset<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">engdict<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">subtokenid<\/span><span style=\"color: #ECEFF4\">])<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            subtokenid <\/span><span style=\"color: #81A1C1\">-=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #B48EAD\">16384<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">128<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\">#convert to bytes1 (array of 3 bytes)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            bytes2 <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> subtokenid<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">to_bytes<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #B48EAD\">3<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9\">byteorder<\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">little<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;&quot;<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">join<\/span><span style=\"color: #ECEFF4\">([<\/span><span style=\"color: #81A1C1\">f<\/span><span style=\"color: #A3BE8C\">&quot;<\/span><span style=\"color: #EBCB8B\">\\\\<\/span><span style=\"color: #A3BE8C\">x<\/span><span style=\"color: #EBCB8B\">{<\/span><span style=\"color: #D8DEE9FF\">byte<\/span><span style=\"color: #81A1C1\">:02x<\/span><span style=\"color: #EBCB8B\">}<\/span><span style=\"color: #A3BE8C\">&quot;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">for<\/span><span style=\"color: #D8DEE9FF\"> byte <\/span><span style=\"color: #81A1C1\">in<\/span><span style=\"color: #D8DEE9FF\"> bytes2<\/span><span style=\"color: #ECEFF4\">]))<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\">#convert to bitarray<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            c <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">bitarray<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9\">endian<\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">little<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            c<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">frombytes<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">bytes2<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">c<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># set msb of first byte to 1 and shift the bits above up.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            c<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">insert<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #B48EAD\">7<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">c<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># set msb of second byte to 1 and shift the bits above up.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            c<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">insert<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #B48EAD\">15<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">c<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># remove two excess bits that arose from our shifts<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #81A1C1\">del<\/span><span style=\"color: #D8DEE9FF\"> c<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #B48EAD\">24<\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #B48EAD\">26<\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #ECEFF4\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">c<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># append our three tweaked bytes to the compressed bytearray<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            compressed<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">append<\/span><span style=\"color: #ECEFF4\">((<\/span><span style=\"color: #D8DEE9FF\">c<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">tobytes<\/span><span style=\"color: #ECEFF4\">())[<\/span><span style=\"color: #B48EAD\">0<\/span><span style=\"color: #ECEFF4\">])<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            compressed<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">append<\/span><span style=\"color: #ECEFF4\">((<\/span><span style=\"color: #D8DEE9FF\">c<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">tobytes<\/span><span style=\"color: #ECEFF4\">())[<\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #ECEFF4\">])<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            compressed<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">append<\/span><span style=\"color: #ECEFF4\">((<\/span><span style=\"color: #D8DEE9FF\">c<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">tobytes<\/span><span style=\"color: #ECEFF4\">())[<\/span><span style=\"color: #B48EAD\">2<\/span><span style=\"color: #ECEFF4\">])<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #616E88\">#if(16384 +128 &lt;= subtokenid &lt; 4194304 - 1):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #B48EAD\">16384<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #B48EAD\">128<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">2097152<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">&lt;=<\/span><span style=\"color: #D8DEE9FF\"> subtokenid <\/span><span style=\"color: #81A1C1\">&lt;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">4194304<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">-<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">5<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">unknown word from session DIC<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># remove offset<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">engdict<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">subtokenid<\/span><span style=\"color: #ECEFF4\">])<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            subtokenid <\/span><span style=\"color: #81A1C1\">-=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #B48EAD\">2097152<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">16384<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">128<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\">#convert to bytes1 (array of 3 bytes)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            bytes2 <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> subtokenid<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">to_bytes<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #B48EAD\">3<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9\">byteorder<\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">little<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;&quot;<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">join<\/span><span style=\"color: #ECEFF4\">([<\/span><span style=\"color: #81A1C1\">f<\/span><span style=\"color: #A3BE8C\">&quot;<\/span><span style=\"color: #EBCB8B\">\\\\<\/span><span style=\"color: #A3BE8C\">x<\/span><span style=\"color: #EBCB8B\">{<\/span><span style=\"color: #D8DEE9FF\">byte<\/span><span style=\"color: #81A1C1\">:02x<\/span><span style=\"color: #EBCB8B\">}<\/span><span style=\"color: #A3BE8C\">&quot;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">for<\/span><span style=\"color: #D8DEE9FF\"> byte <\/span><span style=\"color: #81A1C1\">in<\/span><span style=\"color: #D8DEE9FF\"> bytes2<\/span><span style=\"color: #ECEFF4\">]))<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\">#convert to bitarray<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            c <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">bitarray<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9\">endian<\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">little<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            c<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">frombytes<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">bytes2<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">c<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># set msb of first byte to 1 and shift the bits above up.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            c<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">insert<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #B48EAD\">7<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">c<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># set msb of second byte to 1 and shift the bits above up.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            c<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">insert<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #B48EAD\">15<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">c<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># set msb of third byte to 1 and shift the bits above up.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            c<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">insert<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #B48EAD\">23<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">c<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># remove three excess bits that arose from our shifts<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #81A1C1\">del<\/span><span style=\"color: #D8DEE9FF\"> c<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #B48EAD\">24<\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #B48EAD\">27<\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #ECEFF4\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">c<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># append our three tweaked bytes to the compressed bytearray<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            compressed<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">append<\/span><span style=\"color: #ECEFF4\">((<\/span><span style=\"color: #D8DEE9FF\">c<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">tobytes<\/span><span style=\"color: #ECEFF4\">())[<\/span><span style=\"color: #B48EAD\">0<\/span><span style=\"color: #ECEFF4\">])<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            compressed<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">append<\/span><span style=\"color: #ECEFF4\">((<\/span><span style=\"color: #D8DEE9FF\">c<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">tobytes<\/span><span style=\"color: #ECEFF4\">())[<\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #ECEFF4\">])<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            compressed<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">append<\/span><span style=\"color: #ECEFF4\">((<\/span><span style=\"color: #D8DEE9FF\">c<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">tobytes<\/span><span style=\"color: #ECEFF4\">())[<\/span><span style=\"color: #B48EAD\">2<\/span><span style=\"color: #ECEFF4\">])<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #616E88\">#if(subtokenid == (4194304 - 1)):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">subtokenid <\/span><span style=\"color: #81A1C1\">in<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">range<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #B48EAD\">4194299<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #B48EAD\">4194304<\/span><span style=\"color: #ECEFF4\">)):<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\">#compressed.append(255)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\">#compressed.append(255)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\">#compressed.append(255)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">huffmann tree code :<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">str<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">subtokenid<\/span><span style=\"color: #ECEFF4\">))<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># <\/span><span style=\"color: #81A1C1\">TODO<\/span><span style=\"color: #616E88\"> : Use Huffmann tree instead of byte-&gt;byte encoding.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\">#convert to bytes1 (array of 3 bytes)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            bytes2 <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> subtokenid<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">to_bytes<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #B48EAD\">3<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9\">byteorder<\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">little<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;&quot;<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">join<\/span><span style=\"color: #ECEFF4\">([<\/span><span style=\"color: #81A1C1\">f<\/span><span style=\"color: #A3BE8C\">&quot;<\/span><span style=\"color: #EBCB8B\">\\\\<\/span><span style=\"color: #A3BE8C\">x<\/span><span style=\"color: #EBCB8B\">{<\/span><span style=\"color: #D8DEE9FF\">byte<\/span><span style=\"color: #81A1C1\">:02x<\/span><span style=\"color: #EBCB8B\">}<\/span><span style=\"color: #A3BE8C\">&quot;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">for<\/span><span style=\"color: #D8DEE9FF\"> byte <\/span><span style=\"color: #81A1C1\">in<\/span><span style=\"color: #D8DEE9FF\"> bytes2<\/span><span style=\"color: #ECEFF4\">]))<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\">#convert to bitarray<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            c <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">bitarray<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9\">endian<\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">little<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            c<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">frombytes<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">bytes2<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">c<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># set msb of first byte to 1 and shift the bits above up.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            c<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">insert<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #B48EAD\">7<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">c<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># set msb of second byte to 1 and shift the bits above up.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            c<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">insert<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #B48EAD\">15<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">c<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># no need to set  msb of third byte to 1 since the range will take care of it.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\">#c.insert(23,1)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\">#debugw(c)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># remove two excess bits that arose from our shifts<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #81A1C1\">del<\/span><span style=\"color: #D8DEE9FF\"> c<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #B48EAD\">24<\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #B48EAD\">26<\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #ECEFF4\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">c<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># append our three tweaked bytes that signify the huffmann tree to use to the compressed bytearray<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            compressed<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">append<\/span><span style=\"color: #ECEFF4\">((<\/span><span style=\"color: #D8DEE9FF\">c<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">tobytes<\/span><span style=\"color: #ECEFF4\">())[<\/span><span style=\"color: #B48EAD\">0<\/span><span style=\"color: #ECEFF4\">])<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            compressed<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">append<\/span><span style=\"color: #ECEFF4\">((<\/span><span style=\"color: #D8DEE9FF\">c<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">tobytes<\/span><span style=\"color: #ECEFF4\">())[<\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #ECEFF4\">])<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            compressed<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">append<\/span><span style=\"color: #ECEFF4\">((<\/span><span style=\"color: #D8DEE9FF\">c<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">tobytes<\/span><span style=\"color: #ECEFF4\">())[<\/span><span style=\"color: #B48EAD\">2<\/span><span style=\"color: #ECEFF4\">])<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #88C0D0\">len<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">subtokens<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">==<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #81A1C1\">not<\/span><span style=\"color: #D8DEE9FF\"> use_huffmann<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">encoding unkown word<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #616E88\">#for charidx in range(0, len(line_token)):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #616E88\">#    debugw(&quot;appending chars..&quot;)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #616E88\">#    debugw(line_token[charidx])<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #616E88\">#    compressed.append(ord(line_token[charidx]))<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    compressed<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">extend<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #88C0D0\">encode_unknown<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">line_token<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #B48EAD\">0<\/span><span style=\"color: #ECEFF4\">))<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #81A1C1\">else<\/span><span style=\"color: #ECEFF4\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">encoding unkown line token with Huffmann<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    huffmann_tree_code <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">-<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">subtokenid <\/span><span style=\"color: #81A1C1\">-<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">4194303<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    compressed<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">extend<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #88C0D0\">encode_unknown<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">line_token<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\">huffmann_tree_code<\/span><span style=\"color: #ECEFF4\">))<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #81A1C1\">else<\/span><span style=\"color: #ECEFF4\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #81A1C1\">not<\/span><span style=\"color: #D8DEE9FF\"> use_huffmann<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">encoding unkown subtoken<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #616E88\">#for charidx in range(0, len(subtokens[subtokenidx])):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #616E88\">#    debugw(&quot;appending chars..&quot;)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #616E88\">#    debugw((subtokens[subtokenidx])[charidx])<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #616E88\">#    compressed.append(ord((subtokens[subtokenidx])[charidx]))<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    compressed<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">extend<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #88C0D0\">encode_unknown<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">subtokens<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">subtokenidx<\/span><span style=\"color: #ECEFF4\">],<\/span><span style=\"color: #B48EAD\">0<\/span><span style=\"color: #ECEFF4\">))<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #81A1C1\">else<\/span><span style=\"color: #ECEFF4\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">encoding unkown subtoken with Huffmann<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">subtokens<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">subtokenidx<\/span><span style=\"color: #ECEFF4\">])<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #616E88\">#huffmann_tree_code = find_huffmann_to_use(subtokens[subtokenidx])<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    huffmann_tree_code <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">-<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">subtokenid <\/span><span style=\"color: #81A1C1\">-<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">4194303<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    compressed<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">extend<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #88C0D0\">encode_unknown<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">subtokens<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">subtokenidx<\/span><span style=\"color: #ECEFF4\">],<\/span><span style=\"color: #D8DEE9FF\">huffmann_tree_code<\/span><span style=\"color: #ECEFF4\">))<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            compressed<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">append<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #B48EAD\">0<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #616E88\"># terminate c string style<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        subtokenidx <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #D8DEE9FF\">        <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    token_of_line_count <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">token of line count<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">token_of_line_count<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">lentoken<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">lentoken<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #ECEFF4\">((<\/span><span style=\"color: #D8DEE9FF\">token_of_line_count <\/span><span style=\"color: #81A1C1\">==<\/span><span style=\"color: #D8DEE9FF\"> lentoken<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">and<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #81A1C1\">not<\/span><span style=\"color: #D8DEE9FF\"> gendic<\/span><span style=\"color: #ECEFF4\">)):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #616E88\"># newline<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">append new line<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        compressed<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">append<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #B48EAD\">0<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #616E88\">#quit()  <\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #81A1C1\">return<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">compressed<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\">token_of_line_count<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #81A1C1\">def<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">compress_tokens<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9\">tokens<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9\">gendic<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #616E88\">#time.sleep(0.001)    <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #616E88\"># Init byte array<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    compressed <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">bytearray<\/span><span style=\"color: #ECEFF4\">()<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">tokens are:<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">tokens<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #81A1C1\">for<\/span><span style=\"color: #D8DEE9FF\"> token <\/span><span style=\"color: #81A1C1\">in<\/span><span style=\"color: #D8DEE9FF\"> tokens<\/span><span style=\"color: #ECEFF4\">:<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">token is:<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">token<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        token_of_line_count <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #616E88\"># start compression run<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #81A1C1\">not<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">len<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">token<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">and<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #81A1C1\">not<\/span><span style=\"color: #D8DEE9FF\"> gendic<\/span><span style=\"color: #ECEFF4\">)):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">paragraph<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            compressed<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">append<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #B48EAD\">0<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\">#compressed.append(0)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\">#quit()<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        lentoken <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">len<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">token<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #81A1C1\">not<\/span><span style=\"color: #D8DEE9FF\"> gendic<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #81A1C1\">for<\/span><span style=\"color: #D8DEE9FF\"> line_token <\/span><span style=\"color: #81A1C1\">in<\/span><span style=\"color: #D8DEE9FF\"> token<\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\">           <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">compressed<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\"> token_of_line_count<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">compress_token_or_subtoken<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">compressed<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\">line_token<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\">token_of_line_count<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\">lentoken<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\">gendic<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #81A1C1\">else<\/span><span style=\"color: #ECEFF4\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">compressed<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\"> token_of_line_count<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">compress_token_or_subtoken<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">compressed<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\">token<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\">token_of_line_count<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\">lentoken<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\">gendic<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #D8DEE9FF\">           <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #81A1C1\">not<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">len<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">compressed<\/span><span style=\"color: #ECEFF4\">)):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">unknown word in gendic sequence, aborting<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    compressed <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">bytearray<\/span><span style=\"color: #ECEFF4\">()<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #81A1C1\">return<\/span><span style=\"color: #D8DEE9FF\"> compressed<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #616E88\"># dump whole compressed stream<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">compressed ngram is=<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">compressed<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">hex<\/span><span style=\"color: #ECEFF4\">())<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">compressed ngram byte length is=<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #88C0D0\">len<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">compressed<\/span><span style=\"color: #ECEFF4\">))<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #81A1C1\">return<\/span><span style=\"color: #D8DEE9FF\"> compressed<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #81A1C1\">def<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">compress_second_pass<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9\">compressed<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    ngram_compressed <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">bytearray<\/span><span style=\"color: #ECEFF4\">()<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    ngram_length <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    ngram_byte_length <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    index_jumps <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">[]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    candidates <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">[]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    idx <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #616E88\"># second pass main loop<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #616E88\">#debugw(&quot;compressed=&quot;)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #616E88\">#debugw(compressed)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #81A1C1\">while<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">idx <\/span><span style=\"color: #81A1C1\">&lt;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">len<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">compressed<\/span><span style=\"color: #ECEFF4\">)):<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">second pass idx=<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        idxchar <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        reset_ngram <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">False<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">indexjumps=<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">index_jumps<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #81A1C1\">not<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">compressed<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #ECEFF4\">]<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">&amp;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">128<\/span><span style=\"color: #ECEFF4\">)):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            ngram_compressed<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">append<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">compressed<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #ECEFF4\">])<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;&quot;<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">join<\/span><span style=\"color: #ECEFF4\">([<\/span><span style=\"color: #81A1C1\">f<\/span><span style=\"color: #A3BE8C\">&quot;<\/span><span style=\"color: #EBCB8B\">\\\\<\/span><span style=\"color: #A3BE8C\">x<\/span><span style=\"color: #EBCB8B\">{<\/span><span style=\"color: #D8DEE9FF\">byte<\/span><span style=\"color: #81A1C1\">:02x<\/span><span style=\"color: #EBCB8B\">}<\/span><span style=\"color: #A3BE8C\">&quot;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">for<\/span><span style=\"color: #D8DEE9FF\"> byte <\/span><span style=\"color: #81A1C1\">in<\/span><span style=\"color: #D8DEE9FF\"> ngram_compressed<\/span><span style=\"color: #ECEFF4\">]))<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">super common ext<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            idx <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            index_jumps<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">append<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            ngram_byte_length <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #81A1C1\">elif<\/span><span style=\"color: #ECEFF4\">((<\/span><span style=\"color: #D8DEE9FF\">compressed<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #ECEFF4\">]<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">&amp;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">128<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">and<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #81A1C1\">not<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">compressed<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #ECEFF4\">]<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">&amp;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">128<\/span><span style=\"color: #ECEFF4\">))):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            ngram_compressed<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">extend<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">compressed<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #B48EAD\">2<\/span><span style=\"color: #ECEFF4\">])<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;&quot;<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">join<\/span><span style=\"color: #ECEFF4\">([<\/span><span style=\"color: #81A1C1\">f<\/span><span style=\"color: #A3BE8C\">&quot;<\/span><span style=\"color: #EBCB8B\">\\\\<\/span><span style=\"color: #A3BE8C\">x<\/span><span style=\"color: #EBCB8B\">{<\/span><span style=\"color: #D8DEE9FF\">byte<\/span><span style=\"color: #81A1C1\">:02x<\/span><span style=\"color: #EBCB8B\">}<\/span><span style=\"color: #A3BE8C\">&quot;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">for<\/span><span style=\"color: #D8DEE9FF\"> byte <\/span><span style=\"color: #81A1C1\">in<\/span><span style=\"color: #D8DEE9FF\"> ngram_compressed<\/span><span style=\"color: #ECEFF4\">]))<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">common ext<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            idx <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">2<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            index_jumps<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">append<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #B48EAD\">2<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            ngram_byte_length <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">2<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #81A1C1\">elif<\/span><span style=\"color: #ECEFF4\">((<\/span><span style=\"color: #D8DEE9FF\">compressed<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #ECEFF4\">]<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">&amp;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">128<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">and<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">compressed<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #ECEFF4\">]<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">&amp;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">128<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">and<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #81A1C1\">not<\/span><span style=\"color: #D8DEE9FF\"> compressed<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #B48EAD\">2<\/span><span style=\"color: #ECEFF4\">]<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">&amp;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">128<\/span><span style=\"color: #ECEFF4\">)):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            ngram_compressed<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">extend<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">compressed<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #B48EAD\">3<\/span><span style=\"color: #ECEFF4\">])<\/span><span style=\"color: #D8DEE9FF\"> <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;&quot;<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">join<\/span><span style=\"color: #ECEFF4\">([<\/span><span style=\"color: #81A1C1\">f<\/span><span style=\"color: #A3BE8C\">&quot;<\/span><span style=\"color: #EBCB8B\">\\\\<\/span><span style=\"color: #A3BE8C\">x<\/span><span style=\"color: #EBCB8B\">{<\/span><span style=\"color: #D8DEE9FF\">byte<\/span><span style=\"color: #81A1C1\">:02x<\/span><span style=\"color: #EBCB8B\">}<\/span><span style=\"color: #A3BE8C\">&quot;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">for<\/span><span style=\"color: #D8DEE9FF\"> byte <\/span><span style=\"color: #81A1C1\">in<\/span><span style=\"color: #D8DEE9FF\"> ngram_compressed<\/span><span style=\"color: #ECEFF4\">]))<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">rare ext<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            idx <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">3<\/span><span style=\"color: #D8DEE9FF\">  <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            index_jumps<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">append<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #B48EAD\">3<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            ngram_byte_length <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">3<\/span><span style=\"color: #D8DEE9FF\">     <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #81A1C1\">elif<\/span><span style=\"color: #ECEFF4\">((<\/span><span style=\"color: #D8DEE9FF\">compressed<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #ECEFF4\">]<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">==<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">255<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">and<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">compressed<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #ECEFF4\">]<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">==<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">255<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">and<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">compressed<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #B48EAD\">2<\/span><span style=\"color: #ECEFF4\">]<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">==<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">255<\/span><span style=\"color: #ECEFF4\">)):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># <\/span><span style=\"color: #81A1C1\">TODO<\/span><span style=\"color: #616E88\"> : take into account 4 escape sequences instead of only one.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\">#reset ngram_compressed<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            char <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> compressed<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #B48EAD\">3<\/span><span style=\"color: #ECEFF4\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">unknown token sequence detected<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\">#print(char)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\">#str = &quot;&quot;<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            idxchar <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #81A1C1\">while<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">char <\/span><span style=\"color: #81A1C1\">!=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                   idxchar <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                   char <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> compressed<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #B48EAD\">3<\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #D8DEE9FF\">idxchar<\/span><span style=\"color: #ECEFF4\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                   <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">char=<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                   <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">char<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">end of unknown token sequence detected at idx:<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            idx <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #B48EAD\">3<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #D8DEE9FF\"> idxchar<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            index_jumps<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">append<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #B48EAD\">3<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #D8DEE9FF\"> idxchar<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            ngram_length <\/span><span style=\"color: #81A1C1\">-=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            reset_ngram <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">True<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">         <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #81A1C1\">elif<\/span><span style=\"color: #ECEFF4\">((<\/span><span style=\"color: #D8DEE9FF\">compressed<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #ECEFF4\">]<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">&amp;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">128<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">and<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">compressed<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #ECEFF4\">]<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">&amp;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">128<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">and<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">compressed<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #B48EAD\">2<\/span><span style=\"color: #ECEFF4\">]<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">&amp;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">128<\/span><span style=\"color: #ECEFF4\">)):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># Session DIC space, breaks ngram construction.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">session DIC space, we break ngram construction<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            idx <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">3<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            index_jumps<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">append<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #B48EAD\">3<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            ngram_length <\/span><span style=\"color: #81A1C1\">-=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            reset_ngram <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">True<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        ngram_length <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">indexjumps=<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">index_jumps<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">ngram_length<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">ngram_length<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(((<\/span><span style=\"color: #D8DEE9FF\">ngram_length <\/span><span style=\"color: #81A1C1\">==<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">3<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">and<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">ngram_byte_length <\/span><span style=\"color: #81A1C1\">&gt;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">3<\/span><span style=\"color: #ECEFF4\">))<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">or<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">ngram_length <\/span><span style=\"color: #81A1C1\">==<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">4<\/span><span style=\"color: #ECEFF4\">)):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># if there are contractions, apparent ngram length will be one token less and potentially present in N4 ngrams<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># try to replace the ngram if it exists, and only if ngram_byte_length is &gt; 3, otherwise there will be no compression gain.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># save index jumps for rewind operations.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># TO BE CONTINUED .....<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #81A1C1\">try<\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                ngram_compressed_no_ascii <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">&quot;&quot;<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">join<\/span><span style=\"color: #ECEFF4\">([<\/span><span style=\"color: #81A1C1\">f<\/span><span style=\"color: #A3BE8C\">&quot;<\/span><span style=\"color: #EBCB8B\">\\\\<\/span><span style=\"color: #A3BE8C\">x<\/span><span style=\"color: #EBCB8B\">{<\/span><span style=\"color: #D8DEE9FF\">byte<\/span><span style=\"color: #81A1C1\">:02x<\/span><span style=\"color: #EBCB8B\">}<\/span><span style=\"color: #A3BE8C\">&quot;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">for<\/span><span style=\"color: #D8DEE9FF\"> byte <\/span><span style=\"color: #81A1C1\">in<\/span><span style=\"color: #D8DEE9FF\"> ngram_compressed<\/span><span style=\"color: #ECEFF4\">])<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                ngram_compressed_no_ascii <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> ngram_compressed_no_ascii<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">replace<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #EBCB8B\">\\\\<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #ECEFF4\">&quot;&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">ngram_compressed_no_ascii<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                code <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> ngram_dict<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">ngram_compressed_no_ascii<\/span><span style=\"color: #ECEFF4\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">****FOUND*****<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                ratio <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> ngram_byte_length<\/span><span style=\"color: #81A1C1\">\/<\/span><span style=\"color: #B48EAD\">3<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #616E88\"># all ngrams are encoded in a 3 byte address space, hence div by 3<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                removebytes <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> ngram_byte_length<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">idxchar<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    insertpos <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> idx <\/span><span style=\"color: #81A1C1\">-<\/span><span style=\"color: #D8DEE9FF\"> ngram_byte_length <\/span><span style=\"color: #81A1C1\">-<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #B48EAD\">3<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #D8DEE9FF\"> idxchar<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #81A1C1\">else<\/span><span style=\"color: #ECEFF4\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    insertpos <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> idx <\/span><span style=\"color: #81A1C1\">-<\/span><span style=\"color: #D8DEE9FF\"> ngram_byte_length                <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                candidates<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">append<\/span><span style=\"color: #ECEFF4\">((<\/span><span style=\"color: #D8DEE9FF\">code<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\">insertpos<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\">removebytes<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\">ratio<\/span><span style=\"color: #ECEFF4\">))<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #81A1C1\">except<\/span><span style=\"color: #ECEFF4\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #616E88\">#traceback.print_exc()<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">no luck 3N\/4N<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># reset all ngram data<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            ngram_length <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            ngram_byte_length <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            ratio <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            removebytes <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            ngram_compressed <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">bytearray<\/span><span style=\"color: #ECEFF4\">()<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\">#rewind...and retry a new ngram window from initial token index + one token shift<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\">#<\/span><span style=\"color: #81A1C1\">BUG<\/span><span style=\"color: #616E88\"> HERE !!<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">indexjumps=<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">index_jumps<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\">#time.sleep(0.1)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">lastindexjumps_except_first=<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">index_jumps<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #81A1C1\">-<\/span><span style=\"color: #88C0D0\">len<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">index_jumps<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #ECEFF4\">:])<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">index_before_rewind=<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            idx <\/span><span style=\"color: #81A1C1\">-=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">sum<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">index_jumps<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #81A1C1\">-<\/span><span style=\"color: #88C0D0\">len<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">index_jumps<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #ECEFF4\">:])<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            index_jumps <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">[]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">idx after rewind=<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #81A1C1\">elif<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">reset_ngram<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">ngram reset : unknown token starts before ngram_length 3 or 4<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            ngram_length <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            ngram_byte_length <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            ratio <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            removebytes <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\">#do not rewind : reset pos after unknown sequence<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            index_jumps <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">[]<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #81A1C1\">return<\/span><span style=\"color: #D8DEE9FF\"> candidates        <\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #81A1C1\">def<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">process_candidates_v2<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9\">candidates<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #616E88\">#here we scan all candidates.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #616E88\">#if there are overlaps, we select the candidate with the best ratio, if any.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #616E88\">#The result is a reduced list of candidates data.<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #616E88\">#Next we recreate the compressed stream and replace the bytes at insertpos by the candidate code<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">candidates<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    candidates_reduced <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">[]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    idx_reduced <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    idx <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    deleted_candidates_number <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    mutual_overlaps <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">[]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    overlap_idx <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #81A1C1\">while<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">idx <\/span><span style=\"color: #81A1C1\">&lt;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">len<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">candidates<\/span><span style=\"color: #ECEFF4\">)):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        code <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> candidates<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #ECEFF4\">][<\/span><span style=\"color: #B48EAD\">0<\/span><span style=\"color: #ECEFF4\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        insertpos <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> candidates<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #ECEFF4\">][<\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #ECEFF4\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        removebytes <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> candidates<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #ECEFF4\">][<\/span><span style=\"color: #B48EAD\">2<\/span><span style=\"color: #ECEFF4\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        ratio <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> candidates<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #ECEFF4\">][<\/span><span style=\"color: #B48EAD\">3<\/span><span style=\"color: #ECEFF4\">]<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        first_overlap <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">True<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #81A1C1\">for<\/span><span style=\"color: #D8DEE9FF\"> idx_lookahead <\/span><span style=\"color: #81A1C1\">in<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">range<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #88C0D0\">len<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">candidates<\/span><span style=\"color: #ECEFF4\">)):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            code_lookahead <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> candidates<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">idx_lookahead<\/span><span style=\"color: #ECEFF4\">][<\/span><span style=\"color: #B48EAD\">0<\/span><span style=\"color: #ECEFF4\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            insertpos_lookahead <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> candidates<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">idx_lookahead<\/span><span style=\"color: #ECEFF4\">][<\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #ECEFF4\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            removebytes_lookahead <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> candidates<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">idx_lookahead<\/span><span style=\"color: #ECEFF4\">][<\/span><span style=\"color: #B48EAD\">2<\/span><span style=\"color: #ECEFF4\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            ratio_lookahead <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> candidates<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">idx_lookahead<\/span><span style=\"color: #ECEFF4\">][<\/span><span style=\"color: #B48EAD\">3<\/span><span style=\"color: #ECEFF4\">]<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #ECEFF4\">((<\/span><span style=\"color: #D8DEE9FF\">insertpos <\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #D8DEE9FF\"> removebytes <\/span><span style=\"color: #81A1C1\">-<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">&gt;=<\/span><span style=\"color: #D8DEE9FF\"> insertpos_lookahead<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">overlap!<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">code<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">code_lookahead<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #616E88\">#add mutually overlapping indexes to an array<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">first_overlap<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    mutual_overlaps<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">append<\/span><span style=\"color: #ECEFF4\">([<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #ECEFF4\">])<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    mutual_overlaps<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">overlap_idx<\/span><span style=\"color: #ECEFF4\">].<\/span><span style=\"color: #88C0D0\">append<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">idx_lookahead<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    first_overlap <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">False<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #81A1C1\">else<\/span><span style=\"color: #ECEFF4\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #616E88\"># case for a mutual overlap of at least 3 ngrams<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">len mutual overlap:<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #88C0D0\">len<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">mutual_overlaps<\/span><span style=\"color: #ECEFF4\">))<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">overlap_idx<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">overlap_idx<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    mutual_overlaps<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">overlap_idx<\/span><span style=\"color: #ECEFF4\">].<\/span><span style=\"color: #88C0D0\">append<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">idx_lookahead<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                 <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    overlap_idx <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #81A1C1\">else<\/span><span style=\"color: #ECEFF4\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #616E88\">#end of mutual overlap (current lookahead is not overlapping with original idx)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #81A1C1\">break<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        idx <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #D8DEE9FF\">        <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #616E88\">#keep best ratio from all overlap lists<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    keep_idxs <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">[]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    remove_idx_shift <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #81A1C1\">for<\/span><span style=\"color: #D8DEE9FF\"> overlap <\/span><span style=\"color: #81A1C1\">in<\/span><span style=\"color: #D8DEE9FF\"> mutual_overlaps<\/span><span style=\"color: #ECEFF4\">:<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        prev_candidate_ratio <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #81A1C1\">for<\/span><span style=\"color: #D8DEE9FF\"> candidate_idx <\/span><span style=\"color: #81A1C1\">in<\/span><span style=\"color: #D8DEE9FF\"> overlap<\/span><span style=\"color: #ECEFF4\">:<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">candidate_idx:<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">candidate_idx<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            candidate_ratio <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> candidates<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">candidate_idx <\/span><span style=\"color: #81A1C1\">-<\/span><span style=\"color: #D8DEE9FF\"> remove_idx_shift<\/span><span style=\"color: #ECEFF4\">][<\/span><span style=\"color: #B48EAD\">3<\/span><span style=\"color: #ECEFF4\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">candidate_ratio <\/span><span style=\"color: #81A1C1\">&gt;=<\/span><span style=\"color: #D8DEE9FF\"> prev_candidate_ratio<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                keep_idx <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> candidate_idx<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                prev_candidate_ratio <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> candidate_ratio<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        keep_idxs<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">append<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">keep_idx<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #81A1C1\">for<\/span><span style=\"color: #D8DEE9FF\"> candidate_idx <\/span><span style=\"color: #81A1C1\">in<\/span><span style=\"color: #D8DEE9FF\"> overlap<\/span><span style=\"color: #ECEFF4\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">candidate_idx <\/span><span style=\"color: #81A1C1\">!=<\/span><span style=\"color: #D8DEE9FF\"> keep_idx<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">candidate len:<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #88C0D0\">len<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">candidates<\/span><span style=\"color: #ECEFF4\">))<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">will delete idx:<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #88C0D0\">str<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">candidate_idx <\/span><span style=\"color: #81A1C1\">-<\/span><span style=\"color: #D8DEE9FF\"> remove_idx_shift<\/span><span style=\"color: #ECEFF4\">))<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #81A1C1\">del<\/span><span style=\"color: #D8DEE9FF\"> candidates<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">candidate_idx <\/span><span style=\"color: #81A1C1\">-<\/span><span style=\"color: #D8DEE9FF\"> remove_idx_shift<\/span><span style=\"color: #ECEFF4\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                deleted_candidates_number <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">deleted idx:<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #88C0D0\">str<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">candidate_idx <\/span><span style=\"color: #81A1C1\">-<\/span><span style=\"color: #D8DEE9FF\"> remove_idx_shift<\/span><span style=\"color: #ECEFF4\">))<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                remove_idx_shift <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #616E88\">#keep the best ratio only from the list of mutual overlaps<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">deleted_candidates_number <\/span><span style=\"color: #81A1C1\">&gt;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">recursive<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        deleted_candidates_number <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #88C0D0\">process_candidates_v2<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">candidates<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #616E88\">#need to exit recursion when len candidates stops decreasing<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #81A1C1\">return<\/span><span style=\"color: #D8DEE9FF\"> candidates<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #81A1C1\">def<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">ngram_insert_reserved_bits<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9\">ngram_compressed<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;&quot;<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">join<\/span><span style=\"color: #ECEFF4\">([<\/span><span style=\"color: #81A1C1\">f<\/span><span style=\"color: #A3BE8C\">&quot;<\/span><span style=\"color: #EBCB8B\">\\\\<\/span><span style=\"color: #A3BE8C\">x<\/span><span style=\"color: #EBCB8B\">{<\/span><span style=\"color: #D8DEE9FF\">byte<\/span><span style=\"color: #81A1C1\">:02x<\/span><span style=\"color: #EBCB8B\">}<\/span><span style=\"color: #A3BE8C\">&quot;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">for<\/span><span style=\"color: #D8DEE9FF\"> byte <\/span><span style=\"color: #81A1C1\">in<\/span><span style=\"color: #D8DEE9FF\"> ngram_compressed<\/span><span style=\"color: #ECEFF4\">]))<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #616E88\">#convert to bitarray<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    c <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">bitarray<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9\">endian<\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">little<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    c<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">frombytes<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">ngram_compressed<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">c<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #616E88\"># set msb of first byte to 1 and shift the bits above up.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    c<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">insert<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #B48EAD\">7<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">c<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #616E88\"># set msb of second byte to 1 and shift the bits above up.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    c<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">insert<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #B48EAD\">15<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">c<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #616E88\"># remove two excess bits that arose from our shifts<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #81A1C1\">del<\/span><span style=\"color: #D8DEE9FF\"> c<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #B48EAD\">24<\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #B48EAD\">26<\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #ECEFF4\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">c<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #616E88\"># replace the original ngram_compressed bytearray with our tweaked bytes<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    ngram_compressed <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">bytearray<\/span><span style=\"color: #ECEFF4\">()<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    ngram_compressed<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">append<\/span><span style=\"color: #ECEFF4\">((<\/span><span style=\"color: #D8DEE9FF\">c<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">tobytes<\/span><span style=\"color: #ECEFF4\">())[<\/span><span style=\"color: #B48EAD\">0<\/span><span style=\"color: #ECEFF4\">])<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    ngram_compressed<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">append<\/span><span style=\"color: #ECEFF4\">((<\/span><span style=\"color: #D8DEE9FF\">c<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">tobytes<\/span><span style=\"color: #ECEFF4\">())[<\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #ECEFF4\">])<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    ngram_compressed<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">append<\/span><span style=\"color: #ECEFF4\">((<\/span><span style=\"color: #D8DEE9FF\">c<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">tobytes<\/span><span style=\"color: #ECEFF4\">())[<\/span><span style=\"color: #B48EAD\">2<\/span><span style=\"color: #ECEFF4\">])<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #81A1C1\">return<\/span><span style=\"color: #D8DEE9FF\"> ngram_compressed<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #81A1C1\">def<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">replace_candidates_in_processed<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9\">candidates<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9\">processed<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    byteshift <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    shiftcode <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #81A1C1\">for<\/span><span style=\"color: #D8DEE9FF\"> candidate <\/span><span style=\"color: #81A1C1\">in<\/span><span style=\"color: #D8DEE9FF\"> candidates<\/span><span style=\"color: #ECEFF4\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            insertpos <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> candidate<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #ECEFF4\">]<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">-<\/span><span style=\"color: #D8DEE9FF\"> byteshift<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            removebytes <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> candidate<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #B48EAD\">2<\/span><span style=\"color: #ECEFF4\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #81A1C1\">del<\/span><span style=\"color: #D8DEE9FF\"> processed<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">insertpos<\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\">insertpos <\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #D8DEE9FF\"> removebytes<\/span><span style=\"color: #ECEFF4\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            byteshift <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> removebytes<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\">## first we need to convert candidate code to proper 3 byte format<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># we add our 4 ngram code space at a 2^20 shift in the 3 bytes address space. <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            shifted_code <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">524416<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #D8DEE9FF\"> candidate<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #B48EAD\">0<\/span><span style=\"color: #ECEFF4\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># now we convert our shifted ngram code to a byte sequence in the compressed format<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            bytes_shiftedcode <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> shifted_code<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">to_bytes<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #B48EAD\">3<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #D8DEE9\">byteorder<\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">little<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># print it<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">bytes_shiftedcode<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># tweak the bytes to insert reserved bits for 1\/2\/3 bytes variable length encoding<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># compliance.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            bytes_shiftedcode <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">ngram_insert_reserved_bits<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">bytes_shiftedcode<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># print it<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">bytes_shiftedcode<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># now we insert it at the position of the non-compressed ngram<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            processed<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">insertpos<\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\">insertpos<\/span><span style=\"color: #ECEFF4\">]<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> bytes_shiftedcode<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># we added 3 bytes, we have to compensate to keep future insertpos valid.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            byteshift <\/span><span style=\"color: #81A1C1\">-=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">3<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #81A1C1\">return<\/span><span style=\"color: #D8DEE9FF\"> processed<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #81A1C1\">def<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">ngram_process_rules<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9\">subtokens<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #616E88\">### VARIOUS DETOKENIZER CLEANUP\/FORMATTING OPERATIONS<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    processed_ngram_string <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">&quot;&quot;<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    capitalize <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">False<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    token_idx <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #81A1C1\">for<\/span><span style=\"color: #D8DEE9FF\"> token <\/span><span style=\"color: #81A1C1\">in<\/span><span style=\"color: #D8DEE9FF\"> subtokens<\/span><span style=\"color: #ECEFF4\">:<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">capitalize<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            token <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> token<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">capitalize<\/span><span style=\"color: #ECEFF4\">()<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            capitalize <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">False<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #616E88\"># English syntactic rules : remove whitespace left of &quot;!?.&quot; <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #616E88\"># and enforce capitalization on first non whitespace character following.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">re<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">match<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">[!\\?\\.]<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\">token<\/span><span style=\"color: #ECEFF4\">)):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            processed_ngram_string <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> token<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            capitalize <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">True<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #616E88\"># English syntactic rules : remove whitespace left of &quot;,;:&quot; <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #81A1C1\">elif<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">re<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">match<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">[,;:]<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\">token<\/span><span style=\"color: #ECEFF4\">)):<\/span><span style=\"color: #D8DEE9FF\">         <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            processed_ngram_string <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> token<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            capitalize <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">False<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #616E88\"># append whitespace left of added token<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #81A1C1\">else<\/span><span style=\"color: #ECEFF4\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            processed_ngram_string <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> processed_ngram_string <\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\"> <\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #D8DEE9FF\"> token<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        token_idx <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #88C0D0\">len<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">subtokens<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">==<\/span><span style=\"color: #D8DEE9FF\"> token_idx<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">last token of ngram<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            processed_ngram_string <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\"> <\/span><span style=\"color: #ECEFF4\">&quot;<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #81A1C1\">return<\/span><span style=\"color: #D8DEE9FF\"> processed_ngram_string<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #81A1C1\">def<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">decompress_ngram_bytes<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9\">compressed<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    idx <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    detokenizer_ngram <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">[]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #81A1C1\">while<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">idx <\/span><span style=\"color: #81A1C1\">&lt;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">len<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">compressed<\/span><span style=\"color: #ECEFF4\">)):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #81A1C1\">not<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">compressed<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #ECEFF4\">]<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">&amp;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">128<\/span><span style=\"color: #ECEFF4\">)):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># current index byte msb is at 0, <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># it is one of the 128 first tokens in the dictionary.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">super common word<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\">#decode in place<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            inta <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> compressed<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #ECEFF4\">]<\/span><span style=\"color: #D8DEE9FF\">        <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            detokenizer_ngram<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">append<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">engdict<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">inta<\/span><span style=\"color: #ECEFF4\">])<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            idx <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #81A1C1\">elif<\/span><span style=\"color: #ECEFF4\">((<\/span><span style=\"color: #D8DEE9FF\">compressed<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #ECEFF4\">]<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">&amp;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">128<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">and<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #81A1C1\">not<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">compressed<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #ECEFF4\">]<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">&amp;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">128<\/span><span style=\"color: #ECEFF4\">))):<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># current index byte msb is at 1, and next byte msb is at 0. <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># it is one of the 16384 next tokens in the dictionary.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">common word<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># populate bitarray from the two bytes<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            c <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">bitarray<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9\">endian<\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">little<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            c<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">frombytes<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">compressed<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #B48EAD\">2<\/span><span style=\"color: #ECEFF4\">])<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">c<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># remove first byte msb (shift down the bits above)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #81A1C1\">del<\/span><span style=\"color: #D8DEE9FF\"> c<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #B48EAD\">7<\/span><span style=\"color: #ECEFF4\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">c<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># convert bytes array to 16 bit unsigned integer<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            inta <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">struct<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">unpack<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">&lt;H<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\"> c<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">tobytes<\/span><span style=\"color: #ECEFF4\">()))[<\/span><span style=\"color: #B48EAD\">0<\/span><span style=\"color: #ECEFF4\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># add offset back so we get a valid dictionary key<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            inta <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">128<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># print word<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            detokenizer_ngram<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">append<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">engdict<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">inta<\/span><span style=\"color: #ECEFF4\">])<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># increment byte counter with step 2, we processed 2 bytes.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            idx <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">2<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #616E88\">#elif((compressed[idx] &amp; 128) and (compressed[idx+1] &amp; 128)):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #81A1C1\">elif<\/span><span style=\"color: #ECEFF4\">((<\/span><span style=\"color: #D8DEE9FF\">compressed<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #ECEFF4\">]<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">&amp;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">128<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">and<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">compressed<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #ECEFF4\">]<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">&amp;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">128<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">and<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #81A1C1\">not<\/span><span style=\"color: #D8DEE9FF\"> compressed<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #B48EAD\">2<\/span><span style=\"color: #ECEFF4\">]<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">&amp;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">128<\/span><span style=\"color: #ECEFF4\">)):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># current index byte msb is at 1, and next byte mbs is at 1. <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># it is one of the 4194304 next tokens in the dictionary.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">rare word<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            chunk <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> compressed<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #B48EAD\">3<\/span><span style=\"color: #ECEFF4\">]<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># populate bitarray from the three bytes<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            c <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">bitarray<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9\">endian<\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">little<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\">#c.frombytes(compressed[idx:idx+3])<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            c<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">frombytes<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">chunk<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">c<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># remove second byte msb (shift down the bits above)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #81A1C1\">del<\/span><span style=\"color: #D8DEE9FF\"> c<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #B48EAD\">15<\/span><span style=\"color: #ECEFF4\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">c<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># remove first byte msb (shift down the bits above)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #81A1C1\">del<\/span><span style=\"color: #D8DEE9FF\"> c<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #B48EAD\">7<\/span><span style=\"color: #ECEFF4\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">c<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            c<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">extend<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">0000000000<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #D8DEE9FF\"> <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># pad to 4 bytes (32 bit integer format) : 3 bytes + 10 bits <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># because we previously removed two bits with del c[15] and del c[7]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">c<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># convert bytes array to 32 bit unsigned integer<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            inta <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">struct<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">unpack<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">&lt;L<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\"> c<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">tobytes<\/span><span style=\"color: #ECEFF4\">()))[<\/span><span style=\"color: #B48EAD\">0<\/span><span style=\"color: #ECEFF4\">]<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            inta <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #B48EAD\">16384<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">128<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            detokenizer_ngram<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">append<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">engdict<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">inta<\/span><span style=\"color: #ECEFF4\">])<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># increment byte counter with step 3, we processed 3 bytes.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            idx <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">3<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #81A1C1\">return<\/span><span style=\"color: #D8DEE9FF\"> detokenizer_ngram<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\">###INLINE START###<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\">#downloading tokenizer model if missing<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">nltk<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">download<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">punkt<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\">#opening the english dict of most used 1\/3 million words from google corpus of 1 trillion words.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\">#special characters have been added with their respective prevalence (from wikipedia corpus)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\">#contractions also have been added in their form with a quote just after (next line) the form <\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># without quote. ex : next line after &quot;dont&quot; appears &quot;don&#39;t&quot;<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">file1 <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">open<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">count_1w.txt<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">r<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">Lines <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> file1<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">readlines<\/span><span style=\"color: #ECEFF4\">()<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\">#initializing Python dicts<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">count <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">engdict <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">{}<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">engdictrev <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">{}<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># special case : byte val 0 is equal to new line.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># <\/span><span style=\"color: #81A1C1\">TODO<\/span><span style=\"color: #616E88\"> : make sure that windows CRLF is taken care of.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">engdict<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #B48EAD\">0<\/span><span style=\"color: #ECEFF4\">]<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #EBCB8B\">\\n<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">engdictrev<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #EBCB8B\">\\n<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">]<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># populating dicts<\/span><\/span>\n<span class=\"line\"><span style=\"color: #81A1C1\">for<\/span><span style=\"color: #D8DEE9FF\"> line <\/span><span style=\"color: #81A1C1\">in<\/span><span style=\"color: #D8DEE9FF\"> Lines<\/span><span style=\"color: #ECEFF4\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #616E88\"># Strips the newline character<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    engdict<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">count<\/span><span style=\"color: #ECEFF4\">]<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> line<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">strip<\/span><span style=\"color: #ECEFF4\">()<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    engdictrev<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">line<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">strip<\/span><span style=\"color: #ECEFF4\">()]<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> count<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    count <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\">### populating ngram dict<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">filengrams <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">open<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">outngrams.bin<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">rt<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">ngramlines <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> filengrams<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">readlines<\/span><span style=\"color: #ECEFF4\">()<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">ngram_dict <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">{}<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">ngram_dict_rev <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">{}<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">count <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># populating dicts<\/span><\/span>\n<span class=\"line\"><span style=\"color: #81A1C1\">for<\/span><span style=\"color: #D8DEE9FF\"> ngramline <\/span><span style=\"color: #81A1C1\">in<\/span><span style=\"color: #D8DEE9FF\"> ngramlines<\/span><span style=\"color: #ECEFF4\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># Strips the newline character<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #616E88\">#keystr = &quot;&quot;.join([f&quot;\\\\x{byte:02x}&quot; for byte in ngramline.strip()])<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #616E88\">#keystr = keystr.replace(&quot;\\\\&quot;,&quot;&quot;)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #616E88\">#if(count == 71374):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    keystr <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> ngramline<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">strip<\/span><span style=\"color: #ECEFF4\">()<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #616E88\">#print(ngramline.strip())<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #616E88\">#print(keystr)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #616E88\">#quit()<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    ngram_dict_rev<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">count<\/span><span style=\"color: #ECEFF4\">]<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> keystr<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    ngram_dict<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">keystr<\/span><span style=\"color: #ECEFF4\">]<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> count<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    count <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">idx <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0<\/span><\/span>\n<span class=\"line\"><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">first ngram in dict:<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">test <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> ngram_dict_rev<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #B48EAD\">0<\/span><span style=\"color: #ECEFF4\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">test<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">ngram_dict<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">test<\/span><span style=\"color: #ECEFF4\">])<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">count <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">compress<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    tokens <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">[]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #616E88\"># check if file is utf-8<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #88C0D0\">check_file_is_utf8<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">infile<\/span><span style=\"color: #ECEFF4\">)):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #81A1C1\">with<\/span><span style=\"color: #D8DEE9FF\"> codecs<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">open<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">infile<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">r<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #D8DEE9\">encoding<\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">utf-8<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">as<\/span><span style=\"color: #D8DEE9FF\"> utf8_file<\/span><span style=\"color: #ECEFF4\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># Read the content of the UTF-8 file and transcode it to ASCII<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># encode(&#39;ascii&#39;,&#39;ignore&#39;) MAY replace unknown char with chr(0)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># We don&#39;t want that, as it is a termination char for unknown strings.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># on the other hand backslashreplace replaces too much chars that could be transcribed<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># the best option for now it check for chr(0) presence before writing the unknown token representation.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            ascii_content <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> utf8_file<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">read<\/span><span style=\"color: #ECEFF4\">().<\/span><span style=\"color: #88C0D0\">encode<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">ascii<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">ignore<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #ECEFF4\">).<\/span><span style=\"color: #88C0D0\">decode<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">ascii<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\">#debugw(ascii_content)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            Linesin <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> ascii_content<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">splitlines<\/span><span style=\"color: #ECEFF4\">()<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">debug_on<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                outfile_ascii <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> infile <\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">.asc<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #81A1C1\">with<\/span><span style=\"color: #D8DEE9FF\"> codecs<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">open<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">outfile_ascii<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">w<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #D8DEE9\">encoding<\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">ascii<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">as<\/span><span style=\"color: #D8DEE9FF\"> ascii_file<\/span><span style=\"color: #ECEFF4\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    ascii_file<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">write<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">ascii_content<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #81A1C1\">else<\/span><span style=\"color: #ECEFF4\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #616E88\"># Reading file to be compressed<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        file2 <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">open<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">infile<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">r<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #616E88\">#text = file2.read()<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        Linesin <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> file2<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">readlines<\/span><span style=\"color: #ECEFF4\">()<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">gendic<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">         <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #88C0D0\">len<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">outfile<\/span><span style=\"color: #ECEFF4\">)):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                fh <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">open<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">outfile<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">wt<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    lineidx <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #81A1C1\">for<\/span><span style=\"color: #D8DEE9FF\"> line <\/span><span style=\"color: #81A1C1\">in<\/span><span style=\"color: #D8DEE9FF\"> Linesin<\/span><span style=\"color: #ECEFF4\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        line <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> line<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">lower<\/span><span style=\"color: #ECEFF4\">()<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #616E88\"># First pass tokenizer (does not split adjunct special chars)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        line_tokens <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> tknzr<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">tokenize<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">line<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #616E88\">#debugw(line_tokens)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">not<\/span><span style=\"color: #D8DEE9FF\"> gendic<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            tokens<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">append<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">line_tokens<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #81A1C1\">else<\/span><span style=\"color: #ECEFF4\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            compressed <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">compress_tokens<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">line_tokens<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\">gendic<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #88C0D0\">len<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">outfile<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">and<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">len<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">compressed<\/span><span style=\"color: #ECEFF4\">)):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #616E88\"># write compressed binary stream to file if supplied in args or to stdout otherwise.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                hexstr <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">&quot;&quot;<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">join<\/span><span style=\"color: #ECEFF4\">([<\/span><span style=\"color: #81A1C1\">f<\/span><span style=\"color: #A3BE8C\">&quot;<\/span><span style=\"color: #EBCB8B\">\\\\<\/span><span style=\"color: #A3BE8C\">x<\/span><span style=\"color: #EBCB8B\">{<\/span><span style=\"color: #D8DEE9FF\">byte<\/span><span style=\"color: #81A1C1\">:02x<\/span><span style=\"color: #EBCB8B\">}<\/span><span style=\"color: #A3BE8C\">&quot;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">for<\/span><span style=\"color: #D8DEE9FF\"> byte <\/span><span style=\"color: #81A1C1\">in<\/span><span style=\"color: #D8DEE9FF\"> compressed<\/span><span style=\"color: #ECEFF4\">])<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                hexstr <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> hexstr<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">replace<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #EBCB8B\">\\\\<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #ECEFF4\">&quot;&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                fh<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">write<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">hexstr<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">debug_ngrams_dic<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    fh<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">write<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #EBCB8B\">\\t<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    strline <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">str<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">lineidx<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    fh<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">write<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">strline<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                fh<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">write<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #EBCB8B\">\\n<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #81A1C1\">else<\/span><span style=\"color: #ECEFF4\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                sys<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #D8DEE9FF\">stdout<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #D8DEE9FF\">buffer<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">write<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">compressed<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                sys<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #D8DEE9FF\">stdout<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #D8DEE9FF\">buffer<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">write<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #81A1C1\">b<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #EBCB8B\">\\n<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        lineidx <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #616E88\">#line_tokens.append(&quot;\\n&quot;)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #616E88\">#tokens = tokens + line_tokens<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">tokens<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #81A1C1\">not<\/span><span style=\"color: #D8DEE9FF\"> gendic<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        compressed <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">compress_tokens<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">tokens<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\">gendic<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">secondpass<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            candidates <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">compress_second_pass<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">compressed<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">candidates:<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">candidates<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            processed_candidates <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">process_candidates_v2<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">candidates<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">processed candidates:<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">processed_candidates<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            compressed <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">replace_candidates_in_processed<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">processed_candidates<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\">compressed<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #616E88\"># write compressed binary stream to file if supplied in args or to stdout otherwise.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #88C0D0\">len<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">outfile<\/span><span style=\"color: #ECEFF4\">)):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #81A1C1\">with<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">open<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">outfile<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">wb<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">as<\/span><span style=\"color: #D8DEE9FF\"> fh<\/span><span style=\"color: #ECEFF4\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                fh<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">write<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">compressed<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #81A1C1\">else<\/span><span style=\"color: #ECEFF4\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            sys<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #D8DEE9FF\">stdout<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #D8DEE9FF\">buffer<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">write<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">compressed<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #81A1C1\">for<\/span><span style=\"color: #D8DEE9FF\"> sessidx <\/span><span style=\"color: #81A1C1\">in<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">range<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #B48EAD\">2113664<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\">unknown_token_idx<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">session_index:<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">str<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">sessidx<\/span><span style=\"color: #ECEFF4\">))<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">engdict<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">sessidx<\/span><span style=\"color: #ECEFF4\">])<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">engdictrev<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">engdict<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">sessidx<\/span><span style=\"color: #ECEFF4\">]])<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">session_index:<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">str<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">sessidx<\/span><span style=\"color: #ECEFF4\">))<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    fh<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">close<\/span><span style=\"color: #ECEFF4\">()<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># decompress mode<\/span><\/span>\n<span class=\"line\"><span style=\"color: #81A1C1\">else<\/span><span style=\"color: #ECEFF4\">:<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #616E88\"># decoding part<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">decoding...<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    detokenizer <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">[]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    detokenizer_idx <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #88C0D0\">len<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">infile<\/span><span style=\"color: #ECEFF4\">)):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #81A1C1\">with<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">open<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">infile<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">rb<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">as<\/span><span style=\"color: #D8DEE9FF\"> fh<\/span><span style=\"color: #ECEFF4\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            compressed <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">bytearray<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">fh<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">read<\/span><span style=\"color: #ECEFF4\">())<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    idx <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #616E88\">#FirstCharOfLine = 1<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    CharIsUpperCase <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #616E88\">#CharIsUpperCase2 = 0<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #616E88\"># main decoding loop<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #81A1C1\">while<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">idx <\/span><span style=\"color: #81A1C1\">&lt;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">len<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">compressed<\/span><span style=\"color: #ECEFF4\">)):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\"># write each byte<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #88C0D0\">hex<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">compressed<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #ECEFF4\">]))<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\">#if( (idx &gt; 0) and compressed[idx] == 0 and compressed[idx - 1] == 0):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\">#find len of consecutive 0 chars<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">idx <\/span><span style=\"color: #81A1C1\">&lt;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">len<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">compressed<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">-<\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #ECEFF4\">((<\/span><span style=\"color: #D8DEE9FF\">compressed<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #ECEFF4\">]<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">==<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">and<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">compressed<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #ECEFF4\">]<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">!=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0<\/span><span style=\"color: #ECEFF4\">)):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #616E88\">#FirstCharOfLine = 1<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    CharIsUpperCase <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #81A1C1\">elif<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">CharIsUpperCase <\/span><span style=\"color: #81A1C1\">==<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #616E88\">#FirstCharOfLine = 2<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    CharIsUpperCase <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">2<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                        <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #88C0D0\">len<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">detokenizer<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">&gt;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #616E88\">### VARIOUS DETOKENIZER CLEANUP\/FORMATTING OPERATIONS<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #616E88\">#ensure this is not the end of an ngram. ngrams necessarily contain whitespaces<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #81A1C1\">not<\/span><span style=\"color: #D8DEE9FF\"> re<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">search<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\"> <\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\">detokenizer<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">detokenizer_idx<\/span><span style=\"color: #81A1C1\">-<\/span><span style=\"color: #B48EAD\">2<\/span><span style=\"color: #ECEFF4\">])):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #616E88\"># English syntactic rules : remove whitespace left of &quot;!?.&quot; <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #616E88\"># and enforce capitalization on first non whitespace character following.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">re<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">match<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">[!\\?\\.]<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\">detokenizer<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">detokenizer_idx<\/span><span style=\"color: #81A1C1\">-<\/span><span style=\"color: #B48EAD\">2<\/span><span style=\"color: #ECEFF4\">])<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">and<\/span><span style=\"color: #D8DEE9FF\"> detokenizer_idx <\/span><span style=\"color: #81A1C1\">&gt;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">2<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                        <\/span><span style=\"color: #81A1C1\">del<\/span><span style=\"color: #D8DEE9FF\"> detokenizer<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">detokenizer_idx<\/span><span style=\"color: #81A1C1\">-<\/span><span style=\"color: #B48EAD\">3<\/span><span style=\"color: #ECEFF4\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                        detokenizer_idx <\/span><span style=\"color: #81A1C1\">-=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                        <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">CharIsUpperCase <\/span><span style=\"color: #81A1C1\">!=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                            CharIsUpperCase <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">2<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #616E88\"># English syntactic rules : remove whitespace left of &quot;,;:&quot; <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">re<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">match<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">[,;:]<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\">detokenizer<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">detokenizer_idx<\/span><span style=\"color: #81A1C1\">-<\/span><span style=\"color: #B48EAD\">2<\/span><span style=\"color: #ECEFF4\">])<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">and<\/span><span style=\"color: #D8DEE9FF\"> detokenizer_idx <\/span><span style=\"color: #81A1C1\">&gt;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">2<\/span><span style=\"color: #ECEFF4\">):<\/span><span style=\"color: #D8DEE9FF\">         <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                        <\/span><span style=\"color: #81A1C1\">del<\/span><span style=\"color: #D8DEE9FF\"> detokenizer<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">detokenizer_idx<\/span><span style=\"color: #81A1C1\">-<\/span><span style=\"color: #B48EAD\">3<\/span><span style=\"color: #ECEFF4\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                        detokenizer_idx <\/span><span style=\"color: #81A1C1\">-=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #616E88\"># URL\/URI detected, remove any spurious whitespace before &quot;\/\/&quot; <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">re<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">match<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">^\\\/\\\/<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\">detokenizer<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">detokenizer_idx<\/span><span style=\"color: #81A1C1\">-<\/span><span style=\"color: #B48EAD\">2<\/span><span style=\"color: #ECEFF4\">])<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">and<\/span><span style=\"color: #D8DEE9FF\"> detokenizer_idx <\/span><span style=\"color: #81A1C1\">&gt;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">2<\/span><span style=\"color: #ECEFF4\">):<\/span><span style=\"color: #D8DEE9FF\">         <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                        <\/span><span style=\"color: #81A1C1\">del<\/span><span style=\"color: #D8DEE9FF\"> detokenizer<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">detokenizer_idx<\/span><span style=\"color: #81A1C1\">-<\/span><span style=\"color: #B48EAD\">3<\/span><span style=\"color: #ECEFF4\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                        detokenizer_idx <\/span><span style=\"color: #81A1C1\">-=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #616E88\"># E-mail detected, remove whitespaces left and right of &quot;@&quot;<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">re<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">match<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">@<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\">detokenizer<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">detokenizer_idx<\/span><span style=\"color: #81A1C1\">-<\/span><span style=\"color: #B48EAD\">2<\/span><span style=\"color: #ECEFF4\">])<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">and<\/span><span style=\"color: #D8DEE9FF\"> detokenizer_idx <\/span><span style=\"color: #81A1C1\">&gt;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">2<\/span><span style=\"color: #ECEFF4\">):<\/span><span style=\"color: #D8DEE9FF\">         <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                        <\/span><span style=\"color: #81A1C1\">del<\/span><span style=\"color: #D8DEE9FF\"> detokenizer<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">detokenizer_idx<\/span><span style=\"color: #81A1C1\">-<\/span><span style=\"color: #B48EAD\">3<\/span><span style=\"color: #ECEFF4\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                        detokenizer_idx <\/span><span style=\"color: #81A1C1\">-=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                        <\/span><span style=\"color: #81A1C1\">del<\/span><span style=\"color: #D8DEE9FF\"> detokenizer<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">detokenizer_idx<\/span><span style=\"color: #81A1C1\">-<\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #ECEFF4\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                        detokenizer_idx <\/span><span style=\"color: #81A1C1\">-=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #81A1C1\">not<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">compressed<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #ECEFF4\">]<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">&amp;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">128<\/span><span style=\"color: #ECEFF4\">)):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #616E88\"># current index byte msb is at 0, <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #616E88\"># it is one of the 128 first tokens in the dictionary.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">super common word<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #616E88\">#decode in place<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                inta <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> compressed<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #ECEFF4\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                       <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">CharIsUpperCase <\/span><span style=\"color: #81A1C1\">==<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">2<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    detokenizer<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">append<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">engdict<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">inta<\/span><span style=\"color: #ECEFF4\">].<\/span><span style=\"color: #88C0D0\">capitalize<\/span><span style=\"color: #ECEFF4\">())<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    detokenizer_idx <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    CharIsUpperCase <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #81A1C1\">else<\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\">    <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    detokenizer<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">append<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">engdict<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">inta<\/span><span style=\"color: #ECEFF4\">])<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    detokenizer_idx <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                  <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #616E88\"># print to stdout<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">CharIsUpperCase <\/span><span style=\"color: #81A1C1\">!=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    detokenizer<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">append<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\"> <\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    detokenizer_idx <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">engdict<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">inta<\/span><span style=\"color: #ECEFF4\">])<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                idx <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #81A1C1\">elif<\/span><span style=\"color: #ECEFF4\">((<\/span><span style=\"color: #D8DEE9FF\">compressed<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #ECEFF4\">]<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">&amp;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">128<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">and<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #81A1C1\">not<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">compressed<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #ECEFF4\">]<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">&amp;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">128<\/span><span style=\"color: #ECEFF4\">))):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #616E88\"># current index byte msb is at 1, and next byte msb is at 0. <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #616E88\"># it is one of the 16384 next tokens in the dictionary.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">common word<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #616E88\"># populate bitarray from the two bytes<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                c <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">bitarray<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9\">endian<\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">little<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                c<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">frombytes<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">compressed<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #B48EAD\">2<\/span><span style=\"color: #ECEFF4\">])<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">c<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #616E88\"># remove first byte msb (shift down the bits above)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #81A1C1\">del<\/span><span style=\"color: #D8DEE9FF\"> c<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #B48EAD\">7<\/span><span style=\"color: #ECEFF4\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">c<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #616E88\"># convert bytes array to 16 bit unsigned integer<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                inta <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">struct<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">unpack<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">&lt;H<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\"> c<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">tobytes<\/span><span style=\"color: #ECEFF4\">()))[<\/span><span style=\"color: #B48EAD\">0<\/span><span style=\"color: #ECEFF4\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #616E88\"># add offset back so we get a valid dictionary key<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                inta <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">128<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #616E88\"># print word<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">CharIsUpperCase <\/span><span style=\"color: #81A1C1\">==<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">2<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    detokenizer<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">append<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">engdict<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">inta<\/span><span style=\"color: #ECEFF4\">].<\/span><span style=\"color: #88C0D0\">capitalize<\/span><span style=\"color: #ECEFF4\">())<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    detokenizer_idx <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    CharIsUpperCase <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #81A1C1\">else<\/span><span style=\"color: #ECEFF4\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    detokenizer<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">append<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">engdict<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">inta<\/span><span style=\"color: #ECEFF4\">])<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    detokenizer_idx <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #D8DEE9FF\">   <\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">CharIsUpperCase <\/span><span style=\"color: #81A1C1\">!=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    detokenizer<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">append<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\"> <\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    detokenizer_idx <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #D8DEE9FF\"> <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">engdict<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">inta<\/span><span style=\"color: #ECEFF4\">])<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #616E88\"># increment byte counter with step 2, we processed 2 bytes.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                idx <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">2<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\">#elif((compressed[idx] &amp; 128) and (compressed[idx+1] &amp; 128)):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #81A1C1\">elif<\/span><span style=\"color: #ECEFF4\">((<\/span><span style=\"color: #D8DEE9FF\">compressed<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #ECEFF4\">]<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">&amp;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">128<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">and<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">compressed<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #ECEFF4\">]<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">&amp;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">128<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">and<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #81A1C1\">not<\/span><span style=\"color: #D8DEE9FF\"> compressed<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #B48EAD\">2<\/span><span style=\"color: #ECEFF4\">]<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">&amp;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">128<\/span><span style=\"color: #ECEFF4\">)):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #616E88\"># current index byte msb is at 1, and next byte mbs is at 1. <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #616E88\"># it is one of the 4194304 next tokens in the dictionary.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">rare word<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                chunk <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> compressed<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #B48EAD\">3<\/span><span style=\"color: #ECEFF4\">]<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #616E88\"># populate bitarray from the three bytes<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                c <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">bitarray<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9\">endian<\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">little<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #616E88\">#c.frombytes(compressed[idx:idx+3])<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                c<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">frombytes<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">chunk<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">c<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #616E88\"># remove second byte msb (shift down the bits above)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #81A1C1\">del<\/span><span style=\"color: #D8DEE9FF\"> c<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #B48EAD\">15<\/span><span style=\"color: #ECEFF4\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">c<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #616E88\"># remove first byte msb (shift down the bits above)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #81A1C1\">del<\/span><span style=\"color: #D8DEE9FF\"> c<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #B48EAD\">7<\/span><span style=\"color: #ECEFF4\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">c<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                c<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">extend<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">0000000000<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #D8DEE9FF\"> <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #616E88\"># pad to 4 bytes (32 bit integer format) : 3 bytes + 10 bits <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #616E88\"># because we previously removed two bits with del c[15] and del c[7]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">c<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #616E88\"># convert bytes array to 32 bit unsigned integer<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                inta <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">struct<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">unpack<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">&lt;L<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\"> c<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">tobytes<\/span><span style=\"color: #ECEFF4\">()))[<\/span><span style=\"color: #B48EAD\">0<\/span><span style=\"color: #ECEFF4\">]<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">inta <\/span><span style=\"color: #81A1C1\">&gt;=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">524416<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #616E88\"># this is a ngram.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #616E88\"># remove offset to get into ngram dic code range.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    inta <\/span><span style=\"color: #81A1C1\">-=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">524416<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">this is an ngram. code:<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">inta<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #616E88\"># process ngram through ngram dictionary<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #616E88\"># replace ngram code with corresponding ngram string and add them to the tokenizer<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    ngram_string <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> ngram_dict_rev<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">inta<\/span><span style=\"color: #ECEFF4\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">ngram string:<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">ngram_string<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    subs <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #616E88\">#(ngram_string,subs) = re.subn(r&#39;x&#39;,r&#39;\\\\x&#39;,ngram_string)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">ngram_string<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\">subs<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> re<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">subn<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #81A1C1\">r<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #EBCB8B\">x<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #81A1C1\">r<\/span><span style=\"color: #ECEFF4\">&#39;&#39;<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\">ngram_string<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #D8DEE9FF\">   <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">ngram string:<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">ngram_string<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    ngram_bytes <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">bytes<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">fromhex<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">ngram_string<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    subtokens <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">decompress_ngram_bytes<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">ngram_bytes<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #616E88\">#bytes = bytearray(ngram_string,encoding=&quot;ascii&quot;)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #616E88\">#subtokens.insert(0,&quot;PREFIX&quot;)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #616E88\">#subtokens.append(&quot;SUFFIX&quot;)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #616E88\">#subtokens = nltk.word_tokenize(ngram_string)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #616E88\"># We know there shouldn&#39;t be any new lines in the subtokens.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #616E88\"># possessives, contractions or punctuation may occur.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #616E88\"># we need to add capitalization rules and spaces after punctuation rules.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #616E88\"># These should be catched by the detokenizer backward processor (detokenizer_idx -2)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #616E88\"># The problem is we append more than one token.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #616E88\"># So we should process rules for first subtoken insertion only.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #616E88\"># The rest should have inline processing (here)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">CharIsUpperCase <\/span><span style=\"color: #81A1C1\">==<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">2<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                        detokenizer<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">append<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">subtokens<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #B48EAD\">0<\/span><span style=\"color: #ECEFF4\">].<\/span><span style=\"color: #88C0D0\">capitalize<\/span><span style=\"color: #ECEFF4\">())<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                        detokenizer_idx <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                        CharIsUpperCase <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #81A1C1\">else<\/span><span style=\"color: #ECEFF4\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                        detokenizer<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">append<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">subtokens<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #B48EAD\">0<\/span><span style=\"color: #ECEFF4\">])<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                        detokenizer_idx <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #D8DEE9FF\"> <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #616E88\">#if(CharIsUpperCase != 1):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #616E88\">#    detokenizer.append(&quot; &quot;) <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #616E88\">#    detokenizer_idx += 1<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    ngram_processed_string <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">ngram_process_rules<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">subtokens<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #ECEFF4\">:])<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #616E88\"># We shoud take care that the backward detokenizer processor does not mingle<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #616E88\"># with the the rest of the ngram string.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #616E88\"># Such a special token will be the only one to have whitespaces in it<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #616E88\"># So we can detect it this way<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    detokenizer<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">append<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">ngram_processed_string<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    detokenizer_idx <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                                        <\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #81A1C1\">else<\/span><span style=\"color: #ECEFF4\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    inta <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #B48EAD\">16384<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">128<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">CharIsUpperCase <\/span><span style=\"color: #81A1C1\">==<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">2<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                        detokenizer<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">append<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">engdict<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">inta<\/span><span style=\"color: #ECEFF4\">].<\/span><span style=\"color: #88C0D0\">capitalize<\/span><span style=\"color: #ECEFF4\">())<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                        detokenizer_idx <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                        CharIsUpperCase <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #81A1C1\">else<\/span><span style=\"color: #ECEFF4\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                        detokenizer<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">append<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">engdict<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">inta<\/span><span style=\"color: #ECEFF4\">])<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                        detokenizer_idx <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #D8DEE9FF\"> <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">CharIsUpperCase <\/span><span style=\"color: #81A1C1\">!=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                        detokenizer<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">append<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\"> <\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #D8DEE9FF\"> <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                        detokenizer_idx <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">engdict<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">inta<\/span><span style=\"color: #ECEFF4\">])<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #616E88\"># increment byte counter with step 3, we processed 3 bytes.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                idx <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">3<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #616E88\">#elif((compressed[idx] == 255) and (compressed[idx+1] == 255) and (compressed[idx+2] == 255)):   <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><span style=\"color: #81A1C1\">elif<\/span><span style=\"color: #ECEFF4\">((<\/span><span style=\"color: #D8DEE9FF\">compressed<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #ECEFF4\">]<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">&amp;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">128<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">and<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">compressed<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #ECEFF4\">]<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">&amp;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">128<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">and<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">compressed<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #B48EAD\">2<\/span><span style=\"color: #ECEFF4\">]<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">&amp;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">128<\/span><span style=\"color: #ECEFF4\">)):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #616E88\">#check if Huffmann first<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                chunk <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> compressed<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #ECEFF4\">:<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #B48EAD\">3<\/span><span style=\"color: #ECEFF4\">]<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #616E88\"># populate bitarray from the three bytes<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                c <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">bitarray<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9\">endian<\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">little<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #616E88\">#c.frombytes(compressed[idx:idx+3])<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                c<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">frombytes<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">chunk<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">c<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #616E88\"># remove third byte msb (shift down the bits above)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #81A1C1\">del<\/span><span style=\"color: #D8DEE9FF\"> c<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #B48EAD\">23<\/span><span style=\"color: #ECEFF4\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">c<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #616E88\"># remove second byte msb (shift down the bits above)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #81A1C1\">del<\/span><span style=\"color: #D8DEE9FF\"> c<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #B48EAD\">15<\/span><span style=\"color: #ECEFF4\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">c<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #616E88\"># remove first byte msb (shift down the bits above)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #81A1C1\">del<\/span><span style=\"color: #D8DEE9FF\"> c<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #B48EAD\">7<\/span><span style=\"color: #ECEFF4\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">c<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                c<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">extend<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">00000000000<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #D8DEE9FF\"> <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #616E88\"># pad to 4 bytes (32 bit integer format) : 3 bytes + 8 bits + 3 bits <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #616E88\"># because we previously removed three bits with del c[23], del c[15] and del c[7]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">c<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #616E88\"># convert bytes array to 32 bit unsigned integer<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                inta <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">struct<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">unpack<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">&lt;L<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\"> c<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">tobytes<\/span><span style=\"color: #ECEFF4\">()))[<\/span><span style=\"color: #B48EAD\">0<\/span><span style=\"color: #ECEFF4\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                inta <\/span><span style=\"color: #81A1C1\">-=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">2097151<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #616E88\"># if it is a Huffmann select tree code it will be 0 to 4 included<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #616E88\"># if it is a session DIC it will be shifted in the negatives.<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">inta <\/span><span style=\"color: #81A1C1\">in<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">range<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #B48EAD\">0<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #B48EAD\">5<\/span><span style=\"color: #ECEFF4\">)):<\/span><span style=\"color: #D8DEE9FF\">        <\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #616E88\"># unknown word<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #616E88\"># end check if Huffmann first<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">unknown word escape sequence detected, code: <\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">str<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">inta<\/span><span style=\"color: #ECEFF4\">))<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #616E88\">#unknown word escape sequence detected.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">inta <\/span><span style=\"color: #81A1C1\">==<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                        char <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> compressed<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #B48EAD\">3<\/span><span style=\"color: #ECEFF4\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                        stra <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">&quot;&quot;<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                        idxchar <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                        <\/span><span style=\"color: #81A1C1\">while<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">char <\/span><span style=\"color: #81A1C1\">!=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">char=<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">char<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                            stra <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">chr<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">char<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">printing string state=<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">stra<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                            idxchar <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                            char <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> compressed<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #B48EAD\">3<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #D8DEE9FF\"> idxchar<\/span><span style=\"color: #ECEFF4\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                        <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">termination char detected=<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                        <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">char<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #81A1C1\">else<\/span><span style=\"color: #ECEFF4\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                        bstr <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">bytearray<\/span><span style=\"color: #ECEFF4\">()<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                        idxchar <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                        <\/span><span style=\"color: #81A1C1\">while<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">char <\/span><span style=\"color: #81A1C1\">!=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                            bstr<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">append<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">char<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                            idxchar <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                            char <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> compressed<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">idx<\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #B48EAD\">3<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #D8DEE9FF\"> idxchar<\/span><span style=\"color: #ECEFF4\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                        <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">huffmann : termination char detected=<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                        <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">char<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                        stra <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">decode_unknown<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">bstr<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\">inta<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                        <\/span><span style=\"color: #616E88\">#stra = codec.decode(bstr)    <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">we append that unknown word in our session dic at idx: <\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">str<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">unknown_token_idx<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\"> since it may be recalled<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    engdictrev<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">stra<\/span><span style=\"color: #ECEFF4\">]<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> unknown_token_idx<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    engdict<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">unknown_token_idx<\/span><span style=\"color: #ECEFF4\">]<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> stra<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    unknown_token_idx <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                        <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">CharIsUpperCase <\/span><span style=\"color: #81A1C1\">==<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">2<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                        detokenizer<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">append<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">stra<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">capitalize<\/span><span style=\"color: #ECEFF4\">())<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                        detokenizer_idx <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                        CharIsUpperCase <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #81A1C1\">else<\/span><span style=\"color: #ECEFF4\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                        detokenizer<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">append<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">stra<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                        detokenizer_idx <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #D8DEE9FF\"> <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">CharIsUpperCase <\/span><span style=\"color: #81A1C1\">!=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                        detokenizer<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">append<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\"> <\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #D8DEE9FF\"> <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                        detokenizer_idx <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                <\/span><span style=\"color: #81A1C1\">else<\/span><span style=\"color: #ECEFF4\">:<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    inta <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">2097151<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #616E88\"># it is a session DIC, shifting back to 0.<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    inta <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #B48EAD\">2097152<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">16384<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">128<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #616E88\"># it is a session DIC, shifting back session dic address space.<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">recalled word:<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #81A1C1\">try<\/span><span style=\"color: #ECEFF4\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                        <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">engdict<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">inta<\/span><span style=\"color: #ECEFF4\">])<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                        <\/span><span style=\"color: #616E88\"># print word<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                        <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">CharIsUpperCase <\/span><span style=\"color: #81A1C1\">==<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">2<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                            detokenizer<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">append<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">engdict<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">inta<\/span><span style=\"color: #ECEFF4\">].<\/span><span style=\"color: #88C0D0\">capitalize<\/span><span style=\"color: #ECEFF4\">())<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                            detokenizer_idx <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                            CharIsUpperCase <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                        <\/span><span style=\"color: #81A1C1\">else<\/span><span style=\"color: #ECEFF4\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                            detokenizer<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">append<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">engdict<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">inta<\/span><span style=\"color: #ECEFF4\">])<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                            detokenizer_idx <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #D8DEE9FF\">   <\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                        <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">CharIsUpperCase <\/span><span style=\"color: #81A1C1\">!=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                            detokenizer<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">append<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\"> <\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                            detokenizer_idx <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">1<\/span><span style=\"color: #D8DEE9FF\"> <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                    <\/span><span style=\"color: #81A1C1\">except<\/span><span style=\"color: #ECEFF4\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                        <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">something went wrong, could not find word in session DIC<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                        <\/span><span style=\"color: #81A1C1\">for<\/span><span style=\"color: #D8DEE9FF\"> sessidx <\/span><span style=\"color: #81A1C1\">in<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">range<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #B48EAD\">2113664<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\">unknown_token_idx<\/span><span style=\"color: #ECEFF4\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">session_index:<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">str<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">sessidx<\/span><span style=\"color: #ECEFF4\">))<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">engdict<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">sessidx<\/span><span style=\"color: #ECEFF4\">])<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">engdictrev<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">engdict<\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #D8DEE9FF\">sessidx<\/span><span style=\"color: #ECEFF4\">]])<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                            <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">session_index:<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">str<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">sessidx<\/span><span style=\"color: #ECEFF4\">))<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">                idx <\/span><span style=\"color: #81A1C1\">+=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">3<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #D8DEE9FF\"> idxchar<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #88C0D0\">debugw<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">detokenizer<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #81A1C1\">if<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">not<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #88C0D0\">len<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">outfile<\/span><span style=\"color: #ECEFF4\">)):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #88C0D0\">print<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&#39;&#39;<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">join<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">detokenizer<\/span><span style=\"color: #ECEFF4\">))<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #81A1C1\">else<\/span><span style=\"color: #ECEFF4\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #616E88\"># write clear text to file if supplied in args<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #81A1C1\">with<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">open<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">outfile<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">w<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">as<\/span><span style=\"color: #D8DEE9FF\"> fh<\/span><span style=\"color: #ECEFF4\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">            fh<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">write<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&#39;&#39;<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">join<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">detokenizer<\/span><span style=\"color: #ECEFF4\">))<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><\/span>\n<span class=\"line\"><\/span><\/code><\/pre><\/div>\n","protected":false},"excerpt":{"rendered":"<p>NOTE : updated documentation and source code is available at : https:\/\/github.com\/rodv92\/PLETSC PLETSC Lightweight english text stream compression, with word tokens, ngrams, session dictionaries and Huffmann for unknown words. How to use : git clone and decompress dics.zip in the current folder. Syntax for compression : python3 dicstrv.py -c txt_inputfile compressed_outputfile Reads txt_inputfile and writes<\/p><\/div>\n<div class=\"blog-btn\"><a href=\"https:\/\/www.skynext.tech\/index.php\/2023\/10\/15\/lightweight-english-text-stream-compression-in-python\/\" class=\"home-blog-btn\">Read More<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_editorskit_title_hidden":false,"_editorskit_reading_time":0,"_editorskit_is_block_options_detached":false,"_editorskit_block_options_position":"{}","pmpro_default_level":"","footnotes":""},"categories":[16],"tags":[],"class_list":["post-1074","post","type-post","status-publish","format-standard","hentry","category-information-technology","pmpro-has-access"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.9 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Lightweight ASCII English Text Stream Compression in Python. - SKYNEXT Tech.<\/title>\n<meta name=\"description\" content=\"Lightweight english text stream compression, with word tokens, ngrams, session dictionaries and Huffmann trees for unknown words.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.skynext.tech\/index.php\/2023\/10\/15\/lightweight-english-text-stream-compression-in-python\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Lightweight ASCII English Text Stream Compression in Python. - SKYNEXT Tech.\" \/>\n<meta property=\"og:description\" content=\"Lightweight english text stream compression, with word tokens, ngrams, session dictionaries and Huffmann trees for unknown words.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.skynext.tech\/index.php\/2023\/10\/15\/lightweight-english-text-stream-compression-in-python\/\" \/>\n<meta property=\"og:site_name\" content=\"SKYNEXT Tech.\" \/>\n<meta property=\"article:published_time\" content=\"2023-10-15T20:05:19+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2023-11-05T17:43:40+00:00\" \/>\n<meta name=\"author\" content=\"R.Verissimo\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"R.Verissimo\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"42 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/www.skynext.tech\/index.php\/2023\/10\/15\/lightweight-english-text-stream-compression-in-python\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/www.skynext.tech\/index.php\/2023\/10\/15\/lightweight-english-text-stream-compression-in-python\/\"},\"author\":{\"name\":\"R.Verissimo\",\"@id\":\"https:\/\/www.skynext.tech\/#\/schema\/person\/6b71040d3e4353a85583550901159cd8\"},\"headline\":\"Lightweight ASCII English Text Stream Compression in Python.\",\"datePublished\":\"2023-10-15T20:05:19+00:00\",\"dateModified\":\"2023-11-05T17:43:40+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/www.skynext.tech\/index.php\/2023\/10\/15\/lightweight-english-text-stream-compression-in-python\/\"},\"wordCount\":1287,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/www.skynext.tech\/#organization\"},\"articleSection\":[\"Information Technology\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/www.skynext.tech\/index.php\/2023\/10\/15\/lightweight-english-text-stream-compression-in-python\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.skynext.tech\/index.php\/2023\/10\/15\/lightweight-english-text-stream-compression-in-python\/\",\"url\":\"https:\/\/www.skynext.tech\/index.php\/2023\/10\/15\/lightweight-english-text-stream-compression-in-python\/\",\"name\":\"Lightweight ASCII English Text Stream Compression in Python. - SKYNEXT Tech.\",\"isPartOf\":{\"@id\":\"https:\/\/www.skynext.tech\/#website\"},\"datePublished\":\"2023-10-15T20:05:19+00:00\",\"dateModified\":\"2023-11-05T17:43:40+00:00\",\"description\":\"Lightweight english text stream compression, with word tokens, ngrams, session dictionaries and Huffmann trees for unknown words.\",\"breadcrumb\":{\"@id\":\"https:\/\/www.skynext.tech\/index.php\/2023\/10\/15\/lightweight-english-text-stream-compression-in-python\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.skynext.tech\/index.php\/2023\/10\/15\/lightweight-english-text-stream-compression-in-python\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.skynext.tech\/index.php\/2023\/10\/15\/lightweight-english-text-stream-compression-in-python\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.skynext.tech\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Lightweight ASCII English Text Stream Compression in Python.\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.skynext.tech\/#website\",\"url\":\"https:\/\/www.skynext.tech\/\",\"name\":\"SKYNEXT Tech.\",\"description\":\"Power Electronics &amp; Reverse Engineering\",\"publisher\":{\"@id\":\"https:\/\/www.skynext.tech\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.skynext.tech\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/www.skynext.tech\/#organization\",\"name\":\"SKYNEXT Tech.\",\"alternateName\":\"DELIVERYSIMO EIRL\",\"url\":\"https:\/\/www.skynext.tech\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.skynext.tech\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/www.skynext.tech\/wp-content\/uploads\/2019\/09\/cropped-SKYNEXT_logo_square-2.png\",\"contentUrl\":\"https:\/\/www.skynext.tech\/wp-content\/uploads\/2019\/09\/cropped-SKYNEXT_logo_square-2.png\",\"width\":210,\"height\":210,\"caption\":\"SKYNEXT Tech.\"},\"image\":{\"@id\":\"https:\/\/www.skynext.tech\/#\/schema\/logo\/image\/\"}},{\"@type\":\"Person\",\"@id\":\"https:\/\/www.skynext.tech\/#\/schema\/person\/6b71040d3e4353a85583550901159cd8\",\"name\":\"R.Verissimo\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.skynext.tech\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/0bf345444b71baae1301e50a1a8cbeb98a5b7f41b85ffe9e1e6c2640ef23b528?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/0bf345444b71baae1301e50a1a8cbeb98a5b7f41b85ffe9e1e6c2640ef23b528?s=96&d=mm&r=g\",\"caption\":\"R.Verissimo\"},\"url\":\"https:\/\/www.skynext.tech\/index.php\/author\/wpadmin\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Lightweight ASCII English Text Stream Compression in Python. - SKYNEXT Tech.","description":"Lightweight english text stream compression, with word tokens, ngrams, session dictionaries and Huffmann trees for unknown words.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.skynext.tech\/index.php\/2023\/10\/15\/lightweight-english-text-stream-compression-in-python\/","og_locale":"en_US","og_type":"article","og_title":"Lightweight ASCII English Text Stream Compression in Python. - SKYNEXT Tech.","og_description":"Lightweight english text stream compression, with word tokens, ngrams, session dictionaries and Huffmann trees for unknown words.","og_url":"https:\/\/www.skynext.tech\/index.php\/2023\/10\/15\/lightweight-english-text-stream-compression-in-python\/","og_site_name":"SKYNEXT Tech.","article_published_time":"2023-10-15T20:05:19+00:00","article_modified_time":"2023-11-05T17:43:40+00:00","author":"R.Verissimo","twitter_card":"summary_large_image","twitter_misc":{"Written by":"R.Verissimo","Est. reading time":"42 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.skynext.tech\/index.php\/2023\/10\/15\/lightweight-english-text-stream-compression-in-python\/#article","isPartOf":{"@id":"https:\/\/www.skynext.tech\/index.php\/2023\/10\/15\/lightweight-english-text-stream-compression-in-python\/"},"author":{"name":"R.Verissimo","@id":"https:\/\/www.skynext.tech\/#\/schema\/person\/6b71040d3e4353a85583550901159cd8"},"headline":"Lightweight ASCII English Text Stream Compression in Python.","datePublished":"2023-10-15T20:05:19+00:00","dateModified":"2023-11-05T17:43:40+00:00","mainEntityOfPage":{"@id":"https:\/\/www.skynext.tech\/index.php\/2023\/10\/15\/lightweight-english-text-stream-compression-in-python\/"},"wordCount":1287,"commentCount":0,"publisher":{"@id":"https:\/\/www.skynext.tech\/#organization"},"articleSection":["Information Technology"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.skynext.tech\/index.php\/2023\/10\/15\/lightweight-english-text-stream-compression-in-python\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.skynext.tech\/index.php\/2023\/10\/15\/lightweight-english-text-stream-compression-in-python\/","url":"https:\/\/www.skynext.tech\/index.php\/2023\/10\/15\/lightweight-english-text-stream-compression-in-python\/","name":"Lightweight ASCII English Text Stream Compression in Python. - SKYNEXT Tech.","isPartOf":{"@id":"https:\/\/www.skynext.tech\/#website"},"datePublished":"2023-10-15T20:05:19+00:00","dateModified":"2023-11-05T17:43:40+00:00","description":"Lightweight english text stream compression, with word tokens, ngrams, session dictionaries and Huffmann trees for unknown words.","breadcrumb":{"@id":"https:\/\/www.skynext.tech\/index.php\/2023\/10\/15\/lightweight-english-text-stream-compression-in-python\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.skynext.tech\/index.php\/2023\/10\/15\/lightweight-english-text-stream-compression-in-python\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/www.skynext.tech\/index.php\/2023\/10\/15\/lightweight-english-text-stream-compression-in-python\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.skynext.tech\/"},{"@type":"ListItem","position":2,"name":"Lightweight ASCII English Text Stream Compression in Python."}]},{"@type":"WebSite","@id":"https:\/\/www.skynext.tech\/#website","url":"https:\/\/www.skynext.tech\/","name":"SKYNEXT Tech.","description":"Power Electronics &amp; Reverse Engineering","publisher":{"@id":"https:\/\/www.skynext.tech\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.skynext.tech\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.skynext.tech\/#organization","name":"SKYNEXT Tech.","alternateName":"DELIVERYSIMO EIRL","url":"https:\/\/www.skynext.tech\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.skynext.tech\/#\/schema\/logo\/image\/","url":"https:\/\/www.skynext.tech\/wp-content\/uploads\/2019\/09\/cropped-SKYNEXT_logo_square-2.png","contentUrl":"https:\/\/www.skynext.tech\/wp-content\/uploads\/2019\/09\/cropped-SKYNEXT_logo_square-2.png","width":210,"height":210,"caption":"SKYNEXT Tech."},"image":{"@id":"https:\/\/www.skynext.tech\/#\/schema\/logo\/image\/"}},{"@type":"Person","@id":"https:\/\/www.skynext.tech\/#\/schema\/person\/6b71040d3e4353a85583550901159cd8","name":"R.Verissimo","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.skynext.tech\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/0bf345444b71baae1301e50a1a8cbeb98a5b7f41b85ffe9e1e6c2640ef23b528?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/0bf345444b71baae1301e50a1a8cbeb98a5b7f41b85ffe9e1e6c2640ef23b528?s=96&d=mm&r=g","caption":"R.Verissimo"},"url":"https:\/\/www.skynext.tech\/index.php\/author\/wpadmin\/"}]}},"_links":{"self":[{"href":"https:\/\/www.skynext.tech\/index.php\/wp-json\/wp\/v2\/posts\/1074","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.skynext.tech\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.skynext.tech\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.skynext.tech\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.skynext.tech\/index.php\/wp-json\/wp\/v2\/comments?post=1074"}],"version-history":[{"count":33,"href":"https:\/\/www.skynext.tech\/index.php\/wp-json\/wp\/v2\/posts\/1074\/revisions"}],"predecessor-version":[{"id":1131,"href":"https:\/\/www.skynext.tech\/index.php\/wp-json\/wp\/v2\/posts\/1074\/revisions\/1131"}],"wp:attachment":[{"href":"https:\/\/www.skynext.tech\/index.php\/wp-json\/wp\/v2\/media?parent=1074"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.skynext.tech\/index.php\/wp-json\/wp\/v2\/categories?post=1074"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.skynext.tech\/index.php\/wp-json\/wp\/v2\/tags?post=1074"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}