Lightweight Natural Language Text Compression
Nieves Brisaboa, Antonio Fariña, Gonzalo Navarro, and José
Paramá
Variants of Huffman codes where words are taken as the source
symbols are currently the most attractive choices to compress
natural language text databases. In particular, Tagged Huffman
Code by Moura et al. offers fast direct searching on the
compressed text and random access capabilities, in exchange for
producing around 11% larger compressed files.
This work describes End-Tagged Dense Code and
(s,c)-Dense Code, two new semistatic statistical methods for
compressing natural language texts. These techniques permit
simpler and faster encoding and obtain better compression ratios
than Tagged Huffman Code, while maintaining its fast direct
search and random access capabilities.
We show that Dense Codes improve Tagged Huffman Code
compression ratio by about 10%, reaching only 0.6% overhead over
the optimal Huffman compression ratio. Being simpler, Dense Codes
are generated 45% to 60% faster than Huffman codes. This makes
Dense Codes a very attractive alternative to Huffman code variants
for various reasons: they are simpler to program, faster to build,
of almost optimal size, and as fast and easy to search as the best
Huffman variants, which are not so close to the optimal size.