Boosting Text Compression with Word-based Statistical Encoding
Antonio Fariña, Gonzalo Navarro, and José Paramá
Semistatic word-based byte-oriented compressors are known to be attractive
alternatives to compress natural language texts. With compression ratios
around 30-35%, they allow fast direct searching of compressed text. In
this article we reveal that these compressors have even more benefits. We
show that most of the state-of-the-art compressors benefit from compressing
not the original text, but the compressed representation obtained by a
word-based byte-oriented statistical compressor. For example, p7zip with
a dense-coding preprocessing achieves even better compression ratios and
much faster compression than p7zip alone. We reach compression ratios
below 17% in typical large English texts, which was obtained only by
the slow PPM compressors. Furthermore, searches perform much faster if
the final compressor operates over word-based compressed text. We show that
typical self-indexes also profit from our preprocessing step. They achieve
much better space and time performance when indexing is preceded by a
compression step. Apart from using the well-known Tagged Huffman code,
we present a new suffix-free Dense-Code-based compressor that compresses
slightly better. We also show how some self-indexes can handle non-suffix-free
codes. As a result, the compressed/indexed text requires around 35% of the
space of the original text and allows indexed searches for both words and
phrases.