Direct Pattern Matching on Compressed Text
Edleno de Moura, Gonzalo Navarro, Nivio Ziviani and Ricardo Baeza-Yates
We present a fast compression and decompression technique for natural language
texts.
The novelty is that the exact search can be done on the compressed text
directly, using any known sequential pattern matching algorithm.
Approximate search can also be done efficiently without any decoding.
The compression scheme uses a semi-static word-based modeling and a
Huffman coding where the coding alphabet is byte-oriented rather than
bit-oriented.
We use the first bit of each byte to mark the beginning of a word, which
allows the searching of the compressed pattern directly on the compressed
text.
We achieve about 33% compression ratio for typical English texts.
When searching for simple patterns, our experiments show that running
our algorithm on a compressed text is almost twice as fast as running
agrep on the uncompressed version of the same text.
When searching complex or approximate patterns,
our algorithm is up to 8 times faster than agrep.