Fast Searching on Compressed Text Allowing Errors
Edleno de Moura, Gonzalo Navarro, Nivio Ziviani and Ricardo Baeza-Yates
We present a fast compression and decompression scheme for natural language
texts that allows efficient and flexible
string matching by searching the compressed text directly.
The compression scheme uses a word-based Huffman encoding and the
coding alphabet is byte-oriented rather than bit-oriented.
We compress typical English texts to about 30% of their original size,
against 40% and 35% for Compress and Gzip, respectively.
Compression times are close to the times of Compress and
approximately half the times of Gzip, and decompression times are
lower than those of Gzip and one third of those of Compress.
The searching algorithm allows a large number of variations of the
exact and approximate compressed string matching problem, such as
phrases, ranges, complements, wild cards and arbitrary regular expressions.
Separators and stopwords can be discarded at search time
without significantly increasing the cost.
The algorithm is based on a word-oriented shift-or algorithm
and a fast Boyer-Moore-type filter.
It concomitantly uses the vocabulary of the text available as part
of the Huffman coding data.
When searching for simple patterns, our experiments show that running
our algorithm on a compressed text is twice as fast as running Agrep
on the uncompressed version of the same text.
When searching complex or approximate patterns,
our algorithm is up to 8 times faster than Agrep.
We also mention the impact of our technique in inverted files pointing
to documents or logical blocks as Glimpse.