The compression scheme uses a semi-static word-based model and a Huffman code where the coding alphabet is byte-oriented rather than bit-oriented. We compress typical English texts to about 30% of their original size, against 40% and 35% for Compress and Gzip, respectively. Compression time is close to that of Compress and approximately half the time of Gzip, and decompression time is lower than that of Gzip and one third of that of Compress.
We present three algorithms to search the compressed text. They allow a large number of variations over the basic word and phrase search capability, such as sets of characters, arbitrary regular expressions and approximate matching (i.e. allowing errors in the occurrences). Separators and stopwords (articles, prepositions, etc.) can be discarded at search time without significantly increasing the cost. When searching for simple words, the experiments show that running our algorithms on a compressed text is twice as fast as running the best existing software (Agrep) on the uncompressed version of the same text. When searching complex or approximate patterns, our algorithms are up to 8 times faster than the search on uncompressed text.
We also discuss the impact of our technique in inverted files pointing to logical blocks (as in Glimpse) and argue for the possibility of keeping the text compressed all the time, decompressing only for displaying purposes.