Compressing Dynamic Text Collections via Phrase-Based Coding
Nieves Brisaboa, Antonio Fariña, Gonzalo Navarro and José Paramá
We present a new statistical compression method, which we call
Phrase Based Dense Code (PBDC), aimed at
compressing large digital libraries. PBDC compresses the text
collection to 30-32% of its original size, permits maintaining
the text compressed all the time, and offers efficient on-line
information retrieval services. The novelty of PBDC is that it
supports continuous growing of the compressed text collection, by
automatically adapting the vocabulary both to new words and to
changes in the word frequency distribution, without degrading the
compression ratio. Text compressed with PBDC can be searched directly
without decompression, using fast Boyer-Moore algorithms.
It is also possible to decompress arbitrary portions of
the collection. Alternative compression methods oriented to
information retrieval focus on static collections and thus are less
well suited to digital libraries.