New Adaptive Compressors for Natural Language Text
Nieves Brisaboa, Antonio Fariña, Gonzalo Navarro, and José
Paramá
Semistatic byte-oriented word-based compression codes have been
shown to be an attractive alternative to compress natural language
text databases, because of the combination of speed,
effectiveness, and direct
searchability they offer.
In particular, our recently proposed family of
dense compression codes has been shown to be superior to the more
traditional byte-oriented word-based Huffman codes in most
aspects. In this paper, we focus on the problem of transmitting
texts among peers that do not share the vocabulary. This is the
typical scenario for adaptive compression methods. We design
adaptive variants of our semistatic dense codes, showing that they
are much simpler and faster than dynamic Huffman codes and reach
almost the same compression effectiveness. We show that our
variants have a very compelling trade-off between
compression/decompression speed, compression ratio and search
speed compared with most of the state-of-the-art general
compressors.