Improved Antidictionary Based Compression

Maxime Crochemore and Gonzalo Navarro

The compression of binary texts using antidictionaries is a novel technique based on the fact that some substrings (called "antifactors") never appear in the text. Let sb be an antifactor, where b is its last bit. Every time s appears in the text we know that the next bit is not b and hence omit its representation. Since building the set of all antifactors is space consuming at compression time, it is customary to limit the maximum length of antifactors considered up to a constant k. Larger k yields better compression of the text but requires more space at compression time. In this paper we introduce the notion of almost antifactors, which are strings that rarely appear in the text. More formally, almost antifactors are strings that, if we consider them as antifactors and separately code their occurrences as exceptions, the compression ratio improves. We show that almost antifactors permit improving compression with a limited amount of main memory to compress. Our experiments show that they obtain the same compression of the classical algorithm using only 30%-55% of its memory space.