Lempel-Ziv Compressed Structures for Document Retrieval

Héctor Ferrada and Gonzalo Navarro

Document retrieval structures index a collection of string documents, to quickly retrieve those that satisfy a given condition for a pattern string p. For document listing, we must list all the documents where p appears. For top-k retrieval, we must list the k most relevant documents for p, for some relevance criterion (such as the frequency of p in the documents). There exist optimal-time and linear-space solutions for both problems, but they use too much space in practice. Most current research uses compressed suffix arrays, but fast indices still use 17-21 bpc (bits per character), whereas small indices take several milliseconds to return each answer. This work presents the first document retrieval structures based on Lempel-Ziv compression. We build on the LZ78 parsing of the collection, which yields structures using 7-10 bpc that dominate an important area in the current space/time tradeoff map. In addition, the structures allow for much more efficient partial or approximate answers, which are acceptable in many applications: the structure for document listing outputs the first 75%-80% of the answers at a rate of one answer per microsecond, and the top-k retrieval structure returns a result of 90% quality at the same rate and using just 4-6 bpc. Current indices using such a little space are orders of magnitude slower, whereas indices achieving those speeds are 2-4 times larger.