Lempel-Ziv Compressed Structures for Document Retrieval
Héctor Ferrada and Gonzalo Navarro
Document retrieval structures index a collection of string documents,
to quickly retrieve those
that satisfy a given condition for a pattern string p. For document listing,
we must list all the documents where p appears. For top-k retrieval, we
must
list the k most relevant documents for p, for some relevance criterion
(such as the frequency of p in the documents).
There exist optimal-time and linear-space solutions for both problems, but
they
use too much space in practice. Most current research uses compressed suffix
arrays, but fast indices still use 17-21 bpc (bits per character), whereas
small indices take several milliseconds to return each answer.
This work presents the first document retrieval structures based on Lempel-Ziv
compression. We build on the LZ78 parsing of the collection, which yields
structures using 7-10 bpc that dominate an important area in the current
space/time tradeoff map. In addition, the structures allow for much more
efficient partial or approximate answers, which are acceptable in many
applications: the structure for document listing outputs the first 75%-80%
of the answers at a rate of one answer per microsecond, and the top-k
retrieval structure returns a result of 90% quality at the same rate and
using
just 4-6 bpc. Current indices using such a little space are orders of
magnitude slower, whereas indices achieving those speeds are 2-4 times
larger.