Practical Compact Indexes for Top-k Document Retrieval
Simon Gog, Roberto Konow, and Gonzalo Navarro
We present a fast and compact index for top-k document retrieval on general
string collections. That is,
given a string pattern, the index returns the k documents where it appears
most often. We adapt a linear-space
and optimal-time theoretical solution, whose implementation poses various
algorithm engineering
challenges. While a naive implementation of the optimal solution is estimated
to require around 80n bytes
for a text collection of n symbols, our implementation requires
2.5n-3.0n
bytes, text included, and answers
queries within microseconds. This outperforms all the previous practical
indexes by orders of magnitude; the
only index using less space is hundreds of times slower. Our index can be
built on collections of hundreds of
gigabytes and on tokenized text collections.