Practical Compact Indexes for Top-k Document Retrieval

Simon Gog, Roberto Konow, and Gonzalo Navarro

We present a fast and compact index for top-k document retrieval on general string collections. That is, given a string pattern, the index returns the k documents where it appears most often. We adapt a linear-space and optimal-time theoretical solution, whose implementation poses various algorithm engineering challenges. While a naive implementation of the optimal solution is estimated to require around 80n bytes for a text collection of n symbols, our implementation requires 2.5n-3.0n bytes, text included, and answers queries within microseconds. This outperforms all the previous practical indexes by orders of magnitude; the only index using less space is hundreds of times slower. Our index can be built on collections of hundreds of gigabytes and on tokenized text collections.