Improved Single-Term Top-k Document Retrieval
Simon Gog and Gonzalo Navarro
On natural language text collections, finding the k documents most relevant
to a query is generally solved with inverted indexes. On general string
collections, however, more sophisticated data structures are necessary.
Navarro
and Nekrich [SODA 2012] showed that a linear-space index can solve such
top-k queries in optimal time O(m+k), where m is the query length.
Konow and Navarro [DCC 2013] implemented the scheme, managing to solve
top-k queries within microseconds with an index using 3.3-4.0 bytes per
character (this includes the storage of the collection itself). In this paper
we introduce a new implementation using significantly less space, 2.5-3.0
bytes
per character (again, including the collection), and retaining similar query
times.
For short queries, which are the most difficult, our new index
actually outperforms the previous one, as well as all the other solutions in
the
literature. We also show that our index can be built on very large text
collections, and that it can
handle phrase queries efficiently on natural language text collections. In the
latter case, it uses about the same space of the tokenized text (and replaces
it), while answering phrase queries an order of magnitude faster than a
positional inverted index.