Time-Optimal Top-k Document Retrieval
Gonzalo Navarro and Yakov Nekrich
Let DD be a collection of D documents, which are
strings over an alphabet of size s, of total length n.
We describe a data structure that uses linear space and
and reports k most relevant documents that contain a query
pattern P, which is a string of length p packed in
p / log_s n
words, in time O(p / log_s n + k).
This is optimal in the RAM model in the general case where log D =
Theta(log n), and involves a novel RAM-optimal suffix tree search.
Our construction supports an ample set of important relevance
measures, such as the number of times P appears in a document (called term
frequency), a fixed document importance, and the minimal distance
between two occurrences of P in a document.
\n\n
When log D = o(log n), we show how to reduce the space
of the data structure from O(n log n) to O(n( log s + log D + log log
n))
bits, and to O(n(log s + log D)) bits in the case of the popular
term frequency measure of relevance, at the price of an additive term
O(log^e_s n) in the query time, for any constant
e>0.
\n\n
We also consider the dynamic scenario, where documents
can be inserted and deleted from the collection. We obtain linear space and
query time O(p(log log n)^2 / log_s n + log n + k log log k),
whereas insertions and deletions require O(log^(1+e) n) time per symbol,
for any constant e>0.
\n\n
Finally, we consider an extended static scenario where an extra parameter
par(P,d) is defined, and the query must retrieve only documents
d
such that par(P,d)\in [t_1,t_2], where this range is specified at
query time. We solve these queries using linear space and
O(p / log_s n + \log^(1+e) n + k log^e n) time, for any constant
e>0.
\n\n
Our technique is to translate these top-k problems into multidimensional
geometric search problems. As a bonus, we describe some
improvements to those problems.