Time-Optimal Top-k Document Retrieval

Gonzalo Navarro and Yakov Nekrich

Let DD be a collection of D documents, which are strings over an alphabet of size s, of total length n. We describe a data structure that uses linear space and and reports k most relevant documents that contain a query pattern P, which is a string of length p packed in p / log_s n words, in time O(p / log_s n + k). This is optimal in the RAM model in the general case where log D = Theta(log n), and involves a novel RAM-optimal suffix tree search. Our construction supports an ample set of important relevance measures, such as the number of times P appears in a document (called term frequency), a fixed document importance, and the minimal distance between two occurrences of P in a document. \n\n When log D = o(log n), we show how to reduce the space of the data structure from O(n log n) to O(n( log s + log D + log log n)) bits, and to O(n(log s + log D)) bits in the case of the popular term frequency measure of relevance, at the price of an additive term O(log^e_s n) in the query time, for any constant e>0. \n\n We also consider the dynamic scenario, where documents can be inserted and deleted from the collection. We obtain linear space and query time O(p(log log n)^2 / log_s n + log n + k log log k), whereas insertions and deletions require O(log^(1+e) n) time per symbol, for any constant e>0. \n\n Finally, we consider an extended static scenario where an extra parameter par(P,d) is defined, and the query must retrieve only documents d such that par(P,d)\in [t_1,t_2], where this range is specified at query time. We solve these queries using linear space and O(p / log_s n + \log^(1+e) n + k log^e n) time, for any constant e>0. \n\n Our technique is to translate these top-k problems into multidimensional geometric search problems. As a bonus, we describe some improvements to those problems.