Top-<i>k</i> Document Retrieval in Compressed Space

Top-k Document Retrieval in Compressed Space

Gonzalo Navarro and Yakov Nekrich

Let D be a collection of d strings of total length n over an alphabet of size s. We consider the so-called top-k document retrieval problem: given a short string P and an integer k, list the identifiers of k strings in D most relevant to P, in decreasing order of relevance. Relevance may be a fixed value associated with the strings where P occurs, or the number of times P occurs in the strings. While RAM-optimal solutions using O(n log n) bits and O(|P| / log_s n + k) time exist, solving the problem optimally within space close to O(n log s) bits is open.

We describe a data structure for the top-k document retrieval problem that uses O(log log n) bits per symbol on top of any compressed suffix array (CSA) of D, and supports queries in essentially optimal time, in the following sense. Given a CSA using |CSA| bits of space, that finds the suffix array range of a query string P in time t_count, and accesses a suffix array entry in time t_SA, listing any k pattern occurrences would take time O(t_count +k t_SA). Our top-k data structure uses |CSA|+O(n log log n) bits and reports k most relevant documents that contain P in time O(t_count + k (t_SA + log log n)). On every known CSA using O(n log s) bits, t_SA is Omega(log log n) in virtually all cases, thus our time is O(t_count + k t_SA) in most situations.

When the query string P is sufficiently long, some CSAs reach time O(t_count + k) to list any k occurrences of P. Our structure also does better, obtaining time O(t_count + t_sort(k,n)) on every known CSA, where t_sort(k,n) is the time to sort k integers in [1,n]. This time is already O(t_count + k) in the typical regimes, k = O(polylog n) and k = Omega(n^eps) for any constant eps > 0. If we can deliver the results in unsorted order of relevance, then the time for long patterns is always O(t_count + k), which is optimal with respect to the CSA, and reaches the RAM-optimal time O(|P| / log_s n + k) on a particular CSA. No top-k solution using o(n log d) bits of space has achieved this before.