Faster Top-k Document Retrieval in Optimal Space
Gonzalo Navarro and Sharma Thankachan
We consider the problem of retrieving the k documents from a collection of
strings where a given pattern P appears most often. We show that, by
representing the collection using a Compressed Suffix Array
CSA, a data structure using the asymptotically optimal
|CSA|+o(n) bits can answer queries in the time needed by
CSA to find
the suffix array interval of
the pattern plus O(k lg^2 k lg^e n) accesses to suffix array cells,
for any constant e>0. This is lg n / lg k times faster than the
only previous solution using optimal space, lg k times slower than the
fastest structure
that uses twice the space, and lg^2 k lg^e n times the lower-bound
cost of obtaining k document identifiers from the CSA. To obtain the
result we introduce a tool called the sampled document array, which
can be of independent interest.