New Space/Time Tradeoffs for Top-k Document Retrieval on Sequences
Gonzalo Navarro and Sharma V. Thankachan
We address the problem of indexing a collection DD = {T_1,T_2,...T_D}
of D string documents of total length n, so that we can efficiently answer
{\em top-k queries}: retrieve k documents most relevant to a pattern
P
of length p given at query time. There exist linear-space data structures,
that is, using O(n) words, that answer such queries in optimal
O(p+k) time
for an ample set of notions of relevance. However, using linear space is not
sufficiently good for large text collections. In this paper we explore how far
the space/time tradeoff for this problem can be pushed. We obtain three
results: (1) When relevance is measured as term frequency (number of times
P
appears in a document T_i), an index occupying |CSA|+o(n) bits answers
the query in time O(tsearch(p)+k lg^2 k lg^e n), where
CSA is a
compressed suffix array indexing DD, tsearch is its time to find the
suffix array interval of P, and e>0 is any constant. (2) With the same
measure of relevance, an index occupying |CSA|+n lg D+ o(n lg
s+n lg D) bits answers the query in time O(tsearch(p)+k lg*
k),
where lg* k is the iterated logarithm of k. (3) When the relevance
depends only on the documents, an index occupying |CSA|+O(n lg lg n) bits
answers the query in O(tsearch(p) + k tSA) time, where tSA is the time
the CSA needs to retrieve a suffix array cell. On our way, we obtain some
other results of independent interest.