###
Top-*k* Term-Proximity in Succinct Space

####
Ian Munro, Gonzalo Navarro, Jesper Sindahl Nielsen, Rahul Shah, and Sharma
Thankachan

Let *DD = {T_1,T_2,...T_D}* be a collection of *D* string
documents of *n* characters in total, that are drawn from an alphabet
set *S=[s]*. The *top-k document retrieval problem*
is to preprocess *DD* into a data structure that, given a query
*(P[1..p],k)*, can return the *k* documents of *DD* most relevant to pattern
*P*. The relevance is
captured using a predefined ranking function, which
depends on the set of occurrences of *P* in *T_d*. For example,
it can be the term frequency (i.e., the
number of occurrences of *P* in *T_d*), or it can be the term
proximity (i.e., the distance between the closest pair of
occurrences of *P* in *T_d*), or a pattern-independent
importance score of *T_d* such as PageRank. Linear space and
optimal query time solutions already exist for this problem.
Compressed and compact space solutions are also known, but only for
a few ranking functions such as term frequency and importance. However,
space efficient data structures for term proximity based
retrieval have been evasive. In this paper we present the first sub-linear
space data structure for this relevance function, which uses only *o(n)* bits
on top of any compressed suffix array of *DD* and solves queries in time
*O((p+k) polylog n)*.