Top-k Term-Proximity in Succinct Space
Ian Munro, Gonzalo Navarro, Jesper Sindahl Nielsen, Rahul Shah, and Sharma
Thankachan
Let DD = {T_1,T_2,...T_D} be a collection of D string
documents of n characters in total, that are drawn from an alphabet
set S=[s]. The top-k document retrieval problem
is to preprocess DD into a data structure that, given a query
(P[1..p],k), can return the k documents of DD most relevant to pattern
P. The relevance is
captured using a predefined ranking function, which
depends on the set of occurrences of P in T_d. For example,
it can be the term frequency (i.e., the
number of occurrences of P in T_d), or it can be the term
proximity (i.e., the distance between the closest pair of
occurrences of P in T_d), or a pattern-independent
importance score of T_d such as PageRank. Linear space and
optimal query time solutions already exist for this problem.
Compressed and compact space solutions are also known, but only for
a few ranking functions such as term frequency and importance. However,
space efficient data structures for term proximity based
retrieval have been evasive. In this paper we present the first sub-linear
space data structure for this relevance function, which uses only o(n) bits
on top of any compressed suffix array of DD and solves queries in time
O((p+k) polylog n).