Fast, Small, and Simple Document Listing on Repetitive Text Collections
Dustin Cobas and Gonzalo Navarro
Document listing on string collections is the task of finding all documents
where a pattern appears. It is regarded as the most fundamental document
retrieval problem, and is useful in various applications. Many of the
fastest-growing
string collections are composed of very similar documents, such as versioned
code and document collections, genome repositories, etc. Plain pattern-matching
indexes designed for repetitive text collections achieve orders-of-magnitude
reductions in space. Instead, there are not many analogous indexes for document
retrieval. In this paper we present a simple document listing index for
repetitive string collections of total length n that lists the
ndoc
distinct documents where a pattern of length m appears in time
O(m + ndoc log n). We exploit the repetitiveness of the document
array (i.e., the suffix array coarsened to document identifiers) to
grammar-compress it while precomputing the answers to nonterminals, and store
them in grammar-compressed form as well. Our experimental results show that our
index sharply outperforms existing alternatives in the space/time tradeoff map.