Improved Compressed Indexes for Full-Text Document Retrieval
Djamal Belazzougui and Gonzalo Navarro
We give new space/time tradeoffs for compressed indexes that answer document
retrieval queries on general sequences. On a collection of D documents of
total length n, current approaches require at least |CSA|+
O(n lg D / lg lg D) or 2|CSA|+o(n) bits of space, where
CSA is
a full-text index. Using monotone minimum perfect hash functions, we give new
algorithms for document listing with frequencies and top-k document
retrieval
using just |CSA|+O(n lg lg lg D) bits. We also improve current solutions
that use 2|CSA|+o(n) bits, and consider other problems such as colored
range
listing, top-k most important documents, and computing arbitrary
frequencies.