Improved Compressed Indexes for Full-Text Document Retrieval
Djamal Belazzougui, Gonzalo Navarro, and Daniel Valenzuela
We give new space/time tradeoffs for compressed indexes that answer document
retrieval queries on general sequences. On a collection of D documents of
total length n, current approaches require at least |CSA| +
O(n lg D / lg lg D) or 2|CSA|+o(n) bits of space, where CSA is
a full-text index. Using monotone minimal perfect hash functions (mmphfs), we
give new
algorithms for document listing with frequencies and top-k document
retrieval
using just |CSA|+O(n lg lg lg D) bits. We also improve current solutions
that use 2|CSA|+o(n) bits, and consider other problems such as colored
range
listing, top-k most important documents, and computing arbitrary
frequencies.
We give proof-of-concept experimental results that show that using mmphfs
may provide relevant practical tradeoffs for document listing with
frequencies.