Document Listing on Repetitive Collections
Travis Gagie, Kalle Karhu, Gonzalo Navarro, Simon Puglisi, and
Jouni Sirén
Many document collections consist largely of repeated material, and several
indexes have been designed to take advantage of this. There has been only
preliminary work, however, on document retrieval for repetitive collections.
In this paper we show how one of those indexes, the run-length compressed
suffix array (RLCSA), can be extended to support document listing. In our
experiments, our additional structures on top of the RLCSA can reduce the
query time for document listing by an order of magnitude while still using
total space that is only a fraction of the uncompressed size of the
collection. As a byproduct, we develop a new document listing technique
for general collections that is of independent interest.