Practical Compressed Document Retrieval

Gonzalo Navarro, Simon Puglisi, and Daniel Valenzuela

Recent research on document retrieval for general texts has established the virtues of explicitly representing the so-called document array, which stores the document each pointer of the suffix array belongs to. While it makes document retrieval faster, this array occupies a significative amount of redundant space and is not easily compressible. In this paper we present the first practical proposal to compress the document array. We show that the resulting structure is significatively smaller than the uncompressed counterpart, and than alternatives to the document array proposed in the literature. We also compare various known algorithms for document listing and top-k retrieval, and find that the most useful combinations of algorithms run over our new compressed document arrays.