Document Counting in Compressed Space
Travis Gagie, Aleksi Hartikainen, Juha Kärkkäinen, Gonzalo
Navarro, Simon Puglisi, and Jouni Sirén
We address the problem of counting the number of strings in a collection
where a given pattern appears, which has applications in information retrieval
and data mining. Existing solutions are in a theoretical stage. In this paper
we implement these solutions and explore compressed variants, aiming to
reduce data structure size. Our main result is to uncover some unexpected
compressibility properties of the fastest known data structure for the
problem.
By taking advantage of these properties, we can reduce the size of the
structure
by a factor of 5-400, depending on the dataset.