Fast and Small Subsampled R-indexes

Dustin Cobas, Travis Gagie, and Gonzalo Navarro

The r-index (Gagie et al., JACM 2020) represented a breakthrough in compressed indexing of repetitive text collections, outperforming its alternatives by orders of magnitude in query time. Its space usage, O(r) where r is the number of runs in the Burrows-Wheeler Transform of the text, is however higher than Lempel-Ziv and grammar-based indexes, and makes it uninteresting in various real-life scenarios of milder repetitiveness. In this paper we introduce the sr-index, a variant that limits a large fraction of the space to O(min(r,n/s)) for a text of length n and a given parameter s, at the expense of multiplying bys the time per occurrence reported. The sr-index is obtained by carefully subsampling the text positions indexed by the r-index, in a way that we prove is still able to support pattern matching with guaranteed performance. Our experiments demonstrate that the theoretical analysis falls short in describing the practical advantages of the sr-index, because it performs much better on real texts than on synthetic ones: the sr-index retains the performance of the r-index while using 1.5-4.0 times less space, sharply outperforming virtually every other compressed index on repetitive texts in both time and space. Only a particular Lempel-Ziv-based index uses less space--about half--than the sr-index, but it is an order of magnitude slower.

Our second contribution are the r-csa and sr-csa indexes. Just like the r-index adapts the well-known FM-Index to repetitive texts, the r-csa adapts Sadakane's Compressed Suffix Array (CSA) to this case. We show that the principles used on the r-index turn out to fit naturally and efficiently in the CSA framework. The sr-csa is the corresponding subsampled version of the r-csa. While the CSA performs better than the FM-Index on classic texts with alphabets larger than DNA, our experiments show that the sr-csa outperforms the sr-index on repetitive texts not only over those larger alphabets, but on some DNA texts as well.

Overall, our new subsampled indexes sweep the table of the existing indexes for highly repetitive text collection, by combining the exceptional speed of the r-index with drastically reduced storage use.