Fast and Small Subsampled R-indexes
Dustin Cobas, Travis Gagie, and Gonzalo Navarro
The r-index (Gagie et al., JACM 2020) represented a breakthrough in compressed indexing of repetitive
text collections, outperforming its alternatives by orders of magnitude in
query time. Its space usage, O(r)
where r is the number of runs in the Burrows-Wheeler Transform of the text, is however higher than
Lempel-Ziv and grammar-based indexes, and makes it uninteresting in various real-life scenarios of milder
repetitiveness. In this paper we introduce the sr-index, a variant that limits a large fraction of the space to
O(min(r,n/s)) for a text of length n and a given parameter
s, at the expense of multiplying bys the time
per occurrence reported. The sr-index is obtained by carefully subsampling the text positions indexed by
the r-index, in a way that we prove is still able to support pattern matching with guaranteed performance.
Our experiments demonstrate that the theoretical analysis falls short in describing the practical advantages
of the sr-index, because it performs much better on real texts than on synthetic ones: the sr-index retains
the performance of the r-index while using 1.5-4.0 times less space, sharply outperforming virtually every
other compressed index on repetitive texts in both time and space. Only a particular Lempel–Ziv-based
index uses less space--about half--than the sr-index, but it is an order of magnitude slower.
Our second contribution are the r-csa and sr-csa indexes. Just like the r-index adapts the well-known
FM-Index to repetitive texts, the r-csa adapts Sadakane's Compressed Suffix Array (CSA) to this case. We
show that the principles used on the r-index turn out to fit naturally and efficiently in the CSA framework.
The sr-csa is the corresponding subsampled version of the r-csa. While the CSA performs better than the
FM-Index on classic texts with alphabets larger than DNA, our experiments show that the sr-csa outperforms
the sr-index on repetitive texts not only over those larger alphabets, but on some DNA texts as well.
Overall, our new subsampled indexes sweep the table of the existing indexes for highly repetitive text
collection, by combining the exceptional speed of the r-index with drastically reduced storage use.