An LMS-based Grammar Self-index with Local Consistency Properties
Diego Díaz-Domínguez, Gonzalo Navarro, and Alejandro Pacheco
A grammar self-index of a text T (Claude et al. 2012) consists of a
grammar G that only produces T and a geometric data
structure that indexes the string cuts of the right-hand sides of
G's rules. This representation uses space proportional to
g, the size of the grammar, which is small when the text is repetitive.
However, the index is slow for matching long patterns; it finds the occ
occurrences of a pattern P[1..m] in O((m^2 + occ) log g) time. The
most expensive part is a set of binary searches for the different cuts
P[1..j]P[j+1..m] in the geometric data structure. Christiansen et al.
(2019) solved this problem by building a locally consistent grammar that only
searches for O(log m) cuts of P. Their representation, however,
requires significant extra space (tough still in O(g)) to store a set of
permutations for the nonterminal symbols. In this work, we propose another
locally consistent grammar that builds on the idea of LMS substrings (Nong et
al. 2009). Our grammar also requires to try O(log m) cuts when searching
for P, but it does not need to store permutations.
As a result, we obtain a self-index that searches in time O((m log m + occ)
log g) and is of practical size. Our experiments showed that our index is faster than previous grammar-indexes at the price of increasing the space by a 1.8x factor on average. Other experimental results showed that our data structure becomes convenient when the patterns to search for are long.