An LMS-based Grammar Self-index with Local Consistency Properties

Diego Díaz-Domínguez, Gonzalo Navarro, and Alejandro Pacheco

A grammar self-index of a text T (Claude et al. 2012) consists of a grammar G that only produces T and a geometric data structure that indexes the string cuts of the right-hand sides of G's rules. This representation uses space proportional to g, the size of the grammar, which is small when the text is repetitive. However, the index is slow for matching long patterns; it finds the occ occurrences of a pattern P[1..m] in O((m^2 + occ) log g) time. The most expensive part is a set of binary searches for the different cuts P[1..j]P[j+1..m] in the geometric data structure. Christiansen et al. (2019) solved this problem by building a locally consistent grammar that only searches for O(log m) cuts of P. Their representation, however, requires significant extra space (tough still in O(g)) to store a set of permutations for the nonterminal symbols. In this work, we propose another locally consistent grammar that builds on the idea of LMS substrings (Nong et al. 2009). Our grammar also requires to try O(log m) cuts when searching for P, but it does not need to store permutations. As a result, we obtain a self-index that searches in time O((m log m + occ) log g) and is of practical size. Our experiments showed that our index is faster than previous grammar-indexes at the price of increasing the space by a 1.8x factor on average. Other experimental results showed that our data structure becomes convenient when the patterns to search for are long.