Self-Indexed Grammar-Based Compression

Francisco Claude and Gonzalo Navarro

Self-indexes aim at representing text collections in a compressed format that allows extracting arbitrary portions and also offers indexed searching on the collection. Current self-indexes are unable of fully exploiting the redundancy of highly repetitive text collections that arise in several applications. Grammar-based compression is well suited to exploit such repetitiveness.

We introduce the first grammar-based self-index. It builds on Straight-Line Programs (SLPs), a rather general kind of context-free grammars. If an SLP of n rules represents a text T[1,u], then an SLP-compressed representation of T requires 2n log_2 n bits. For that same SLP, our self-index takes O(n log n) + n log_2 u bits. It extracts any text substring of length m in time O((m+h) log n), and finds occ occurrences of a pattern string of length m in time O((m(m+h)+h occ) log n), where h is the height of the parse tree of the SLP. No previous grammar representation had achieved o(n) search time. \\ % As byproducts we introduce (i) a representation of SLPs that takes 2n log_2 n (1+o(1)) bits and efficiently supports more operations than a plain array of rules; (ii) a representation for binary relations with labels supporting various extended queries; (iii) a generalization of our self-index to grammar compressors that reduce T to a sequence of terminals and nonterminals, such as Re-Pair and LZ78.