Text Indexing for Simple Regular Expressions

Hideo Bannai, Philip Bille, Inge Li Gørtz, Gad M. Landau, Gonzalo Navarro, Nicola Prezza, Teresa Anna Steiner, and Simon Rumle Tarnow

We study the problem of indexing a text T[1..n] in Sigma^n so that, later, given a query regular expression pattern R of size m=|R|, we can report all the occ substrings T[i..j] of T matching R. The problem is known to be hard for arbitrary patterns R, so in this paper we consider the following two types of patterns. (1) Character-class Kleene-star patterns of the form P_1 D^* P_2, where P_1 and P_2 are strings and D = \{c_1, ..., c_k\} subset of Sigma is a character-class that is shorthand for the regular expression (c_1 | c_2 | ... | c_k). (2) String Kleene-star patterns of the form P_1 P^* P_2 where P, P_1 and P_2 are strings. In case (1), we describe an index of O(n log^(1+e) n) space (for any constant e > 0) solving queries in time O(m + log n / log log n + occ) on constant-sized alphabets. We also describe a more general solution working on any alphabet size. This result is conditioned on the existence of an anchor: a character of P_1P_2 that does not belong to D. We justify this assumption by proving that if an anchor is not present, no efficient indexing solution can exist unless the Set Disjointness Conjecture fails. In case (2), we describe an index of size O(n) answering queries in time O(m + (occ+1) log^e n) on any alphabet size.