Text Indexing for Simple Regular Expressions
Hideo Bannai, Philip Bille, Inge Li Gørtz, Gad M. Landau,
Gonzalo Navarro, Nicola Prezza, Teresa Anna Steiner, and Simon Rumle Tarnow
We study the problem of indexing a text T[1..n] in Sigma^n so that,
later, given a query regular expression pattern R of size m=|R|,
we can report all the occ substrings T[i..j] of T matching
R. The problem is known to be hard for arbitrary patterns R, so
in this paper we consider the following two types of patterns. (1)
Character-class Kleene-star patterns of the form P_1 D^* P_2,
where P_1 and P_2 are strings and D = \{c_1, ..., c_k\}
subset of Sigma is a character-class that is shorthand for the
regular expression (c_1 | c_2 | ... | c_k). (2) String
Kleene-star patterns of the form P_1 P^* P_2 where P,
P_1 and P_2 are strings. In case (1), we describe an index of
O(n log^(1+e) n) space (for any constant e > 0)
solving queries in time O(m + log n / log log n + occ) on
constant-sized alphabets. We also describe a more general solution working on
any alphabet size. This result is conditioned on the existence of an
anchor: a character of P_1P_2 that does not belong to D.
We justify this assumption by proving that if an anchor is not present, no
efficient indexing solution can exist unless the Set Disjointness Conjecture
fails. In case (2), we describe an index of size O(n) answering queries
in time O(m + (occ+1) log^e n) on any alphabet size.