Position-Restricted Substring Searching
Veli Mäkinen and Gonzalo Navarro
A full-text index is a data structure built over a text string T[1,n].
The most basic functionality provided is (a) counting how many times a
pattern string P[1,m] appears in T and (b) locating all
those occ positions. There exist several indexes that solve (a)
in O(m) time and (b) in O(occ) time. In this paper we
propose two new queries, (c) counting how many times P[1,m]
appears in T[l,r] and (d) locating all those occ(l,r)
positions. These can be solved using (a) and (b) but this
requires O(occ) time. We present two solutions to (c) and
(d) in this paper. The first is an index that requires O(n*log n)
bits of space and answers (c) in O(m + log n) time and (d)
in O(log n) time per occurrence (that is, O(occ(l,r)*log n) time
overall). A variant of the first solution answers (c) in
O(m + log log n) time and (d) in constant time per occurrence,
but requires O(n*log^(1+e) n) bits of space for any constant e>0.
The second solution requires n*m*log s (1+o(1)) bits of space,
solving (c) in O(m*ceil(log s / log log n)) time and (d)
in O(m*ceil(log s / log log n)) time per occurrence, where s is
the alphabet size. This second structure takes less space when the text is
compressible.
Our solutions can be seen as a generalization of rank and select
dictionaries, which allow computing how many times a given character c
appears in a prefix T[1,i] and also locate the i-th occurrence of
c in T. Our solution to (c) extends character rank
queries to substring rank queries, and our solution to (d)
extends character select to substring select queries.
As a byproduct, we show how rank queries can be used to implement
fractional cascading in little space, so as to obtain a succinct version of a
well-known two-dimensional range search data structure by Chazelle. We also
show how Grossi et al.'s wavelet trees are suitable for two-dimensional
range searching, and their connection with Chazelle's data structure.