Document Listing on Repetitive Collections with Guaranteed Performance
Gonzalo Navarro
We consider document listing on string collections, that is, finding in which
strings a given pattern appears. In particular, we focus on repetitive
collections: a collection of size N over alphabet [1,sigma] is composed
of D copies of a string of size n, and s
edits are applied on ranges of copies. We introduce the first document listing
index with size Õ(n+s), precisely O((n log sigma+s log^2
N) log D)
bits, and with useful worst-case time guarantees: Given a pattern of length
m,
the index reports the ndoc > 0 strings where it appears in time
O(m log^(1+e) N * ndoc), for any constant
e > 0 (and
tells in time O(m log N) if ndoc = 0). Our technique is to augment a range
data structure that is commonly used on grammar-based indexes, so that instead
of retrieving all the pattern occurrences, it computes useful summaries on
them. We show that the idea has
independent interest: we introduce the first grammar-based index that, on
a text T[1,N] with a grammar of size r, uses O(r log
N) bits and
counts the number of occurrences of a pattern P[1,m] in time
O(m^2 + m log^(2+e) r), for any constant e > 0. We also give
the first index using O(z log(N/z) log N) bits, where T is parsed by
Lempel-Ziv into z phrases, counting occurrences in time
O(m log^(2+e) N).