Document Listing on Repetitive Collections with Guaranteed Performance

Gonzalo Navarro

We consider document listing on string collections, that is, finding in which strings a given pattern appears. In particular, we focus on repetitive collections: a collection of size N over alphabet [1,sigma] is composed of D copies of a string of size n, and s edits are applied on ranges of copies. We introduce the first document listing index with size Õ(n+s), precisely O((n log sigma+s log^2 N) log D) bits, and with useful worst-case time guarantees: Given a pattern of length m, the index reports the ndoc > 0 strings where it appears in time O(m log^(1+e) N * ndoc), for any constant e > 0 (and tells in time O(m log N) if ndoc = 0). Our technique is to augment a range data structure that is commonly used on grammar-based indexes, so that instead of retrieving all the pattern occurrences, it computes useful summaries on them. We show that the idea has independent interest: we introduce the first grammar-based index that, on a text T[1,N] with a grammar of size r, uses O(r log N) bits and counts the number of occurrences of a pattern P[1,m] in time O(m^2 + m log^(2+e) r), for any constant e > 0. We also give the first index using O(z log(N/z) log N) bits, where T is parsed by Lempel-Ziv into z phrases, counting occurrences in time O(m log^(2+e) N).