Optimal-Time Dictionary-Compressed Indexes
Anders Roy Christiansen, Mikko Berggren Ettienne, Tomasz Kociumaka,
Gonzalo Navarro, and Nicola Prezza
We describe the first self-indexes able to count and locate pattern
occurrences in optimal time within a space bounded by the size of the most
popular dictionary compressors. To achieve this result we combine several
recent findings, including string attractors --- new combinatorial
objects encompassing most known compressibility measures for highly repetitive
texts ---, and grammars based on locally-consistent parsing.
More in detail, let γ be the size of the smallest attractor for a
text T of length n. The measure γ is an (asymptotic) lower
bound to the size of dictionary compressors based on Lempel--Ziv, context-free
grammars, and many others. The smallest known text representations in terms of
attractors use space O(γ log(n/γ)), and
our lightest indexes work within the same asymptotic space. Let
e>0 be a suitably small constant fixed at construction time,
m be the pattern length, and occ be the number of its text
occurrences. Our index counts pattern occurrences in
O(m + log^(2+e) n) time, and locates them in
O(m + (occ+1) log^e n) time.
These times already outperform those of most dictionary-compressed indexes,
while obtaining the least asymptotic space for any index searching within
O((m+occ) polylog n) time.
Further, by increasing the space to O(γ log(n/γ) log^e
n), we reduce the
locating time to the optimal O(m+occ), and within
O(γ log(n/γ) log n)
space we can also count in optimal O(m) time. No dictionary-compressed index had obtained this time before. All our indexes can be
constructed in O(n) space and O(n log n) expected time.
As a byproduct of independent interest, we show how to build, in O(n)
expected time and without knowing the size γ of the smallest
attractor (which is NP-hard to find), a run-length context-free grammar of
size O(γ log(n/γ)) generating (only) T.
As a result, our indexes can be built without knowing γ.