Computing MEMs and Relatives on Repetitive Text Collections

Gonzalo Navarro

We consider the problem of computing the Maximal Exact Matches (MEMs) of a given pattern P[1..m] on a large repetitive text collection T[1..n] over an alphabet of size σ, which is represented as a (hopefully much smaller) run-length context-free grammar of size g_rl. We show that the problem can be solved in time O(m^2 log^e n), for any constant e > 0, on a data structure of size O(g_rl). Further, on a locally consistent grammar of size O(δ log (n log σ / (δ log n))), the time decreases to O(m log m (log m + log^e n)). The value δ is a function of the substring complexity of T and Omega(δ log (n log σ / (δ log n))) is a tight lower bound on the compressibility of repetitive texts T, so our structure has optimal size in terms of n, σ, and δ. We extend our results to several related problems, such as finding k-MEMs, MUMs, rare MEMs, and applications.