Towards a Definitive Measure of Repetitiveness

Tomasz Kociumaka, Gonzalo Navarro, and Nicola Prezza

Unlike in statistical compression, where Shannon's entropy is a definitive lower bound, no such a clear measure exists for the compressibility of repetitive sequences. Since statistical entropy does not capture repetitiveness, ad-hoc measures like the size z of the Lempel-Ziv parse are frequently used to estimate repetitiveness. Recently, a more principled measure, the size γ of the smallest attractor of a string S[1..n], was introduced. The measure γ lower bounds all the previous relevant ones (including z), yet S can be represented and indexed within space O(γ log (n/γ)), which also upper bounds most measures. While γ is certainly a better measure of repetitiveness than z, it is NP-complete to compute, and it is not known if S can be represented in o(γ log n) space. In this paper, we study a smaller measure, δ ≤ γ, which can be computed in linear time. We show that δ better captures the compressibility of repetitive strings. For every length n and every value δ ≥ 2, we construct a string such that γ = Omega(δ log (n/δ)). Still, we show a representation of any string S in O(δ log (n/δ)) space that supports direct access to any character S[i] in time O(log (n/δ)) and finds the occ occurrences of any pattern P[1..m] in time O(m log n + occ log^ε n) for any constant ε > 0. Further, we prove that no o(δ log n)-space representation exists: for every length n and any value 2 ≤ δ ≤ n^(1-ε), we exhibit a string family whose elements can only be encoded in Omega(δ log (n/δ)) space. We complete our characterization of δ by showing that, although γ, z, and other repetitiveness measures are always O(δ log (n/δ)), in some string families the smallest context-free grammar can be of size g = Omega(δ log^2 n / log log n). No such a separation is known for γ.