Towards a Definitive Measure of Repetitiveness
Tomasz Kociumaka, Gonzalo Navarro, and Nicola Prezza
Unlike in statistical compression, where Shannon's entropy is a definitive
lower bound, no such a clear measure exists for the compressibility of
repetitive sequences. Since statistical entropy does not capture
repetitiveness, ad-hoc measures like the size z of the Lempel-Ziv parse are frequently used to estimate repetitiveness.
Recently, a more principled measure, the size γ of the smallest
attractor of a string S[1..n], was introduced. The measure
γ lower bounds all the previous relevant ones (including
z),
yet S can be represented and indexed within space
O(γ log (n/γ)), which also upper bounds most measures.
While γ is certainly a better measure of repetitiveness than
z,
it is NP-complete to compute, and it is not known if S can be represented
in o(γ log n) space.
In this paper, we study a smaller measure, δ ≤ γ, which can
be computed in linear time. We show that δ better captures the
compressibility of repetitive strings. For every length n and every
value δ ≥ 2, we construct a string such that
γ = Omega(δ log (n/δ)).
Still, we show a representation of any string S in
O(δ log (n/δ)) space that supports direct access to any
character S[i] in time O(log (n/δ)) and finds the
occ occurrences of any pattern P[1..m] in time O(m log n +
occ log^ε n) for any constant ε > 0.
Further, we prove that no o(δ log n)-space representation
exists: for every length n and any value 2 ≤ δ ≤
n^(1-ε), we exhibit a string family whose elements can only
be encoded in Omega(δ log (n/δ)) space.
We complete our
characterization of δ by showing that, although γ,
z,
and other repetitiveness measures are always
O(δ log (n/δ)), in some string
families the smallest context-free grammar can be of size
g = Omega(δ log^2 n / log log n). No such a separation is known for
γ.