###
Towards a Definitive Measure of Repetitiveness

####
Tomasz Kociumaka, Gonzalo Navarro, and Nicola Prezza

Unlike in statistical compression, where Shannon's entropy is a definitive
lower bound, no such a clear measure exists for the compressibility of
repetitive sequences. Since statistical entropy does not capture
repetitiveness, ad-hoc measures like the size *z* of the Lempel-Ziv parse are frequently used to estimate repetitiveness.
Recently, a more principled measure, the size *γ* of the smallest
*attractor* of a string *S[1..n]*, was introduced. The measure
*γ* lower bounds all the previous relevant ones (including
*z*),
yet *S* can be represented and indexed within space
*O(γ log (n/γ))*, which also upper bounds most measures.
While *γ* is certainly a better measure of repetitiveness than
*z*,
it is NP-complete to compute, and it is not known if *S* can be represented
in *o(γ log n)* space.
In this paper, we study a smaller measure, *δ ≤ γ*, which can
be computed in linear time. We show that *δ* better captures the
compressibility of repetitive strings. For every length *n and every
value **δ ≥ 2*, we construct a string such that
*γ = Omega(δ log (n/δ))*.
Still, we show a representation of any string *S* in
*O(δ log (n/δ))* space that supports direct access to any
character *S[i]* in time *O(log (n/δ))* and finds the
*occ* occurrences of any pattern *P[1..m]* in time *O(m log n +
occ log^ε n)* for any constant *ε > 0*.
Further, we prove that no *o(δ log n)*-space representation
exists: for every length *n* and any value *2 ≤ δ ≤
n^(1-ε)*, we exhibit a string family whose elements can only
be encoded in *Omega(δ log (n/δ))* space.
We complete our
characterization of *δ* by showing that, although *γ*,
*z*,
and other repetitiveness measures are always
*O(δ log (n/δ))*, in some string
families the smallest context-free grammar can be of size
*g = Omega(δ log^2 n / log log n)*. No such a separation is known for
*γ*.