Repetitiveness Measures Based on String Morphisms

Gonzalo Navarro and Cristian Urbina

We study deterministic Lindenmayer systems, and more generally string morphisms, as mechanisms to capture the repetitiveness of string collections. In particular, we define L-systems, which extend CPD0L-systems with two parameters, d and n, to unambiguously determine a single string w = τ(φ^d(s))[1..n], where φ is the morphism of the system, τ is a coding, and s is its starting symbol. We define ell(w) as the size of the shortest description of an L-system generating w, and show that ell, which builds on the self-similarities that arise in the sequence, is a relevant measure of repetitiveness.

We study the relation between ell and a better-established measure called δ, which builds on substring complexity. Our results show that ell and δ are largely orthogonal, in the sense that either can be larger than the other, depending on the string family. In particular, ell can be Theta(sqrt(n)) times smaller than δ, whereas ell is upper-bounded by the smallest context-free grammar that generates the string, and thus can be only polylogarithmically larger than δ. This suggests that both mechanisms capture different kinds of regularities related to repetitiveness.

We then combine the mechanisms of L-systems and of bidirectional macro schemes, the smallest known representation scheme that captures substring copies. We call the combination NU-systems, and show that they can be asymptotically strictly smaller than both mechanisms for the same fixed string family. The size ν of the smallest NU-system is a new minimal reachable repetitiveness measure, and it is also shown to be uncomparable with δ. This shows that combining morphism substitution with copy-paste mechanisms yields a stronger approach to capture string repetitiveness.

We also study algorithmic problems on L-systems and NU-systems, such as the cost of decompression and of accessing substrings of the encoded string without fully decompressing it, and explore various simplifications of L-systems that trade compression effectiveness by algorithmic efficiency to handle them.