RePair and All Irreducible Grammars are Upper Bounded by High-Order Empirical Entropy

Carlos Ochoa and Gonzalo Navarro

Irreducible grammars are a class of context-free grammars with well-known representatives, such as RePair (with a few tweaks), Longest Match, Greedy, and Sequential. We show that a grammar-based compression method described by Kieffer and Yang (2000) is upper bounded by the high-order empirical entropy of the string when the underlying grammar is irreducible. Specifically, given a string S over an alphabet of size s, we prove that if the underlying grammar is irreducible, then the length of the binary code output by this grammar-based compression method is bounded by |S| H_k(S) + o(|S| log s) for any k in o(log_s |S|), where H_k(S) is the k-order empirical entropy of S. This is the first bound encompassing the whole class of irreducible grammars in terms of the high-order empirical entropy, with coefficient 1.