Implicit Compression Boosting with Applications to Self-Indexing
Veli Mäkinen and Gonzalo Navarro
Compression boosting (Ferragina & Manzini, SODA 2004) is a new technique
to enhance zeroth order entropy compressors' performance to k-th order
entropy. It works by constructing the Burrows-Wheeler transform of the input
text, finding optimal partitioning of the transform, and then compressing each
piece using an arbitrary zeroth order compressor. The optimal partitioning has
the property that the achieved compression is boosted to k-th order
entropy, for any k The technique has an application to text indexing:
Essentially, building a wavelet tree (Grossi et al., SODA 2003) for each
piece in the partitioning yields a k-th order compressed full-text
self-index providing efficient substring searches on the indexed text
(Ferragina et al., SPIRE 2004). In this paper, we show that using explicit
compression boosting with wavelet trees is not necessary; our new analysis
reveals that the size of the wavelet tree built for the complete Burrows-Wheeler
transformed text is, in essence, the sum of those built for the pieces in the
optimal partitioning. Hence, the technique provides a way to do compression
boosting implicitly, with a trivial linear time algorithm, but fixed to a
specific zeroth order compressor (Raman et al., SODA 2002). In addition to
having these consequences on compression and static full-text self-indexes,
the analysis shows that a recent dynamic zeroth order compressed
self-index (Mäkinen & Navarro, CPM 2006) occupies in fact space
proportional to k-th order entropy.