Dynamic Entropy-Compressed Sequences and Full-Text Indexes

Veli Mäkinen and Gonzalo Navarro

We give new solutions to the Searchable Partial Sums with Indels problem. Given a sequence of n k-bit numbers, we present a structure taking kn + o(kn) bits of space, able of performing operations sum, search, insert, and delete, all in O(log n) worst-case time, for any k = O(log n). This extends previous results by Hon et al. [ISAAC 2003] achieving the same space and O(log n / log log n) time complexities for the queries, yet offering complexities for insert and delete that are amortized and worse than ours, and supported only for k = O(1). Our result matches an existing lower bound for large values of k.

We also give new solutions to the Dynamic Sequence problem. Given a sequence of n symbols in the range [1,s] with binary zero-order entropy H_0, we present a dynamic data structure that requires nH_0 + o(n log s) bits of space, which is able of performing rank and select, as well as inserting and deleting symbols at arbitrary positions, in O(log n * log s) time. Our result is the first entropy-bound dynamic data structure for rank and select over general sequences.

In the case s=2, where both previous problems coincide, we improve the dynamic solution of Hon et al. in that we compress the sequence. The only previous result with entropy-bound space for dynamic binary sequences is by Blandford and Blelloch [SODA 2004], which has the same complexities as our structure, but does not achieve constant 1 multiplying the entropy term in the space complexity.

Finally, we present a new dynamic compressed full-text self-index, for a collection of texts over an alphabet of size s, of overall length n and h-th order empirical entropy H_h. The index requires nH_h + o(n log s) bits of space, for any h <= a * log_s n and constant 0<a<1. It can count the number of occurrences of a pattern of length m in time O(m * log n * log s). Each such occurrence can be reported in O(log^2 n * log log n) time, and displaying a context of length l from a text takes time O(log n * (l log s + log n log log n)). Insertion/deletion of a text to/from the collection takes O(log n * log s) time per symbol. This significantly improves the space of a previous result by Chan et al. [CPM 2004] in exchange for a slight time complexity penalty. We achieve at the same time the first dynamic index requiring essentially nH_h bits of space, and the first construction of a compressed full-text self-index within that working space. Previous results achieve at best O(nH_h) space with constants larger than 1 (Ferragina and Manzini [FOCS 2000], Arroyuelo and Navarro [ISAAC 2005]) and higher time complexities.

An important result we prove in this paper is that the wavelet tree of the Burrows-Wheeler transform of a text, if compressed with a technique that achieves zero-order compression locally (e.g., Raman et al. [SODA 2002]), automatically achieves h-th order entropy space for any h. This unforeseen relation is essential for the results of the previous paragraph, but it also derives into significant simplifications on many existing static compressed full-text self-indexes that build on wavelet trees.