FIX to Dynamic Entropy-Compressed Sequences and Full-Text Indexes

Level: Small

There is an error in Sec. 5.1.1, page 23. When using the circular arrays, in order to make space for log n bits, it might be necessary to take out 2*log n bits from the next leaf, and then 3*log n bits from the next, and so on. So the total cost is not O(f(n)+log n) but O(f(n)^2+log n). Theorem 5 still holds by using f(n) = sqrt(log n) instead of f(n) = log n, but the subsequent comparison with Gupta et al. is not anymore valid (their sublinear terms are better).

Fortunately this error does not propagate any further. The central part of the paper (block identifier encoding, which already needed f(n) = sqrt(log n)) stays untouched.

Thanks to Jakub Opuszaski, from University of Wrocaw, for pointing out this error.

Another point raised by Bojian Xu, from University of Kansas, is not really an error but needs clarification. The problem is in Lemma 4, where we say that the bit vectors B^j are precisely those that would result if we built the wavelet tree just for L^j. This is not completely true because the local text can have a smaller vocabulary than the global one. In the extreme case one can have two partitions a^n and b^n, so that the local wavelet trees have entropy zero, but that require a whole root bitmap if they become a wavelet tree on a,b.

The answer is that those overheads add up to o(n log s) (s=sigma). Consider expanding the local wavelet trees before integrating them in the global one. Each unused symbol we add to the local wavelet tree converts a leaf into an internal node whose bitmap is all 0s or all 1s. With as few as log s new symbols we can already force the creation of the n log s bits. However, all those bits we add are runs of 0 or 1, and thus the (c,o) encoding of RRR reduces them to o(n log s) bits.