Space-Efficient Construction of Lempel-Ziv Compressed Text Indexes
Diego Arroyuelo and Gonzalo Navarro
A compressed full-text self-index is a data structure that replaces
a text and in addition gives indexed access to it, while taking space
proportional to the compressed text size.
This is very important nowadays, since one can accommodate the index
of very large texts entirely in main memory, avoiding the slower access to
secondary storage.
In particular, the LZ-index [G. Navarro, Journal of Discrete Algorithms, 2004]
stands out for its good performance at extracting text passages and locating
pattern occurrences. Given a text T[1..u] over an alphabet of size
s,
the LZ-index requires 4|LZ|(1+o(1)) bits of space, where |LZ| is the size
of the LZ78-compression of T. This can be bounded by
|LZ| = u Hk(T)+o(u log s), where Hk(T) is the k-th order empirical
entropy of T, for any k=o(log_s u). The LZ-index is built in
O(u log s) time, yet requiring O(u log u) bits of main memory in the
worst case. In practice, the LZ-index occupies 1.0-1.5 times the text size
(and replaces the text), but
its construction requires around 5 times the text size. This limits
its applicability to medium-sized texts. In this paper we present a
space-efficient algorithm to construct the LZ-index in
O(u(log s + log log u)) time and requiring
4|LZ|(1+o(1)) bits of
main memory, that is, asymptotically the same space of the final index.
We also adapt our algorithm to construct more recent reduced versions
of the LZ-index, which occupy from 1 to 3 times |LZ|(1+o(1)) bits, and
show that these can also be built using asymptotically the same space of
the final index.
Finally, we study an alternative model in which we are given only a limited
amount of main memory to carry out the indexing process (less than that
required by the final index), and must use the disk for the rest.
We show how to build all the LZ-index
variants in O(u(log s + log log u)) time, and
within |LZ|(1+o(1)) bits of main memory, that is, asymptotically just the
space to hold the LZ78-compressed text.
Our experimental results show that our method is efficient in practice,
needing
an amount of memory close to that of the final index, and being competitive
with the best construction times of other compressed indexes.