A Lempel-Ziv Text Index on Secondary Storage
Diego Arroyuelo and Gonzalo Navarro
Full-text searching consists in locating the occurrences of a given pattern
P[1..m] in a text T[1..u], both sequences over an alphabet of size s.
In this paper we define a new index for full-text searching
on secondary storage, based on the Lempel-Ziv compression
algorithm and requiring 8uH_k(T) + o(u log s) bits of space,
where H_k(T) denotes the k-th order empirical entropy of T,
for any k=o(log_s u).
Our experimental results show that our index is significantly
smaller than any other practical secondary-memory data structure:
1.4-2.3 times the text size including the text,
which means 39%-65% the size of traditional indexes
like String B-trees [Ferragina and Grossi, JACM 1999].
In exchange, our index requires more disk access to locate the pattern
occurrences. Our index is able to report
up to 600 occurrences per disk access. If we only need
to count pattern occurrences, the space can be reduced to
about 1.04-1.68 times the text size, requiring about
20-60 disk accesses, depending on the pattern length.