LZgrep: A Boyer-Moore String Matching Tool for Ziv-Lempel Compressed Text
Gonzalo Navarro and Jorma Tarhio
We present a Boyer-Moore approach to string matching over LZ78 and LZW
compressed text. The idea is to search the text directly in compressed form
instead of decompressing and then searching it. We modify the Boyer-Moore
approach so as to skip text using the characters explicitly represented in the
LZ78/LZW formats, modifying the basic technique where the algorithm can choose
which characters to inspect. We present and compare several solutions for
single and multipattern search. We show that our algorithms obtain speedups of
up to 50% compared to the simple decompress-then-search approach. Finally,
we present a public tool, LZgrep, which uses our algorithms to offer
grep-like capabilities searching directly files compressed using Unix's
Compress, a LZW compressor. LZgrep can also search files compressed
with Unix gzip, using new decompress-then-search techniques we develop,
which are faster than the current tools. This way, users can always keep their
files in compressed form and still search them, uncompressing only when they
want to see them.