Lempel-Ziv Compression of Structured Text
Joaquín Adiego, Gonzalo Navarro and Pablo de la Fuente.
We describe a novel Lempel-Ziv approach suitable for compressing structured
documents, called LZCS, which takes advantage of redundant information
that can appear in the structure. The main idea is that frequently repeated
subtrees may exist and these can be replaced by a backward reference to
their first ocurrence. The main advantage is that compressed documents
generated by LZCS are easy to display, access at random, and navigate. In a
second stage, processed documents can be further compressed using some
semiadaptive technique, so that random access and navigability remain possible.
LZCS is especially efficient to compress collections of highly structured data,
such as XML forms, invoices, e-commerce and web-service exchange documents.
The comparison against structure-based and standard compressors shows that
LZCS is a competitive choice for this type of documents, while the others are
not well-suited to support navigation or random access.