Lempel-Ziv Compression of Highly Structured Documents
Joaquín Adiego, Gonzalo Navarro and Pablo de la Fuente
We describe LZCS, a novel Lempel-Ziv approach suitable for
compressing structured documents. LZCS takes advantage of repeated substructures
that may appear in the documents, by replacing them with a backward reference
to their previous occurrence. The result of the LZCS transformation is still
a valid structured document which is human-readable and can be transmitted by
ASCII channels. Moreover, LZCS transformed documents are easy to search,
display, access at random, and navigate. In a second stage, the transformed
documents can be further compressed using any semi-static technique, so that
it is still possible to do all those operations efficiently; or with any
adaptive technique to boost compression. LZCS is especially efficient to
compress collections of highly structured data, such as XML forms, invoices,
e-commerce and web-service exchange documents. The comparison with other
structure-aware and standard compressors shows that LZCS is a competitive
choice for this type of documents, while the others are not well-suited to
support navigation or random access. When joined to an adaptive compressor,
LZCS obtains by far the best compression ratios.