Combining Structural and Textual Contexts for Compressing Semistructured
Joaquín Adiego, Pablo de la Fuente and Gonzalo Navarro.
We describe a compression technique for semistructured documents, called
SCMPPM, which combines the Prediction by Partial Matching technique
with Structural Contexts Model (SCM) idea. SCMPPM takes advantage of the
context information usually implicit in the structure of the text.
The idea is to use a separate PPM model to compress the text that lies inside
each different structure type (e.g., different XML tag). The intuition
behind the idea is that the distribution of the texts that belong to a
given structure type should be similar, and different from that of other
structure types. This should allow PPM to make better predictions.
We test our idea against plain PPM modelling, as well as against other
structure-aware techniques. Results show that the new compression method
obtains significant improvements in compression ratios.