Adding Compression to Block Addressing Inverted Indexes
Gonzalo Navarro, Edleno de Moura, Marden Neubert, Nivio Ziviani and
Ricardo Baeza-Yates
Inverted index compression, block addressing and sequential search on compressed
text are three
techniques that have been separately developed for efficient, low-overhead
text retrieval. Modern text compression techniques can reduce the text to
less than 30% of its size and allow searching it directly and faster than
the uncompressed text.
Inverted index compression obtains significant reduction of their original size
at the same processing speed.
Block addressing makes the inverted lists point
to text blocks instead of exact positions and pay the reduction in space with
some sequential text scanning.
In this work we combine the three ideas in a single scheme. We present a
compressed inverted file that indexes compressed text and uses block
addressing. We consider different techniques to compress the index and study
their performance with respect to the block size.
We compare the index against three separate techniques for
varying block sizes, showing that our index is superior to
each isolated approach.
For instance, with just 4% of extra space overhead the index has to
scan less than 12% of the text for exact searches and about
20% allowing one error in the matches.