Block Addressing Indices for Approximate Text Retrieval

Ricardo Baeza-Yates and Gonzalo Navarro

Although the issue of approximate text retrieval is gaining importance in the last years, it is currently addressed by only a few indexing schemes. To reduce space requirements, the indices may point to text blocks instead of exact word positions. This is called "block addressing". The most notorious index of this kind is Glimpse. However, block addressing has not been well studied yet, especially regarding approximate searching.

Our main contribution is an analytical study of the space-time trade-offs related to the block size. We find that, under reasonable assumptions, it is possible to build an index which is simultaneously sublinear in space overhead and in query time. We validate the analysis with extensive experiments, obtaining typical performance figures. These results are valid not only for approximate searching queries but also for classical ones.

Finally, we propose a new strategy for approximate searching on block addressing indices, which we experimentally find 4-5 times faster than Glimpse. This algorithm takes advantage of the index even if the whole text has to be scanned. As a side effect, we find that using blocks of fixed size is better than, say, addressing files.