In recent years some indexing schemes with very low space overhead have appeared, some of them dealing with approximate searching. These low overhead indices (whose most notorious exponent is Glimpse) are modified inverted files, where space is saved by making the lists of occurrences point to text blocks instead of exact word positions. Despite their existence, little is known about the expected behavior of these "block addressing" indices, and even less is known when it comes to cope with approximate search.
Our main contribution is an analytical study of the space-time trade-offs for indexed text searching. We study the space overhead and retrieval times as functions of the block size. We find that, under reasonable assumptions, it is possible to build an index which is simultaneously sublinear in space overhead and in query time. This surprising analytical conclusion is validated with extensive experiments, obtaining typical performance figures. These results are valid for classical exact queries as well as for approximate searching.
We apply our analysis to the Web, using recent statistics on the distribution of the document sizes. We show that pointing to documents instead of to fixed size blocks reduces space requirements but increases search times.