Efficient Compressed Indexing for Approximate Top-k String Retrieval
Héctor Ferrada and Gonzalo Navarro
Given a collection of strings (called documents),
the {\em top-k document retrieval} problem is
that of, given a string pattern p, finding the k documents
where p
appears
most often. This is a basic task in most information retrieval scenarios. The
best current implementations require 20-30 bits per character (bpc) and
k
to
4k microseconds per query, or 12-24 bpc and 1-10 milliseconds per query.
We introduce a Lempel-Ziv compressed data structure that occupies 5-10 bpc
to answer queries in around k microseconds. The drawback is that the answer
is approximate, but we show that its quality improves asymptotically with the
size of the collection, being over 85% already for patterns of length 4-6 on
rather small collections, and improving for larger ones.