Top-k Ranked Document Search in General Text Databases
Shane Culpepper, Gonzalo Navarro, Simon Puglisi, and Andrew Turpin
Text search engines
return a set of k documents ranked by similarity to a query.
Typically,
documents and queries are drawn from natural
language text, which can readily be partitioned into words, allowing
optimizations of data structures and algorithms for
ranking.
However, in many new search domains (DNA, multimedia, OCR texts,
Far East languages)
there is often no obvious definition of words and traditional
indexing approaches are not so easily adapted, or break down entirely.
We present
two new algorithms for ranking documents against a query without making any
assumptions on the structure of the underlying text.
We build on existing theoretical techniques, which we have
implemented and compared empirically with new approaches introduced in this
paper.
Our best approach is significantly faster than
existing methods in RAM, and is
even three times faster than a state-of-the-art
inverted file implementation for
English text when word queries are issued.