Word-based Self-Indexes for Natural Language Text
Antonio Fariña, Nieves Brisaboa, Gonzalo Navarro, Francisco Claude,
Ángeles Places, and Eduardo Rodríguez
The inverted index supports efficient full-text searches on natural language
text collections. It requires some extra space over the compressed text that
can be traded for search speed. It is usually fast for single-word searches,
yet phrase searches require more expensive intersections. In this article we
introduce a different kind of index. It replaces the text using essentially
the same space required by the compressed text alone (compression ratio around
35%). Within this space it supports not only decompression of arbitrary
passages, but efficient word and phrase searches. Searches are orders of
magnitude faster than those over inverted indexes when looking for phrases,
and
still faster on single-word searches when little space is available. Our new
indexes are particularly fast at counting the occurrences of words or
phrases. This is useful for computing relevance of words or phrases.
We adapt self-indexes that succeeded in indexing arbitrary strings
within
compressed space to deal with large alphabets. Natural language texts are then
regarded as sequences of words, not characters, to achieve word-based
self-indexes. We design an architecture that separates the searchable
sequence from its presentation aspects. This permits applying case
folding, stemming, removing stopwords, etc. as is usual on inverted indexes.