Implicit Indexing of Natural Language Text by Reorganizing Bytecodes
Nieves Brisaboa, Antonio Fariña, Susana Ladra, and Gonzalo Navarro.
Word-based byte-oriented compression has succeeded on large
natural language text databases, by providing competitive compression ratios,
fast random access, and direct sequential searching. We show that by just
rearranging the target symbols of the compressed text into a tree-shaped
structure, and using negligible additional space, we obtain a new
implicitly indexed representation of the compressed text, where search times
are drastically improved. The occurrences of a word can be listed directly,
without any text scanning, and in general any inverted-index-like capability,
such as efficient phrase searches, can be emulated without storing any
inverted list information.
We experimentally show that our proposal performs not only much more
efficiently than sequential searches over compressed text, but also than
explicit inverted indexes and other types of indexes, when using little
extra space. Our representation is especially successful when searching for
single words and short phrases.