Self-Indexing Natural Language
Nieves Brisaboa, Antonio Fariña, Gonzalo Navarro,
Ángeles Places, and Eduardo Rodríguez.
Self-indexing is a concept developed for indexing arbitrary
strings. It has been enormously successful to reduce the size ofthe large
indexes typically used on strings, namely suffix trees
and arrays. Self-indexes represent a string in a space close to
its compressed size and provide indexed searching on it. On
natural language, a compressed inverted index over the compressed
text already provides a reasonable alternative, in space and time,
for indexed searching of words and phrases. In this paper we
explore the possibility of regarding natural language text as a
string of words and applying a self-index to it. There are several
challenges involved, such as dealing with a very large alphabet
and detaching searchable content from non-searchable presentation
aspects in the text. As a result, we show that the self-index
requires space very close to that of the best word-based
compressors, and that it obtains better search time than inverted
indexes (using the same overall space) when searching phrases.