Smaller Self-Indexes for Natural Language
Alberto Ordóñez, Nieves Brisaboa, and Gonzalo Navarro
Self-indexes for natural-language texts, where these are regarded as token
(word or separator) sequences, achieve very attractive space and search time.
However, they suffer from a space penalty due to their large vocabulary.
In this paper we show that by replacing the Huffman encoding they implicitly
use by the slightly weaker Hu-Tucker encoding, which respects the lexical
order of the vocabulary, both their space and time are improved.