Compressing Huffman Models on Large Alphabets
Gonzalo Navarro and Alberto Ordóñez
A naive storage of a Huffman model on a text of length n over an alphabet
of size s requires O(s log n) bits. This can be reduced to
s log s + O(s) bits using canonical codes. This overhead
over the entropy can be significant when s is comparable to
n,
and it also dictates the amount of main memory required to compress or
decompress.
We design an encoding scheme that requires O(s log log n) bits in the
worst case, and typically less, while supporting encoding and decoding of
symbols in O(log log n) time. We show that our technique reduces the
storage size of the model using current techniques to around 15% in
various real-life sequences over
large alphabets, while still offering reasonable compression/decompression
times.