Faster and Smaller Inverted Indices with Treaps
Roberto Konow, Gonzalo Navarro, Charles Clarke and Alejandro
López-Ortíz
We introduce a new representation of the inverted index that performs faster
ranked unions and intersections while using less space. Our index is based on
the treap data structure, which allows us to intersect/merge the document
identifiers while simultaneously thresholding by frequency, instead of the
costlier two-step classical processing methods. To achieve compression we
represent the treap topology using compact data structures. Further, the treap
invariants allow us to elegantly encode differentially both document
identifiers and frequencies. Results show that the space consumption is below
10% of the size of the corpus and the index performs queries up to twice as
fast than previous compact representations, which in addition require more
space. Modern two-stage (massive filtering / detailed ranking) information
retrieval systems would benefit from this boosting of the filtration stage of
the query resolution process, which would free more resources for the ranking
stage, thus enabling more precise results within a given time budget.