Fast In-Memory XPath Search over Compressed Text and Tree Indexes
Diego Arroyuelo, Francisco Claude, Sebastian Maneth, Veli Mäkinen,
Gonzalo Navarro, Kim Nguyễn, Jouni Sirén, and Niko
Välimäki
A large fraction of an XML document typically consists
of text data.
The XPath query language allows text search via
the equal, contains, and starts-with predicates.
Such predicates can efficiently be implemented using
a compressed self-index of the document's text nodes.
Most queries, however, contain some parts of querying
the text of the document, plus some parts
of querying the tree structure.
It is therefore a challenge to choose an appropriate
evaluation order for a given query, which optimally leverages
the execution speeds of the text and tree indexes.
Here the SXSI system is introduced; it stores the tree
structure of an XML document using a bit array of
opening and closing brackets, and stores the text nodes of the
document using a global compressed self-index.
On top of these indexes sits an XPath query engine that is
based on tree automata. The engine uses fast counting queries
of the text index in order to dynamically determine whether to
evaluate top-down or bottom-up with respect to
the tree structure. The resulting system has several
advantages over existing systems:
(1) on pure tree queries (without text search) such
as the XPathMark queries, the SXSI system performs on par or
better than the fastest known systems MonetDB and Qizx,
(2) on queries that use text search, SXSI outperforms
the existing systems by 1-3 orders of magnitude
(depending on the size of the result set),
and
(3) with respect to memory consumption, SXSI outperforms
all other systems for counting-only queries.