Large Text Searching Allowing Errors
Marcio Araújo, Gonzalo Navarro and Nivio Ziviani
We present a full inverted index for exact and approximate string matching
in large texts.
The index is composed of a table containing the vocabulary of words of the text
and a list of positions in the text corresponding to each word.
The size of the table of words is usually much less than 1% of the text size
and hence can be kept in main memory, where most query processing
takes place.
The text, on the other hand, is not accessed at all.
The algorithm permits a large number of variations of the exact and
approximate string
search problem, such as phrases, string matching with sets of
characters (range and arbitrary set of characters, complements, wild cards),
approximate search with nonuniform costs and arbitrary regular expressions.
The whole index can be built in linear time, in a single sequential pass over
the text, takes near 1/3 the space of the text, and
retrieval times are near O (sqrt(n)) for typical cases.
Experimental results show that the algorithm works well in practice:
for a one-gigabyte text collection, all matchings of a phrase
of 3 words allowing up to 1 error can be found
in approximately 6 seconds and allowing no errors can
be found in under half a second.
This index has been implemented in a
software package called Igrep, which is publicly available.
Experiments show that Igrep is much faster than Glimpse in
typical queries.