Fast and Simple Character Classes and Bounded Gaps
Pattern Matching, with Applications to Protein Searching.
Gonzalo Navarro and Mathieu Raffinot
The problem of fast exact and approximate searching for a pattern
that contains Classes of characters and Bounded size Gaps (CBG) in a
text has a wide range of applications, among which a very important
one is protein pattern matching (for instance, one PROSITE protein
site is associated with the CBG [RK]-x(2,3)-[DE]-x(2,3)-Y, where
the brackets match any of the letters inside, and $x(2,3)$ a gap of
length between 2 and 3). Currently, the only way to search for a
CBG in a text is to convert it into a full regular expression (RE).
However, a RE is more sophisticated than a CBG, and searching for it
with a RE pattern matching algorithm complicates the search and
makes it slow. This is the reason why we design in this article two
new practical CBG matching algorithms that are much simpler and
faster than all the RE search techniques. The first one looks
exactly once at each text character. The second one does not need to
consider all the text characters and hence it is usually faster than
the first one, but in bad cases may have to read the same text
character more than once. We then propose a criterion based on the
form of the CBG to choose a-priori the fastest between both.
We also show how to search permitting a few mistakes in the
occurrences. We performed many practical experiments using the
PROSITE database, and all them show that our algorithms are the
fastest in virtually all cases.