Fast and Simple Character Classes and Bounded Gaps Pattern
Matching, with Application to Protein Searching
Gonzalo Navarro and Mathieu Raffinot
The problem of fast searching of a pattern that contains Classes of characters
and Bounded size Gaps (CBG) in a text has a wide range of applications, among
which a very important one is protein pattern matching (for instance, one
PROSITE protein is associated with the CBG [RK]-x(2,3)-[DE]-x(2,3)-Y,
where the brackets match any of the letters inside, and x(2,3) a gap of
length between 2 and 3). Currently, the only way to search a CBG in a text is
to convert it into a full regular expression (RE). However, an RE is more
sophisticated than a CBG, and searching it with an RE pattern matching
algorithm complicates the search and makes it slow. This is the reason why we
design in this article two new practical CBG matching algorithms, that are much
simpler and faster than all the RE search techniques. The first one looks
exactly once at each text character. The second one does not need to consider
all the text characters and hence it is usually faster than the first one, but
in bad cases may have to read the same text character more than once. We then
propose a criterion based on the form of the CBG to choose a-priori the
fastest between both. We performed many practical experiments using the PROSITE
database, and all them show that our algorithms are the fastest in virtually
all cases.