This is one of the oldest and most pervasive problems in computer science. Applications requiring some form of string matching can be found virtually everywhere. However, recent years have witnessed a dramatic increase in interest in string matching problems, especially within the rapidly growing communities of information retrieval and computational biology.

Not only are these communities facing a drastic increase in the text sizes they have to manage, but they are demanding more and more sophisticated searches. The patterns of interest are not just simple strings but also include wild cards, gaps, and regular expressions. The definition of a match may also permit slight differences between the pattern and its occurrence in the text. This is called ``approximate matching'' and is especially interesting in text retrieval and computational biology.

The problems arising in this field can be addressed from different viewpoints. In particular, string matching is well known for being amenable to approaches that range from the extremely theoretical to the extremely practical. The theoretical solutions have given rise to important algorithmic achievements, but they are rarely useful in practice: A well-known fact in the community is that simpler ideas work better in practice. Two typical examples are the famous Knuth-Morris-Pratt algorithm, which in practice is twice as slow as the brute force approach, and the well-known Boyer-Moore family, whose most successful members in practice are highly simplified variants of the original proposal.

It is hard, however, to find the simpler ideas in the literature. In most current books on text algorithms, the string matching part covers only the classic theoretical algorithms. There are three reasons for that.

First, the practical algorithms are quite recent, the oldest one being just a decade old. Some recent developments are too new to appear in the established literature or in books. These algorithms are usually based on new techniques such as bit-parallelism, which has appeared with the recent generation of computers.

The second reason is that in this area the theoretical achievements are dissociated from the practical advantages. The algorithmic community is interested in theoretically appealing algorithms, that is, those achieving the best complexities and involving complicated algorithmic concepts. The development community focuses solely on algorithms known to be fast in practice. Neither community pays much attention to what the other does. Only in the last few years have new algorithms emerged that combine aspects of both theory and practice (such as BNDM), and the result has been a new trend of fast and robust string matching algorithms. These new algorithms have also not yet found a place in the established literature.

Finally, the search for extended patterns, of much interest nowadays, is largely unrepresented in the established literature. There are no books dealing with such new search problems as multiple or approximate pattern matching.

These reasons make it extremely difficult to find the correct algorithm if one
is not in the field: The right algorithms exist, but only an expert can find and
recognize them. Consider the case of software practitioners, computational
biologists, researchers, or students who are not directly involved in the field
and are faced with a text searching problem. They are forced to dig into dozens
of articles, most of them of theoretical value but extremely complicated to
implement.
Finally, they get lost in an ocean of choices, without the background necessary
to decide which is better. The typical outcomes of this situation are *(a)*
they decide to implement the simplest approach, which, when available, yields
extremely poor performance and affects the overall quality of development; and
*(b)* they make a (normally unfortunate) choice and invest a lot of work in
implementing it, only to obtain a result that in practice is as bad as a naive
approach or even worse.

The aim of our book is to present, for a large class of patterns (strings, sets
of strings, extended strings, and regular expressions) the
existing exact and approximate search approaches, and to present in depth the
most practical algorithms. By ``practical'' we mean that they are efficient in
practice and that a normal programmer can implement them in a few hours.
Fortunately, these criteria normally coincide in string matching. We focus on
*on-line* searching, which means that we do not build data structures on
the text. Indexed searching, although based on on-line searching, would deserve
a complete volume by itself.

This book is intended to be of use to a large audience. Computer scientists will find everything needed to understand and implement the fastest search algorithms. If they want to go further in studying text algorithms, we give precise links and research references (books, proceedings, articles) to many related problems. Computational biologists will be able to enter in depth in the pattern matching field and find directly the most simple and efficient algorithms for their sequence searches.

We have implemented and experimented with all the algorithms presented in this book. Moreover, some are ours. We give experimental maps whenever possible to help the reader see at a glance the most appropriate algorithms for a particular application.

This book is not a complete survey on text algorithms. This field is too large for a single book. We prefer to focus on a precise topic and present it in detail. We give a list of related recent books in Section 7.2.

We present string matching algorithms according to three general approaches, depending on the way the pattern is searched for in the text.

The first approach consists in reading all the characters in the text one after
the other and at each step updating some variables so as to identify a possible
occurrence. The **Knuth-Morris-Pratt** algorithm is of this kind, as is the
faster **Shift-Or**, which is extensible to more complicated patterns.

The second approach consists in searching for the string *p* in a window that slides
along the text *T*. For every position of this window, we search backwards for
a suffix of the window that matches a suffix of *p*. The **Boyer-Moore**
algorithm uses this approach, but it is generally slower than one of its
simplifications, **Horspool**. And when it is not, it is slower than other
algorithms of other approaches.

The third approach is more recent and leads to the most efficient algorithms in
practice for long enough *p*. As with the second approach, the search is done
backward in a window, but this time we search for the longest suffix of the
window that is also a factor of *p*. The first algorithm using this approach was
**BDM**, which, when *p* is short enough, leads to the simpler and more
efficient **BNDM**. For longer patterns, a new algorithm, **BOM**, is
the fastest.

We give an experimental map to easily choose the fastest algorithm according to the length of the pattern and the size of the alphabet.

The three approaches represent a general framework in which the most efficient algorithms fit. There exist other algorithms, for instance, those based on hashing, but they are not efficient enough. We give references to these algorithms in the last section.

All the three approaches to search for a single string lead to extensions to a
set of strings. The first one leads to the well-known **Aho-Corasick** algorithm and, when
the sum of the pattern lengths, *|P|*, is very small, to the **Multiple
Shift-And** algorithm.

The second one leads to the famous **Commentz-Walter** algorithm, which is
not very efficient in practice. The extension of the **Horspool** algorithm,
**Set Horspool** is efficient for very small sets on large alphabets. A
last algorithm, **Wu-Manber** mixes the suffix search approach with a
hashing paradigm and is usually fast in practice.

The third approach permits an extension of **BOM** the **SBOM** algorithm,
which becomes very efficient when the minimum pattern length grows. Similarly to
**Shift-Or** **BNDM** leads to **Multiple BNDM** when *|P|* is very
small.

We give experimental maps that permit choosing which algorithm to use depending
on the total pattern size *|P|*, the minimum length, and the size of the
alphabet.

The simplest extension is to permit the pattern to be a sequence of classes (sets) of characters instead of just characters. Any text character in the class will match that pattern position. It is also possible for the classes to appear in the text, not only in the pattern.

A second extension is bounded length gaps: Some pattern positions are designated to match any text sequence whose length is between specified minimum and maximum values. This is of interest in computational biology applications, for example, to search for PROSITE patterns.

A third extension is optional and repeatable characters. An optional pattern character may or may not appear in its text occurrence, while a repeatable character may appear one or more times.

Problems arising from these three extensions and combinations thereof can be
solved by adapting **Shift-Or** or **BNDM** Both algorithms involve
bit-parallelism to simulate a nondeterministic automaton that finds all the
pattern occurrences (see Section 1.3). In this case we have more
complex automata, and the core of the problem is finding a way to simulate them.
Extending **Shift-Or** leads to an algorithm unable to
skip text characters but whose efficiency is unaffected by the complexity of
the pattern. Extending **BNDM** on the other hand, is normally faster, but
the efficiency is affected by the minimum length of an occurrence,
the alphabet size, and the sizes of the classes and gaps.
No classical algorithm can be extended so easily and obtain the same efficiency.

Finally, we show that a small set of short strings can be searched for using a similar approach, and give references to other theoretical algorithms that search specific kinds of extended strings.

Searching for a regular expression is a multistage process. First, we need to parse it so as to obtain a more workable tree representation. We show at the end of Chapter 5 how to do this. We then use the tree representation throughout the chapter.

The second stage is to build a nondeterministic finite automaton (NFA) from the pattern. The NFA is a state machine which has some states active that change as we read text characters, recognizing occurrences when states designated as ``final'' are reached. There are two interesting ways to obtain an NFA from a regular expression. Thompson's algorithm obtains an NFA whose number of transitions is proportional to the length of the regular expression and which satisfies some regularity properties that are of interest. Glushkov's algorithm produces an NFA that has the minimum number of states and other interesting regularities.

The NFA can be used directly for searching (we call this algorithm
**NFAThompson** but this is slow because many states
can be active at any time. It can also be converted into a deterministic finite automaton (DFA), which has
only one active state at a time. The DFA is appealing for text searching
and is used in one of the most classical algorithms for regular expression
searching. We call this algorithm **DFAClassical**. Its main problem is that
the size of the DFA can be exponential on that of the NFA, which makes the
approach workable only for small patterns. On longer patterns, a hybrid approach
that we dub **DFAModules** builds an NFA of small DFAs and retains a
reasonable efficiency.

Another trend is to simulate the NFAs using bit-parallelism instead of
converting them to DFAs. We present two relatively new approaches, **
BPThompson** and **BPGlushkov**, which are based on simulating the respective
NFAs using their specific properties. We show that **BPGlushkov** should
always be preferred over **BPThompson**.

A third approach, also novel, permits skipping text characters. The algorithm
**MultiStringRE** computes the minimum length *lmin* of an occurrence of
the regular expression and computes all the prefixes (of that length) of all
the occurrences. It then conducts a multistring search (Chapter 2) for all those
strings. When one such prefix is found, it tries to complete the occurrence.
An extension of it, **MultiFactRE**, selects a set of strings of length
*lmin* such that some of these strings must appear inside any occurrence
(the set of prefixes is just one option). Finally, **RegularBNDM** extends
**BNDM** by simulating Glushkov's NFA.

Choosing the best algorithm is a complex choice that depends on the structure of the regular expression. We give simple criteria based on properties of the pattern to decide which algorithm to use.

Approximate matching is modeled using a distance function that tells how similar
two strings are. We are given the pattern and a threshold *k*, which
is the maximum allowed distance between the pattern and its occurrences.
In this chapter we concentrate on the Levenshtein (or edit) distance, which is the minimum
number of character insertions, deletions, and substitutions needed to make both
strings equal. Many applications use variants of this distance.

We divide the existing algorithms into four types. The first is based on dynamic programming. This is the oldest approach and still the most flexible one to deal with distances other than edit distance. However, algorithms of this kind are not among the most efficient.

The second type of algorithms converts the problem into the output
of an NFA search, which is built as a function of the pattern and *k*, and then
makes the automaton deterministic. The resulting algorithms behave reasonably
well with short patterns, but not as fast as newer techniques.

Bit-parallelism is the third approach, and it yields many of the most successful
results. The algorithms **BPR** and **BPD** simulate the NFA, while
**BPM** simulates the dynamic programming algorithms. **BPM** and **BPD** are the
most efficient of the class, but **BPR** is more flexible and can be
adapted to more complex patterns.

Finally, the fourth approach is filtration. A fast algorithm is used to discard
large text areas that cannot contain a match, and another (nonfiltration)
algorithm is used to check the remaining text areas. These algorithms are among
the fastest, but their efficiency degrades quickly as
*k* becomes large compared to the pattern length *m*.

Among the many filtration algorithms, we present the two most efficient ones.
**PEX** splits the pattern in *k+1* pieces and resorts to multistring
searching of them, as at least one must appear unaltered in any occurrence. **
ABNDM** is an extension of **BNDM** that simulates the NFA of approximate
searching.

We present an experimental map comparing these algorithms. In general, filtration
approaches work better for low *k/m* values. **ABNDM** is best for small
alphabet sizes (such as DNA) while **PEX** is best for larger alphabets (such as proteins or
natural language). For larger *k/m* values, and also to verify the text
areas that the filters cannot discard, the best algorithms are the bit-parallel
ones.

There are some developments for approximate searching of other types of
patterns. For multiple pattern matching with errors, the main algorithms
are **MultiHash** which works only for *k=1* but is efficient even when the
number of patterns is large; **MultiPEX**, which takes *k+1* strings from each
pattern and is the most efficient choice for low *k/m* values; and
**MultiBP**, which superimposes the NFAs of all the patterns and uses the
result as a filter, this being the best choice for intermediate *k/m* values.

For matching extended strings and regular expressions with errors,
there are a few approaches: one based on dynamic programming for
regular expressions, one based on an NFA of DFAs permitting errors,
and a bit-parallel one based on **BPR**. This last one is the most
attractive because of the combination of simplicity and efficiency it offers.