This is one of the oldest and most pervasive problems in computer science. Applications requiring some form of string matching can be found virtually everywhere. However, recent years have witnessed a dramatic increase in interest in string matching problems, especially within the rapidly growing communities of information retrieval and computational biology.
Not only are these communities facing a drastic increase in the text sizes they have to manage, but they are demanding more and more sophisticated searches. The patterns of interest are not just simple strings but also include wild cards, gaps, and regular expressions. The definition of a match may also permit slight differences between the pattern and its occurrence in the text. This is called ``approximate matching'' and is especially interesting in text retrieval and computational biology.
The problems arising in this field can be addressed from different viewpoints. In particular, string matching is well known for being amenable to approaches that range from the extremely theoretical to the extremely practical. The theoretical solutions have given rise to important algorithmic achievements, but they are rarely useful in practice: A well-known fact in the community is that simpler ideas work better in practice. Two typical examples are the famous Knuth-Morris-Pratt algorithm, which in practice is twice as slow as the brute force approach, and the well-known Boyer-Moore family, whose most successful members in practice are highly simplified variants of the original proposal.
It is hard, however, to find the simpler ideas in the literature. In most current books on text algorithms, the string matching part covers only the classic theoretical algorithms. There are three reasons for that.
First, the practical algorithms are quite recent, the oldest one being just a decade old. Some recent developments are too new to appear in the established literature or in books. These algorithms are usually based on new techniques such as bit-parallelism, which has appeared with the recent generation of computers.
The second reason is that in this area the theoretical achievements are dissociated from the practical advantages. The algorithmic community is interested in theoretically appealing algorithms, that is, those achieving the best complexities and involving complicated algorithmic concepts. The development community focuses solely on algorithms known to be fast in practice. Neither community pays much attention to what the other does. Only in the last few years have new algorithms emerged that combine aspects of both theory and practice (such as BNDM), and the result has been a new trend of fast and robust string matching algorithms. These new algorithms have also not yet found a place in the established literature.
Finally, the search for extended patterns, of much interest nowadays, is largely unrepresented in the established literature. There are no books dealing with such new search problems as multiple or approximate pattern matching.
These reasons make it extremely difficult to find the correct algorithm if one is not in the field: The right algorithms exist, but only an expert can find and recognize them. Consider the case of software practitioners, computational biologists, researchers, or students who are not directly involved in the field and are faced with a text searching problem. They are forced to dig into dozens of articles, most of them of theoretical value but extremely complicated to implement. Finally, they get lost in an ocean of choices, without the background necessary to decide which is better. The typical outcomes of this situation are (a) they decide to implement the simplest approach, which, when available, yields extremely poor performance and affects the overall quality of development; and (b) they make a (normally unfortunate) choice and invest a lot of work in implementing it, only to obtain a result that in practice is as bad as a naive approach or even worse.
The aim of our book is to present, for a large class of patterns (strings, sets of strings, extended strings, and regular expressions) the existing exact and approximate search approaches, and to present in depth the most practical algorithms. By ``practical'' we mean that they are efficient in practice and that a normal programmer can implement them in a few hours. Fortunately, these criteria normally coincide in string matching. We focus on on-line searching, which means that we do not build data structures on the text. Indexed searching, although based on on-line searching, would deserve a complete volume by itself.
This book is intended to be of use to a large audience. Computer scientists will find everything needed to understand and implement the fastest search algorithms. If they want to go further in studying text algorithms, we give precise links and research references (books, proceedings, articles) to many related problems. Computational biologists will be able to enter in depth in the pattern matching field and find directly the most simple and efficient algorithms for their sequence searches.
We have implemented and experimented with all the algorithms presented in this book. Moreover, some are ours. We give experimental maps whenever possible to help the reader see at a glance the most appropriate algorithms for a particular application.
This book is not a complete survey on text algorithms. This field is too large for a single book. We prefer to focus on a precise topic and present it in detail. We give a list of related recent books in Section 7.2.
We present string matching algorithms according to three general approaches, depending on the way the pattern is searched for in the text.
The first approach consists in reading all the characters in the text one after the other and at each step updating some variables so as to identify a possible occurrence. The Knuth-Morris-Pratt algorithm is of this kind, as is the faster Shift-Or, which is extensible to more complicated patterns.
The second approach consists in searching for the string p in a window that slides along the text T. For every position of this window, we search backwards for a suffix of the window that matches a suffix of p. The Boyer-Moore algorithm uses this approach, but it is generally slower than one of its simplifications, Horspool. And when it is not, it is slower than other algorithms of other approaches.
The third approach is more recent and leads to the most efficient algorithms in practice for long enough p. As with the second approach, the search is done backward in a window, but this time we search for the longest suffix of the window that is also a factor of p. The first algorithm using this approach was BDM, which, when p is short enough, leads to the simpler and more efficient BNDM. For longer patterns, a new algorithm, BOM, is the fastest.
We give an experimental map to easily choose the fastest algorithm according to the length of the pattern and the size of the alphabet.
The three approaches represent a general framework in which the most efficient algorithms fit. There exist other algorithms, for instance, those based on hashing, but they are not efficient enough. We give references to these algorithms in the last section.
All the three approaches to search for a single string lead to extensions to a set of strings. The first one leads to the well-known Aho-Corasick algorithm and, when the sum of the pattern lengths, |P|, is very small, to the Multiple Shift-And algorithm.
The second one leads to the famous Commentz-Walter algorithm, which is not very efficient in practice. The extension of the Horspool algorithm, Set Horspool is efficient for very small sets on large alphabets. A last algorithm, Wu-Manber mixes the suffix search approach with a hashing paradigm and is usually fast in practice.
The third approach permits an extension of BOM the SBOM algorithm, which becomes very efficient when the minimum pattern length grows. Similarly to Shift-Or BNDM leads to Multiple BNDM when |P| is very small.
We give experimental maps that permit choosing which algorithm to use depending on the total pattern size |P|, the minimum length, and the size of the alphabet.
The simplest extension is to permit the pattern to be a sequence of classes (sets) of characters instead of just characters. Any text character in the class will match that pattern position. It is also possible for the classes to appear in the text, not only in the pattern.
A second extension is bounded length gaps: Some pattern positions are designated to match any text sequence whose length is between specified minimum and maximum values. This is of interest in computational biology applications, for example, to search for PROSITE patterns.
A third extension is optional and repeatable characters. An optional pattern character may or may not appear in its text occurrence, while a repeatable character may appear one or more times.
Problems arising from these three extensions and combinations thereof can be solved by adapting Shift-Or or BNDM Both algorithms involve bit-parallelism to simulate a nondeterministic automaton that finds all the pattern occurrences (see Section 1.3). In this case we have more complex automata, and the core of the problem is finding a way to simulate them. Extending Shift-Or leads to an algorithm unable to skip text characters but whose efficiency is unaffected by the complexity of the pattern. Extending BNDM on the other hand, is normally faster, but the efficiency is affected by the minimum length of an occurrence, the alphabet size, and the sizes of the classes and gaps. No classical algorithm can be extended so easily and obtain the same efficiency.
Finally, we show that a small set of short strings can be searched for using a similar approach, and give references to other theoretical algorithms that search specific kinds of extended strings.
Searching for a regular expression is a multistage process. First, we need to parse it so as to obtain a more workable tree representation. We show at the end of Chapter 5 how to do this. We then use the tree representation throughout the chapter.
The second stage is to build a nondeterministic finite automaton (NFA) from the pattern. The NFA is a state machine which has some states active that change as we read text characters, recognizing occurrences when states designated as ``final'' are reached. There are two interesting ways to obtain an NFA from a regular expression. Thompson's algorithm obtains an NFA whose number of transitions is proportional to the length of the regular expression and which satisfies some regularity properties that are of interest. Glushkov's algorithm produces an NFA that has the minimum number of states and other interesting regularities.
The NFA can be used directly for searching (we call this algorithm NFAThompson but this is slow because many states can be active at any time. It can also be converted into a deterministic finite automaton (DFA), which has only one active state at a time. The DFA is appealing for text searching and is used in one of the most classical algorithms for regular expression searching. We call this algorithm DFAClassical. Its main problem is that the size of the DFA can be exponential on that of the NFA, which makes the approach workable only for small patterns. On longer patterns, a hybrid approach that we dub DFAModules builds an NFA of small DFAs and retains a reasonable efficiency.
Another trend is to simulate the NFAs using bit-parallelism instead of converting them to DFAs. We present two relatively new approaches, BPThompson and BPGlushkov, which are based on simulating the respective NFAs using their specific properties. We show that BPGlushkov should always be preferred over BPThompson.
A third approach, also novel, permits skipping text characters. The algorithm MultiStringRE computes the minimum length lmin of an occurrence of the regular expression and computes all the prefixes (of that length) of all the occurrences. It then conducts a multistring search (Chapter 2) for all those strings. When one such prefix is found, it tries to complete the occurrence. An extension of it, MultiFactRE, selects a set of strings of length lmin such that some of these strings must appear inside any occurrence (the set of prefixes is just one option). Finally, RegularBNDM extends BNDM by simulating Glushkov's NFA.
Choosing the best algorithm is a complex choice that depends on the structure of the regular expression. We give simple criteria based on properties of the pattern to decide which algorithm to use.
Approximate matching is modeled using a distance function that tells how similar two strings are. We are given the pattern and a threshold k, which is the maximum allowed distance between the pattern and its occurrences. In this chapter we concentrate on the Levenshtein (or edit) distance, which is the minimum number of character insertions, deletions, and substitutions needed to make both strings equal. Many applications use variants of this distance.
We divide the existing algorithms into four types. The first is based on dynamic programming. This is the oldest approach and still the most flexible one to deal with distances other than edit distance. However, algorithms of this kind are not among the most efficient.
The second type of algorithms converts the problem into the output of an NFA search, which is built as a function of the pattern and k, and then makes the automaton deterministic. The resulting algorithms behave reasonably well with short patterns, but not as fast as newer techniques.
Bit-parallelism is the third approach, and it yields many of the most successful results. The algorithms BPR and BPD simulate the NFA, while BPM simulates the dynamic programming algorithms. BPM and BPD are the most efficient of the class, but BPR is more flexible and can be adapted to more complex patterns.
Finally, the fourth approach is filtration. A fast algorithm is used to discard large text areas that cannot contain a match, and another (nonfiltration) algorithm is used to check the remaining text areas. These algorithms are among the fastest, but their efficiency degrades quickly as k becomes large compared to the pattern length m.
Among the many filtration algorithms, we present the two most efficient ones. PEX splits the pattern in k+1 pieces and resorts to multistring searching of them, as at least one must appear unaltered in any occurrence. ABNDM is an extension of BNDM that simulates the NFA of approximate searching.
We present an experimental map comparing these algorithms. In general, filtration approaches work better for low k/m values. ABNDM is best for small alphabet sizes (such as DNA) while PEX is best for larger alphabets (such as proteins or natural language). For larger k/m values, and also to verify the text areas that the filters cannot discard, the best algorithms are the bit-parallel ones.
There are some developments for approximate searching of other types of patterns. For multiple pattern matching with errors, the main algorithms are MultiHash which works only for k=1 but is efficient even when the number of patterns is large; MultiPEX, which takes k+1 strings from each pattern and is the most efficient choice for low k/m values; and MultiBP, which superimposes the NFAs of all the patterns and uses the result as a filter, this being the best choice for intermediate k/m values.
For matching extended strings and regular expressions with errors, there are a few approaches: one based on dynamic programming for regular expressions, one based on an NFA of DFAs permitting errors, and a bit-parallel one based on BPR. This last one is the most attractive because of the combination of simplicity and efficiency it offers.