One of the major ways to solve performance bottlenecks is to find a better algorithm. In our increasingly data rich societies, efficient searching of vast databases is becoming increasingly important as well as increasingly difficult. Not only do we need to search data for exact matches but we also need to do so for partial matches, combinations of substrings and for matches with, possibly large, gaps between one part and another. We also have to cope with a variety of different 'alphabets.' For example, the explosive growth of work in genetics and in proteins means that we have databases with more restricted alphabets but with substantial amounts of data to search (just consider the size of the human genome). In many areas of algorithmic research improvements have been relatively slow and normally only by small amounts. This means that most experienced programmers can quickly find an algorithm that is within 10% of the best known for a task with a given set of constraints. This is not the case for pattern matching. A very large amount of work has been done over the last couple of decades with the results scattered through a vast number of academic papers. To make matters worse, theoretically good mechanisms often fail to deliver when the constraints of various resources such as memory size, register width etc. come into play.
The main objective of the authors of this book is to provide a source of practical information for working programmers. Between chapter 1 which is an important introduction to the topic and chapter 7 which is a conclusion that includes advice as to where to look for further information, the authors cover straight string matching, multiple string matching, extended string matching, regular expression matching and approximate matching. They make no attempt at comprehensive coverage but the provide a mix of highly practical information, theoretical explanations and pragmatic results.
While it might seem obvious that the correct algorithm to use would depend on the number of symbols in the alphabet, it is far less obvious that the width of the data registers as well as the length of the string being sought would be critical. Then there is the curious fact that English text behaves much more like random strings composed of an alphabet of 16 symbols than text composed from an alphabet of 26 symbols. I would be interested to learn how other languages compare, particularly as straight searches through random text built from 16 letter alphabets is unusually sensitive to the register width. The authors' experimental results suggest that you should choose a different algorithm depending on whether you are using 16-bit, 32-bit or 64-bit registers.
The subject matter is tough going which makes the provision of pseudocode for all the algorithms covered particularly valuable. I have to admit that I have not had time to actually convert the pseudocode into working programmes but the provision of carefully worked simple examples of the use of each algorithm leads me to expect that the authors do deliver on their promises.
If you need efficient pattern matching for any kind of string then this is the only book I know that comes even close to providing you the tools for the job. It maybe relatively expensive for a slim volume, and in practice you would only use a small part of the content. However the time it will save you and the resulting improvement in performance of your code will repay the initial expenditure many times over. You will still have to work hard, but at least you will have an adequate roadmap and a reasonable expectation of success.
String matching problems range from searching a single text for a string of characters to searching a database for approximate occurrences of a complex pattern. Recently, interest in sophisticated string matching problems has increased dramatically, especially in information retrieval and computational biology. The authors present a practical approach to string matching problems, focusing on the algorithms and implementations that perform best in practice. The book also covers searches for simple, multiple, and extended strings and for regular expressions. Detailed explanations, step-by-step examples, algorithm pseudocode, and implementation efficiency maps give readers the tools to choose appropriate algorithms for their applications.
Pattern matching in strings is a basic problem in many areas of computer science, but particularly in applications that deal with text searching and genetic sequences. Information retrieval and computational biology are generating dramatic increases both in the size of texts to search and in the sophistication of the searches. The authors are two academics with bioinformatics industry experience. They use this book to make the case that information on pattern matching algorithms is not well understood except by experts in the area, and that for non-experts useful, practical implementations are nearly impossible to construct from available literature. Further, they claim that the only way to truly determine the fastest algorithm is empirical testing of the implementation. The intended audience of this book is software engineers, computational biologists, and other non-experts. The goal is to provide a means for them to determine the most simple and efficient algorithms in practice.
The algorithms covered are all online algorithms, meaning that no data structures are built on the text as a whole. In other words, the text can be examined as it is generated and need not be stored. Indexed searching is not covered. In the first chapter the authors review a few classic string matching algorithms including Knuth-Morris-Pratt, Boyer-Moore, and some of their variations. They also cover basic concepts in the areas of bit-parallelism, bit operations, and automata. Other chapters discuss algorithms for multiple string matching (finding a pattern in a series of strings), extended string matching (character classes, bounded length gaps, optional characters, and repeatable characters), regular expressions, approximate matching, and available resources, such as software and technical literature.
Some of the most interesting parts of the book are where the authors attempt to provide a practical map for which pattern matching algorithm to use in a given case. They present an experimental framework based on the length of the pattern and the size of the alphabet. The chapter on extended string matching demonstrates how some tradeo.s can be made when the overhead of full regular expression processing is not required.
The book is well written and engaging. An undergraduate computer science theory background should be su.cient to understand all the material. This would be a useful book for any practitioner who needs to implement pattern matching algorithms, or as a graduate text.
Solid theory - 4 out of 5 stars.
This is an excellent book for someone with a strong foundation in computer science wanting to learn about regular expression and string matching. It is not for the feint-hearted, and would not be recommended to computer science novices. Written by one of the world experts in the field, the relatively thin book is packed with theory and comments on application. The algorithms are presented in a very concise form which often requires a fair amount of thought when trying to implement. However once understood, the book makes an excellent reference for the various algorithms.
The book is on the domain of searching. Data are plain text documents that are searched for patterns of various kinds expressed by different mechanisms such as regular expressions or approximation requirements. The generic term used by the author is flexible patterns. So, the book focuses on searching strategies for flexible patterns in texts. Solutions are all on-line with respect to the text, which means that indexing techniques are not involved as a main feature. Searching texts is a pervasive problem in computer science since it occurs, for example, in basic system software, data retrieval systems, Web search engine development, and computational biology.
The authors orient the exposition toward searching algorithms that are expected to provide efficient solutions in software development, rather than to presenting theoretical solutions that are already treated in other books. Along this line, much attention is given to the technique called bit-parallelism. Experimental results are provided to support the evaluation of algorithms because asymptotic complexity measures cannot support the notion of "practical efficiency" claimed by the authors.
The titles of the main chapters are: String matching, Multiple string matching, Extended string matching, Regular expression matching, Approximate matching. A conclusion discusses related searching topics, available software and references.
The book is actually unique on the subject. It is intended to be used by a large audience of practitioners who want to understand and implement the fastest searching algorithms.
Thin but useful.
Don't let its small size (~200 pages) fool you. There's a lot of good material here, and in very usable form. In fact, that's the whole focus of the book: practical, efficient implementations.
The main topics, chapter by chapter, are simple matching of one desired word to a string, matching of multiple words, two levels of complexity in wildcards and regular expressions, and approximate matching. A number of important and historical algorithms are discussed in each chapter, in great detail. There's pseudo-code for the most important algorithms. Quite a few also have examples worked in detail. The mechanics are tedious and somewhat bulky, but anyone actually trying to implement these techniques will appreciate the examples.
What's really interesting is what's not in this book. You won't find a lot of theory, and you won't find some of the most famous algorithms in string matching. The authors make it clear that this is about practical algorithms with efficient implementations. Lots of the algorithms beloved by theoreticians are impractically complex or just plain slow. Those may be mentioned in passing or as the base for more practical algorithms, but are not welcome on these pages.
It's not an easy read, but it's not a book for people with easy problems. It discusses tradeoffs, like when one technique works well for short strings but another works better on long strings. It addresses the different needs of English-language processing and bioinformatics - just the different numbers of letters in each alphabet make a difference, in some cases.
This is a good one for anyone who takes string processing seriously. There's no cut&paste code here, but plenty for a knowledgable programmer to use. Even better, it offers references to the literature and to working code, and pointers to some books on related topics. I expect to get a lot of use out of this one.
Information for an important research field, string pattern matching, is provided in this book. The authors have conducted premium quality research on topics related to regular expression matching and approximate ...