Extra Material and Comments

  1. Section 2.2 (Shift-And) should now mention the recent average-optimal algorithm by Fredriksson and Grabowski based on Shift-And. It appeared in SPIRE 2005. Interestingly, it is still a prefix-based algorithm, but it reads one text character out of b, for a properly chosen b.
  2. In Section 2.3.2 (Horspool) we have not considered the single-pattern version of Wu-Manber algorithm (Section 3.3.3), which could be competitive in the area where BOM is claimed to be the best, if we parameterize Wu-Manber appropriately. At least it is clearly optimal if we choose B=log_s(m).
  3. In Section 2.4 we should have mentioned the possibility of not considering "last" but simply shifting the window just after the character that killed the automaton. This has been done, for example, in nrgrep [Navarro, Software Practice and Experience 2001], and is faster than Horspool on Intel machines.
  4. Section 4.5 can be improved by noticing that S[c] = S & B[c], where S is the union of S[c] for all c. That is, the i-th bit of S is active if state i is repeatable. Hence we can simulate repeatable characters by doing "D <- (((D<<1)|1) | (D & S)) & B[c]". This solution makes only one memory access per character.
  5. In Section 3.4, Figure 3.19, the use of CL could have been avoided by doing D <- D & ~DF in a new line between 24 and 25, hence removing line 8 and leaving line 26 just as D <- D << 1. The result is possibly a little bit faster.
  6. For Section 5.2, there are newer methods to achieve smaller NFAs than Glushkov's. A short review is given in CPM 2002 paper by Lucian Ilie (p. 279). However, they do not preserve Glushkov's property that all edges arriving at a node are labeled by the same character/class, so they are not so attractive for the bit-parallel algorithms described in Section 5.4. However, they could be interesting for other cases. We plan to work on a combination that minimizes the NFA while preserving Glushkov's property.
  7. In Section 5.3.3, the same complexity is obtained without recursion in the structure. Just make modules at the leaves and build a normal NFA with the rest. Some tree leaves will actually be modules. A leaf, in Thompson's construction, becomes an edge between the initial and final node. Now leaves that are modules are a DFA that goes between those initial and final nodes. At each step one must feed each DFA module with the current character, as well as activating its initial state when the initial node is active. Each time the final node of a DFA module is activated, we activate the final node it has in the NFA.
  8. In Section 5.4.2 (page 122), it should be noted that we do not need to represent the initial state of the NFA, since it is always active. Hence we need 2^m table entries rather than 2^(m+1). This seems trivial but represents half of the space! The same trick cannot be used, however, in Section 5.5.3.
  9. There are news for Section 5.7 (page 139), about NFA construction. A new paper by V. Geffert, "Translation of binary regular expressions into nondeterministic epsilon-free automata with O(n log n) transitions", JCSS 66(3):451-472, 2003. Finally they found an algorithm to achieve the minimum possible number of transitions, O(m log m). [Thanks to Lucian Ilie, Univ. of Western Ontario, Canada].
  10. There has been a notable burst of activity regarding Myers' bit-parallel algorithm (Section 6.4.2) in recent years. Some results can be found in the last CPM, SPIRE, and PSC conferences.
  11. Section 6.5.2 should be replaced/enriched with a new method to implement ABNDM over Myers' simulation (explained in Section 6.4.2). The algorithm, given by Heikki Hyyrö and Gonzalo Navarro in CPM 2002 (p. 203), is better than the version of the book for k > 3. This should enlarge the area of ABNDM in Figure 6.21. Preliminary work on this is already mentioned in the book ([HN01], in page 184).
  12. Sections 6.5 and 6.6 could be enriched with a new algorithm for multipattern approximate searching by Kimmo Fredriksson and Gonzalo Navarro. This inherits from Chang and Marr 1994 paper (for one pattern), extending it for multiple patterns and trying some variants too. It seems to be a relevant filtration algorithm not only for several but also for a single pattern. The paper has appeared in CPM 2003 (p. 109).
  13. For Section 6.9, there is a new bit-parallel algorithm for handling approximate regular expression searching with arbitrary integer costs, by Gonzalo Navarro (ISAAC 2003). In the book we mention [BH01] as an attempt in this direction. The new technique is implemented and shows a clear improvement over the algorithms in Sections 6.7.1 and 6.7.2. It is not as fast as that of Section 6.7.3, but this one can handle only integer costs and a few extensions.
Something to add? Please let us know and enter our Hall of Fame!