Extra Material and Comments
- Section 2.2 (Shift-And) should now mention the recent average-optimal
algorithm by Fredriksson and Grabowski based on Shift-And. It appeared in
SPIRE 2005. Interestingly, it is still a prefix-based algorithm, but it
reads one text character out of b, for a properly chosen b.
- In Section 2.3.2 (Horspool) we have not considered the single-pattern
version of Wu-Manber algorithm (Section 3.3.3), which could be competitive
in the area where BOM is claimed to be the best, if we parameterize Wu-Manber
appropriately. At least it is clearly optimal if we choose B=log_s(m).
- In Section 2.4 we should have mentioned the possibility of not considering
"last" but simply shifting the window just after the character that killed the
automaton. This has been done, for example, in nrgrep [Navarro, Software
Practice and Experience 2001], and is faster than Horspool on Intel machines.
- Section 4.5 can be improved by noticing that S[c] = S & B[c], where S is
the union of S[c] for all c. That is, the i-th bit of S is active if state i is
repeatable. Hence we can simulate repeatable characters by doing
"D <- (((D<<1)|1) | (D & S)) & B[c]". This solution makes only one memory
access per character.
- In Section 3.4, Figure 3.19, the use of CL could have been avoided
by doing D <- D & ~DF in a new line between 24 and 25,
hence removing line 8 and leaving line 26 just as D <- D << 1.
The result is possibly a little bit faster.
- For Section 5.2, there are newer methods to achieve smaller NFAs than
Glushkov's. A short review is given in CPM 2002 paper by Lucian Ilie (p. 279).
However, they do not preserve Glushkov's property that all edges arriving at
a node are labeled by the same character/class, so they are not so attractive
for the bit-parallel algorithms described in Section 5.4. However, they could
be interesting for other cases. We plan to work on a combination that
minimizes the NFA while preserving Glushkov's property.
- In Section 5.3.3, the same complexity is obtained without recursion in
the structure. Just make modules at the leaves and build a normal NFA with the
rest. Some tree leaves will actually be modules. A leaf, in Thompson's
construction, becomes an edge between the initial and final node. Now leaves
that are modules are a DFA that goes between those initial and final nodes. At
each step one must feed each DFA module with the current character, as well as
activating its initial state when the initial node is active. Each time the
final node of a DFA module is activated, we activate the final node it has in
the NFA.
- In Section 5.4.2 (page 122), it should be noted that we do not need to
represent the initial state of the NFA, since it is always active. Hence we
need 2^m table entries rather than 2^(m+1). This seems trivial
but represents half of the space! The same trick cannot be used, however, in
Section 5.5.3.
- There are news for Section 5.7 (page 139), about NFA construction. A
new paper by V. Geffert, "Translation of binary regular expressions into
nondeterministic epsilon-free automata with O(n log n) transitions", JCSS
66(3):451-472, 2003. Finally they found an algorithm to achieve the minimum
possible number of transitions, O(m log m). [Thanks to Lucian Ilie, Univ. of
Western Ontario, Canada].
- There has been a notable burst of activity regarding Myers' bit-parallel
algorithm (Section 6.4.2) in recent years. Some results can be found in
the last CPM, SPIRE, and PSC conferences.
- Section 6.5.2 should be replaced/enriched with a new method to implement
ABNDM over Myers' simulation (explained in Section 6.4.2). The algorithm, given
by Heikki Hyyrö and Gonzalo Navarro in CPM 2002 (p. 203), is better than the
version of the book for k > 3. This should enlarge the area of ABNDM in
Figure 6.21. Preliminary work on this is already mentioned in the book ([HN01],
in page 184).
- Sections 6.5 and 6.6 could be enriched with a new algorithm for
multipattern approximate searching by Kimmo Fredriksson and Gonzalo Navarro.
This inherits from Chang and Marr 1994 paper (for one pattern), extending it
for multiple patterns and trying some variants too. It seems to be a relevant
filtration algorithm not only for several but also for a single pattern. The
paper has appeared in CPM 2003 (p. 109).
- For Section 6.9, there is a new bit-parallel algorithm for handling
approximate regular expression searching with arbitrary integer costs, by
Gonzalo Navarro (ISAAC 2003). In the book we mention [BH01]
as an attempt in this direction. The new technique is implemented and shows
a clear improvement over the algorithms in Sections 6.7.1 and 6.7.2. It is
not as fast as that of Section 6.7.3, but this one can handle only integer costs
and a few extensions.
Something to add? Please let us know
and enter our Hall of Fame!