Matchsimile: A Flexible Approximate Matching Tool for Personal Names Searching

Gonzalo Navarro, Ricardo Baeza-Yates and Joćo Marcelo Arcoverde

In this paper we present the architecture and algorithms behind Matchsimile, an approximate string matching lookup tool especially designed for human and company names searches against a large textual database. Part of a larger information retrieval environment, this specific engine accepts an input text file with a set of personal and company names and a set of restrictions for the search. After a batch processing, the engine outputs another text file containing the occurrences that match each record of the input names file, according to its search parameters. Beyond the similarity search capabilities applied on each word that forms a name, the tool considers a set of personal names formation rules for their words such as combination, abbreviation, character mapping, duplicity detections, ordering, word omission and insertion, among others. This engine is used in a succeeded commercial application (also named Matchsimile), which uses this tool to allow lawyers names searches against many official law journals publications.