Fault-tolerant Search


Classical approaches to error-tolerant search

This is a brief description of the most relevant "classical" approaches to tackle the problem of error-tolerant search.

Deterministic Finite Automat on (DFA or Transducer)

DFAs are based on regular languages, widely used in compilers and interpreters to quickly check words for membership in a defined word set. The retrieval in this kind of structure is based on a directed walk through a set of states. This is usually extremely fast and independent of the size of the dictionary. Error tolerance can only be added by either considering expected variations of the query string or by adding such to the dictionary. This ends up being a trial-and-error approach, which is naturally very limited.

In the former case the retrieval time grows exponentially with the number of allowed errors and the latter is not manageable in terms of memory consumption for large dictionaries. In general this approach is maximally good for up to 2 corrections. The major drawback of this approach is that resource consumption increases exponentially with the number of corrections.

Hash Tables

There are a wide range of implementations of hash tables. All of them have the advantage that the procedure is extremely fast (constant), but not fault tolerant at all. Error-tolerance is usually applied by doubling the hash table and by accessing it with the so-called "front hash"-"back hash" procedure, assuming that an error occurs only in one half of the word. Memory resources are comparably small. There are implementations which have a built-in hash for special situations, where time can be saved when a perfect match is sufficient. There is no commercially available solution that relies on this technology.

Associative Memory (Neural Nets)

This approach relies heavily on a learning scheme. Memory consumption is significant because a large set of learned coupling constants have to be stored. Retrieval time depends on the dictionary size, but not on the number of errors. The biggest disadvantage of neural nets is the lack of result transparency. The results and their credibility heavily depend on random factors during the learning period. Also the overall performance is pretty slow. Other associative memory approaches try to define an objective set of "features" (such as bigrams) in order to increase transparency. However, these features are often not very distinctive and produce a lot of "cross-talk", in other words the retrieval of unwanted matches.

Bipartite Matching

A totally different approach is bipartite matching. Here a cost function is introduced for displacing a character. The idea is to match each query string to the reference string with the lowest possible displacement cost. Unfortunately this technology can only be used by "compiling" the query and running it against the data, with a general performance as fast as grep or other scanners. Also the results are intuitively similar, but there are no clear criteria for the results quality. Developers of solutions that are based on this technology admit that calculating the edit distance would be clearly superior, but that they had no efficient way to implement it.

Longest-Common-Subsequence (LCS)

This is considered a simplified variation of the Levenshtein algorithm. It has drawbacks, since directly adjacent sequences of characters are receive the same credibilities as scattered sequences. In cases where a clear relationship between query length and dictionary entry length cannot be given, it is a convenient algorithm for approximate retrieval, but has also no efficient implementation.

GlobStyle, Regular Expressions

The most common method of imprecise querying is using regular expression or glob style matching. Regular expressions (egrep) are cumbersome to use and if not specified with care, the correct entry can still be missed. As a result reliability and transparency are not provided. Search performance is so poor for full-table scans of large databases, that the technology is not commercially viable.

Feedback -  Imprint