Saturday 11 August 2012

Fuzzy Thing: Approximate String Matching (Intro & BKTree)

Understanding different fuzzy-matching approaches

So I have these tasks at work that often involve having to look up a word/name of one data set in another data set, whether it is found or which are the found records that match this word/name closest.
In the following I'll use a tiny sample data set to examine how different approaches on fuzzy matching work. I'll focus on the offline-solutions that create an index of the LookUp data set in order to speed up the querying process. I won't go into theoretical details as there are excellent descriptions out there for each of the methods I'll use, for me it's rather important to understand what they do in a hands-on way. Today I start with Burkhard-Keller Trees and hopefully I'll follow up with other approaches.

Definitions:

I have a Pattern, that is one column of a record in my Source data-set, for which I want to find similar records in a LookUp data-set. Each record consists of two columns, I am basically interested in the values of colA, but eventually I want to get back the whole record.

Sample LookUp data-set

      colA                colB
0   valueX  rec1_value_of_colB
1   valueY  rec2_value_of_colB
2   valueX  rec3_value_of_colB
3  valueXZ  rec4_value_of_colB
4  valueYZ  rec5_value_of_colB

Burkhard-Keller Trees

A Burkhard-Keller (BK) Tree algorithm(1) can be used to create a tree-index of a data-set that stores the data-entries according to their similarities/distances towards each other in a hierarchical tree structure. The expression of similarity/distance has to be expressible in a metric space, it as to obey the law of linear triangularity. An excellent introduction into the theoretical approach: Damn Cool Algorithms, Part 1: BK-Trees.
I use this python implementation by Adam Hupp with a few modifications. For example I've edited it to accept (index, word)-tuples instead of only words, as I eventually want to get back the whole record rather than only a matching pattern.
Our BKTree class gets called with only one argument the distance function to use; I use the Levenshtein edit distance that expresses the distance between two patterns as the total number of edits (insertions, deletions and/or substitutions of characters) one has to apply in order to get from one pattern to the other. There is a very well performing C-extension Levenshtein module available in python.
class BKTree(object):
    def __init__(self, distfn):
        """
        Create a new BK-tree from a list of words using `distfn`.
        `distfn`: a binary function that returns the distance between
            two words.  Return value is a non-negative integer.  the
            distance function must be a metric space.
        """
        self.distfn = distfn

Creating the index

Let's say we want to create an index on colA of the LookUp data-set. We provide a list of (index_of_record, value_of_colA)-tuples to the following function:
    def create_tree(self, indexed_words):
        """
        The 'index creating function'.
        `indexed_words`: A list of tuples ((index, word)), where
        `index` is the index of a record represented by the 
        `word` in a record list.
        """
        it = indexed_words.iteritems()
        root = it.next()
        tree = (root, {})
        for i in it:
            self._add_word(tree, i)
        return tree
The principle of indexing here is that we start with the first record of our LookUp data-source and use it as the 'root' of our tree. Then we add each remaining record and it's metric distance to 'root' to our tree.
    def _add_word(self, parent, indexed_word):
        index, word = indexed_word
        (pindex, pword), children = parent
        d = self.distfn(word, pword)
        if d in children:
            self._add_word(children[d], indexed_word)
        else:
            children[d] = (indexed_word, {})
So for each (index, word)-tuple from our remaining 'create_tree' list we calculate the edit distance to the parent element; if a previous item has already been registered with this distance to the parent, this item is taken as parent instead and the new word is compared to this one.
In that way we create recursively a hierarchical tree that contains information about the distance relations between *all* items in the dataset without any duplication. We can do that due to the guaranty of triangular inequality in the metric spaces expressed via the edit distance.
Here is the index that will be used for querying:
((0, 'valueX'),
 {0: ((2, 'valueX'), {}),
  1: ((1, 'valueY'), {2: ((3, 'valueXZ'), {})}),
  2: ((4, 'valueYZ'), {})})
Note that (3, 'valueXZ') is put into relation with (1, 'valueY') of a distance of 2; this due to the fact that both patterns share the same distance (1) to the root (0, 'valueX').
A potential problem resulting from this approach occurs if we have to deal with a data set where the column we want to index has a lot of *identical* entries for its records. That would lead to an index with a huge amount of hierarchy-levels but without benefits, we'll see that later in a query example.

Querying

    def query(self, word, n, tree):
        """
        Return all words in the `tree` that are within a distance of
        `n` from `word`.
        Yields tuples (distance, (index, word))
        NOTE: the original implementation returns a sorted list,
        however as I'm using a generator here, the yielded results
        aren't sorted!
        """
        def rec(parent):
            (pindex, pword), children = parent
            d = self.distfn(word, pword)
            if d <= n:
                yield (d, (pindex, pword))
            for i in range(d - n, d + n + 1):
                child = children.get(i)
                if child is not None:
                    for c in rec(child):
                        yield c
        return rec(tree) 
Let's query for similar words to the word 'valu--Z' with an accepted max distance n of 2. I also provide as tree the previously created index. Comparing the pattern visually to the LookUp records let's us expect only two valid matches: Record 3 and 4, both with an edit distance of 2. But what happens:

Firstly the query function will call the inner function rec with the whole tree. This one is then split up into the 'root' tuple (with the index 0 and the word 'valueX') and its child elements. The computed edit distance between my query word 'valu--Z' and root's 'valueX' is 3, it's not within the accepted range and so for not yielded.

Now comes the awesome bit: Based on this distance an accepted range of [1, 2, 3, 4, 5] is created that excludes all words of which it is sure (triangular inequality guaranty) they are out of the desired scope. In this way they query 'jumps' through the index picking items that are likely to fit, ignoring the ones of which it knows they wouldn't. LookupRecord 2 with its identical 'valueX' will be therefore ignored.

Still the range of records checked is always n * 2 + 1 which leads to a significant slow-down of bk-trees towards higher accepted edit distances. The length of the words on the other hand does only have a negative impact on the peformance of the distance function.

The result of our query is as expected:
[(2, (3, 'valueXZ')), (2, (4, 'valueYZ'))]
The indices allow to get back the whole records from the LookUp Data Set, adding further acceptance criteria is a straight forward option.

No comments:

Post a Comment