Sunday, 19 August 2012

FuzzyMatching (2): NGram

After having explored Burkhard-Keller Trees I will now have a look at the n-gram string matching method. I'll use the same LookUp data set as in the previous post in order to compare the methods at a later point.

I use this great python implementation in order to understand how it works: https://github.com/gpoulter/python-ngram.

Create the index 

I will initialise NGram with the the same (record-index, value-of-colA)-tuple-list I used for the bk-trees. I specify the following parameters:
  • key: A lambda function providing the index of the actual strings in the tuples.
  • N: The size of the grams (chunks) our strings will be split in.
>>> from ngram import NGram
>>> indexed_words = [(0, 'valueX'), (1, 'valueY'), (2, 'valueX'),
                     (3, 'valueXZ'), (4, 'valueYZ')]
>>> index = NGram(indexed_words, key=lambda x:x[1], N=3)
The NGram class itself inherits from set(), one part of the initialisation is simply to add all items of the list to this set. As I provided indexed items I don't 'benefit' from the uniqueness of the set.
Let's look at the grams that have been created:
>>> index._grams
{'$$v': {(0, 'valueX'): 1,
         (1, 'valueY'): 1,
         (2, 'valueX'): 1,
         (3, 'valueXZ'): 1,
         (4, 'valueYZ'): 1},
 '$va': {(0, 'valueX'): 1,
         (1, 'valueY'): 1,
         (2, 'valueX'): 1,
         (3, 'valueXZ'): 1,
         (4, 'valueYZ'): 1},
 'X$$': {(0, 'valueX'): 1, (2, 'valueX'): 1},
 'XZ$': {(3, 'valueXZ'): 1},
 'Y$$': {(1, 'valueY'): 1},
 'YZ$': {(4, 'valueYZ'): 1},
 'Z$$': {(3, 'valueXZ'): 1, (4, 'valueYZ'): 1},
 'alu': {(0, 'valueX'): 1,
         (1, 'valueY'): 1,
         (2, 'valueX'): 1,
         (3, 'valueXZ'): 1,
         (4, 'valueYZ'): 1},
 'eX$': {(0, 'valueX'): 1, (2, 'valueX'): 1},
 'eXZ': {(3, 'valueXZ'): 1},
 'eY$': {(1, 'valueY'): 1},
 'eYZ': {(4, 'valueYZ'): 1},
 'lue': {(0, 'valueX'): 1,
         (1, 'valueY'): 1,
         (2, 'valueX'): 1,
         (3, 'valueXZ'): 1,
         (4, 'valueYZ'): 1},
 'ueX': {(0, 'valueX'): 1, (2, 'valueX'): 1, (3, 'valueXZ'): 1},
 'ueY': {(1, 'valueY'): 1, (4, 'valueYZ'): 1},
 'val': {(0, 'valueX'): 1,
         (1, 'valueY'): 1,
         (2, 'valueX'): 1,
         (3, 'valueXZ'): 1,
         (4, 'valueYZ'): 1}}
Quite big compared to the bk-tree index, init?
So the grams are stored in a dictionary, where the keys are the n-grams themselves, the values are mappings of the indexed strings that contain the specific gram and a count of how many times the gram occurs in the string.
Note the 'padding' characters $ that wrap the strings to create full n-grams containing the start- and end-characters.

Querying

I now query the index for the same value as I did in the bk-tree post. The difference is that NGram accepts as similarity parameter a threshold rather than an edit-distance. This threshold is defined as a decimal number between 0 and 1, the closer to one, the more similar two strings are. We'll see that there is a warp parameter that'll help us to tweak the impact of the word-length.
>>> query = 'valu--Z'
>>> index.search(query, threshold=0.5)
I'll just step through the process here and list the interesting bits it's doing:
  • split the query itself in n-grams:
  • ipdb> pp list(self.split(query))
    ['$$v', '$va', 'val', 'alu', 'lu-', 'u--', '--Z', '-Z$', 'Z$$']
    
  • for each of them do an *exact* (==quick) look-up in the ._grams dict
  • write each of the matches in a dictionary and keep track of the number of their shared grams (how many grams were matching for each index value)
  • ipdb> pp shared # for gram '$$v'
    {(0, 'valueX'): 4,
     (1, 'valueY'): 4,
     (2, 'valueX'): 4,
     (3, 'valueXZ'): 5,
     (4, 'valueYZ'): 5}
  • Based on the combined number of distinct n-grams and on the number of common n-grams the similarity is calculated, the standard formula is the number of matching n-grams divided by the common number of distinct n-grams of query *and* match (0 - 1).
  • ipdb> pp similarity # for comparing query 'valu--Z' with 'valueXZ'
    0.38461538461538464
    
    Outch, that's not a lot! That will not be accepted by our threshold of 0.5. Funny, they have a edit-distance of 2, that's not so much. But both strings are relatively short, so the ratio all_grams / same_grams must be necessarily more sensitive for short word pairs than for longer ones. I mentioned this warp parameter, that, being set higher, smoothens the punishment if there are only some n-grams missing. This is by default 1. Let's set it to 2 and see what happens:
    ipdb> warp = 2
    ipdb> pp self.ngram_similarity(samegrams, allgrams, warp)
    0.621301775147929
    
    Whohoo! That somehow worked, but in order to understand it better, we have to look deeper into the details. The similarity is calculated on the following formula:
    diffgrams = float(allgrams - samegrams)
    similarity = ((allgrams ** warp - diffgrams ** warp) / (allgrams ** warp))
    
    For warp == 1 this results exactly in the equation specified above. But what if the warp varies?

    Sensitivity analysis of the impact of warp


    So a warp > 1 smothens the effect of a relatively small number of different n-grams, a warp between 0 and 1 increases the punishment.The ladder effect is apparently not desired as the code I use here prevents an increased punishmend by specifying:
    if abs(warp - 1.0) < 1e-9:
         similarity = float(samegrams) / allgrams
    
  • However with warp = 2 and threshold = 0.5 all of my test-records would be accepted, this it not what I want. So it apparently makes sense to increase the threshold then as well, in order to increase sensitivity in a specific region.
    >>> from ngram import NGram
    >>> indexed_words = [(0, 'valueX'), (1, 'valueY'), (2, 'valueX'),
                         (3, 'valueXZ'), (4, 'valueYZ')]
    >>> index = NGram(indexed_words, key=lambda x:x[1], N=3, warp=2)
    >>> index.search('valu__X', threshold=0.6)
    [((4, 'valueYZ'), 0.621301775147929),
     ((3, 'valueXZ'), 0.621301775147929)] 
    I'm happy with that for now! So with a meaningful tweaking that takes into account which data one is dealing with, one can get very satisfying results.

No comments:

Post a Comment