ComputerSchmuter: FuzzyMatching (2): NGram

After having explored Burkhard-Keller Trees I will now have a look at the n-gram string matching method. I'll use the same LookUp data set as in the previous post in order to compare the methods at a later point.

I use this great python implementation in order to understand how it works: https://github.com/gpoulter/python-ngram.

Create the index

I will initialise NGram with the the same (record-index, value-of-colA)-tuple-list I used for the bk-trees. I specify the following parameters:

key: A lambda function providing the index of the actual strings in the tuples.
N: The size of the grams (chunks) our strings will be split in.

>>> from ngram import NGram
>>> indexed_words = [(0, 'valueX'), (1, 'valueY'), (2, 'valueX'),
                     (3, 'valueXZ'), (4, 'valueYZ')]
>>> index = NGram(indexed_words, key=lambda x:x[1], N=3)

The NGram class itself inherits from set(), one part of the initialisation is simply to add all items of the list to this set. As I provided indexed items I don't 'benefit' from the uniqueness of the set.
Let's look at the grams that have been created:

>>> index._grams
{'$$v': {(0, 'valueX'): 1,
         (1, 'valueY'): 1,
         (2, 'valueX'): 1,
         (3, 'valueXZ'): 1,
         (4, 'valueYZ'): 1},
 '$va': {(0, 'valueX'): 1,
         (1, 'valueY'): 1,
         (2, 'valueX'): 1,
         (3, 'valueXZ'): 1,
         (4, 'valueYZ'): 1},
 'X$$': {(0, 'valueX'): 1, (2, 'valueX'): 1},
 'XZ$': {(3, 'valueXZ'): 1},
 'Y$$': {(1, 'valueY'): 1},
 'YZ$': {(4, 'valueYZ'): 1},
 'Z$$': {(3, 'valueXZ'): 1, (4, 'valueYZ'): 1},
 'alu': {(0, 'valueX'): 1,
         (1, 'valueY'): 1,
         (2, 'valueX'): 1,
         (3, 'valueXZ'): 1,
         (4, 'valueYZ'): 1},
 'eX$': {(0, 'valueX'): 1, (2, 'valueX'): 1},
 'eXZ': {(3, 'valueXZ'): 1},
 'eY$': {(1, 'valueY'): 1},
 'eYZ': {(4, 'valueYZ'): 1},
 'lue': {(0, 'valueX'): 1,
         (1, 'valueY'): 1,
         (2, 'valueX'): 1,
         (3, 'valueXZ'): 1,
         (4, 'valueYZ'): 1},
 'ueX': {(0, 'valueX'): 1, (2, 'valueX'): 1, (3, 'valueXZ'): 1},
 'ueY': {(1, 'valueY'): 1, (4, 'valueYZ'): 1},
 'val': {(0, 'valueX'): 1,
         (1, 'valueY'): 1,
         (2, 'valueX'): 1,
         (3, 'valueXZ'): 1,
         (4, 'valueYZ'): 1}}

Quite big compared to the bk-tree index, init?
So the grams are stored in a dictionary, where the keys are the n-grams themselves, the values are mappings of the indexed strings that contain the specific gram and a count of how many times the gram occurs in the string.
Note the 'padding' characters $ that wrap the strings to create full n-grams containing the start- and end-characters.

Querying

I now query the index for the same value as I did in the bk-tree post. The difference is that NGram accepts as similarity parameter a threshold rather than an edit-distance. This threshold is defined as a decimal number between 0 and 1, the closer to one, the more similar two strings are. We'll see that there is a warp parameter that'll help us to tweak the impact of the word-length.

>>> query = 'valu--Z'

>>> index.search(query, threshold=0.5)

I'll just step through the process here and list the interesting bits it's doing:

split the query itself in n-grams:

ipdb> pp list(self.split(query))
['$$v', '$va', 'val', 'alu', 'lu-', 'u--', '--Z', '-Z$', 'Z$$']

for each of them do an *exact* (==quick) look-up in the ._grams dict
write each of the matches in a dictionary and keep track of the number of their shared grams (how many grams were matching for each index value)

ipdb> pp shared # for gram '$$v'
{(0, 'valueX'): 4,
 (1, 'valueY'): 4,
 (2, 'valueX'): 4,
 (3, 'valueXZ'): 5,
 (4, 'valueYZ'): 5}

Based on the combined number of distinct n-grams and on the number of common n-grams the similarity is calculated, the standard formula is the number of matching n-grams divided by the common number of distinct n-grams of query *and* match (0 - 1).

ipdb> pp similarity # for comparing query 'valu--Z' with 'valueXZ'
0.38461538461538464

ipdb> warp = 2
ipdb> pp self.ngram_similarity(samegrams, allgrams, warp)
0.621301775147929

diffgrams = float(allgrams - samegrams)
similarity = ((allgrams ** warp - diffgrams ** warp) / (allgrams ** warp))

Sensitivity analysis of the impact of warp

So a warp > 1

smothens the effect of a relatively small number of different n-grams, a warp between 0 and 1

increases the punishment.

if abs(warp - 1.0) < 1e-9:
     similarity = float(samegrams) / allgrams

However with warp = 2 and threshold = 0.5 all of my test-records would be accepted, this it not what I want. So it apparently makes sense to increase the threshold then as well, in order to increase sensitivity in a specific region.

>>> from ngram import NGram
>>> indexed_words = [(0, 'valueX'), (1, 'valueY'), (2, 'valueX'),
                     (3, 'valueXZ'), (4, 'valueYZ')]
>>> index = NGram(indexed_words, key=lambda x:x[1], N=3, warp=2)
>>> index.search('valu__X', threshold=0.6)
[((4, 'valueYZ'), 0.621301775147929),
 ((3, 'valueXZ'), 0.621301775147929)]

I'm happy with that for now! So with a meaningful tweaking that takes into account which data one is dealing with, one can get very satisfying results.

ComputerSchmuter

Sunday, 19 August 2012

FuzzyMatching (2): NGram

Create the index

Querying

Sensitivity analysis of the impact of warp

No comments:

Post a Comment