I use this great python implementation in order to understand how it works: https://github.com/gpoulter/python-ngram.
Create the index
I will initialise NGram with the the same (record-index, value-of-colA)-tuple-list I used for the bk-trees. I specify the following parameters:- key: A lambda function providing the index of the actual strings in the tuples.
- N: The size of the grams (chunks) our strings will be split in.
>>> from ngram import NGram >>> indexed_words = [(0, 'valueX'), (1, 'valueY'), (2, 'valueX'), (3, 'valueXZ'), (4, 'valueYZ')] >>> index = NGram(indexed_words, key=lambda x:x[1], N=3)The NGram class itself inherits from set(), one part of the initialisation is simply to add all items of the list to this set. As I provided indexed items I don't 'benefit' from the uniqueness of the set.
Let's look at the grams that have been created:
>>> index._grams {'$$v': {(0, 'valueX'): 1, (1, 'valueY'): 1, (2, 'valueX'): 1, (3, 'valueXZ'): 1, (4, 'valueYZ'): 1}, '$va': {(0, 'valueX'): 1, (1, 'valueY'): 1, (2, 'valueX'): 1, (3, 'valueXZ'): 1, (4, 'valueYZ'): 1}, 'X$$': {(0, 'valueX'): 1, (2, 'valueX'): 1}, 'XZ$': {(3, 'valueXZ'): 1}, 'Y$$': {(1, 'valueY'): 1}, 'YZ$': {(4, 'valueYZ'): 1}, 'Z$$': {(3, 'valueXZ'): 1, (4, 'valueYZ'): 1}, 'alu': {(0, 'valueX'): 1, (1, 'valueY'): 1, (2, 'valueX'): 1, (3, 'valueXZ'): 1, (4, 'valueYZ'): 1}, 'eX$': {(0, 'valueX'): 1, (2, 'valueX'): 1}, 'eXZ': {(3, 'valueXZ'): 1}, 'eY$': {(1, 'valueY'): 1}, 'eYZ': {(4, 'valueYZ'): 1}, 'lue': {(0, 'valueX'): 1, (1, 'valueY'): 1, (2, 'valueX'): 1, (3, 'valueXZ'): 1, (4, 'valueYZ'): 1}, 'ueX': {(0, 'valueX'): 1, (2, 'valueX'): 1, (3, 'valueXZ'): 1}, 'ueY': {(1, 'valueY'): 1, (4, 'valueYZ'): 1}, 'val': {(0, 'valueX'): 1, (1, 'valueY'): 1, (2, 'valueX'): 1, (3, 'valueXZ'): 1, (4, 'valueYZ'): 1}}Quite big compared to the bk-tree index, init?
So the grams are stored in a dictionary, where the keys are the n-grams themselves, the values are mappings of the indexed strings that contain the specific gram and a count of how many times the gram occurs in the string.
Note the 'padding' characters $ that wrap the strings to create full n-grams containing the start- and end-characters.
Querying
I now query the index for the same value as I did in the bk-tree post. The difference is that NGram accepts as similarity parameter a threshold rather than an edit-distance. This threshold is defined as a decimal number between 0 and 1, the closer to one, the more similar two strings are. We'll see that there is a warp parameter that'll help us to tweak the impact of the word-length.>>> query = 'valu--Z'
>>> index.search(query, threshold=0.5)I'll just step through the process here and list the interesting bits it's doing:
- split the query itself in n-grams:
ipdb> pp list(self.split(query)) ['$$v', '$va', 'val', 'alu', 'lu-', 'u--', '--Z', '-Z$', 'Z$$']
ipdb> pp shared # for gram '$$v' {(0, 'valueX'): 4, (1, 'valueY'): 4, (2, 'valueX'): 4, (3, 'valueXZ'): 5, (4, 'valueYZ'): 5}
ipdb> pp similarity # for comparing query 'valu--Z' with 'valueXZ' 0.38461538461538464Outch, that's not a lot! That will not be accepted by our threshold of 0.5. Funny, they have a edit-distance of 2, that's not so much. But both strings are relatively short, so the ratio all_grams / same_grams must be necessarily more sensitive for short word pairs than for longer ones. I mentioned this warp parameter, that, being set higher, smoothens the punishment if there are only some n-grams missing. This is by default 1. Let's set it to 2 and see what happens:
ipdb> warp = 2 ipdb> pp self.ngram_similarity(samegrams, allgrams, warp) 0.621301775147929Whohoo! That somehow worked, but in order to understand it better, we have to look deeper into the details. The similarity is calculated on the following formula:
diffgrams = float(allgrams - samegrams) similarity = ((allgrams ** warp - diffgrams ** warp) / (allgrams ** warp))For warp == 1 this results exactly in the equation specified above. But what if the warp varies?
Sensitivity analysis of the impact of warp
if abs(warp - 1.0) < 1e-9: similarity = float(samegrams) / allgrams
- However with warp = 2 and threshold = 0.5 all of my test-records would be accepted, this it not what I want. So it apparently makes sense to increase the threshold then as well, in order to increase sensitivity in a specific region.
>>> from ngram import NGram >>> indexed_words = [(0, 'valueX'), (1, 'valueY'), (2, 'valueX'), (3, 'valueXZ'), (4, 'valueYZ')] >>> index = NGram(indexed_words, key=lambda x:x[1], N=3, warp=2) >>> index.search('valu__X', threshold=0.6) [((4, 'valueYZ'), 0.621301775147929), ((3, 'valueXZ'), 0.621301775147929)]
I'm happy with that for now! So with a meaningful tweaking that takes into account which data one is dealing with, one can get very satisfying results.