Thursday, 31 January 2013

Compare CSV Files - A usecase of cluster analisis

Comparing output files in CSV or tab-separated format after the process of which they are the product has changed slightly is a common task.
Which tool would be best to use for such comparison?

 Technology choices 

Besides Excel et al. which are surely a decent way of getting a visual overview, the existing 'diff' like tools I know (including the otherwise excellent graphical 'diff' variant 'meld') don't really cope with the fact that apart from a change in the order of columns two rows still contain the same information. And if I want to compare files with information nested within some columns, where the order might differ but is not important, it can get quite tricky. A quick way of dealing with that could be using some basic set theory and your programming language of choice, which I want to demonstrate in the following using python.

 Usecase: tweaking a cluster analysis - a work in progress

Say I have a set of 13 records (A-M), and I want to cluster them in order to find potential duplicates in the dataset. They way I define the clustering algorithm shall not be of concern here. The important bit is that I'm still not sure how to configure it and I want to compare two different approaches. As an outcome of the two clustering approaches I have two tab-separated files. Each file has two columns, the first one contains a "MasterID", the second one contains a list of IDs that are assigned to the cluster represented by this MasterID. As I have changed something in my clustering process, either the code itself or the order of the input-rows, I have different occurrences of clusters:
>>> cat file_1.tsv
MasterID IDs
   A         E
   B         F,G
   C         H,I,J
   D       K,L,M
>>> cat file2.tsv
MasterID IDs
   E         A
   B         G,F
   C         H,M,J
   I         K,L,D
At this stage I don't make a difference between MasterIDs and other IDs, I just want to know whether the rows in the two files represent the same clusters.

(1) Preparation: A set of sets

# we use frozensets here as they are immutable in opposite to normal sets and a
# (hash-)set can't contain mutable members
import re
def get_unique_clusters(filepath, delimits='\t|,', to_delete='\n',
                        ignore_header=True):
    """
    Creates a set of (frozen) sets of the content of a file with tabular
    data, the set representing the file, the frozensets its lines.
 
    Splits each line of a file at `filepath` into the desired items,
    converts them in a frozenset and adds them to a big set that is then returned.
 
    Unfortunately I haven't found a nice way of dealing with more than one
    separator in python's otherwise very nice csv module, hence I have to do it manually.
    """
    myfile = open(filepath, 'rb')
    if ignore_header:
        myfile.next()
    items = set()
    for line in myfile:
        # we use frozensets here as they are immutable in opposite to
        # normal sets and a (hash-)set can't contain mutable members.
        cluster = frozenset([col
                             for col in re.split(delimitss, re.sub(to_delete, '', line))
                             if col])
        items.add(cluster)
    return items
>>> clusters_1 = get_unique_clusters('file1.tsv')
>>> clusters_2 = get_unique_clusters('file2.tsv')
>>> clusters_1, clusters_2
(set([frozenset(['A', 'E']),
      frozenset(['I', 'H', 'C', 'J']),
      frozenset(['B', 'G', 'F']),
      frozenset(['K', 'M', 'D', 'L'])]),
 set([frozenset(['H', 'C', 'J', 'M']),
      frozenset(['I', 'K', 'L', 'D']),
      frozenset(['A', 'E']),
      frozenset(['B', 'G', 'F'])]))

(2) A quick error check

>>> # Let's have a look at all occuring items in each of our datasets;
>>> # they should match up if there
>>> # is no mistake in the previously excercised clustering process.
>>> all_items_1 = set([col for row in clusters_1 for col in row])
>>> all_items_2 = set([col for row in clusters_2 for col in row])
>>> all_items_1, all_items_2
set([frozenset(['I', 'H', 'C', 'J']), frozenset(['K', 'M', 'D', 'L'])])
>>> # A bigger data set obviously compares easier like that:
>>> all_items_1 == all_items_2
True

(3) Find the different clusters

>>> only_in_1 = clusters_1.difference(clusters_2)
>>> only_in_1 # expecting the 3rd and 4th row of the original file
set([frozenset(['I', 'H', 'C', 'J']), frozenset(['K', 'M', 'D', 'L'])])
>>> only_in_2 = clusters_2.difference(clusters_1)
>>> only_in_2 # expecting the 3rd and 4th row of the original file
set([frozenset(['H', 'C', 'J', 'M']), frozenset(['I', 'K', 'L', 'D'])])

(4) Double check all well done

>>> # all members in each of the difference sets added together should match the others,
>>> # as the clustering shouldn't have swallowed items, only it wrapped them differently.
>>> all_items_only_1 = set([col for row in only_in_1 for col in row]
>>> all_items_only_2 = set([col for row in only_in_2 for col in row])
>>> all_items_only_1 == all_items_only_2
True
# further eyeballing can be done in excel et al.
import csv
with open('only_in_1.csv', 'wb') as csvfile:
    mywriter = csv.writer(csvfile, delimiter='\t')
    for i in only_in_1:
        mywriter.writerow(list(i))
with open('only_in_2.csv', 'wb') as csvfile:
    mywriter = csv.writer(csvfile, delimiter='\t')
    for i in only_in_2:
        mywriter.writerow(list(i))

Sorting csv files on columns in vim

Nice one, it's actually very easy to sort a [comma or tab or whatever] separated file in vim on a certain column.
The following example
:sort n/\([^\t]*\t\)\{9\}
  • sorts (:sort )
  • a "column" containing *numbers* ( n )
  • defined as anything which is no tab until a tab ( \([^\t]*\t\) )
  • after this having occurred 9 times \{9\} == on the *10th* column.

Sunday, 19 August 2012

FuzzyMatching (2): NGram

After having explored Burkhard-Keller Trees I will now have a look at the n-gram string matching method. I'll use the same LookUp data set as in the previous post in order to compare the methods at a later point.

I use this great python implementation in order to understand how it works: https://github.com/gpoulter/python-ngram.

Create the index 

I will initialise NGram with the the same (record-index, value-of-colA)-tuple-list I used for the bk-trees. I specify the following parameters:
  • key: A lambda function providing the index of the actual strings in the tuples.
  • N: The size of the grams (chunks) our strings will be split in.
>>> from ngram import NGram
>>> indexed_words = [(0, 'valueX'), (1, 'valueY'), (2, 'valueX'),
                     (3, 'valueXZ'), (4, 'valueYZ')]
>>> index = NGram(indexed_words, key=lambda x:x[1], N=3)
The NGram class itself inherits from set(), one part of the initialisation is simply to add all items of the list to this set. As I provided indexed items I don't 'benefit' from the uniqueness of the set.
Let's look at the grams that have been created:
>>> index._grams
{'$$v': {(0, 'valueX'): 1,
         (1, 'valueY'): 1,
         (2, 'valueX'): 1,
         (3, 'valueXZ'): 1,
         (4, 'valueYZ'): 1},
 '$va': {(0, 'valueX'): 1,
         (1, 'valueY'): 1,
         (2, 'valueX'): 1,
         (3, 'valueXZ'): 1,
         (4, 'valueYZ'): 1},
 'X$$': {(0, 'valueX'): 1, (2, 'valueX'): 1},
 'XZ$': {(3, 'valueXZ'): 1},
 'Y$$': {(1, 'valueY'): 1},
 'YZ$': {(4, 'valueYZ'): 1},
 'Z$$': {(3, 'valueXZ'): 1, (4, 'valueYZ'): 1},
 'alu': {(0, 'valueX'): 1,
         (1, 'valueY'): 1,
         (2, 'valueX'): 1,
         (3, 'valueXZ'): 1,
         (4, 'valueYZ'): 1},
 'eX$': {(0, 'valueX'): 1, (2, 'valueX'): 1},
 'eXZ': {(3, 'valueXZ'): 1},
 'eY$': {(1, 'valueY'): 1},
 'eYZ': {(4, 'valueYZ'): 1},
 'lue': {(0, 'valueX'): 1,
         (1, 'valueY'): 1,
         (2, 'valueX'): 1,
         (3, 'valueXZ'): 1,
         (4, 'valueYZ'): 1},
 'ueX': {(0, 'valueX'): 1, (2, 'valueX'): 1, (3, 'valueXZ'): 1},
 'ueY': {(1, 'valueY'): 1, (4, 'valueYZ'): 1},
 'val': {(0, 'valueX'): 1,
         (1, 'valueY'): 1,
         (2, 'valueX'): 1,
         (3, 'valueXZ'): 1,
         (4, 'valueYZ'): 1}}
Quite big compared to the bk-tree index, init?
So the grams are stored in a dictionary, where the keys are the n-grams themselves, the values are mappings of the indexed strings that contain the specific gram and a count of how many times the gram occurs in the string.
Note the 'padding' characters $ that wrap the strings to create full n-grams containing the start- and end-characters.

Querying

I now query the index for the same value as I did in the bk-tree post. The difference is that NGram accepts as similarity parameter a threshold rather than an edit-distance. This threshold is defined as a decimal number between 0 and 1, the closer to one, the more similar two strings are. We'll see that there is a warp parameter that'll help us to tweak the impact of the word-length.
>>> query = 'valu--Z'
>>> index.search(query, threshold=0.5)
I'll just step through the process here and list the interesting bits it's doing:
  • split the query itself in n-grams:
  • ipdb> pp list(self.split(query))
    ['$$v', '$va', 'val', 'alu', 'lu-', 'u--', '--Z', '-Z$', 'Z$$']
    
  • for each of them do an *exact* (==quick) look-up in the ._grams dict
  • write each of the matches in a dictionary and keep track of the number of their shared grams (how many grams were matching for each index value)
  • ipdb> pp shared # for gram '$$v'
    {(0, 'valueX'): 4,
     (1, 'valueY'): 4,
     (2, 'valueX'): 4,
     (3, 'valueXZ'): 5,
     (4, 'valueYZ'): 5}
  • Based on the combined number of distinct n-grams and on the number of common n-grams the similarity is calculated, the standard formula is the number of matching n-grams divided by the common number of distinct n-grams of query *and* match (0 - 1).
  • ipdb> pp similarity # for comparing query 'valu--Z' with 'valueXZ'
    0.38461538461538464
    
    Outch, that's not a lot! That will not be accepted by our threshold of 0.5. Funny, they have a edit-distance of 2, that's not so much. But both strings are relatively short, so the ratio all_grams / same_grams must be necessarily more sensitive for short word pairs than for longer ones. I mentioned this warp parameter, that, being set higher, smoothens the punishment if there are only some n-grams missing. This is by default 1. Let's set it to 2 and see what happens:
    ipdb> warp = 2
    ipdb> pp self.ngram_similarity(samegrams, allgrams, warp)
    0.621301775147929
    
    Whohoo! That somehow worked, but in order to understand it better, we have to look deeper into the details. The similarity is calculated on the following formula:
    diffgrams = float(allgrams - samegrams)
    similarity = ((allgrams ** warp - diffgrams ** warp) / (allgrams ** warp))
    
    For warp == 1 this results exactly in the equation specified above. But what if the warp varies?

    Sensitivity analysis of the impact of warp


    So a warp > 1 smothens the effect of a relatively small number of different n-grams, a warp between 0 and 1 increases the punishment.The ladder effect is apparently not desired as the code I use here prevents an increased punishmend by specifying:
    if abs(warp - 1.0) < 1e-9:
         similarity = float(samegrams) / allgrams
    
  • However with warp = 2 and threshold = 0.5 all of my test-records would be accepted, this it not what I want. So it apparently makes sense to increase the threshold then as well, in order to increase sensitivity in a specific region.
    >>> from ngram import NGram
    >>> indexed_words = [(0, 'valueX'), (1, 'valueY'), (2, 'valueX'),
                         (3, 'valueXZ'), (4, 'valueYZ')]
    >>> index = NGram(indexed_words, key=lambda x:x[1], N=3, warp=2)
    >>> index.search('valu__X', threshold=0.6)
    [((4, 'valueYZ'), 0.621301775147929),
     ((3, 'valueXZ'), 0.621301775147929)] 
    I'm happy with that for now! So with a meaningful tweaking that takes into account which data one is dealing with, one can get very satisfying results.

Saturday, 18 August 2012

PyLucene in a virtualenv on Ubuntu12.04

Ubuntu ships with a PyLucene version in its repositories, installing that one is straight forward:
$ sudo apt-get install pylucene

However the version is a bit outdated, so I try installing it from source. For people who don't want to install it in their local virtualenv the whole process is well discribed here: http://john.wesorick.com/2011/11/installing-pylucene-on-ubuntu

I'll stick to the structure of this description but tweak it to be installed in my virtualenv.

Get Java

$ sudo apt-get install openjdk-7-jre openjdk-7-jdk ant

Get and unpack PyLucene's source

$ wget http://www.eu.apache.org/dist/lucene/pylucene/pylucene-(latest)
$ tar xvfx pylucene-(latest)
In my case (latest) is version 3.6.0-2

Build jcc

$ cd pylucene-3.6.0-2/jcc/
Edit setup.py and change the dictionary JDK to contain the Java version just installed:
JDK = {
    'linux2': '/usr/lib/jvm/java-7-openjdk-i386',
}
Now run the setup, I didn't have patches setuptools, I guess that's due to the different version you get in your virtualenv? Anyway:
$ python setup.py install

Build PyLucene

Edit Makefile in the root directory: I have to set the following variables:
# pylucene-3.6.0-2/Makefile
# ...
PREFIX_PYTHON=(path/to/your/virtualenv)
ANT=JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386 /usr/bin/ant
PYTHON=$(PREFIX_PYTHON)/bin/python
JCC=$(PYTHON) -m jcc --shared
NUM_FILES=4
If you don't know the path to your virtualenv, type 'which python', $PREFIX_PYTHON is the the bit in front of '/bin'.
Now build PyLucene and see whether it works.
$ make
$ make install
$ make test

Saturday, 11 August 2012

Fuzzy Thing: Approximate String Matching (Intro & BKTree)

Understanding different fuzzy-matching approaches

So I have these tasks at work that often involve having to look up a word/name of one data set in another data set, whether it is found or which are the found records that match this word/name closest.
In the following I'll use a tiny sample data set to examine how different approaches on fuzzy matching work. I'll focus on the offline-solutions that create an index of the LookUp data set in order to speed up the querying process. I won't go into theoretical details as there are excellent descriptions out there for each of the methods I'll use, for me it's rather important to understand what they do in a hands-on way. Today I start with Burkhard-Keller Trees and hopefully I'll follow up with other approaches.

Definitions:

I have a Pattern, that is one column of a record in my Source data-set, for which I want to find similar records in a LookUp data-set. Each record consists of two columns, I am basically interested in the values of colA, but eventually I want to get back the whole record.

Sample LookUp data-set

      colA                colB
0   valueX  rec1_value_of_colB
1   valueY  rec2_value_of_colB
2   valueX  rec3_value_of_colB
3  valueXZ  rec4_value_of_colB
4  valueYZ  rec5_value_of_colB

Burkhard-Keller Trees

A Burkhard-Keller (BK) Tree algorithm(1) can be used to create a tree-index of a data-set that stores the data-entries according to their similarities/distances towards each other in a hierarchical tree structure. The expression of similarity/distance has to be expressible in a metric space, it as to obey the law of linear triangularity. An excellent introduction into the theoretical approach: Damn Cool Algorithms, Part 1: BK-Trees.
I use this python implementation by Adam Hupp with a few modifications. For example I've edited it to accept (index, word)-tuples instead of only words, as I eventually want to get back the whole record rather than only a matching pattern.
Our BKTree class gets called with only one argument the distance function to use; I use the Levenshtein edit distance that expresses the distance between two patterns as the total number of edits (insertions, deletions and/or substitutions of characters) one has to apply in order to get from one pattern to the other. There is a very well performing C-extension Levenshtein module available in python.
class BKTree(object):
    def __init__(self, distfn):
        """
        Create a new BK-tree from a list of words using `distfn`.
        `distfn`: a binary function that returns the distance between
            two words.  Return value is a non-negative integer.  the
            distance function must be a metric space.
        """
        self.distfn = distfn

Creating the index

Let's say we want to create an index on colA of the LookUp data-set. We provide a list of (index_of_record, value_of_colA)-tuples to the following function:
    def create_tree(self, indexed_words):
        """
        The 'index creating function'.
        `indexed_words`: A list of tuples ((index, word)), where
        `index` is the index of a record represented by the 
        `word` in a record list.
        """
        it = indexed_words.iteritems()
        root = it.next()
        tree = (root, {})
        for i in it:
            self._add_word(tree, i)
        return tree
The principle of indexing here is that we start with the first record of our LookUp data-source and use it as the 'root' of our tree. Then we add each remaining record and it's metric distance to 'root' to our tree.
    def _add_word(self, parent, indexed_word):
        index, word = indexed_word
        (pindex, pword), children = parent
        d = self.distfn(word, pword)
        if d in children:
            self._add_word(children[d], indexed_word)
        else:
            children[d] = (indexed_word, {})
So for each (index, word)-tuple from our remaining 'create_tree' list we calculate the edit distance to the parent element; if a previous item has already been registered with this distance to the parent, this item is taken as parent instead and the new word is compared to this one.
In that way we create recursively a hierarchical tree that contains information about the distance relations between *all* items in the dataset without any duplication. We can do that due to the guaranty of triangular inequality in the metric spaces expressed via the edit distance.
Here is the index that will be used for querying:
((0, 'valueX'),
 {0: ((2, 'valueX'), {}),
  1: ((1, 'valueY'), {2: ((3, 'valueXZ'), {})}),
  2: ((4, 'valueYZ'), {})})
Note that (3, 'valueXZ') is put into relation with (1, 'valueY') of a distance of 2; this due to the fact that both patterns share the same distance (1) to the root (0, 'valueX').
A potential problem resulting from this approach occurs if we have to deal with a data set where the column we want to index has a lot of *identical* entries for its records. That would lead to an index with a huge amount of hierarchy-levels but without benefits, we'll see that later in a query example.

Querying

    def query(self, word, n, tree):
        """
        Return all words in the `tree` that are within a distance of
        `n` from `word`.
        Yields tuples (distance, (index, word))
        NOTE: the original implementation returns a sorted list,
        however as I'm using a generator here, the yielded results
        aren't sorted!
        """
        def rec(parent):
            (pindex, pword), children = parent
            d = self.distfn(word, pword)
            if d <= n:
                yield (d, (pindex, pword))
            for i in range(d - n, d + n + 1):
                child = children.get(i)
                if child is not None:
                    for c in rec(child):
                        yield c
        return rec(tree) 
Let's query for similar words to the word 'valu--Z' with an accepted max distance n of 2. I also provide as tree the previously created index. Comparing the pattern visually to the LookUp records let's us expect only two valid matches: Record 3 and 4, both with an edit distance of 2. But what happens:

Firstly the query function will call the inner function rec with the whole tree. This one is then split up into the 'root' tuple (with the index 0 and the word 'valueX') and its child elements. The computed edit distance between my query word 'valu--Z' and root's 'valueX' is 3, it's not within the accepted range and so for not yielded.

Now comes the awesome bit: Based on this distance an accepted range of [1, 2, 3, 4, 5] is created that excludes all words of which it is sure (triangular inequality guaranty) they are out of the desired scope. In this way they query 'jumps' through the index picking items that are likely to fit, ignoring the ones of which it knows they wouldn't. LookupRecord 2 with its identical 'valueX' will be therefore ignored.

Still the range of records checked is always n * 2 + 1 which leads to a significant slow-down of bk-trees towards higher accepted edit distances. The length of the words on the other hand does only have a negative impact on the peformance of the distance function.

The result of our query is as expected:
[(2, (3, 'valueXZ')), (2, (4, 'valueYZ'))]
The indices allow to get back the whole records from the LookUp Data Set, adding further acceptance criteria is a straight forward option.