Thursday 31 January 2013

Compare CSV Files - A usecase of cluster analisis

Comparing output files in CSV or tab-separated format after the process of which they are the product has changed slightly is a common task.
Which tool would be best to use for such comparison?

 Technology choices 

Besides Excel et al. which are surely a decent way of getting a visual overview, the existing 'diff' like tools I know (including the otherwise excellent graphical 'diff' variant 'meld') don't really cope with the fact that apart from a change in the order of columns two rows still contain the same information. And if I want to compare files with information nested within some columns, where the order might differ but is not important, it can get quite tricky. A quick way of dealing with that could be using some basic set theory and your programming language of choice, which I want to demonstrate in the following using python.

 Usecase: tweaking a cluster analysis - a work in progress

Say I have a set of 13 records (A-M), and I want to cluster them in order to find potential duplicates in the dataset. They way I define the clustering algorithm shall not be of concern here. The important bit is that I'm still not sure how to configure it and I want to compare two different approaches. As an outcome of the two clustering approaches I have two tab-separated files. Each file has two columns, the first one contains a "MasterID", the second one contains a list of IDs that are assigned to the cluster represented by this MasterID. As I have changed something in my clustering process, either the code itself or the order of the input-rows, I have different occurrences of clusters:
>>> cat file_1.tsv
MasterID IDs
   A         E
   B         F,G
   C         H,I,J
   D       K,L,M
>>> cat file2.tsv
MasterID IDs
   E         A
   B         G,F
   C         H,M,J
   I         K,L,D
At this stage I don't make a difference between MasterIDs and other IDs, I just want to know whether the rows in the two files represent the same clusters.

(1) Preparation: A set of sets

# we use frozensets here as they are immutable in opposite to normal sets and a
# (hash-)set can't contain mutable members
import re
def get_unique_clusters(filepath, delimits='\t|,', to_delete='\n',
                        ignore_header=True):
    """
    Creates a set of (frozen) sets of the content of a file with tabular
    data, the set representing the file, the frozensets its lines.
 
    Splits each line of a file at `filepath` into the desired items,
    converts them in a frozenset and adds them to a big set that is then returned.
 
    Unfortunately I haven't found a nice way of dealing with more than one
    separator in python's otherwise very nice csv module, hence I have to do it manually.
    """
    myfile = open(filepath, 'rb')
    if ignore_header:
        myfile.next()
    items = set()
    for line in myfile:
        # we use frozensets here as they are immutable in opposite to
        # normal sets and a (hash-)set can't contain mutable members.
        cluster = frozenset([col
                             for col in re.split(delimitss, re.sub(to_delete, '', line))
                             if col])
        items.add(cluster)
    return items
>>> clusters_1 = get_unique_clusters('file1.tsv')
>>> clusters_2 = get_unique_clusters('file2.tsv')
>>> clusters_1, clusters_2
(set([frozenset(['A', 'E']),
      frozenset(['I', 'H', 'C', 'J']),
      frozenset(['B', 'G', 'F']),
      frozenset(['K', 'M', 'D', 'L'])]),
 set([frozenset(['H', 'C', 'J', 'M']),
      frozenset(['I', 'K', 'L', 'D']),
      frozenset(['A', 'E']),
      frozenset(['B', 'G', 'F'])]))

(2) A quick error check

>>> # Let's have a look at all occuring items in each of our datasets;
>>> # they should match up if there
>>> # is no mistake in the previously excercised clustering process.
>>> all_items_1 = set([col for row in clusters_1 for col in row])
>>> all_items_2 = set([col for row in clusters_2 for col in row])
>>> all_items_1, all_items_2
set([frozenset(['I', 'H', 'C', 'J']), frozenset(['K', 'M', 'D', 'L'])])
>>> # A bigger data set obviously compares easier like that:
>>> all_items_1 == all_items_2
True

(3) Find the different clusters

>>> only_in_1 = clusters_1.difference(clusters_2)
>>> only_in_1 # expecting the 3rd and 4th row of the original file
set([frozenset(['I', 'H', 'C', 'J']), frozenset(['K', 'M', 'D', 'L'])])
>>> only_in_2 = clusters_2.difference(clusters_1)
>>> only_in_2 # expecting the 3rd and 4th row of the original file
set([frozenset(['H', 'C', 'J', 'M']), frozenset(['I', 'K', 'L', 'D'])])

(4) Double check all well done

>>> # all members in each of the difference sets added together should match the others,
>>> # as the clustering shouldn't have swallowed items, only it wrapped them differently.
>>> all_items_only_1 = set([col for row in only_in_1 for col in row]
>>> all_items_only_2 = set([col for row in only_in_2 for col in row])
>>> all_items_only_1 == all_items_only_2
True
# further eyeballing can be done in excel et al.
import csv
with open('only_in_1.csv', 'wb') as csvfile:
    mywriter = csv.writer(csvfile, delimiter='\t')
    for i in only_in_1:
        mywriter.writerow(list(i))
with open('only_in_2.csv', 'wb') as csvfile:
    mywriter = csv.writer(csvfile, delimiter='\t')
    for i in only_in_2:
        mywriter.writerow(list(i))

Sorting csv files on columns in vim

Nice one, it's actually very easy to sort a [comma or tab or whatever] separated file in vim on a certain column.
The following example
:sort n/\([^\t]*\t\)\{9\}
  • sorts (:sort )
  • a "column" containing *numbers* ( n )
  • defined as anything which is no tab until a tab ( \([^\t]*\t\) )
  • after this having occurred 9 times \{9\} == on the *10th* column.