Compare CSV Files - A usecase of cluster analisis
Comparing output files in CSV or tab-separated format after the process of which they are the product has changed slightly is a common task.Which tool would be best to use for such comparison?
Technology choices
Besides Excel et al. which are surely a decent way of getting a visual overview, the existing 'diff' like tools I know (including the otherwise excellent graphical 'diff' variant 'meld') don't really cope with the fact that apart from a change in the order of columns two rows still contain the same information. And if I want to compare files with information nested within some columns, where the order might differ but is not important, it can get quite tricky. A quick way of dealing with that could be using some basic set theory and your programming language of choice, which I want to demonstrate in the following using python.Usecase: tweaking a cluster analysis - a work in progress
Say I have a set of 13 records (A-M), and I want to cluster them in order to find potential duplicates in the dataset. They way I define the clustering algorithm shall not be of concern here. The important bit is that I'm still not sure how to configure it and I want to compare two different approaches. As an outcome of the two clustering approaches I have two tab-separated files. Each file has two columns, the first one contains a "MasterID", the second one contains a list of IDs that are assigned to the cluster represented by this MasterID. As I have changed something in my clustering process, either the code itself or the order of the input-rows, I have different occurrences of clusters:>>> cat file_1.tsv MasterID IDs A E B F,G C H,I,J D K,L,M
>>> cat file2.tsv MasterID IDs E A B G,F C H,M,J I K,L,DAt this stage I don't make a difference between MasterIDs and other IDs, I just want to know whether the rows in the two files represent the same clusters.
(1) Preparation: A set of sets
# we use frozensets here as they are immutable in opposite to normal sets and a # (hash-)set can't contain mutable members import re def get_unique_clusters(filepath, delimits='\t|,', to_delete='\n', ignore_header=True): """ Creates a set of (frozen) sets of the content of a file with tabular data, the set representing the file, the frozensets its lines. Splits each line of a file at `filepath` into the desired items, converts them in a frozenset and adds them to a big set that is then returned. Unfortunately I haven't found a nice way of dealing with more than one separator in python's otherwise very nice csv module, hence I have to do it manually. """ myfile = open(filepath, 'rb') if ignore_header: myfile.next() items = set() for line in myfile: # we use frozensets here as they are immutable in opposite to # normal sets and a (hash-)set can't contain mutable members. cluster = frozenset([col for col in re.split(delimitss, re.sub(to_delete, '', line)) if col]) items.add(cluster) return items
>>> clusters_1 = get_unique_clusters('file1.tsv') >>> clusters_2 = get_unique_clusters('file2.tsv') >>> clusters_1, clusters_2 (set([frozenset(['A', 'E']), frozenset(['I', 'H', 'C', 'J']), frozenset(['B', 'G', 'F']), frozenset(['K', 'M', 'D', 'L'])]), set([frozenset(['H', 'C', 'J', 'M']), frozenset(['I', 'K', 'L', 'D']), frozenset(['A', 'E']), frozenset(['B', 'G', 'F'])]))
(2) A quick error check
>>> # Let's have a look at all occuring items in each of our datasets; >>> # they should match up if there >>> # is no mistake in the previously excercised clustering process. >>> all_items_1 = set([col for row in clusters_1 for col in row]) >>> all_items_2 = set([col for row in clusters_2 for col in row]) >>> all_items_1, all_items_2 set([frozenset(['I', 'H', 'C', 'J']), frozenset(['K', 'M', 'D', 'L'])])
>>> # A bigger data set obviously compares easier like that: >>> all_items_1 == all_items_2 True
(3) Find the different clusters
>>> only_in_1 = clusters_1.difference(clusters_2) >>> only_in_1 # expecting the 3rd and 4th row of the original file set([frozenset(['I', 'H', 'C', 'J']), frozenset(['K', 'M', 'D', 'L'])])
>>> only_in_2 = clusters_2.difference(clusters_1) >>> only_in_2 # expecting the 3rd and 4th row of the original file set([frozenset(['H', 'C', 'J', 'M']), frozenset(['I', 'K', 'L', 'D'])])
(4) Double check all well done
>>> # all members in each of the difference sets added together should match the others, >>> # as the clustering shouldn't have swallowed items, only it wrapped them differently. >>> all_items_only_1 = set([col for row in only_in_1 for col in row] >>> all_items_only_2 = set([col for row in only_in_2 for col in row]) >>> all_items_only_1 == all_items_only_2 True
# further eyeballing can be done in excel et al. import csv with open('only_in_1.csv', 'wb') as csvfile: mywriter = csv.writer(csvfile, delimiter='\t') for i in only_in_1: mywriter.writerow(list(i)) with open('only_in_2.csv', 'wb') as csvfile: mywriter = csv.writer(csvfile, delimiter='\t') for i in only_in_2: mywriter.writerow(list(i))