<< back to other nerdy projects

part 1: resemblance with the jaccard coefficient

part 2: fastmap projection using jaccard distances

part 3: the simhash algorithm

part 4: a sketching algorithm

huh?

i started working on another rss feed classification technique using a data duplication algorithm to classify articles.
the idea is that an article can be classified by determining which class it is most likely a duplicate of.
however half way through i realised this technique could work against a problem we were seeing at work and changed to start work on that data instead
it's a bit sad i know but data is data and it's still an interesting problem.
i'll use nothing but publicly available data for this, and if it looks promising i might get a chance to work on it further during business hours!

all discussed ruby/c++ code is available from http://github.com/matpalm/resemblance

so what is the actual problem?

given two very similiar business names, address pairs can we decide if they are actually the same company?
let's consider some examples...

eg1
Burra Hotel, 5 Market Sq, Burra, SA, 5417
Camping Country Superstore, 401 Pacific Hwy, Belmont North, NSW, 2280
it's pretty obvious these are not the same company. next!

eg2
One Stop Bakery, 1304 High St Rd, Wantirna, VIC, 3152
One Stop Bakery, 1304 High Street Rd, Wantirna South, VIC, 3152
i think these are the same, it's just one is using an abbrev for street.

eg3
Park Beach Interiors, Showroom Park Beach Plaza Pacific Hwy, Coffs Harbour, NSW, 2450
Park Beach Interiors, Showroom Park Beach Plaza Pacific Highway, Coffs Harbour, NSW, 2450
Park Beach Interiors, Park Beach Plaza Pacific Hwy, Coffs Harbour, NSW, 2450
Park Beach Interiors, 26 Park Beach Plaza, Pacific Hwy, Coffs Harbour, NSW, 2450
i think these are all the same.

eg 4
Weaver Interiors, 955 Pacific Hwy, Pymble, NSW, 2073
Weaver Interiors, 997 Pacific Hwy, Pymble, NSW, 2073
this pair is interesting.... they might be the same, but maybe not...

eg 5
Gibbon Hamor Commercial Interiors, 233 Johnston St, Annandale, NSW, 2038
Gibbon Hamor Development Planners, 233 Johnston St, Annandale, NSW, 2038
this pair is also interesting for the same reasons.

shingling

shingling is a way of generating a set that represents a bit of data which can be used for comparisons
eg. the 4 bigram shingles of "the cat sat on the cat" are...

  1. the cat
  2. cat sat
  3. sat on
  4. on the
(note: this is a set so we only count the shingle "the cat" once)

the jaccard index

the jaccard index is a simple measure of how similiar two sets are.
it's simply the ratio of the size of the intersection of the sets and the size of the union of the sets.

eg. if J(A,B) is jaccard index between sets A and B
and A = {1,2,3}, B = {2,3,4}, C = {4,5,6},
then J(A,B) = 2/4 = 0.5,
and J(A,C) = 0/6 = 0,
and J(B,C) = 1/5 = 0.2
so the most "similiar" sets are A and B and the least similiar are A and C
(note also J(A,A) = J(B,B) = J(C,C) = 1)

putting it all together

so given two business name/addresses we can build a shingling set for each and use the jaccard index to decide how similiar they are.
we'll use bigrams for building our sets but lets use character bigrams, not word bigrams.
this is since the documents are quite small and we want to include puncutation in the comparisons...

lets run through our above examples again...

eg 1
Burra Hotel, 5 Market Sq, Burra, SA, 5417
is represented by the set of 2 character-gram shingles
{" 5", " B", " H", " M", " S", ", ", "17", "41", "5 ", "54", "A,", "Bu", "Ho", "Ma", "SA", "Sq", "a ", "a,", "ar", "el", "et", "ke", "l,", "ot", "q,", "ra", "rk", "rr", "t ", "te", "ur"}

Camping Country Superstore, 401 Pacific Hwy, Belmont North, NSW, 2280
is represented by the set of 2 character-gram shingles
{" 2", " 4", " B", " C", " H", " N", " P", " S", ", ", "01", "1 ", "22", "28", "40", "80", "Be", "Ca", "Co", "Hw", "NS", "No", "Pa", "SW", "Su", "W,", "ac", "am", "c ", "ci", "e,", "el", "er", "fi", "g ", "h,", "ic", "if", "in", "lm", "mo", "mp", "ng", "nt", "on", "or", "ou", "pe", "pi", "re", "rs", "rt", "ry", "st", "t ", "th", "to", "tr", "un", "up", "wy", "y ", "y,"}

they have an intersection size of 6 shingles and a union size of 87 shingles, hence a jaccard index of 6/87 = 0.068

eg 2
One Stop Bakery, 1304 High St Rd, Wantirna, VIC, 3152 and
One Stop Bakery, 1304 High Street Rd, Wantirna South, VIC, 3152
have an intersection size of 46 shingles and a union size of 57 shingles, hence a jaccard index of 46/57 = 0.807

eg 3
a) Park Beach Interiors, Showroom Park Beach Plaza Pacific Hwy, Coffs Harbour, NSW, 2450
b) Park Beach Interiors, Showroom Park Beach Plaza Pacific Highway, Coffs Harbour, NSW, 2450
c) Park Beach Interiors, Park Beach Plaza Pacific Hwy, Coffs Harbour, NSW, 2450
d) Park Beach Interiors, 26 Park Beach Plaza, Pacific Hwy, Coffs Harbour, NSW, 2450
have indexes J(ab)=0.888, J(ac)=0.861, J(ad)=0.808, J(bc)=0.760, J(bd)=0.716, J(cd)=0.932

eg 4
Weaver Interiors, 955 Pacific Hwy, Pymble, NSW, 2073 and
Weaver Interiors, 997 Pacific Hwy, Pymble, NSW, 2073
have an intersection size of 43 shingles and a union size of 49 shingles, hence a jaccard index of 43/49 = 0.877

eg 5
Gibbon Hamor Commercial Interiors, 233 Johnston St, Annandale, NSW, 2038 and
Gibbon Hamor Development Planners, 233 Johnston St, Annandale, NSW, 2038
have an intersection size of 49 shingles and a union size of 76 shingles, hence a jaccard index of 49/76 = 0.644

conclusion

though there is no obvious magic cutoff point it seems to give pretty good values.
it would find some obvious duplicates, though would require a bit of human double checking to make sure.

here's a histogram of the frequency of resemblance values from the comparison of all pairs of 2000 name addresses
(a total of 1,999,000 comparisons and notice the y scale is logarithmic)

algorithmic discussion

order n squared sucks

the jaccard coefficient is, unfortunately, not transistive
(ie if we know J(A,B) and J(B,C) it tells use nothing about J(A,C)

naively then to determine the pair with the highest similarity requires we compare every element with every other element.
this is O(n2) and O(n2) sucks since we are looking at (n(n-1))/2 comparisons, joy!

lets examine some of the ruby runtimes
num recordscomparisonstime
501,2250.2s
1004,9500.9s
25031,1255.6s
500124,75024s
750280,87552s
20001,999,0006m 57s

and just say i ran this over a subset of the full data, say, 1,000,000 records
it would be 499,999,500,000 comparisons
and at about 300,000 per minute we'll be here till christmas (2011)

( luckily the actual data allows me to do something which reduces the runtime to be O(n) but i'm not going to talk about it out of work)

bit level optimisation in c++

i decided to reimplement this in c++ and go the whole hog by using a bit level representation of the data to wring everything out of the machine.

the big question is: how to optimise the jaccard index calculation? it's where the time is spent.

consider the shingle sets for "cat" and "mat", ie {"ca","at"} and {"ma","at"}
we can convert shingles to ints by taking all the unique ones and mapping them to ints from a sequence starting at 0
ie { "ca" => 0, "at" => 1, "ma" => 2}
giving us the two equivalent shingle sets {0,1} and {2,1}
finally we can use the values in these sets to set bits in a nibble
giving us the two nibbles 0011 (setting bits 0 and 1) and 0110 (setting bits 2 and 1)

now consider the bit representations and the results of the bitwise operators | and &
  0011 (equivalent to {"ca","at"})
  0110 (equivalent to {"ma","at"})
& 0010 => and'ing the bits strings gives us their intersection!
| 0111 => or'ing the bits strings gives us their union!

the number of bits set in x0010 (size of intersection) is 1 and
the number of bits set in x0111 (size of union) is 3
so the jaccard index of "cat" and "mat" is 1/3

note: we can count the number of bits set with a crazy bit of c like
inline int count_number_bits_set(long l) {
 unsigned int c;
  for(c=0;l;c++)
   l &= l-1;
 return c;
}

(thanks to brian kernighan for that one)

using this method we can calculate the union or intersection of a 4 byte long (ie 32 set elements) in a single | or &!
bamm!

finally we can use the awesome openmp library ( available as part of gcc since 4.2 )
with two additional lines of code (both pragma statements) we can give hints to the compiler where the code can be multithreaded

num recordscomparisonsruby timec++ timec++ openmp time
501,2250.29s0.008s0.013s
1004,9500.97s0.01s0.013s
25031,1255.5s0.04s0.04s
500124,75022s0.12s0.09s
1000499,5001m 30s0.37s0.2s
20001,999,0006m 34s1.2s0.5s
40007,998,000?7.4s1.8s
800031,996,000?21s6.2s
16000127,992,000??26s

so the ruby code is getting about 5,000 a second
the single threaded c++ implementation is getting about 1,500,000 a second
and the c++ implementation using openmp on a quad core box (utilising about 350% cpu) is getting about 5,000,000 a second
this is a speed up of about 1,000 times
booya! that's more like it!

now lets consider the jaccard distance after which we'll consider the simhash algorithm as a way of avoiding all that O(n2) nastiness.

june 11 2009
me on twitter
me on google+