<< back to other nerdy projects

part 1: resemblance with the jaccard coefficient

part 2: fastmap projection using jaccard distances

part 3: the simhash algorithm

part 4: a sketching algorithm

what is it?

sketching is another similiar near duplicate finding technique.
a very good description of one sketching algorithm is given in s19.6 of manning et al's awesome text introduction to information retrieval

the basic idea is built around hashing a document's shingles a number of times to make a 'sketch' of a document.
turns out that the jaccard coefficient of two documents is estimated by the overlap of the document's sketch.

it requires a large datastructure but the building and checking of it is NOT O(n2) in runtime


so trying it out it's very fast! here's my simple ruby implementation

entries shingling c++ simhash16 simhash32 usimhash64 sketch
1,600 3s 27s (67%) 48s (75%) 2m 52s (97%) 3s (95%)
3,200 5s 58s (50%) 1m 43s (61%) 6m 38s (92%) 15s (94%)
6,400 14s 5m (48%) 19m 29s (83%) 1m 10s (98%)
12,800 1m 8m 10s(42%) 1m 52s (91%)

however a slight problem
entries time memory usagekb per entry
1,600 3s 16mb10
3,200 15s 66mb21
6,400 1m 10s 290mb46
12,800 1m 52s 1gb81

so seems we have encountered a classic time space tradeoff.
though the time is approximately linear the algorithm is quadratic in space (we're looking at needing 65gb ram for only 100,000 entries)

we'll have to get out of ram and start writing things to disk...

enter days of our lives like dream sequence skipping forward five months in time....

at this stage i went off and had a good month long hack session in erlang writing an awesome framework for handling the problems associated with this large dataset. it was very interesting, lots and fun and probably the most complex piece of erlang code i've written. however i never really finished it since right near the end of it i realised i was just writing a map reduce framework. again, interesting stuff, but why bother when there is one even more awesome that the one i wrote called hadoop

so after picking up hadoop i started playing with some other bits and pieces, like statistically improbable phrases and some random graph stuff around twitter social networks (like e10 and e11) and am finally back to this resemblance problem. more stuff coming soon!

june 2009
me on twitter
me on google+