part 4: a sketching algorithm

what is it?

sketching is another similiar near duplicate finding technique.
a very good description of one sketching algorithm is given in s19.6 of manning et al's awesome text introduction to information retrieval

the basic idea is built around hashing a document's shingles a number of times to make a 'sketch' of a document.
turns out that the jaccard coefficient of two documents is estimated by the overlap of the document's sketch.

it requires a large datastructure but the building and checking of it is NOT O(n²) in runtime

experiments

so trying it out it's very fast! here's my simple ruby implementation

entries shingling c++ simhash16 simhash32 usimhash64 sketch

1,600 3s 27s (67%) 48s (75%) 2m 52s (97%) 3s (95%)

3,200 5s 58s (50%) 1m 43s (61%) 6m 38s (92%) 15s (94%)

6,400 14s 5m (48%) 19m 29s (83%) 1m 10s (98%)

12,800 1m 8m 10s(42%) 1m 52s (91%)

however a slight problem

entries time memory usage kb per entry

1,600 3s 16mb 10

3,200 15s 66mb 21

6,400 1m 10s 290mb 46

12,800 1m 52s 1gb 81

so seems we have encountered a classic time space tradeoff.
though the time is approximately linear the algorithm is quadratic in space (we're looking at needing 65gb ram for only 100,000 entries)

we'll have to get out of ram and start writing things to disk...

enter days of our lives like dream sequence skipping forward five months in time....

at this stage i went off and had a good month long hack session in erlang writing an awesome framework for handling the problems associated with this large dataset. it was very interesting, lots and fun and probably the most complex piece of erlang code i've written. however i never really finished it since right near the end of it i realised i was just writing a map reduce framework. again, interesting stuff, but why bother when there is one even more awesome that the one i wrote called hadoop

so after picking up hadoop i started playing with some other bits and pieces, like statistically improbable phrases and some random graph stuff around twitter social networks (like e10 and e11) and am finally back to this resemblance problem. more stuff coming soon!

june 2009
me on twitter
me on google+

<< back to other nerdy projects

part 1: resemblance with the jaccard coefficient

part 2: fastmap projection using jaccard distances

part 3: the simhash algorithm

part 4: a sketching algorithm

what is it?

experiments

entries	shingling c++	simhash16	simhash32	usimhash64	sketch
1,600	3s	27s (67%)	48s (75%)	2m 52s (97%)	3s (95%)
3,200	5s	58s (50%)	1m 43s (61%)	6m 38s (92%)	15s (94%)
6,400	14s		5m (48%)	19m 29s (83%)	1m 10s (98%)
12,800	1m		8m 10s(42%)		1m 52s (91%)

entries	time	memory usage	kb per entry
1,600	3s	16mb	10
3,200	15s	66mb	21
6,400	1m 10s	290mb	46
12,800	1m 52s	1gb	81