sketching is another similiar near duplicate finding technique. a very good description of one sketching algorithm is given in s19.6 of manning et al's awesome text introduction to information retrieval
the basic idea is built around hashing a document's shingles a number of times to make a 'sketch' of a document. turns out that the jaccard coefficient of two documents is estimated by the overlap of the document's sketch.
it requires a large datastructure but the building and checking of it is NOT O(n2) in runtime
so trying it out it's very fast! here's my simple ruby implementation
entries | shingling c++ | simhash16 | simhash32 | usimhash64 | sketch |
1,600 | 3s | 27s (67%) | 48s (75%) | 2m 52s (97%) | 3s (95%) |
3,200 | 5s | 58s (50%) | 1m 43s (61%) | 6m 38s (92%) | 15s (94%) |
6,400 | 14s | 5m (48%) | 19m 29s (83%) | 1m 10s (98%) | |
12,800 | 1m | 8m 10s(42%) | 1m 52s (91%) |
however a slight problem
entries | time | memory usage | kb per entry |
1,600 | 3s | 16mb | 10 |
3,200 | 15s | 66mb | 21 |
6,400 | 1m 10s | 290mb | 46 |
12,800 | 1m 52s | 1gb | 81 |
so seems we have encountered a classic time space tradeoff. though the time is approximately linear the algorithm is quadratic in space (we're looking at needing 65gb ram for only 100,000 entries)
we'll have to get out of ram and start writing things to disk...
enter days of our lives like dream sequence skipping forward five months in time....
at this stage i went off and had a good month long hack session in erlang writing an awesome framework for handling the problems associated with this large dataset. it was very interesting, lots and fun and probably the most complex piece of erlang code i've written. however i never really finished it since right near the end of it i realised i was just writing a map reduce framework. again, interesting stuff, but why bother when there is one even more awesome that the one i wrote called hadoop
so after picking up hadoop i started playing with some other bits and pieces, like statistically improbable phrases and some random graph stuff around twitter social networks (like e10 and e11) and am finally back to this resemblance problem. more stuff coming soon!
june 2009 me on twitter me on google+