on the weekend i did another experiment using shingling and the jaccard index to try to determine if two sets of data were “duplicates”
it works quite well and includes a ruby and c++ version with low level bit operations.
project page is www.matpalm.com/resemblance
code at github.com/matpalm/resemblance
i was going to put the discussion here but the page ended up too long, next time i’ll break it into chunks, maybe.
Tags: algorithms, c++, deduplication, ruby