shingling and the jaccard index

on the weekend i did another experiment using shingling and the jaccard index to try to determine if two sets of data were “duplicates”

it works quite well and includes a ruby and c++ version with low level bit operations.

project page is www.matpalm.com/resemblance

code at github.com/matpalm/resemblance

i was going to put the discussion here but the page ended up too long, next time i’ll break it into chunks, maybe.

Tags: , , ,

Leave a Reply