Posts Tagged ‘deduplication’

fastmap and the jaccard distance

Friday, October 31st, 2008

given a set of pairwise distances how do you determine what points correspond to those distances?

my latest experiment considers this problem in relation to jaccard distances, a resemblance measure similar to jaccard coefficients used in a previous experiment

by using the fastmap algorithm we get points from distances and once you have points you have visualisation!

shingling and the jaccard index

Monday, October 6th, 2008

on the weekend i did another experiment using shingling and the jaccard index to try to determine if two sets of data were “duplicates”

it works quite well and includes a ruby and c++ version with low level bit operations.

project page is www.matpalm.com/resemblance

code at github.com/matpalm/resemblance

i was going to put the discussion here but the page ended up too long, next time i’ll break it into chunks, maybe.