brain of mat kelcey
fuzzy jaccard
July 31, 2012 at 08:00 PM | categories: text, similarity | View Comments
the jaccard coefficient is one of the fundamental measures for doing set similarity. ( recall jaccard(set1, set2) = |intersection| / |union|. when set1 == set2 this evaluates to 1.0 and when set1 and set2 have no intersection it evaluates to 0.0 )one thing that's always annoyed me about it though is that is loses any sense of partial similarity. as a set based measure it's all or nothing.so consider the sets set1 = {i1, i2, i3} and set2 = {i1, i2, i4}jaccard(set1, set2) = 2/4 = 0.5 which is fine given you have no prior info about the relationship between...
old projects...
- latent semantic analysis via the singular value decomposition (for dummies)
- semi supervised naive bayes
- statistical synonyms
- round the world tweets
- decomposing social graphs on twitter
- do it yourself statistically improbable phrases
- should i burn it?
- the median of a trillion numbers
- deduping with resemblance metrics
- simple supervised learning / should i read it?
- audioscrobbler experiments
- chaoscope experiment