brain of mat kelcey
simple text search in ruby using ferret
September 12, 2010 at 09:28 PM | categories: search, ruby, ferret | View Comments
ferret is a lightweight text search engine for ruby, a bit like lucene but with less (ie no) java.i've been looking at it today as part of my named entity extraction prototype which needs to be able to fuzzily match one short string against a list of other short strings.let's go through an example, it's the only way my brain works sorry.moremaking a ferret index is simple; we'll just make a memory based index for this demo.12require 'ferret'index = Ferret::Index::Index.new()next we'll add a handful of places in africa and europe to our index.each document we add is simply a hash...
shingling and the jaccard index
October 06, 2008 at 11:30 AM | categories: ruby, algorithms, deduplication, c++ | View Comments
on the weekend i did another experiment using shingling and the jaccard index to try to determine if two sets of data were “duplicates”it works quite well and includes a ruby and c++ version with low level bit operations.project page is www.matpalm.com/resemblancecode at github.com/matpalm/resemblance...
old projects...
- latent semantic analysis via the singular value decomposition (for dummies)
- semi supervised naive bayes
- statistical synonyms
- round the world tweets
- decomposing social graphs on twitter
- do it yourself statistically improbable phrases
- should i burn it?
- the median of a trillion numbers
- deduping with resemblance metrics
- simple supervised learning / should i read it?
- audioscrobbler experiments
- chaoscope experiment