me on twitter

brain of mat kelcey


simple text search in ruby using ferret

September 12, 2010 at 09:28 PM | categories: search, ruby, ferret | View Comments

ferret is a lightweight text search engine for ruby, a bit like lucene but with less (ie no) java.i've been looking at it today as part of my named entity extraction prototype which needs to be able to fuzzily match one short string against a list of other short strings.let's go through an example, it's the only way my brain works sorry.moremaking a ferret index is simple; we'll just make a memory based index for this demo.12require 'ferret'index = Ferret::Index::Index.new()next we'll add a handful of places in africa and europe to our index.each document we add is simply a hash...
Read and Post Comments

shingling and the jaccard index

October 06, 2008 at 11:30 AM | categories: ruby, algorithms, deduplication, c++ | View Comments

on the weekend i did another experiment using shingling and the jaccard index to try to determine if two sets of data were “duplicates”it works quite well and includes a ruby and c++ version with low level bit operations.project page is www.matpalm.com/resemblancecode at github.com/matpalm/resemblance...
Read and Post Comments

old projects...