brain of mat kelcey...
simple text search in ruby using ferret
September 12, 2010 at 09:28 PM | categories: Uncategorized
ferret is a lightweight text search engine for ruby, a bit like lucene but with less (ie no) java.
i've been looking at it today as part of my named entity extraction prototype which needs to be able to fuzzily match one short string against a list of other short strings.
let's go through an example, it's the only way my brain works sorry. making a ferret index is simple; we'll just make a memory based index for this demo.
1 2 | require 'ferret' index = Ferret::Index::Index.new() |
next we'll add a handful of places in africa and europe to our index. each document we add is simply a hash with whatever fields we want to be able to search or return
1 2 3 4 5 6 7 8 | african_places = ['Ain Sefra','Algiers','South Algiers','Batna','Batni'] african_places.each do |place| index << { :continent => 'africa', :name => place } end european_places = ['Paris','London','Batna'] european_places.each do |place| index << { :continent => 'europe', :name => place } end |
the simplest querying just searches across all fields; in our example this is both continent and name. search hits returning the id of the document found and a relevancy score for ranking. the full contents of a document can be looked up based on their id (and are lazily loaded unless an explicit load is given).
1 2 3 4 5 6 | index.search("europe").hits.each { |hit| puts hit.inspect } #<struct Ferret::Search::Hit doc=6, score=0.446250796318054> #<struct Ferret::Search::Hit doc=7, score=0.446250796318054> #<struct Ferret::Search::Hit doc=8, score=0.446250796318054> puts index[7].load.inspect {:continent=>"europe", :name=>"London"} |
query control will be very similiar to those that know lucene.
as seen above the simplest query allows a match against any term in any field. a particular field can be targetted though using the query form FIELD:VALUE
1 2 3 4 5 | index.search('name:algiers').hits.each do |hit| puts "score=#{hit.score} doc=#{@index[hit.doc].load.inspect}" end score=2.0986123085022 doc={:continent=>"africa", :name=>"Algiers"} score=1.31163263320923 doc={:continent=>"africa", :name=>"South Algiers |
wildcarding is done with a asterix.
1 2 3 4 5 6 | index.search('name:ba*').hits.each do |hit| puts "score=#{hit.score} doc=#{@index[hit.doc].load.inspect}" end score=1.81093037128448 doc={:continent=>"africa", :name=>"Batna"} score=1.81093037128448 doc={:continent=>"africa", :name=>"Batni"} score=1.81093037128448 doc={:continent=>"europe", :name=>"Batni"} |
fuzzy search is denoted by tilde. an optional fuzziness factor can be supplied from 0 (very fuzzy match) to 1 (exact match only). a reasonable default is assumed if a factor is not given.
1 2 3 4 5 6 | index.search('name:bitna~0.4').hits.each do |hit| puts "score=#{hit.score} doc=#{@index[hit.doc].load.inspect}" end score=1.44874429702759 doc={:continent=>"africa", :name=>"Batna"} score=1.08655822277069 doc={:continent=>"africa", :name=>"Batni"} score=1.08655822277069 doc={:continent=>"europe", :name=>"Batni"} |
and, not surprisingly, a full set of boolean logic operators are supported.
1 2 3 4 5 | index.search('continent:africa AND name:bitna~').hits.each do |hit| puts "score=#{hit.score} doc=#{@index[hit.doc].load.inspect}" end score=1.90322256088257 doc={:continent=>"africa", :name=>"Batna"} score=1.6052508354187 doc={:continent=>"africa", :name=>"Batni"} |
though i've no idea how it scales to a larger dataset it's doing the job pretty well for me with a modest index of approx 250,000 small text items.
loads more api doco is provided on the ferret site.