Posts Tagged ‘e12’

e12.3 stat syns FAIL!

Friday, February 5th, 2010

after quite a bit of hacking the statistical synonyms idea doesn’t seem to give terribly interesting results so i’m going onto do something else.

for the record here’s what I did do though….

  1. generate 3grams from 800e3 tweets
  2. collect n-grams together that share the same first and last term; eg ‘the blue cat’, ‘the green cat’, ‘the red cat’
  3. for each set generate all the combos of the middle terms; eg ‘blue green’, ‘blue red’, ‘green red’
  4. count the occurrences of each pair
  5. draw a graph of the 150 top occurring pairs

graph.840k.150viola! click this image for a bigger version

some interesting result. few of the more complex things i was trying were working. they were mainly based on trying to incorporate the frequencies of terms but it seemed the simplest gave the best result (i think it’s because my assumptions about how to use the data were wrong).

here’s the code, feel free to read my notes, correct my incorrect terrible statistical assumptions and make a better image!

e12.2 entity set expansion

Thursday, January 28th, 2010

i’ve been doing some reading for my statistical synonyms project and have uncovered a heap of cool papers. most of them are around an idea (from the 1950’s!) called the distributional hypothesis that simply states that words that appear in similar contexts often have similar meanings.

the coolest paper so far is ‘Web-Scale Distributional Similarity and Entity Set Expansion’ by Pantel,Crestan,Borkovsky et al which has introduced me to an area of research i didn’t really know existed; entity set expansion.

entity set expansion is a bit like thesaurus building for proper nouns; given a seed set of related items can you expand the set to include other semantically similiar items?

an example might be brands of japanese motorbikes. starting with ‘yamaha’ and ‘kawasaki’ we might expect the set to be expanded to include ‘honda’

i started hacking around in pig but today switched back to ruby for slightly quicker prototyping. who knows, i might give piglet a go!

the code is on github

e12.1 statistical synonyms

Saturday, January 23rd, 2010

i’ve had an idea brewing in my head for awhile now seeded by a great talk by peter norvig about statistically approaches to find patterns in data.

one thing he alludes to is the generation of synoyms based on n-gram models.

the basic intuition is this; if a corpus contains occurrences of the phrases ‘a x b’ and ‘a y b’ then to some degree x and y are synonymous.

the question becomes how do we calculate the strength of the relationship? how is it a function of the frequencies of a, b, x, y, ‘a x b’, ‘a y b’, ‘a ? b’ in the corpus. what else can we take into account?