latent semantic analysis via the singular value decomposition (for dummies)

April 19th, 2010

i’ve been trying to get a deeper understanding of latent semantic analysis for awhile now.
last week i came to the conclusion the other way to truly understand would be to start from the ground up
so here goes; mat’s guide to latent semantic analysis via the singular value decomposition (for dummies)

cool bash stuff; mkfifo

April 15th, 2010

mkfifo is one of those shell commands provided as part of coreutils that not many people seem to know about.

here’s an (semi contrived) example close to something i did the other day to show how awesome it is

say you have a number of largish presorted files; run-00 to run-03; and you want to find the most frequent lines. you could do something like the following…

sort -m run-* | uniq -c | sort -nr | head

Read the rest of this entry »

e10.6 community detection for my twitter network

April 4th, 2010

last night i applied my network decomposition algorithm to a graph of some of the people near me in twitter.

first i build a friend graph for 100 people ‘around’ me (taken from a crawl i did last year). by ‘friend’ i mean that if alice follows bob then bob also follows alice.

here the graph, some things to note though; it was an unfinished crawl (can a crawl of twitter EVER be finished) and was done october last year so is a bit out of date.

friends Read the rest of this entry »

e10.5 revisiting community detection

March 30th, 2010

i’ve decided to switch back to some previous work i did on community detection in (social) graphs

the last chunk of code i wrote which tried to deal with weighted directed graphs was terribly, terribly, broken but it seems that simplifying to undirected graphs is giving me much saner results. yay!

here’s an example of my work in progress generated from the new version of the code

consider the graph

p97

and it’s corresponding decomposition

p97.dendrogram

the results are reasonable; the initial breaking of clusters [1,2,3,4,5,6] and [7,8,9,10,11,12] is the most obvious but some of the others are not as intuitive

[1,2,5] and [7,8,10] remain as unbreakable cliques though it’s arbitrary that 11 was broken off from [7,8,10] instead of 10 (arbitrary but an artifact related to my shortest path calculation for the edge betweenness)

the idea of identifying the edge to remove using edge betweenness works well but it is often the case there are many edges with the same maximal betweeness and you have to choose only one. i think my current implementation of picking one is a bit naive and i’m not sure if i should move to a stochastic / monte carlo style approach or focus more on modularity maximisation

brutally short intro to collaborative filtering

March 18th, 2010

my favourite recommendations system is the collaborative filter; it gives good results
and is easy to understand and extend as required.

it works on the intuition that
if i like coffee, chocolate and ice cream
and you like coffee and chocolate
you might also like ice cream

Read the rest of this entry »

sentiment analysis training data using mechanical turk

March 12th, 2010

want to try doing some sentiment analysis work on tweets but i need some good training data.

i could label a heap of tweets myself as being positive, neutral or negative but instead this seems to be the perfect job for mechanical turk

so i put up 100 ‘cream cheese’ tweets on mechanical turk, asked for 3 opinions per tweet and offered $0.01 per opinion. took under 30 minutes to get back all 300 opinions and only cost $4.50 ($3 for the work, $1.50 admin fee)

the results are interesting in themselves…

mostly they are consistent;

for example all three sentiments for bagels and cream cheese for breakfast. very original were neutral

and all three sentiments for very few things are as good as a warm nyc bagel with cream cheese first thing in the am were positive.

but occasionally they aren’t consistent;

the tweet developing a recipe for orange cream cheese swirled cardamom brownies… that’s too long a name. hmm… suggestions? had one positive, one neutral and one negative

interestingly there was no case of a tweet having all opinions being negative; even bad idea. dont eat bagel with mixed berry cream cheese, right after u washed ur mouth with listerine. . ended up with two negatives and one positive (?)

hmmmm

mongodb + twitter + yahoo term extractor = fun!

March 7th, 2010

ran a little experiment in using yahoo term extraction yesterday and it worked well enough. here’s some code to pass some text to yahoo and get back an array of terms

i’ve got to say mongodb is such an easy tool for working with json data. these 20 odd lines insert a text json tweet stream into mongo. so simple, why can’t all code be this easy…

what to do with a week off?

February 22nd, 2010

this week i’m between jobs so i have (a little) more time than usual to hack.

i’ve got a list of pending things to do but can’t decide what to do next, here’s my list in (sort of) priority order…

  • fix up my numerical underflow / overflow problems in my recent semi supervised classification project.
  • work through the exerecises from the first few chapters to introductory statistics with r and all of statistics. i’m particularly keen to write a intro stats blog post about statistical signifigance.
  • do this mongdb tute i found; shouldn’t take too long.
  • do a weka screencast. i did some little talks at work lately about weka and they seemed to be interesting enough to others that it might be worth doing a screencast on it.
  • do some work on modelling of periodic functions. seemed like trending topics is an interesting area at the moment and this would be a good chance to learn some more about R. fourier series look like a potential solution. there is also some interesting stuff to do in this area around majority evaluation from a stream of data.
  • finish my work on detecting resemblance with hadoop. something that’s been hanging over my head for about 2 years is the first piece of work i did that led me onto hadoop. i’ve had a long running project on resemblance that ended up with me writing a map/reduce framework in erlang (until i (re)discovered hadoop).
  • revisit mahout, it’s looking a bit more polished nowadays.
  • redo and finish my project on latent semantic analysis; need to include some comparison work with probabilistic latent semantic analysis and latent dirichlet allocation (which is close to winning the scariest-formulas-on-a-wikipedia-page award)
  • finish my twitter classifier; haven’t work on it since lists were introduced and i think they would be an interesting addition to the algorithm.

decisions, decisions….

semi supervised naive bayes for text classification

February 14th, 2010

experiment 13; a test of semi supervised naive bayes for text classification is complete.

semi supervised algorithms seem to work pretty well and i can see how they are a huge benefit for text classification where you can have an enormous corpus but not enough time to label it all…

e12.3 stat syns FAIL!

February 5th, 2010

after quite a bit of hacking the statistical synonyms idea doesn’t seem to give terribly interesting results so i’m going onto do something else.

for the record here’s what I did do though….

  1. generate 3grams from 800e3 tweets
  2. collect n-grams together that share the same first and last term; eg ‘the blue cat’, ‘the green cat’, ‘the red cat’
  3. for each set generate all the combos of the middle terms; eg ‘blue green’, ‘blue red’, ‘green red’
  4. count the occurrences of each pair
  5. draw a graph of the 150 top occurring pairs

graph.840k.150viola! click this image for a bigger version

some interesting result. few of the more complex things i was trying were working. they were mainly based on trying to incorporate the frequencies of terms but it seemed the simplest gave the best result (i think it’s because my assumptions about how to use the data were wrong).

here’s the code, feel free to read my notes, correct my incorrect terrible statistical assumptions and make a better image!