i’ve been trying to get a deeper understanding of latent semantic analysis for awhile now.
last week i came to the conclusion the other way to truly understand would be to start from the ground up
so here goes; mat’s guide to latent semantic analysis via the singular value decomposition (for dummies)
latent semantic analysis via the singular value decomposition (for dummies)
April 19th, 2010cool bash stuff; mkfifo
April 15th, 2010mkfifo is one of those shell commands provided as part of coreutils that not many people seem to know about.
here’s an (semi contrived) example close to something i did the other day to show how awesome it is
say you have a number of largish presorted files; run-00 to run-03; and you want to find the most frequent lines. you could do something like the following…
sort -m run-* | uniq -c | sort -nr | head
brutally short intro to collaborative filtering
March 18th, 2010my favourite recommendations system is the collaborative filter; it gives good results
and is easy to understand and extend as required.
it works on the intuition that
if i like coffee, chocolate and ice cream
and you like coffee and chocolate
you might also like ice cream
sentiment analysis training data using mechanical turk
March 12th, 2010want to try doing some sentiment analysis work on tweets but i need some good training data.
i could label a heap of tweets myself as being positive, neutral or negative but instead this seems to be the perfect job for mechanical turk
so i put up 100 ‘cream cheese’ tweets on mechanical turk, asked for 3 opinions per tweet and offered $0.01 per opinion. took under 30 minutes to get back all 300 opinions and only cost $4.50 ($3 for the work, $1.50 admin fee)
the results are interesting in themselves…
mostly they are consistent;
for example all three sentiments for bagels and cream cheese for breakfast. very original were neutral
and all three sentiments for very few things are as good as a warm nyc bagel with cream cheese first thing in the am were positive.
but occasionally they aren’t consistent;
the tweet developing a recipe for orange cream cheese swirled cardamom brownies… that’s too long a name. hmm… suggestions? had one positive, one neutral and one negative
interestingly there was no case of a tweet having all opinions being negative; even bad idea. dont eat bagel with mixed berry cream cheese, right after u washed ur mouth with listerine. . ended up with two negatives and one positive (?)
hmmmm
mongodb + twitter + yahoo term extractor = fun!
March 7th, 2010ran a little experiment in using yahoo term extraction yesterday and it worked well enough. here’s some code to pass some text to yahoo and get back an array of terms
i’ve got to say mongodb is such an easy tool for working with json data. these 20 odd lines insert a text json tweet stream into mongo. so simple, why can’t all code be this easy…
what to do with a week off?
February 22nd, 2010this week i’m between jobs so i have (a little) more time than usual to hack.
i’ve got a list of pending things to do but can’t decide what to do next, here’s my list in (sort of) priority order…
- fix up my numerical underflow / overflow problems in my recent semi supervised classification project.
- work through the exerecises from the first few chapters to introductory statistics with r and all of statistics. i’m particularly keen to write a intro stats blog post about statistical signifigance.
- do this mongdb tute i found; shouldn’t take too long.
- do a weka screencast. i did some little talks at work lately about weka and they seemed to be interesting enough to others that it might be worth doing a screencast on it.
- do some work on modelling of periodic functions. seemed like trending topics is an interesting area at the moment and this would be a good chance to learn some more about R. fourier series look like a potential solution. there is also some interesting stuff to do in this area around majority evaluation from a stream of data.
- finish my work on detecting resemblance with hadoop. something that’s been hanging over my head for about 2 years is the first piece of work i did that led me onto hadoop. i’ve had a long running project on resemblance that ended up with me writing a map/reduce framework in erlang (until i (re)discovered hadoop).
- revisit mahout, it’s looking a bit more polished nowadays.
- redo and finish my project on latent semantic analysis; need to include some comparison work with probabilistic latent semantic analysis and latent dirichlet allocation (which is close to winning the scariest-formulas-on-a-wikipedia-page award)
- finish my twitter classifier; haven’t work on it since lists were introduced and i think they would be an interesting addition to the algorithm.
decisions, decisions….
semi supervised naive bayes for text classification
February 14th, 2010experiment 13; a test of semi supervised naive bayes for text classification is complete.
semi supervised algorithms seem to work pretty well and i can see how they are a huge benefit for text classification where you can have an enormous corpus but not enough time to label it all…
e12.3 stat syns FAIL!
February 5th, 2010after quite a bit of hacking the statistical synonyms idea doesn’t seem to give terribly interesting results so i’m going onto do something else.
for the record here’s what I did do though….
- generate 3grams from 800e3 tweets
- collect n-grams together that share the same first and last term; eg ‘the blue cat’, ‘the green cat’, ‘the red cat’
- for each set generate all the combos of the middle terms; eg ‘blue green’, ‘blue red’, ‘green red’
- count the occurrences of each pair
- draw a graph of the 150 top occurring pairs
viola! click this image for a bigger version
some interesting result. few of the more complex things i was trying were working. they were mainly based on trying to incorporate the frequencies of terms but it seemed the simplest gave the best result (i think it’s because my assumptions about how to use the data were wrong).
here’s the code, feel free to read my notes, correct my incorrect terrible statistical assumptions and make a better image!


