simple supervised learning

part 1: should i read it? the m-algorithm

there is lots and lots of stuff online, way more than there is enough time to read.
more importantly there's heaps i wouldn't want to read, even if i did have the time.
how can a machine help me decide what to read and what to ignore?

lets write a program to try to decide, given an article from an rss feed whether it thinks i would like it.
it's inputs will be the words of the article and it's output will be whether i should read it or not.

as a measure of how good the algorithm is i'll consider the percentage of times i agree with it's recommendation.
so, as a start, what's the worse performance an algorithm could have? that i agree with it 0% of the time?
not really. agreeing with it 0% of the time is the same as disagreeing with it 100% of the time.
in that case i'd just do what it told me not to do; ie read the article if it said i didn't like it and vice versa.
actually the worse performance is if i agree with it 50% of the time.

with that said the first algorithm that we'll consider as a baseline for performance is a coin toss.

but we actually want, for helping in comparing things, a deterministic coin toss,
(ie one that returns the same value each time you ask it the same question.)
so let's decide to read the article if the first letter of the first word is is in the first half of the alphabet (letters before and including m).
i call this the m-algorithm. (and i'll write m in italics to make it look a bit more scientific)
should be correct roughly 50/50.

but how do we test this idea?
i'm too lazy to read a bunch of articles and say whether i liked them or not.
but i can say two things...

i wouldn't want to read any articles from los angeles celebrity gossip rss feed perezhilton.com
i do read most articles from the rss feed for the theregister.co.uk

so can the m-algorithm correctly recommend i read each article from the register and recommend i don't read any article from perez hilton?
let's take 4,500 articles from each of the feeds over the last six months and see how it goes...
for the 9,000 articles it has an accuracy of 53.1%, pretty much what we expected

here's the code and a small dataset to try it with.
> bunzip2 -c perez_register_small.bz2 | ./really_complex_m_algorithm.rb

so we've got a baseline, lets try something a tad more intelligent, let's consider word occurences

july 08 2008