simple supervised learning

part 3: should i read it? the naive bayes method


our last experiement in seeing whether a word exists in a feed does pretty well, but can we do better?
lets the tried and true machine learning classification technique of naive bayes
(it's bayes because it makes use of that cornerstone of probability, bayes theorem
i'll describe why iit's naive in a bit)

furthermore 80% success from our last experiment is too good, lets make the problem harder!
we'll introduce articles from another new feed, the auto industry related
and change the rules to be that...

  1. i want to read anything from theregister
  2. i want to read any article with the word lamborghini in it

a quick peek at our collected data shows...

rss feedtotal articlesarticles with the word 'lamborghini'

long example

let's work through an example

consider the articles below (our training data) and whether we should read them

textfeedshould read it?
on the linux the register yes (rule 1)
cat on ferrari autoblog no
on the hollywood perezhilton no
lamborghini on cat autoblog yes (rule 2)
hollywood cat perezhilton no
the lamborghini perezhilton yes (rule 2)
cat on linux the register yes (rule 1)

in terms of word occurences (we'll add word frequencies soon enough) our training data breaks down to

on the linuxY Y Y N N N N Y
cat on ferrariY N N Y Y N N N
on the hollywoodY Y N N N Y N N
lamborghini on catY N N Y N N Y Y
hollywood catN N N Y N Y N N
the lamborghiniN Y N N N N Y Y
cat on linuxY N Y Y N N N Y

from our word occurence table we can determine some terribly exciting probabilites
if P(X) denote the probability of X occuring, then
we have decided read 4 of the 7 articles; so P(read) = 4/7
therefore we have decided to ignore the other 3; so P(ignore) = 3/7
how excitement!

we can also determine some probabilities relating the occurences of words to whether we should read or ignore
if P(A|B) denotes the probability of event A occuring given event B has occured, then
of the 4 articles to read 2 have the word 'lamborghini'
so the probability of the word lamborghini given we decided to read the article, P('lamborghini'|read) = 2/4
of the 3 articles to ignore none have the word 'lamborghini'
so the probability of the word lamborghini given we decided to ignore the article, P('lamborghini'|ignore) = 0/3
mind bending stuff indeed.

but who cares about stuff we already know, we're interested in stuff we don't know yet!
so say we are given a new article, should we read it or not?

textfeedshould read it?
the hollywood lamborghini unknown ?

we are interested in two probabilities

  1. the probability that i should read an article given the article contains the words 'the hollywood lamborghini'
  2. the probability that i should ignore an article given the article contains the words 'the hollywood lamborghini'
if the probability for reading is higher we'll read it, if the probability of ignoring is higher we'll ignore it.

so using the P(A|B) notation we're interesting in the maximum of the probabilties
max(P(read|'the hollywood lamborghini'), P(ignore|'the hollywood lamborghini'))

but how do we relate our known probabilities (such as P('lamborghini'|read)) to this function?

enter the mighty bayes theorem!

again the maximum we're looking for is

which, from bayes theorem, is equivalent to

since the denominators are the same all that we care about is

and, if we make the naive assumption that the probability of 'the', 'hollywood' and 'lamborghini' occuring in an article are totally independant we can say

which, if we substitute into our equality, gives us

now, as luck would have it, these are the sort of probabilities we can determine from our test data as we saw earlier!


substituting back into our max function gives
max ( (2/4 . 0/4 . 2/4 . 4/7) , (1/3 . 2/3 . 0/3 . 3/7) )
= max ( (1/2 . 0 . 1/2. 4/7) , (1/3 . 2/3 . 0 . 3/7) )
= max ( 0 , 0 )
= 0
those pesky 0's completely clobbered everything
what to do?

it turns out laplace solved this one 200 or so years ago
we can, errr, avoid the zeros by adding one to each numerators and by adding a compensation to the denominators
this technique is called laplace's rule of succession

doing this for the P(A|B) probabilites we have


now we have the function
max ( (3/7 . 1/7 . 3/7 . 4/7) , (2/6 . 3/6 . 1/6 . 3/7) )
= max ( 0.0149 , 0.0119 )
= max ( 55% , 45% ) [normalisation]

so the probability of should read 55% just outweights the probability we should ignore 44%
hence we should read this article.
(phew, what a lot of work...)

another short example

lets quickly run through another example...

textfeedshould read it?
the hollywood ferrari unknown ?


gives max function
max ( 3/7 . 1/7 . 1/7 . 4.7) , (1/3 . 2/3 . 1/3 . 3/7))
= max ( 0.0050 , 0.0317 )
= max ( 14% , 86% ) [normalisation]
which is a strong indicator to ignore this article

run against the big dataset

firstly lets change how we have been running the tests a little
instead of training a bit, testing a bit, etc we'll use a process called cross validation

in cross validation we break the set of articles into a number of groups, say 16
for each of the groups we train a classifier with the 15 other groups and then use the selected group for testing

so how does this algorithm run against the 13,500 articles we have for theregister, perezhilton and autoblog then?
it turns out it's not as good as the much simpler version we tried previously in our last experiment

the graph to the left shows the accuracy of the three classification algorithms we discussed so far
(thick lines denote the median performance of the algorithm over a number of runs
crosses denote a cross validation run)

- the m-algorithm preforms the worst
- naive bayes does better
- but the best is still just considering word occurences

curious! just goes to show that you don't always need a complex solution.

while we're in a bayes mood let's try a slight variation on naive bayes called multinomial bayes

view the code at

july 2008