simple supervised learning

part 4: should i read it? multinomial bayes

what is it?

multi nominal bayes is a variation of naive bayes that considers not the frequency of articles of a class but the frequencies of the words in a class

example revisited

let's revisit our test data from the last experiment, with some slight variatons
textfeedshould read it?
linux on the linux the register yes (rule 1)
cat on ferrari autoblog no
on the hollywood perezhilton no
on lamborghini on cat autoblog yes (rule 2)
hollywood cat perezhilton no
the lamborghini perezhilton yes (rule 2)
cat on linux cat the register yes (rule 1)

considering just a few words this breaks down to

wordtotal
occurences
number in
should read
number in
should ignore
on642
linux330
the321
hollywood202
  95

from this table we have some various word / class related probabilities including...

P('on' | read) = 4/9 = 44%
P('linux' | ignore) = 0/5 = 0%

we can use a multinomial distribution to determine the probability for a given test article

should we read 'linux the linux'

so for test article, 'linux the linux', we have probabiltiy of should read

which is 7.5%

and for the same article, 'linux the linux', we have probabiltiy of should ignore being

which is strictly 0% but using a laplace estimator (as seen in last experiment) we have

which is 2.2%, less than the should read probability, so the classifier would recommend we read this article

run against the big data set

so how does this algorithm run against the 13,500 articles we have for theregister, perezhilton and autoblog then?
whereas naive bayes did worse than the simple word occurences, the multinomial bayes kicks ass!

the graph to the left shows the accuracy of the three classification algorithms we discussed so far
(thick lines denote the median performance of the algorithm over a number of runs
crosses denote a cross validation run)

well i've had enough of bayes, lets try a classifier based on markov chains!

view the code at github.com/matpalm/rss-feed-experiments

july 2008