so far we have considered

- whether a word occurs for a given class (read or ignore)
- the frequency of words per class
- the frequency of words per test article

now lets try markov chains; a technique that will consider the actual sequence of words

a markov chain is a way of modelling the probabisitic transistions between a number of states

lets do an example and model the road rage levels of two people, calm carl and agro arnold

carl and arnold can be in one of two states; calm or agro

if carl is calm there's a 70% chance he'll stay calm and a 30% chance he'll lose it and turn agro

once he's agro there's only 10% chance he'll stay agro, the other 90% of the time he becomes calm again

arnold of the other hand only has a 30% of staying calm, and once he's agro has a 50% chance of staying agro

then given a sequence of **calm** and **agro** states we can compare it to both chains and see which person it was likely to be

eg consider **calm -> calm -> agro -> agro**

the probability of carl's behaviour being like this is

`
= p(calm -> calm) . p(calm -> agro) . p(agro -> agro)
= 0.7 . 0.3 . 0.1
= 0.021
`

the probability for this being an example of arnold's behaviour is

`= p(calm -> calm) x p(calm -> agro) x p(agro -> agro)`

`= 0.3 x 0.7 x 0.5`

`= 0.105`

which after normalisation gives

carl `= 0.021 / 0.126 = 14%`

arnold `= 0.105 / 0.126 = 83%`

so it's far more likely this was arnold

an article in our classification problem can be represented as a markov chain where the nodes are words and transistions between words exist for adjacent words in an article

eg the article `'the end is the beginning is the end'` can be represented by the chain

from this we can see that `is` always follows `beginning` and `end` follows `the` 2/3rds of the time

using this chain we can decide on the probabilities of other articles having been generated by the system that generated this one

eg which of `'the end is the end'` and `'the beginning is the beginning'` is more likely?

define `p(word1 -> word2)` as the probabiltiy of word2 following word1 in an article

`
p('the end is the end')
= p(the -> end) x p(end -> is) x p(is -> the) x p(the -> end)
= 2/3 x 1/2 x 1/1 x 2/3
= 2/9
`

`
p('the beginning is the beginning')
= p(the -> beginning) x p(beginning -> is) x p(is -> the) x p(the -> beginning)
= 1/3 x 1/1 x 1/1 x 1/3
= 1/9
`

so `'the end is the end'` is twice as likely as `'the beginning is the beginning'`

multiple articles can be used to build a markov chain representing a particular feed

eg the chain for the two articles `'the end is the beginning is the end'` and `'the end of the road'` is

given our two article example we have clear transistion probabilities for `p(the -> road)` and `p(end->of)`

but what are the values for `p(the -> is)`, `p(blah -> is)` or even `p(blah -> foo)`?

strictly these are all zero probabilties but we've seen previously how zeros clobber everything so we should assign these as 0/N and apply a laplace estimator

but the question is; what N to use?

for `p(the -> is)` 0/3 might be sensible since there are 3 edges in total leaving `'the'`

`p(blah -> is)` is a bit harder though since there are no edges leaving `'blah'`

there are edges entering `'is'` but the probabilities are driven by `p(a -> b)` not `p(b <- a)`

`p(blah -> foo)` is harder still since we have no info either way on nodes `'blah' and 'foo'`

we'll start with the simplest of all and just assign all these cases to the low sensible probability of `0 / total_number_of_edges` and see how we go

special consideration can also be made to the frist and last words of an article when building a chain

we can include special nodes for START and END in a chain and then

for articles that start with `'word'` we can include an edge from `START -> 'word'`

for articles that end with `'word'` we can include an edge from `'word' -> END`

revisiting the chain for two articles `'the end is the beginning is the end'` and `'end of the road'` we have

we can ask the question

which of `'the end of road end'` and `'beginning of the road'` is more likely?

`
p('the end of road end')
= p(START -> the) x p(the -> end) x p(end -> of) x p(of -> road) x p(road -> end) x p(end -> END)
= 1/2 x 3/5 x 1/3 x 0/15 x 0/15 x 1/3
= 15/30 x 18/30 x 10/30 x 0/30 x 0/30 x 10/30
= 16/36 x 19/36 x 11/36 x 1/36 x 1/36 x 11/36 (with estimator)
= 0.000017
`

`
p('beginning of the road')
= p(start -> beginning) x p(beginning -> of) x p(of -> the) x p(the -> road) x p(road -> END)
= 0/15 x 0/15 x 1/1 x 1/5 x 1/1
= 0/15 x 0/15 x 15/15 x 3/15 x 15/15
= 1/20 x 1/20 x 16/20 x 4/20 x 16/20 (with estimator)
= 0.00032
`

after normalisation gives

p('the end of road end') `= 0.000017 / 0.000337 = 5%`

p('beginning of the road') `= 0.00032 / 0.000337 = 95%`

so `'beginning of the road'` is a more probable match to these two articles

notice in the last example how our prenormalised probabilities are getting to be pretty small numbers

it's not going to take long before the numbers are less than a machine can accurately handle

what can we do about it? logarithms save the day!

consider the two probabilities products we had from our first carl vs arnold example

carl was `0.7 x 0.3 x 0.1`, arnold was `0.3 x 0.7 x 0.5`

it turns out in this case, and in fact all the cases we've considered to date,

we're not interested in the *actual* values of the probabiltity sums

just whether one is greater than another

we're trying to decide if

`0.7 x 0.3 x 0.1 > 0.3 x 0.7 x 0.5`

recall from high school maths the following two facts about logarithms

1) they preserve order; ie `log(a) > log(b) if a > b`

2) they *reduce* multiplication to addition; ie `log(ab) = log(a) + log(b)`

so `0.7 x 0.3 x 0.1 > 0.3 x 0.7 x 0.5`

is equivalent to `log(0.7 x 0.3 x 0.1) > log(0.3 x 0.7 x 0.5)` (from 1)

is equivalent to `log(0.7) + log(0.3) + log(0.1) > log(0.3) + log(0.7) + log(0.5)` (from 2)

bye bye multiplication of really small numbers, hellooooo high precision comparisons :)

so coming back to our classification problem we can build two chains; one for articles to read and one for articles to ignore

each new article can be compared to each chain to see which is more likely to be applicable

how does it compare to our previous techniques?

but first does it do better including or excluding the start and end probabilities?

turns out it's slightly better when we include these probabilities

and how does it do against our other classifier methods?

pretty poorly actually...

and it's interesting how wildly varying the markov chain results are

view the code at github.com/matpalm/rss-feed-experiments

also download the dot files used to generate the markov chain images

july 2008