# simple supervised learning

## so far

so far we have considered

now lets try markov chains; a technique that will consider the actual sequence of words

## whats a markov chain then?

a markov chain is a way of modelling the probabisitic transistions between a number of states

lets do an example and model the road rage levels of two people, calm carl and agro arnold
carl and arnold can be in one of two states; calm or agro

if carl is calm there's a 70% chance he'll stay calm and a 30% chance he'll lose it and turn agro
once he's agro there's only 10% chance he'll stay agro, the other 90% of the time he becomes calm again

arnold of the other hand only has a 30% of staying calm, and once he's agro has a 50% chance of staying agro  then given a sequence of calm and agro states we can compare it to both chains and see which person it was likely to be

eg consider calm -> calm -> agro -> agro

the probability of carl's behaviour being like this is
= p(calm -> calm) . p(calm -> agro) . p(agro -> agro)
= 0.7 . 0.3 . 0.1
= 0.021

the probability for this being an example of arnold's behaviour is
= p(calm -> calm) x p(calm -> agro) x p(agro -> agro)
= 0.3 x 0.7 x 0.5
= 0.105

which after normalisation gives
carl = 0.021 / 0.126 = 14%
arnold = 0.105 / 0.126 = 83%
so it's far more likely this was arnold

## markov chains from articles

an article in our classification problem can be represented as a markov chain where the nodes are words and transistions between words exist for adjacent words in an article

eg the article 'the end is the beginning is the end' can be represented by the chain from this we can see that is always follows beginning and end follows the 2/3rds of the time

using this chain we can decide on the probabilities of other articles having been generated by the system that generated this one
eg which of 'the end is the end' and 'the beginning is the beginning' is more likely?

define p(word1 -> word2) as the probabiltiy of word2 following word1 in an article

p('the end is the end')
= p(the -> end) x p(end -> is) x p(is -> the) x p(the -> end)
= 2/3 x 1/2 x 1/1 x 2/3
= 2/9

p('the beginning is the beginning')
= p(the -> beginning) x p(beginning -> is) x p(is -> the) x p(the -> beginning)
= 1/3 x 1/1 x 1/1 x 1/3
= 1/9

so 'the end is the end' is twice as likely as 'the beginning is the beginning'

multiple articles can be used to build a markov chain representing a particular feed
eg the chain for the two articles 'the end is the beginning is the end' and 'the end of the road' is ## the zero probability problem, again

given our two article example we have clear transistion probabilities for p(the -> road) and p(end->of)
but what are the values for p(the -> is), p(blah -> is) or even p(blah -> foo)?

strictly these are all zero probabilties but we've seen previously how zeros clobber everything so we should assign these as 0/N and apply a laplace estimator
but the question is; what N to use?

for p(the -> is) 0/3 might be sensible since there are 3 edges in total leaving 'the'

p(blah -> is) is a bit harder though since there are no edges leaving 'blah'
there are edges entering 'is' but the probabilities are driven by p(a -> b) not p(b <- a)

p(blah -> foo) is harder still since we have no info either way on nodes 'blah' and 'foo'

we'll start with the simplest of all and just assign all these cases to the low sensible probability of 0 / total_number_of_edges and see how we go

## the first and last words of an article

special consideration can also be made to the frist and last words of an article when building a chain
we can include special nodes for START and END in a chain and then
for articles that start with 'word' we can include an edge from START -> 'word'
for articles that end with 'word' we can include an edge from 'word' -> END

revisiting the chain for two articles 'the end is the beginning is the end' and 'end of the road' we have which of 'the end of road end' and 'beginning of the road' is more likely?

= p(START -> the) x p(the -> end) x p(end -> of) x p(of -> road) x p(road -> end) x p(end -> END)
= 1/2 x 3/5 x 1/3 x 0/15 x 0/15 x 1/3
= 15/30 x 18/30 x 10/30 x 0/30 x 0/30 x 10/30
= 16/36 x 19/36 x 11/36 x 1/36 x 1/36 x 11/36 (with estimator)
= 0.000017

= p(start -> beginning) x p(beginning -> of) x p(of -> the) x p(the -> road) x p(road -> END)
= 0/15 x 0/15 x 1/1 x 1/5 x 1/1
= 0/15 x 0/15 x 15/15 x 3/15 x 15/15
= 1/20 x 1/20 x 16/20 x 4/20 x 16/20 (with estimator)
= 0.00032

after normalisation gives
p('the end of road end') = 0.000017 / 0.000337 = 5%
p('beginning of the road') = 0.00032 / 0.000337 = 95%
so 'beginning of the road' is a more probable match to these two articles

## digression: underflow and the magic of logarithms

notice in the last example how our prenormalised probabilities are getting to be pretty small numbers
it's not going to take long before the numbers are less than a machine can accurately handle
what can we do about it? logarithms save the day!

consider the two probabilities products we had from our first carl vs arnold example
carl was 0.7 x 0.3 x 0.1, arnold was 0.3 x 0.7 x 0.5

it turns out in this case, and in fact all the cases we've considered to date,
we're not interested in the actual values of the probabiltity sums
just whether one is greater than another

we're trying to decide if
0.7 x 0.3 x 0.1 > 0.3 x 0.7 x 0.5

recall from high school maths the following two facts about logarithms
1) they preserve order; ie log(a) > log(b) if a > b
2) they reduce multiplication to addition; ie log(ab) = log(a) + log(b)

so 0.7 x 0.3 x 0.1 > 0.3 x 0.7 x 0.5
is equivalent to log(0.7 x 0.3 x 0.1) > log(0.3 x 0.7 x 0.5) (from 1)
is equivalent to log(0.7) + log(0.3) + log(0.1) > log(0.3) + log(0.7) + log(0.5) (from 2)

bye bye multiplication of really small numbers, hellooooo high precision comparisons :)

## markov chains for our article classification problem

so coming back to our classification problem we can build two chains; one for articles to read and one for articles to ignore
each new article can be compared to each chain to see which is more likely to be applicable
how does it compare to our previous techniques?

but first does it do better including or excluding the start and end probabilities?
turns out it's slightly better when we include these probabilities

and how does it do against our other classifier methods?
pretty poorly actually...
and it's interesting how wildly varying the markov chain results are  