<< semi supervised algorithms index does it do any better? >>
let's pretend we want to classify the new rss articles we receive each day to decide if we want to bother reading them. in a normal naive bayes system, which i've discussed in a previous article, we'd train a classifier with some pre collected articles that we have taken the time to label as either to-read or to-ignore.
for example; say we have collected a huge historical corpus of six articles (taken from the register and perez hilton). to make full use of the data we would need to label all six examples, but due to time constraints we have only had time to do four (realistically though the number of unlabelled examples left over would be much greater than the number of labelled examples)
feed | doc | article text | class |
the register | 1 | new linux goes fast | read |
perez hilton | 2 | hollywood celebrity goes fast | ignore |
the register | 3 | quake runs on linux | read |
perez hilton | 4 | quake occurs in hollywood | ignore |
the register | 5 | linux website | ? |
perez hilton | 6 | hollywood fashion website | ? |
a new article then comes along...
feed | article text | class |
perez hilton | movie on fashion | ?? |
to which class should we assign this article?
in a traditional naive bayes we can train only using the four labelled examples and have no way of making use of the two unlabelled documents.
for example; to classify for the above example of 'movie on fashion' we first build a term occurence matrix for the four training examples...
class | doc# | new | linux | goes | fast | hollywood | celebrity | quake | runs | on | occurs | in |
read | 1 | 1 | 1 | 1 | 1 | |||||||
ignore | 2 | 1 | 1 | 1 | 1 | |||||||
read | 3 | 1 | 1 | 1 | ||||||||
ignore | 4 | 1 | 1 | 1 | 1 |
and then calculate the predicted probabilities for both P(read) and P(ignore)
P(read | 'movie on fashion') ~= P('movie on fashion' | read) * P(read) ~= P('movie' | read) * P('on' | read) * P('fashion' | read) * P(read) # naive bayes assumption ~= 0 * 1/2 * 0/2 * P(read) ~= 1/5 * 2/5 * 1/5 * P(read) # additive smoothing ~= 2/125 * P(read) ~= 2/125 * 1/2 ~= 2/250 P(ignore | 'movie on fashion') ~= P('movie on fashion' | ignore) * P(ignore) ~= P('movie' | ignore) * P('on' | ignore) * P('fashion' | ignore) * P(ignore) # naive bayes assumption ~= 0/2 * 0/2 * 0/2 * P(ignore) ~= 1/5 * 1/5 * 1/5 * P(ignore) # additive smoothing ~= 1/125 * P(ignore) ~= 1/125 * 1/2 ~= 1/250
as the estimated P(read) > P(ignore) we would class this as an article to-read.
diving into this a little bit we can see this prediction is made entirely based on the presence of the term 'on' in a previous to-read article. the terms 'movie' and 'fashion', having never been seen in the training set, were not part of the decision...
so how does a semi supervised version of naive bayes make use of the unlabelled documents?
let's look at the same corpus as before, with four labelled and two unlabelled examples. this time though we'll record not the class directly but the probability distribution instead...
feed | doc | article text | P(read) | P(ignore) |
the register | 1 | new linux goes fast | 1.00 | 0.00 |
perez hilton | 2 | hollywood celebrity goes fast | 0.00 | 1.00 |
the register | 3 | quake runs on linux | 1.00 | 0.00 |
perez hilton | 4 | quake occurs in hollywood | 0.00 | 1.00 |
the register | 5 | linux website | ? | ? |
perez hilton | 6 | hollywood fashion website | ? | ? |
first we classify documents 5 and 6 using labelled docs 1 thru 4.
P(read | 'linux website') ~= P('linux website' | read) * P(read) ~= P('linux' | read) * P('website' | read) * P(read) # naive bayes assumption ~= 2/2 * 0/2 * P(read) ~= 3/4 * 1/4 * P(read) # additive smoothing ~= 3/16 * P(read) ~= 3/16 * 1/2 ~= 3/32 P(ignore | 'linux website') ~= P('linux website' | ignore) * P(ignore) ~= P('linux' | ignore) * P('website' | ignore) * P(ignore) # naive bayes assumption ~= 0/2 * 0/2 * P(ignore) ~= 1/4 * 1/4 * P(ignore) # additive smoothing ~= 1/16 * P(ignore) ~= 1/16 * 1/2 ~= 1/32
normalising these estimates gives...
P(read | 'linux website) ~= 3/32 ~= 0.75 P(ignore | 'linux website') ~= 1/32 ~= 0.25usually we would classify this as an article to read since 75% > 25% but for the semi supervised case we want to keep this distribution.
similarily for 'hollywood fashion website'...
P(read | 'hollywood fashion website') ~= P('hollywood' | read) * P('fashion' | read) * P('website' | read) * P(read) ~= 0/2 * 0/2 * 0/2 * P(read) ~= 1/5 * 1/5 * 1/5 * P(read) ~= 1/125 * P(read) ~= 1/125 * 1/2 ~= 1/250 P(ignore | 'hollywood fashion website') ~= P('hollywood' | ignore) * P('fashion' | ignore) * P('website' | ignore) * P(ignore) ~= 2/2 * 0/2 * 0/2 * P(ignore) ~= 3/5 * 1/5 * 1/5 * P(ignore) ~= 3/125 * P(ignore) ~= 3/125 * 1/2 ~= 3/250
normalising these estimates gives...
P(read | 'hollywood fashion website') ~= 1/250 ~= 0.25 P(ignore | 'hollywood fashion website') ~= 3/250 ~= 0.75
we can now insert these predicted probabilities into our table
feed | doc | article text | P(read) | P(ignore) |
the register | 1 | new linux goes fast | 1.00 | 0.00 |
perez hilton | 2 | hollywood celebrity goes fast | 0.00 | 1.00 |
the register | 3 | quake runs on linux | 1.00 | 0.00 |
perez hilton | 4 | quake occurs in hollywood | 0.00 | 1.00 |
the register | 5 | linux website | 0.75 | 0.25 |
perez hilton | 6 | hollywood fashion website | 0.25 | 0.75 |
and given new values we can now reclassify documents 5 and 6 using a classifier trained with all 6
read | ignore | doc# | new | linux | goes | fast | hollywood | celebrity | quake | runs | on | occurs | in | website | fashion |
1 | 0 | 1 | 1 | 1 | 1 | 1 | |||||||||
0 | 1 | 2 | 1 | 1 | 1 | 1 | |||||||||
1 | 0 | 3 | 1 | 1 | 1 | 1 | |||||||||
0 | 1 | 4 | 1 | 1 | 1 | 1 | |||||||||
0.75 | 0.25 | 5 | 1 | 1 | |||||||||||
0.25 | 0.75 | 6 | 1 | 1 | 1 |
P(read | 'linux website') ~= P('linux website' | read) * P(read) ~= P('linux' | read) * P('website' | read) * P(read) ~= 2.75/3 * 1/3 * P(read) ~= 2.75/9 * P(read) ~= 2.75/9 * 3/6 ~= 0.152 P(ignore | 'linux website') ~= P('linux website' | ignore) * P(ignore) ~= P('linux' | ignore) * P('website' | ignore) * P(ignore) ~= 0.25/3 * 1/3 * P(ignore) ~= 0.25/9 * P(ignore) ~= 0.25/9 * 3/6 ~= 0.013 P(read | 'website on linux') ~= 0.152 ~= 0.92 P(ignore | 'website on linux') ~= 0.013 ~= 0.08
P(read | 'hollywood fashion website') ~= P('hollywood fashion website' | read) * P(read) ~= P('hollywood' | read) * P('fashion' | read) * P('website' | read) * P(read) ~= 0.25/3 * 0.25/3 * 1/3 * P(read) ~= 0.0023 * 3/6 ~= 0.001 P(ignore | 'hollywood fashion website') ~= P('hollywood fashion website' | ignore) * P(ignore) ~= P('hollywood' | ignore) * P('fashion' | ignore) * P('website' | ignore) * P(ignore) ~= 2.75/3 * 0.75/3 * 1/3 * P(ignore) ~= 0.076 * 3/6 ~= 0.038 P(read | 'hollywood fashion website') ~= 0.001 ~= 0.03 P(ignore | 'hollywood fashion website') ~= 0.038 ~= 0.97
which gives us new values for the docs 5 and 6
feed | doc | article text | P(read) | P(ignore) |
the register | 1 | new linux goes fast | 1.00 | 0.00 |
perez hilton | 2 | hollywood celebrity goes fast | 0.00 | 1.00 |
the register | 3 | quake runs on linux | 1.00 | 0.00 |
perez hilton | 4 | quake occurs in hollywood | 0.00 | 1.00 |
the register | 5 | website on linux | 0.92 | 0.08 |
perez hilton | 6 | website on hollywood fashion | 0.03 | 0.97 |
this process of retraining and reclassifying the initially unlablled articles can be reiterated until the probability distrubtion converges. (note: in this case it looks like it's moving towards a clear case of to-read or to-ignore but it's not always like that)
once the probabilities have converged we can classify our new document 'movie on fashion' using the classifier training with all 6 examples.
P(read | 'movie on fashion') ~= P('movie on fashion' | read) * P(read) ~= P('movie' | read) * P('on' | read) * P('fashion' | read) * P(read) ~= 0/2.95 * 1/2.95 * 0.03/2.95 * P(read) ~= 1/5.95 * 2/5.95 * 1.03/5.95 * P(read) ~= 2.03/210 * 2.95/6 ~= 0.0047 P(ignore | 'movie on fashion') ~= P('movie on fashion' | ignore) * P(ignore) ~= P('movie' | ignore) * P('on' | ignore) * P('fashion' | ignore) * P(ignore) ~= 0/3.05 * 0/3.05 * 0.97/3.05 * P(ignore) ~= 1/6.05 * 1/6.05 * 1.97/6.05 * P(ignore) ~= 1.97/221.4 * 3.05/6 ~= 0.0045 P(read | 'movie on fashion') ~= 0.0047 ~= 0.51 P(ignore | 'movie on fashion') ~= 0.0045 ~= 0.49
recall that previously this was being classified by the normal naive bayes classifier as P(read) ~= 0.66 and P(ignore) ~= 0.33
we can see now how the inclusion of 'fashion', as related to 'hollywood' through doc 6, has increased the to-ignore probability of this example.
so this semi supervised algorithm looks promising on paper but the real question is does it do any better on real data?
february two thousand and ten