<< semi supervised algorithms index does it do any better? >>

v1: semi supervised naive bayes

example using normal naive bayes

let's pretend we want to classify the new rss articles we receive each day to decide if we want to bother reading them. in a normal naive bayes system, which i've discussed in a previous article, we'd train a classifier with some pre collected articles that we have taken the time to label as either to-read or to-ignore.

for example; say we have collected a huge historical corpus of six articles (taken from the register and perez hilton). to make full use of the data we would need to label all six examples, but due to time constraints we have only had time to do four (realistically though the number of unlabelled examples left over would be much greater than the number of labelled examples)

feed doc article text class

the register 1 new linux goes fast read

perez hilton 2 hollywood celebrity goes fast ignore

the register 3 quake runs on linux read

perez hilton 4 quake occurs in hollywood ignore

the register 5 linux website ?

perez hilton 6 hollywood fashion website ?

feed	doc	article text	class
the register	1	new linux goes fast	read
perez hilton	2	hollywood celebrity goes fast	ignore
the register	3	quake runs on linux	read
perez hilton	4	quake occurs in hollywood	ignore
the register	5	linux website	?
perez hilton	6	hollywood fashion website	?

a new article then comes along...

feed article text class

perez hilton movie on fashion ??

feed	article text	class
perez hilton	movie on fashion	??

to which class should we assign this article?

in a traditional naive bayes we can train only using the four labelled examples and have no way of making use of the two unlabelled documents.

for example; to classify for the above example of 'movie on fashion' we first build a term occurence matrix for the four training examples...

class doc# new linux goes fast hollywood celebrity quake runs on occurs in

read 1 1 1 1 1

ignore 2 1 1 1 1

read 3 1 1 1

ignore 4 1 1 1 1

and then calculate the predicted probabilities for both P(read) and P(ignore)

P(read | 'movie on fashion')
~= P('movie on fashion' | read) * P(read) 
~= P('movie' | read) * P('on' | read) * P('fashion' | read) * P(read)  # naive bayes assumption
~= 0 * 1/2 * 0/2 * P(read)
~= 1/5 * 2/5 * 1/5 * P(read)                                           # additive smoothing
~= 2/125 * P(read)
~= 2/125 * 1/2
~= 2/250

P(ignore | 'movie on fashion')
~= P('movie on fashion' | ignore) * P(ignore) 
~= P('movie' | ignore) * P('on' | ignore) * P('fashion' | ignore) * P(ignore)  # naive bayes assumption
~= 0/2 * 0/2 * 0/2 * P(ignore)     
~= 1/5 * 1/5 * 1/5 * P(ignore)                                                 # additive smoothing
~= 1/125 * P(ignore)
~= 1/125 * 1/2
~= 1/250

as the estimated P(read) > P(ignore) we would class this as an article to-read.

diving into this a little bit we can see this prediction is made entirely based on the presence of the term 'on' in a previous to-read article. the terms 'movie' and 'fashion', having never been seen in the training set, were not part of the decision...

example using semi supervised naive bayes

so how does a semi supervised version of naive bayes make use of the unlabelled documents?

let's look at the same corpus as before, with four labelled and two unlabelled examples. this time though we'll record not the class directly but the probability distribution instead...

feed doc article text P(read) P(ignore)

the register 1 new linux goes fast 1.00 0.00

perez hilton 2 hollywood celebrity goes fast 0.00 1.00

the register 3 quake runs on linux 1.00 0.00

perez hilton 4 quake occurs in hollywood 0.00 1.00

the register 5 linux website ? ?

perez hilton 6 hollywood fashion website ? ?

feed	doc	article text	P(read)	P(ignore)
the register	1	new linux goes fast	1.00	0.00
perez hilton	2	hollywood celebrity goes fast	0.00	1.00
the register	3	quake runs on linux	1.00	0.00
perez hilton	4	quake occurs in hollywood	0.00	1.00
the register	5	linux website	?	?
perez hilton	6	hollywood fashion website	?	?

first we classify documents 5 and 6 using labelled docs 1 thru 4.

P(read | 'linux website')
~= P('linux website' | read)  *  P(read) 
~= P('linux' | read) * P('website' | read) * P(read)  # naive bayes assumption
~= 2/2 * 0/2 * P(read)
~= 3/4 * 1/4 * P(read)                                # additive smoothing
~= 3/16 * P(read)
~= 3/16 * 1/2
~= 3/32

P(ignore | 'linux website')
~= P('linux website' | ignore) * P(ignore) 
~= P('linux' | ignore) * P('website' | ignore)  * P(ignore)  # naive bayes assumption
~= 0/2 * 0/2 * P(ignore)     
~= 1/4 * 1/4 * P(ignore)                                     # additive smoothing
~= 1/16 * P(ignore)
~= 1/16 * 1/2
~= 1/32

normalising these estimates gives...

P(read | 'linux website)    ~= 3/32 ~= 0.75
P(ignore | 'linux website') ~= 1/32 ~= 0.25

usually we would classify this as an article to read since 75% > 25% but for the semi supervised case we want to keep this distribution.

similarily for 'hollywood fashion website'...

P(read | 'hollywood fashion website')
~= P('hollywood' | read) * P('fashion' | read) * P('website' | read) * P(read) 
~= 0/2 * 0/2 * 0/2 * P(read) 
~= 1/5 * 1/5 * 1/5 * P(read) 
~= 1/125 * P(read)
~= 1/125 * 1/2
~= 1/250

P(ignore | 'hollywood fashion website')
~= P('hollywood' | ignore) * P('fashion' | ignore) * P('website' | ignore) * P(ignore)  
~= 2/2 * 0/2 * 0/2 * P(ignore) 
~= 3/5 * 1/5 * 1/5 * P(ignore) 
~= 3/125 * P(ignore)
~= 3/125 * 1/2
~= 3/250

normalising these estimates gives...

P(read | 'hollywood fashion website')   ~= 1/250 ~= 0.25
P(ignore | 'hollywood fashion website') ~= 3/250 ~= 0.75

we can now insert these predicted probabilities into our table

feed doc article text P(read) P(ignore)

the register 1 new linux goes fast 1.00 0.00

perez hilton 2 hollywood celebrity goes fast 0.00 1.00

the register 3 quake runs on linux 1.00 0.00

perez hilton 4 quake occurs in hollywood 0.00 1.00

the register 5 linux website 0.75 0.25

perez hilton 6 hollywood fashion website 0.25 0.75

feed	doc	article text	P(read)	P(ignore)
the register	1	new linux goes fast	1.00	0.00
perez hilton	2	hollywood celebrity goes fast	0.00	1.00
the register	3	quake runs on linux	1.00	0.00
perez hilton	4	quake occurs in hollywood	0.00	1.00
the register	5	linux website	0.75	0.25
perez hilton	6	hollywood fashion website	0.25	0.75

and given new values we can now reclassify documents 5 and 6 using a classifier trained with all 6

read ignore doc# new linux goes fast hollywood celebrity quake runs on occurs in website fashion

1 0 1 1 1 1 1

0 1 2 1 1 1 1

1 0 3 1 1 1 1

0 1 4 1 1 1 1

0.75 0.25 5 1 1

0.25 0.75 6 1 1 1

P(read | 'linux website')
~= P('linux website' | read)  *  P(read) 
~= P('linux' | read) * P('website' | read) * P(read)
~= 2.75/3 * 1/3 * P(read)
~= 2.75/9 * P(read)
~= 2.75/9 * 3/6
~= 0.152

P(ignore | 'linux website')
~= P('linux website' | ignore) * P(ignore) 
~= P('linux' | ignore) * P('website' | ignore) * P(ignore)
~= 0.25/3 * 1/3 * P(ignore)     
~= 0.25/9 * P(ignore)                                                                
~= 0.25/9 * 3/6
~= 0.013 

P(read | 'website on linux')   ~= 0.152 ~= 0.92
P(ignore | 'website on linux') ~= 0.013 ~= 0.08

P(read | 'hollywood fashion website')
~= P('hollywood fashion website' | read) * P(read) 
~= P('hollywood' | read) * P('fashion' | read) * P('website' | read) * P(read)
~= 0.25/3 * 0.25/3 * 1/3 * P(read)
~= 0.0023 * 3/6
~= 0.001

P(ignore | 'hollywood fashion website')
~= P('hollywood fashion website' | ignore) * P(ignore) 
~= P('hollywood' | ignore) * P('fashion' | ignore) * P('website' | ignore) * P(ignore)
~= 2.75/3 * 0.75/3 * 1/3 * P(ignore)
~= 0.076 * 3/6
~= 0.038

P(read | 'hollywood fashion website')   ~= 0.001 ~= 0.03
P(ignore | 'hollywood fashion website') ~= 0.038 ~= 0.97

which gives us new values for the docs 5 and 6

feed doc article text P(read) P(ignore)

the register 1 new linux goes fast 1.00 0.00

perez hilton 2 hollywood celebrity goes fast 0.00 1.00

the register 3 quake runs on linux 1.00 0.00

perez hilton 4 quake occurs in hollywood 0.00 1.00

the register 5 website on linux 0.92 0.08

perez hilton 6 website on hollywood fashion 0.03 0.97

feed	doc	article text	P(read)	P(ignore)
the register	1	new linux goes fast	1.00	0.00
perez hilton	2	hollywood celebrity goes fast	0.00	1.00
the register	3	quake runs on linux	1.00	0.00
perez hilton	4	quake occurs in hollywood	0.00	1.00
the register	5	website on linux	0.92	0.08
perez hilton	6	website on hollywood fashion	0.03	0.97

this process of retraining and reclassifying the initially unlablled articles can be reiterated until the probability distrubtion converges. (note: in this case it looks like it's moving towards a clear case of to-read or to-ignore but it's not always like that)

once the probabilities have converged we can classify our new document 'movie on fashion' using the classifier training with all 6 examples.

P(read | 'movie on fashion')
~= P('movie on fashion' | read) * P(read) 
~= P('movie' | read) * P('on' | read) * P('fashion' | read) * P(read)  
~= 0/2.95 * 1/2.95 * 0.03/2.95 * P(read)
~= 1/5.95 * 2/5.95 * 1.03/5.95 * P(read)
~= 2.03/210 * 2.95/6
~= 0.0047

P(ignore | 'movie on fashion')
~= P('movie on fashion' | ignore) * P(ignore) 
~= P('movie' | ignore) * P('on' | ignore) * P('fashion' | ignore) * P(ignore)  
~= 0/3.05 * 0/3.05 * 0.97/3.05 * P(ignore)
~= 1/6.05 * 1/6.05 * 1.97/6.05 * P(ignore)
~= 1.97/221.4 * 3.05/6
~= 0.0045

P(read | 'movie on fashion')   ~= 0.0047 ~= 0.51
P(ignore | 'movie on fashion') ~= 0.0045 ~= 0.49

recall that previously this was being classified by the normal naive bayes classifier as P(read) ~= 0.66 and P(ignore) ~= 0.33

we can see now how the inclusion of 'fashion', as related to 'hollywood' through doc 6, has increased the to-ignore probability of this example.

so this semi supervised algorithm looks promising on paper but the real question is does it do any better on real data?

february two thousand and ten

class	doc#	new	linux	goes	fast	hollywood	celebrity	quake	runs	on	occurs	in
read	1	1	1	1	1
ignore	2			1	1	1	1
read	3							1	1	1
ignore	4					1		1			1	1