<< does it do any better? index how does v2 go? >>

v2: rewriting for scale

problems with the last approach

there are a few problems with the last approach that limit us from using it for all but the smallest data set. it's mainly around trying to calculate an accurate probability distribution for the unlabelled data instead of just choosing the most likely class. to be honest they are problems with my implementation, not some fundamental problem with the algorithm.

so to makes things easier for myself i'm going to reimplement with some changes...

forget trying to get an accurate probability distribution for the unlabelled data and go back to the old style of just assigning a doc to one class or another. this simplifies in many ways and looking at the previous results it seems that this was happening anyways (eg probability distributions like 0.99/0.01)
switch from calculating probabilties based on simply a word existing in an article or not (a bernoulli distribution) to using word frequencies within an article (a multinominal distribution)
change the way we avoid the zero probability problem by simply substituting any zero for a small frequency, say 0.1, instead of using a laplace estimator

example using normal naive bayes (without a semi supervised component)

let's redo the example from our last experiment over the same 6 documents with our above changes...

feed doc article text class

the register 1 new linux goes fast read

perez hilton 2 hollywood celebrity goes fast ignore

the register 3 quake runs on linux read

perez hilton 4 quake occurs in hollywood ignore

the register 5 linux website ?

perez hilton 6 hollywood fashion website ?

feed	doc	article text	class
the register	1	new linux goes fast	read
perez hilton	2	hollywood celebrity goes fast	ignore
the register	3	quake runs on linux	read
perez hilton	4	quake occurs in hollywood	ignore
the register	5	linux website	?
perez hilton	6	hollywood fashion website	?

a new article then comes along...

feed article text class

perez hilton movie on fashion ??

feed	article text	class
perez hilton	movie on fashion	??

first build a term occurence matrix for the four training examples...

CLASS DOC# new linux goes fast hollywood celebrity quake runs on occurs in

read 1 1 1 1 1

ignore 2 1 1 1 1

read 3 1 1 1 1

ignore 4 1 1 1 1

we then collapse into just counts for each class...

CLASS #DOCS new linux goes fast hollywood celebrity quake runs on occurs in TOTAL

read 2 1 2 1 1 1 1 1 8

ignore 2 1 1 2 1 1 1 1 8

TOTAL 4 1 2 2 2 2 1 2 1 1 1 1 16

CLASS	#DOCS	new	linux	goes	fast	hollywood	celebrity	quake	runs	on	occurs	in	TOTAL
read	2	1	2	1	1			1	1	1			8
ignore	2			1	1	2	1	1			1	1	8
TOTAL	4	1	2	2	2	2	1	2	1	1	1	1	16

and to avoid zero probabilities we fill in zero entries with a small value, say 0.1

class #docs new linux goes fast hollywood celebrity quake runs on occurs in TOTAL

read 2 1 2 1 1 0.1 0.1 1 1 1 0.1 0.1 8.4

ignore 2 0.1 0.1 1 1 2 1 1 0.1 0.1 1 1 8.4

TOTAL 4 1.1 2.1 2 2 2.1 1.1 2 1.1 1.1 1.1 1.1 16.8

class	#docs	new	linux	goes	fast	hollywood	celebrity	quake	runs	on	occurs	in	TOTAL
read	2	1	2	1	1	0.1	0.1	1	1	1	0.1	0.1	8.4
ignore	2	0.1	0.1	1	1	2	1	1	0.1	0.1	1	1	8.4
TOTAL	4	1.1	2.1	2	2	2.1	1.1	2	1.1	1.1	1.1	1.1	16.8

we can now calculate term probabilites conditional on the class; eg P('on'|read) = 1.0/8.4 since 'on' appears 1 time across the 8.4 words in the 'read' articles. (0.4 comes the 4 x 0.1 smoothing values)

then to classify the new article, 'movie on fashion', we calculate which of P(read) and P(ignore) is greater.

notes

it turns out that we can ignore terms we have never seen before; in this case movie and fashion.
since we are after a max value we can ignore any constants that are the same across both P(read|text) and P(ignore|text)

P(read | 'movie on fashion')
= P(read | 'on')                                 # ignore movie and fashion
= P('on'|read) * P(read) / P('on')               # bayes law
= 1! * (P('on'|read)^1)/1! * P(read) / P('on')   # assume multinominal distrubtion, see my previous experiment for more info
= (P('on'|read)^1)/1! * P(read)                  # ignore constants
= 1.0/8.4 * 2/4
= 0.05

P(ignore | 'movie on fashion') 
= P(ignore | 'on')  
= P('on'|ignore) * P(ignore) / P('on')
= 1! * (P('on'|ignore)^1)1! * P(ignore) / P('on')
= (P('on'|ignore)^1)/1! * P(ignore)
= 0.1/8.4 * 2/4
= 0.005

0.05 > 0.005 so classify 'movie on fashion' as 'read'

this is not the best result; you'd think that 'movie' and 'fashion' would imply that i don't want to read this. but since they haven't been seen before the only word we've got to work with is 'on'

example using semi supervised naive bayes

so how does a semi supervised version of naive bayes make use of the unlabelled documents?

let's look at the same corpus as before, with four labelled and two unlabelled examples.

feed doc article text class

the register 1 new linux goes fast read

perez hilton 2 hollywood celebrity goes fast ignore

the register 3 quake runs on linux read

perez hilton 4 quake occurs in hollywood ignore

the register 5 linux website ?

perez hilton 6 hollywood fashion website ?

feed	doc	article text	class
the register	1	new linux goes fast	read
perez hilton	2	hollywood celebrity goes fast	ignore
the register	3	quake runs on linux	read
perez hilton	4	quake occurs in hollywood	ignore
the register	5	linux website	?
perez hilton	6	hollywood fashion website	?

first we classify documents 5 and 6 using labelled docs 1 thru 4.

for 'linux website'...

P(read | 'linux website')
= P(read | 'linux')                                    # ignore 'website'
= P('linux'|read) * P(read) / P('linux')               # bayes law
= 1! * (P('linux'|read)^1)/1! * P(read) / P('linux')   # multinominal distribution
~= (P('linux'|read)^1)/1! * P(read)                    # ignore constants
= P('linux'|read) * P(read)         
= 2.0/8.4 * 2/4
= 0.119

P(ignore | 'linux website')
~= P('linux'|ignore) * P(ignore)
= 0.1/8.4 * 2/4
= 0.005

0.119 > 0.005 => 'linux website' is classed 'read'

and for 'hollywood fashion website'...

P(read | 'hollywood fashion website')
= P(read | 'hollywood') # can ignore unseen terms 'fashion' and 'website'
= P('hollywood' | read) * P(read) 
= 0.1/8.4 * 2/4 
= 0.005

P(ignore | 'hollywood fashion website')
= P(ignore | 'hollywood')
= P('hollywood' | ignore) * P(ignore)
= 2/8.4 * 2/4 
= 0.119

0.119 > 0.005 => 'hollywood fashion website' is classed 'ignore'

we can now insert these predicted classes into our table

feed doc article text class

the register 1 new linux goes fast read

perez hilton 2 hollywood celebrity goes fast ignore

the register 3 quake runs on linux read

perez hilton 4 quake occurs in hollywood ignore

the register 5 linux website read

perez hilton 6 hollywood fashion website ignore

feed	doc	article text	class
the register	1	new linux goes fast	read
perez hilton	2	hollywood celebrity goes fast	ignore
the register	3	quake runs on linux	read
perez hilton	4	quake occurs in hollywood	ignore
the register	5	linux website	read
perez hilton	6	hollywood fashion website	ignore

rebuild the term occurence matrix...

class doc# new linux goes fast hollywood celebrity quake runs on occurs in website fashion

read 1 1 1 1 1

ignore 2 1 1 1 1

read 3 1 1 1 1

ignore 4 1 1 1 1

read 5 1 1

ignore 6 1 1 1

and collapse based on class...

CLASS #DOCS new linux goes fast hollywood celebrity quake runs on occurs in website fashion TOTAL

read 3 1 3 1 1 1 1 1 1 10

ignore 3 1 1 3 1 1 1 1 1 1 11

TOTAL 6 1 3 2 2 3 1 2 1 1 1 1 2 1 21

CLASS	#DOCS	new	linux	goes	fast	hollywood	celebrity	quake	runs	on	occurs	in	website	fashion	TOTAL
read	3	1	3	1	1			1	1	1			1		10
ignore	3			1	1	3	1	1			1	1	1	1	11
TOTAL	6	1	3	2	2	3	1	2	1	1	1	1	2	1	21

and finally include smoothing values...

CLASS #DOCS new linux goes fast hollywood celebrity quake runs on occurs in website fashion TOTAL

read 3 1 3 1 1 0.1 0.1 1 1 1 0.1 0.1 1 0.1 10.5

ignore 3 0.1 0.1 1 1 3 1 1 0.1 0.1 1 1 1 1 11.4

TOTAL 6 1.1 3.1 2 2 3.1 1.1 2 1.1 1.1 1.1 1.1 2 1.1 21.9

CLASS	#DOCS	new	linux	goes	fast	hollywood	celebrity	quake	runs	on	occurs	in	website	fashion	TOTAL
read	3	1	3	1	1	0.1	0.1	1	1	1	0.1	0.1	1	0.1	10.5
ignore	3	0.1	0.1	1	1	3	1	1	0.1	0.1	1	1	1	1	11.4
TOTAL	6	1.1	3.1	2	2	3.1	1.1	2	1.1	1.1	1.1	1.1	2	1.1	21.9

given all this we can now reclassify documents 5 and 6 using a classifier trained with all 6. we do this since some documents may no be reclassified.

for doc 5, linux website...

P(read | 'linux website')
= P('linux website' | read) * P(read) / P('linux website')       # bayes law
= (P('linux'|read)^1)/1! * (P('website'|read)^1)/1! * P(read)    # multinomial with constants removed
= P('linux'|read) * P('website'|read) * P(read) 
= 3.0/10.5 * 1.0/10.5 * 3/6
= 0.0136

P(ignore | 'linux website')
= P('linux'|ignore) * P('website'|ignore) * P(ignore)                  # bayes + multinominal + removed constants
= 0.1/11.4 * 1.0/11.4 * 3/6
= 0.0003

0.0136 > 0.0003 so 'linux website' remains as 'read'

for doc 6, hollywood fashion website...

P(read | 'hollywood fashion website')
= P('hollywood'|read) * P('fashion'|read) * P('website'|read) * P(read)  # bayes + multinominal + constants
= 0.1/10.5 * 0.1/10.5 * 1/10.5 * 3/6
= 0.000004

P(ignore | 'hollywood fashion website')
= P('hollywood'|ignore) * P('fashion'|ignore) * P('website'|ignore) * P(ignore)   # bayes + multinominal + constants
= 3.0/11.4 * 1/11.4 * 1/11.4 * 3/6
= 0.001

0.001 > 0.000004 so 'hollywood fashion website' remains as 'ignore'

the process of retraining and reclassifying the initially unlablled articles can be reiterated until the classes for unlabelled articles stabilises, in this example case it was immediate but a non trivial case might take some iterations.

once the classes have converged we can classify our new document 'movie on fashion' using the classifier training with all 6 examples.

P(read | 'movie on fashion')
= P(read | 'on fashion')                                  # ignore unseen terms
= P('on fashion' | read) * P(read) / P('on fashion')      # bayes law
= P('on'|read) * P('fashion'|read) * P(read)              # multinominal + remove constants
= 1.0/10.5 * 0.1/10.5 * 3/6
= 0.00045

P(ignore | 'movie on fashion')
= P('movie on fashion' | ignore) * P(ignore) 
= P('on'|ignore) * P('fashion'|ignore) * P(ignore)       # unseen + bayes + multinominal + constants
= 0.1/11.4 * 1.0/11.4 * 3/6
= 0.00038

0.00045 > 0.00038 so classify as 'read'

originally was strongly classified as 'read' (0.05 > 0.005)
but now it's only just 'read', the additional content from doc 6 (fashion) has brought it down

here's the code for this new implementation. how does this one do? and does it scale? let's have a look

march two thousand and ten

CLASS	DOC#	new	linux	goes	fast	hollywood	celebrity	quake	runs	on	occurs	in
read	1	1	1	1	1
ignore	2			1	1	1	1
read	3		1					1	1	1
ignore	4					1		1			1	1