<< does it do any better? index how does v2 go? >>
there are a few problems with the last approach that limit us from using it for all but the smallest data set. it's mainly around trying to calculate an accurate probability distribution for the unlabelled data instead of just choosing the most likely class. to be honest they are problems with my implementation, not some fundamental problem with the algorithm.
so to makes things easier for myself i'm going to reimplement with some changes...
let's redo the example from our last experiment over the same 6 documents with our above changes...
feed | doc | article text | class |
the register | 1 | new linux goes fast | read |
perez hilton | 2 | hollywood celebrity goes fast | ignore |
the register | 3 | quake runs on linux | read |
perez hilton | 4 | quake occurs in hollywood | ignore |
the register | 5 | linux website | ? |
perez hilton | 6 | hollywood fashion website | ? |
a new article then comes along...
feed | article text | class |
perez hilton | movie on fashion | ?? |
first build a term occurence matrix for the four training examples...
CLASS | DOC# | new | linux | goes | fast | hollywood | celebrity | quake | runs | on | occurs | in |
read | 1 | 1 | 1 | 1 | 1 | |||||||
ignore | 2 | 1 | 1 | 1 | 1 | |||||||
read | 3 | 1 | 1 | 1 | 1 | |||||||
ignore | 4 | 1 | 1 | 1 | 1 |
we then collapse into just counts for each class...
CLASS | #DOCS | new | linux | goes | fast | hollywood | celebrity | quake | runs | on | occurs | in | TOTAL |
read | 2 | 1 | 2 | 1 | 1 | 1 | 1 | 1 | 8 | ||||
ignore | 2 | 1 | 1 | 2 | 1 | 1 | 1 | 1 | 8 | ||||
TOTAL | 4 | 1 | 2 | 2 | 2 | 2 | 1 | 2 | 1 | 1 | 1 | 1 | 16 |
and to avoid zero probabilities we fill in zero entries with a small value, say 0.1
class | #docs | new | linux | goes | fast | hollywood | celebrity | quake | runs | on | occurs | in | TOTAL |
read | 2 | 1 | 2 | 1 | 1 | 0.1 | 0.1 | 1 | 1 | 1 | 0.1 | 0.1 | 8.4 |
ignore | 2 | 0.1 | 0.1 | 1 | 1 | 2 | 1 | 1 | 0.1 | 0.1 | 1 | 1 | 8.4 |
TOTAL | 4 | 1.1 | 2.1 | 2 | 2 | 2.1 | 1.1 | 2 | 1.1 | 1.1 | 1.1 | 1.1 | 16.8 |
we can now calculate term probabilites conditional on the class; eg P('on'|read) = 1.0/8.4 since 'on' appears 1 time across the 8.4 words in the 'read' articles. (0.4 comes the 4 x 0.1 smoothing values)
then to classify the new article, 'movie on fashion', we calculate which of P(read) and P(ignore) is greater.
notes
P(read | 'movie on fashion') = P(read | 'on') # ignore movie and fashion = P('on'|read) * P(read) / P('on') # bayes law = 1! * (P('on'|read)^1)/1! * P(read) / P('on') # assume multinominal distrubtion, see my previous experiment for more info = (P('on'|read)^1)/1! * P(read) # ignore constants = 1.0/8.4 * 2/4 = 0.05 P(ignore | 'movie on fashion') = P(ignore | 'on') = P('on'|ignore) * P(ignore) / P('on') = 1! * (P('on'|ignore)^1)1! * P(ignore) / P('on') = (P('on'|ignore)^1)/1! * P(ignore) = 0.1/8.4 * 2/4 = 0.005 0.05 > 0.005 so classify 'movie on fashion' as 'read'
this is not the best result; you'd think that 'movie' and 'fashion' would imply that i don't want to read this. but since they haven't been seen before the only word we've got to work with is 'on'
so how does a semi supervised version of naive bayes make use of the unlabelled documents?
let's look at the same corpus as before, with four labelled and two unlabelled examples.
feed | doc | article text | class |
the register | 1 | new linux goes fast | read |
perez hilton | 2 | hollywood celebrity goes fast | ignore |
the register | 3 | quake runs on linux | read |
perez hilton | 4 | quake occurs in hollywood | ignore |
the register | 5 | linux website | ? |
perez hilton | 6 | hollywood fashion website | ? |
first we classify documents 5 and 6 using labelled docs 1 thru 4.
for 'linux website'...
P(read | 'linux website') = P(read | 'linux') # ignore 'website' = P('linux'|read) * P(read) / P('linux') # bayes law = 1! * (P('linux'|read)^1)/1! * P(read) / P('linux') # multinominal distribution ~= (P('linux'|read)^1)/1! * P(read) # ignore constants = P('linux'|read) * P(read) = 2.0/8.4 * 2/4 = 0.119 P(ignore | 'linux website') ~= P('linux'|ignore) * P(ignore) = 0.1/8.4 * 2/4 = 0.005 0.119 > 0.005 => 'linux website' is classed 'read'
and for 'hollywood fashion website'...
P(read | 'hollywood fashion website') = P(read | 'hollywood') # can ignore unseen terms 'fashion' and 'website' = P('hollywood' | read) * P(read) = 0.1/8.4 * 2/4 = 0.005 P(ignore | 'hollywood fashion website') = P(ignore | 'hollywood') = P('hollywood' | ignore) * P(ignore) = 2/8.4 * 2/4 = 0.119 0.119 > 0.005 => 'hollywood fashion website' is classed 'ignore'
we can now insert these predicted classes into our table
feed | doc | article text | class |
the register | 1 | new linux goes fast | read |
perez hilton | 2 | hollywood celebrity goes fast | ignore |
the register | 3 | quake runs on linux | read |
perez hilton | 4 | quake occurs in hollywood | ignore |
the register | 5 | linux website | read |
perez hilton | 6 | hollywood fashion website | ignore |
rebuild the term occurence matrix...
class | doc# | new | linux | goes | fast | hollywood | celebrity | quake | runs | on | occurs | in | website | fashion |
read | 1 | 1 | 1 | 1 | 1 | |||||||||
ignore | 2 | 1 | 1 | 1 | 1 | |||||||||
read | 3 | 1 | 1 | 1 | 1 | |||||||||
ignore | 4 | 1 | 1 | 1 | 1 | |||||||||
read | 5 | 1 | 1 | |||||||||||
ignore | 6 | 1 | 1 | 1 |
and collapse based on class...
CLASS | #DOCS | new | linux | goes | fast | hollywood | celebrity | quake | runs | on | occurs | in | website | fashion | TOTAL |
read | 3 | 1 | 3 | 1 | 1 | 1 | 1 | 1 | 1 | 10 | |||||
ignore | 3 | 1 | 1 | 3 | 1 | 1 | 1 | 1 | 1 | 1 | 11 | ||||
TOTAL | 6 | 1 | 3 | 2 | 2 | 3 | 1 | 2 | 1 | 1 | 1 | 1 | 2 | 1 | 21 |
and finally include smoothing values...
CLASS | #DOCS | new | linux | goes | fast | hollywood | celebrity | quake | runs | on | occurs | in | website | fashion | TOTAL |
read | 3 | 1 | 3 | 1 | 1 | 0.1 | 0.1 | 1 | 1 | 1 | 0.1 | 0.1 | 1 | 0.1 | 10.5 |
ignore | 3 | 0.1 | 0.1 | 1 | 1 | 3 | 1 | 1 | 0.1 | 0.1 | 1 | 1 | 1 | 1 | 11.4 |
TOTAL | 6 | 1.1 | 3.1 | 2 | 2 | 3.1 | 1.1 | 2 | 1.1 | 1.1 | 1.1 | 1.1 | 2 | 1.1 | 21.9 |
given all this we can now reclassify documents 5 and 6 using a classifier trained with all 6. we do this since some documents may no be reclassified.
for doc 5, linux website...
P(read | 'linux website') = P('linux website' | read) * P(read) / P('linux website') # bayes law = (P('linux'|read)^1)/1! * (P('website'|read)^1)/1! * P(read) # multinomial with constants removed = P('linux'|read) * P('website'|read) * P(read) = 3.0/10.5 * 1.0/10.5 * 3/6 = 0.0136 P(ignore | 'linux website') = P('linux'|ignore) * P('website'|ignore) * P(ignore) # bayes + multinominal + removed constants = 0.1/11.4 * 1.0/11.4 * 3/6 = 0.0003 0.0136 > 0.0003 so 'linux website' remains as 'read'
for doc 6, hollywood fashion website...
P(read | 'hollywood fashion website') = P('hollywood'|read) * P('fashion'|read) * P('website'|read) * P(read) # bayes + multinominal + constants = 0.1/10.5 * 0.1/10.5 * 1/10.5 * 3/6 = 0.000004 P(ignore | 'hollywood fashion website') = P('hollywood'|ignore) * P('fashion'|ignore) * P('website'|ignore) * P(ignore) # bayes + multinominal + constants = 3.0/11.4 * 1/11.4 * 1/11.4 * 3/6 = 0.001 0.001 > 0.000004 so 'hollywood fashion website' remains as 'ignore'
the process of retraining and reclassifying the initially unlablled articles can be reiterated until the classes for unlabelled articles stabilises, in this example case it was immediate but a non trivial case might take some iterations.
once the classes have converged we can classify our new document 'movie on fashion' using the classifier training with all 6 examples.
P(read | 'movie on fashion') = P(read | 'on fashion') # ignore unseen terms = P('on fashion' | read) * P(read) / P('on fashion') # bayes law = P('on'|read) * P('fashion'|read) * P(read) # multinominal + remove constants = 1.0/10.5 * 0.1/10.5 * 3/6 = 0.00045 P(ignore | 'movie on fashion') = P('movie on fashion' | ignore) * P(ignore) = P('on'|ignore) * P('fashion'|ignore) * P(ignore) # unseen + bayes + multinominal + constants = 0.1/11.4 * 1.0/11.4 * 3/6 = 0.00038 0.00045 > 0.00038 so classify as 'read' originally was strongly classified as 'read' (0.05 > 0.005) but now it's only just 'read', the additional content from doc 6 (fashion) has brought it down
here's the code for this new implementation. how does this one do? and does it scale? let's have a look
march two thousand and ten