<< semi supervised naive bayes index v2: rewriting for scale >>

but does it do any better?

comparing plain supervised naive bayes vs semi supervised naive bayes

as a test i took a random sample of 300 rss articles from a total of 8000. there was roughly a 50/50 split of articles between perez hilton and the register.

they were partitioned into 3 sets...

article set number articles

labelled training 30

labelled test 30

unlabelled training 240

article set	number articles
labelled training	30
labelled test	30
unlabelled training	240

a standard naive bayes classifier was training using the labelled training set and evaluated for correctness using the labelled test set.

a semi supervised naive bayes classifier was trained with both the labelled training and unlabelled training sets using the scheme we've been discussing.correctness was again evaluated using the labelled test set.

the experiment was repeated 7 times for a different random 300 articles with the results plotted below showing the addition gain over naive bayes (nb) using a semi supervised version (ssnb) with 20, 50, 100 or 200 unlabelled examples.

in every case we can see having the adding additional unlablled data makes an improvement. yay!
it's interesting that, for run 1 and 3, 200 unlabelled examples did no better than 100 unlabelled examples.
in general, seems semi supervised works pretty well!

trouble is my clumsy implementation doesn't scale past a few hundred articles so we need to change it

february two thousand and ten