<< semi supervised naive bayes
index
v2: rewriting for scale >>

# but does it do any better?

## comparing plain supervised naive bayes vs semi supervised naive bayes

as a test i took a random sample of 300 rss articles from a total of 8000. there was roughly a 50/50 split of articles between perez hilton and the register.

they were partitioned into 3 sets...

article set | number articles |

labelled training | 30 |

labelled test | 30 |

unlabelled training | 240 |

a standard naive bayes classifier was training using the *labelled training* set and evaluated for correctness using the *labelled test* set.

a semi supervised naive bayes classifier was trained with both the *labelled training* and *unlabelled training* sets using the scheme
we've been discussing.correctness was again evaluated using the *labelled test* set.

the experiment was repeated 7 times for a different random 300 articles with the results plotted below showing the addition gain over naive bayes (nb) using a semi supervised version (ssnb) with 20, 50, 100 or 200 unlabelled examples.

in every case we can see having the adding additional unlablled data makes an improvement. yay!

it's interesting that, for run 1 and 3, 200 unlabelled examples did no better than 100 unlabelled examples.

in general, seems semi supervised works pretty well!

trouble is my clumsy implementation doesn't scale past a few hundred articles so we need to change it

february two thousand and ten