what is a statistically improbable phrase?

amazon has an interesting feature where for each book they calculate a number of small snippets that are representative of the text. they call these snippets statisically improbable phrases (sips) and they are useful as a type of keyword phrase for a book to aid in searching for books on a certain topic.

for example the great data mining book data mining - practical machine learning tools and techniques had it's sips listed as...

informational loss function, category utility formula, generic object editor, synthetic binary attributes, multiresponse linear regression, replicated subtree problem, contact lens data, subtree raising, hypermetrope yes, arff file, myope yes, labor negotiations data, practical data mining, false yes rainy, numeric prediction, outlook attribute, kernel perceptron, subtree replacement, machine learning schemes, holdout set, iris dataset, relative squared error, numeric attributes, weighted instances, maximum margin hyperplane

diy sip

at first i thought these would be pretty easy to calculate. after some more thinking it turns out it's not as simple as it might first appear (like a lot of things i guess). so like all good devs i've decided to roll my own as an exercise.

and to be honest this is as much about trying hadoop as anything else..

the data source

firstly we need a corpus to work with and i decided to use a bunch of etexts taken from project gutenberg (mainly cause it's so easy to get). the only bad thing i can see about it is that the books are quite old (out of copyright which is why they are free to download) and span a reasonable time frame so there will be some language drift. perhaps this drift itself could be an interesting thing to look another day?

cleaning the data

in general for any machine learning application a fair amount of work needs to go into "cleaning" the data, whatever cleaning might mean...

for the gutenberg texts some of the problems with the data include...

each text includes a project gutenberg header and footer which is just noise for our purpose
a number of the files aren't even in english
some texts are more reference like; eg 'the first 1001 fibonacci numbers' or 'a complete grammar of esperanto' and will have large chunks of "unnatural" speech

i ignored the first item for awhile but then decided to clean a little bit of the header info with these scripts. it's not perfect but removes about 4e6 lines of header/footer guff.

however i've decided to ignore these issues to see if i can determine how much of an effect it'll have (it's not at all because i'm lazy)

let's take our first crack at diy sips with trigram frequencies

sept 2009