tf/icf

the data

for this experiment i'm using 4500 rss articles taken from theregister.co.uk, here's a sample (one article per line)

i picked one particular articles to investigate, here it is in it's crude tokenised form

wireds advice could seriously damage your business exclusive the most comprehensive empirical study of digital music sales ever conducted has some bad news for californian technology utopians since 2004 wired magazine editor chris anderson has been hawking his long tail proposition around the world blockbusters will matter less and businesses will sell more of less the graph has become iconic a kind of hockey stick for web 2 0 with the author applying his message to many different business sectors alas following the wired way of business as a matter of faith could be catastrophic for your business and investment decisions

tf/idf

here's how the terms in this article get weighted with traditional tf/idf calculated across the entire corpus. (word size is proportional to the relevance contribution of the word to this documents according to tf/idf)

wireds advice could seriously damage your business exclusive the most comprehensive empirical study of digital music sales ever conducted has some bad news for californian technology utopians since 2004 wired magazine editor chris anderson has been hawking his long tail proposition around the world blockbusters will matter less and businesses will sell more of less the graph has become iconic a kind of hockey stick for web 2 0 with the author applying his message to many different business sectors alas following the wired way of business as a matter of faith could be catastrophic for your business and investment decisions

looks like what we'd expect. common english language words such as 'a' and 'the' give little weight. words common to this document but uncommon elsewhere are given a big weight; in particular 'wired' really stands out.

tf/icf

how about tf/icf?

the term weightings in this case vary considerably based on how many documents have already been processed; the more documents previously seen, the more accurate (ie closer to the "true" tf/idf) they become

eg when the document is only the 6th to be seen it's weightings are quite different...

wireds advice could seriously damage your business exclusive the most comprehensive empirical study of digital music sales ever conducted has some bad news for californian technology utopians since 2004 wired magazine editor chris anderson has been hawking his long tail proposition around the world blockbusters will matter less and businesses will sell more of less the graph has become iconic a kind of hockey stick for web 2 0 with the author applying his message to many different business sectors alas following the wired way of business as a matter of faith could be catastrophic for your business and investment decisions

but if 400 (or more) documents have been seen it's values are starting to look more like that of the true tf/idf; here's another example of when the document was the 417th to be processed...

wireds advice could seriously damage your business exclusive the most comprehensive empirical study of digital music sales ever conducted has some bad news for californian technology utopians since 2004 wired magazine editor chris anderson has been hawking his long tail proposition around the world blockbusters will matter less and businesses will sell more of less the graph has become iconic a kind of hockey stick for web 2 0 with the author applying his message to many different business sectors alas following the wired way of business as a matter of faith could be catastrophic for your business and investment decisions

the values of the tf/icf are never the same as that of tf/idf (even if this document is the last to be seen) but they look close enough to be workable...

jun 9 2010, see other stuff at matpalm.com