while doing some more research on trending algorithms i came across a cool little paper about term frequency normalisation for streaming data: TF-ICF: A New Term Weighting Scheme for Clustering Dynamic Data Streams.
i'm finding streaming related algorithms quite interesting lately and think are the way forward in terms of dealing with large amounts of constant data. it's just not feasible to use algorithms that expect you to have all the data at any given time; it forces you to reprocess all the data you've ever seen as you get new examples. my thinking is the best solutions are the ones that build models of the data and fold in new examples in batches. anyways, i'm getting off topic already.
tf/icf as presented in the paper is a variant on the classic tf/idf for term weighting but instead of requiring all weighting in all docs to be recalculated as a new document comes along (as tf/idf strictly does) it instead just approximates based on what has been seen before.
so how does it do? actually quite well, here's my experimental breakdown