i've been poking around with a simple trending algorithm over the last few weeks and have uncovered a problem that, like most interesting ones, i'm not sure how to solve. the question revolves around discovering multi terms trends.
a sensible place to start when looking for trends is to consider single terms but what if though we ended up with three equally trending terms 'happy', 'new' and 'year'? it's pretty obvious that the actual trend is 'happy new year' but what is the best way to express this as a single trend in an algorithmic sense?
one approach i've been playing with is to collect unigrams, bigrams and trigrams (1,2,3 term 'phrases') and consider the cases where the terms overlap. basically if 'happy new year' is trending then, in some sense, we can ignore trends for 'happy new', 'new year', 'happy', 'new' and 'year'. but does this result in to many false positives? would we miss 'happy' as a trend if lots of people were chirpy about the change of year (as they usually are, on new years eve)
rather than outright ignore we could somehow reduce the weighting by removing the double counting.
eg if we had 3 trends; (free beer,11), (free,12) & (beer,25) we can take 11 (from the 2gram) off both 1grams to give (free beer,11), (free,1) & (beer,14) showing that 'beer', outside of the phrase 'free beer', is perhaps a trend in itself (as it should be)
this feels like it might work but would be non trivial (read: fun) to implement
another slightly different problem is around the handling of retweeting. my experiments have shown a huge amount of the 'trends' found are related to retweets, which is fine in itself, but it gives quite strange trends since the retweeted portion of the text is usually quite long.
for example; say lots of people are retweeting something and, as some people do, are adding various bits and pieces at the beginning and end; eg 'RT @bob omg i just found a peanut' or 'omg i just found a peanut; via @bob lucky him!!'
if we're considering bigrams (which i am in my current implementation) we end up with an odd selection of trends such as 'just found', 'a peanut', 'omg i', 'found a', 'i just' and in these cases it'd be great to be able to just stitch them together into the common retweeted element 'omg i just found a peanut'.
we could 'solve' this problem by not just considering 1,2 and 3 grams but considering all possible n-grams for each tweet and employing the technique we spoke of above of reducing the counts. it'd almost be feasible, since tweets are never that long, but feels uber clumsy and i'd hate to see the order statistic of that algorithm ;)
this seems more like a stitching problem of some kind; eg if we have 4 grams 'omg i just found', 'i just found a', 'just found a peanut' perhaps we can identify the non trivial overlap and stitch them together (?)
not sure, there are a number of things to try. was hoping that brain dumping some of this would help me see the light but nothing obvious jumps out :(