Posts Tagged ‘algorithms’

how many terms in a trend?

Tuesday, May 11th, 2010

i’ve been poking around with a simple trending algorithm over the last few weeks and have uncovered a problem that, like most interesting ones, i’m not sure how to solve. the question revolves around discovering multi terms trends.

a sensible place to start when looking for trends is to consider single terms but what if though we ended up with three equally trending terms ‘happy’, ‘new’ and ‘year’? it’s pretty obvious that the actual trend is ‘happy new year’ but what is the best way to express this as a single trend in an algorithmic sense?

one approach i’ve been playing with is to collect unigrams, bigrams and trigrams (1,2,3 term ‘phrases’) and consider the cases where the terms overlap. basically if ‘happy new year’ is trending then, in some sense, we can ignore trends for ‘happy new’, ‘new year’, ‘happy’, ‘new’ and ‘year’. but does this result in to many false positives? would we miss ‘happy’ as a trend if lots of people were chirpy about the change of year (as they usually are, on new years eve)

rather than outright ignore we could somehow reduce the weighting by removing the double counting.

eg if we had 3 trends; (free beer,11), (free,12) & (beer,25)
we can take 11 (from the 2gram) off both 1grams to give (free beer,11), (free,1) & (beer,14)
showing that ‘beer’, outside of the phrase ‘free beer’, is perhaps a trend in itself (as it should be)

this feels like it might work but would be non trivial (read: fun) to implement

another slightly different problem is around the handling of retweeting. my experiments have shown a huge amount of the ‘trends’ found are related to retweets, which is fine in itself, but it gives quite strange trends since the retweeted portion of the text is usually quite long.

for example; say lots of people are retweeting something and, as some people do, are adding various bits and pieces at the beginning and end; eg ‘RT @bob omg i just found a peanut’ or ‘omg i just found a peanut; via @bob lucky him!!’

if we’re considering bigrams (which i am in my current implementation) we end up with an odd selection of trends such as ‘just found’, ‘a peanut’, ‘omg i’, ‘found a’, ‘i just’ and in these cases it’d be great to be able to just stitch them together into the common retweeted element ‘omg i just found a peanut’.

we could ’solve’ this problem by not just considering 1,2 and 3 grams but considering _all_ possible n-grams for each tweet and employing the technique we spoke of above of reducing the counts. it’d almost be feasible, since tweets are never that long, but feels uber clumsy and i’d hate to see the order statistic of that algorithm ;)

this seems more like a stitching problem of some kind; eg if we have 4 grams ‘omg i just found’, ‘i just found a’, ‘just found a peanut’ perhaps we can identify the non trivial overlap and stitch them together (?)

not sure, there are a number of things to try. was hoping that brain dumping some of this would help me see the light but nothing obvious jumps out :(

e10.4 communities in social graphs

Tuesday, October 6th, 2009

social graphs, like twitter or facebook, often follow the pattern of having clusters of highly connected components with an occasional edge joining these clusters.

these connecting edges define the boundaries of communities in the social network and can be identified by algorithms that measure betweenness.

the girvan-newman algorithm can be used to decompose a graph hierarchically based on successive removal of the edges with the highest betweenness.

the algorithm is basically

  1. calculate the betweenness of each edge (using an all shortest paths algorithm)
  2. remove the edge(s) with the highest betweenness
  3. check for connected components (using tarjan’s algorithm)
  4. repeat for graph or subgraphs if graph was split

e10.3 twitter crawl progress

Tuesday, September 29th, 2009

since the twitter api is rate limited it’s quite slow to crawl twitter and after a most of a week i’ve still only managed to get info on 8,000 users. i probably should subscribe to get a 20,000 an hr limit instead of the 150 i’m on now. i’ll just let it chug along in the background of my pvr.

while the crawl has been going on i’ve been trying some things on the data to decide what to do with it.

i’ve managed to write a version of pagerank using pig which has been very interesting. (for those who haven’t seen it before pig is a query language that sits on top of hadoop’s mapreduce). my initial feel for pig is that it’s pretty awesome. it was much quicker to write this script than to write the statistically improbable phrases. in fact i’m reinspired to have another crack at the sip stuff using pig. my final result wasn’t great for the performance of hadoop and after some great feedback on the hadoop mailing list i’ve got a number of other things to try including writing my joins in pig.

anyways, here’s my pagerank in pig

(more…)

e10.1 crawling twitter

Saturday, September 19th, 2009

our first goal is to get some data and the twitter api makes getting the data trivial. i’m focused mainly on the friends stuff but because it only gives user ids i’ll also get the user info so i can put names to ids.

a depth first crawl makes no sense for this one experiment, we’re unlikely to get the entire graph and are more interested in following edges “close” to me. instead we’ll use a breadth first search.

since any call to twitter is expensive (in time that is, they rate limit their api calls) instead of a plain vanilla breadth first we’ll introduce a cost component to elements on the frontier so help decide what to grab next. this is especially important for a graph like twitter where the outdegree of a node is often in the hundreds. it turns the crawl into something that is not strictly depth first but it works out.

to explain the cost component consider the expected connectivity of nodes in the twitter friend graph. most nodes have an outdegree of the order 20-200. occasionally you see much larger (in the 1000’s) or much smaller (under 10).

we might, naively perhaps, say that having a large outdegree means the person is a bit less strict with her following criteria and that some of them are not really that important to her. if this is the case we should focus a little more on getting nodes with smaller outdegree.

the formula i’ve come up with is to not consider the depth but instead add 1 + Log10(1+the outdegree of the previous node). in this way we penalise large outdegress, but not by a huge amount. we always add 1 to counter the cases where there are no edges leaving a node.

e10.0 introducing tgraph

Saturday, September 19th, 2009

so e9 sip is on hold for a bit while i kick off e10 tgraph. was looking for another problem to try hadoop with and came across a classic graph one, pagerank. a well understood algorithm like page rank will be a  great chance to try pig, the query language that sits on top of hadoop mapreduce.

so we need a graph to work on. my first thoughts were using one of the wikipedia linkage dumps but it feels a bit sterile. instead it’s a good excuse to do a little crawl of the following graph of twitter.

this will also be a chance to try to document a project via a blog. skorks‘ incessant blog rambling has convinced me to give it a go.

bin packing

Sunday, December 14th, 2008

how to decide what next to backup onto a dvd?

when is brute force good enough? will a random walk get a good enough result faster?

http://www.matpalm.com/burn.it

the median of a trillion numbers

Saturday, November 15th, 2008

i got asked in an interview once “how would find the median of a trillion numbers across a thousand machines?”

the question has haunted me, until now.

here’s my ruby and erlang implementation with a bit of running amazon ec2 thrown in for good measure….. matpalm.com/median/

grab the code from github

fastmap and the jaccard distance

Friday, October 31st, 2008

given a set of pairwise distances how do you determine what points correspond to those distances?

my latest experiment considers this problem in relation to jaccard distances, a resemblance measure similar to jaccard coefficients used in a previous experiment

by using the fastmap algorithm we get points from distances and once you have points you have visualisation!

shingling and the jaccard index

Monday, October 6th, 2008

on the weekend i did another experiment using shingling and the jaccard index to try to determine if two sets of data were “duplicates”

it works quite well and includes a ruby and c++ version with low level bit operations.

project page is www.matpalm.com/resemblance

code at github.com/matpalm/resemblance

i was going to put the discussion here but the page ended up too long, next time i’ll break it into chunks, maybe.