brain of mat kelcey...

e10.1 crawling twitter

September 19, 2009 at 09:31 PM | categories: Uncategorized

our first goal is to get some data and the twitter api makes getting the data trivial. i'm focused mainly on the friends stuff but because it only gives user ids i'll also get the user info so i can put names to ids.

a depth first crawl makes no sense for this one experiment, we're unlikely to get the entire graph and are more interested in following edges "close" to me. instead we'll use a breadth first search.

since any call to twitter is expensive (in time that is, they rate limit their api calls) instead of a plain vanilla breadth first we'll introduce a cost component to elements on the frontier so help decide what to grab next. this is especially important for a graph like twitter where the outdegree of a node is often in the hundreds. it turns the crawl into something that is not strictly depth first but it works out.

to explain the cost component consider the expected connectivity of nodes in the twitter friend graph. most nodes have an outdegree of the order 20-200. occasionally you see much larger (in the 1000's) or much smaller (under 10).

we might, naively perhaps, say that having a large outdegree means the person is a bit less strict with her following criteria and that some of them are not really that important to her. if this is the case we should focus a little more on getting nodes with smaller outdegree.

the formula i've come up with is to not consider the depth but instead add 1 + Log10(1+the outdegree of the previous node). in this way we penalise large outdegress, but not by a huge amount. we always add 1 to counter the cases where there are no edges leaving a node.