brain of mat kelcey...


e10.3 twitter crawl progress

September 29, 2009 at 08:43 PM | categories: Uncategorized

since the twitter api is rate limited it's quite slow to crawl twitter and after a most of a week i've still only managed to get info on 8,000 users. i probably should subscribe to get a 20,000 an hr limit instead of the 150 i'm on now. i'll just let it chug along in the background of my pvr.

while the crawl has been going on i've been trying some things on the data to decide what to do with it.

i've managed to write a version of pagerank using pig which has been very interesting. (for those who haven't seen it before pig is a query language that sits on top of hadoop's mapreduce). my initial feel for pig is that it's pretty awesome. it was much quicker to write this script than to write the statistically improbable phrases. in fact i'm reinspired to have another crack at the sip stuff using pig. my final result wasn't great for the performance of hadoop and after some great feedback on the hadoop mailing list i've got a number of other things to try including writing my joins in pig.

anyways, here's my pagerank in pig

done once

1
2
3
4
5
6
edges = load 'edges' as (from:chararray, to:chararray);
nodes = group edges by from;
node_contribs = foreach nodes generate group, 1.0 / (double)SIZE(edges) as contrib;
store node_contribs into 'node_contribs';
zero_contribs = foreach nodes generate group, (double)0 as contrib;
store zero_contribs into 'zero_contribs';

done until convergence

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
page_rank = load '$input' as (node:chararray, rank:float);
node_contribs = load 'node_contribs' as (node:chararray, contrib:double);
nodes_page_rank = join node_contribs by node, page_rank by node;
contribs = foreach nodes_page_rank {
  generate node_contribs::node, (double)node_contribs::contrib*(double)page_rank::rank as contrib;
}
edges = load 'edges' as (from:chararray, to:chararray);
joined_divy_groups = join edges by from, contribs by node_contribs::node;
page_rank_contributions = foreach joined_divy_groups generate edges::to, contribs::contrib;
zero_contribs = load 'zero_contribs' as (node:chararray, contrib:double);
page_rank_contributions_with_zero = union page_rank_contributions, zero_contribs;
group_page_ranks = group page_rank_contributions_with_zero by edges::to;
next_page_rank = foreach group_page_ranks {
  generate group, 0.15+(0.85*SUM(page_rank_contributions_with_zero.contribs::contrib));
}
store next_page_rank into '$output';

as for all my projects code is on github