e11.2 aggregating tweets by time of day

October 24, 2009 at 01:02 PM | categories: e11, twitter, hadoop, pig | View Comments

for v3 lets aggregate by time of the day, should make for an interesting animationbrowsing the data there are lots of other lat longs in data, not just iPhone: and ÜT: there are also one tagged with Coppó:, Pre:, etc perhaps should just try to take anything that looks like a lat long.furthermore lets switch to a bigger dataset again, 4.7e6 tweets from Oct 13 07:00 thru Oct 19 17:00,i've been streaming all my tweets ( as previously discussed ) and been storing them in a directory json_streamhere are the steps...use a streaming script to take a tweet in json...

Read and Post Comments

e11.1 from bash scripts to hadoop

October 18, 2009 at 02:10 PM | categories: e11, maps, twitter, hadoop, pig | View Comments

let's rewrite v1 using hadoop tooling, code is on githubwe'll run hadoop in non distributed standalone mode. in this mode everything runs in a single jvm so it's nice and simple to dev against.in v1 it wasbzcat sample.bz2 | ./extract_locations.pl > locationsusing the the awesome hadoop streaming interface it's not too different. this interface allows you to specify any app as the mapper or reducer. the main difference is that it works on directories not just files.for the mapper we'll use exactly the same script as before; extract_locations.pl and since there is no reduce component of this job so we...

Read and Post Comments

e10.3 twitter crawl progress

September 29, 2009 at 08:43 PM | categories: e10, twitter, algorithms, hadoop | View Comments

since the twitter api is rate limited it's quite slow to crawl twitter and after a most of a week i've still only managed to get info on 8,000 users. i probably should subscribe to get a 20,000 an hr limit instead of the 150 i'm on now. i'll just let it chug along in the background of my pvr.while the crawl has been going on i've been trying some things on the data to decide what to do with it.i've managed to write a version of pagerank using pig which has been very interesting. (for those who haven't seen...

Read and Post Comments

e10.0 introducing tgraph

September 19, 2009 at 02:41 PM | categories: big data, e10, twitter, hadoop, pig, algorithms | View Comments

so e9 sip is on hold for a bit while i kick off e10 tgraph. was looking for another problem to try hadoop with and came across a classic graph one, pagerank. a well understood algorithm like page rank will be a great chance to try pig, the query language that sits on top of hadoop mapreduce.so we need a graph to work on. my first thoughts were using one of the wikipedia linkage dumps but it feels a bit sterile. instead it's a good excuse to do a little crawl of the following graph of twitter.this will also be...

Read and Post Comments

first hadoop experiment

September 16, 2009 at 07:26 PM | categories: ec2, big data, hadoop | View Comments

just finished my first hadoop experiment.matpalm.com/sipnot fantastic results but heaps of of feedback from hadoop mailing groupmore results coming soon...

Read and Post Comments

old projects...

brain of mat kelcey

a pig screencast

e11.2 aggregating tweets by time of day

e11.1 from bash scripts to hadoop

e10.3 twitter crawl progress

e10.0 introducing tgraph

first hadoop experiment