Posts Tagged ‘maps’

e11.1 from bash scripts to hadoop

Sunday, October 18th, 2009

let’s rewrite v1 using hadoop tooling, code is on github

we’ll run hadoop in non distributed standalone mode. in this mode everything runs in a single jvm so it’s nice and simple to dev against.

step 1: extract the locations strings from the json stream

in v1 it was

bzcat sample.bz2 | ./extract_locations.pl > locations

using the the awesome hadoop streaming interface it’s not too different. this interface allows you to specify any app as the mapper or reducer. the main difference is that it works on directories not just files.

for the mapper we’ll use exactly the same script as before; extract_locations.pl and since there is no reduce component of this job so we use an “identity” script, ie cat, as the reduce phase.

mkdir json_stream
bzcat sample.bz2 | gzip - > json_stream/input.gz
# hadoop supports gzip out of the bound but not bzip2 :(
export HADOOP_STREAMING_JAR=$HADOOP_HOME/contrib/streaming/hadoop-*-streaming.jar
hadoop jar $HADOOP_STREAMING_JAR \
  -mapper ./extract_locations.pl -reducer /bin/cat \
  -input json_stream -output locations

this gives us the locations in a single file locations/part-0000

(more…)

e11.0 tweets around the world

Friday, October 16th, 2009

was discussing the streaming twitter api with steve and though i knew about the private firehose i didn’t know there was a lighter weight public gardenhose interface!

since discovering this my pvr has basically been running

curl -u mat_kelcey:XXX http://stream.twitter.com/1/statuses/sample.json |\
   gzip -9 - > sample.json.gz

but what am i going to do with all this data?

while poking around i noticed there was a fair number of iPhone: and ÜT: lat long tagged locations (eg iPhone: 35.670086,139.740766) so as a first hack let’s do some work extracing lat longs and displaying them as heat map points on a map.

all the code is on github

as a test then let’s take a sample.bz2 of 1,300 tweets between Oct 14 22:01:41 and 22:03:24

from this let’s just extract the location part of the tweet

bzcat sample.bz2 | ./extract_locations.pl > locations

of these 1,300 there are 30 examples of iphone lat longs (eg iPhone: -23.492420,-46.846916)

cat locations | ./extract_lat_longs_from_locations.rb iphone > locations.iphone

and 36 examples of ut lat longs (eg UT: 51.503212,5.478329)

cat locations | ./extract_lat_longs_from_locations.rb ut > locations.ut

on a side note, does anyone have any idea what ÜT is ? a phone type, maybe a carrier?

we need to convert these lat/longs to x/y points so we can plot onto a map and we’ll use the standard mercator projection to do this

cat locations.{ut,iphone} | ./lat_long_to_merc.rb > x_y_points

for the heat map we want to aggregate into buckets so the pixels are nice and big. finally we’ll output some simple javascript we can cut and paste into some map html

cat x_y_points | ./bucket.rb | sort | uniq -c | ./as_draw_square.rb

the final result is this map !

a good start. next to do the same over a much larger sample using hadoop streaming and pig and then work towards an animation by aggregating on time slices