Posts Tagged ‘e11’

e11.3 at what time does the world tweet?

Wednesday, October 28th, 2009

consider the graph below which shows the proportion of tweets per 10 min slot of the day (GMT0)

it compares 4.7e6 tweets with any location vs  320e3 tweets with identifiable lat lons

timeslices_freq.comparison

some interesting observations with unanswered questions…

  1. the ebb and flow is not just a result of the time of day for high twitter traffic areas. the reduction between 06:00 and 10:00 comes close to zero. this is false, there is never a worldwide time when internet traffic hits zero. does twitter turn down it’s gatdenhose for capacity reasons?
  2. the number of tweets with lat lons are correlated to those without EXCEPT past 17:00 where the lat lon cases drop drastically. have a couple of ideas banging around my head why this is the case but nothing concrete. any ideas?

speaking of correlation here’s a scatterplot of tweets with lat lons vs without. we can see that time period uncorrelatedness that occurs past 17:00 as a quite obvious cluster.

timeslices_freq.scatter

and here is the R code for these graphs

e11.2 aggregating tweets by time of day

Saturday, October 24th, 2009

for v3 lets aggregate by time of the day, should make for an interesting animation

browsing the data there are lots of other lat longs in data, not just iPhone: and ÜT: there are also one tagged with Coppó:, Pre:, etc perhaps should just try to take anything that looks like a lat long.

furthermore lets switch to a bigger dataset again, 4.7e6 tweets from Oct 13 07:00 thru Oct 19 17:00,

i’ve been streaming all my tweets ( as previously discussed ) and been storing them in a directory json_stream

here are the steps…

(more…)

e11.1 from bash scripts to hadoop

Sunday, October 18th, 2009

let’s rewrite v1 using hadoop tooling, code is on github

we’ll run hadoop in non distributed standalone mode. in this mode everything runs in a single jvm so it’s nice and simple to dev against.

step 1: extract the locations strings from the json stream

in v1 it was

bzcat sample.bz2 | ./extract_locations.pl > locations

using the the awesome hadoop streaming interface it’s not too different. this interface allows you to specify any app as the mapper or reducer. the main difference is that it works on directories not just files.

for the mapper we’ll use exactly the same script as before; extract_locations.pl and since there is no reduce component of this job so we use an “identity” script, ie cat, as the reduce phase.

mkdir json_stream
bzcat sample.bz2 | gzip - > json_stream/input.gz
# hadoop supports gzip out of the bound but not bzip2 :(
export HADOOP_STREAMING_JAR=$HADOOP_HOME/contrib/streaming/hadoop-*-streaming.jar
hadoop jar $HADOOP_STREAMING_JAR \
  -mapper ./extract_locations.pl -reducer /bin/cat \
  -input json_stream -output locations

this gives us the locations in a single file locations/part-0000

(more…)

e11.0 tweets around the world

Friday, October 16th, 2009

was discussing the streaming twitter api with steve and though i knew about the private firehose i didn’t know there was a lighter weight public gardenhose interface!

since discovering this my pvr has basically been running

curl -u mat_kelcey:XXX http://stream.twitter.com/1/statuses/sample.json |\
   gzip -9 - > sample.json.gz

but what am i going to do with all this data?

while poking around i noticed there was a fair number of iPhone: and ÜT: lat long tagged locations (eg iPhone: 35.670086,139.740766) so as a first hack let’s do some work extracing lat longs and displaying them as heat map points on a map.

all the code is on github

as a test then let’s take a sample.bz2 of 1,300 tweets between Oct 14 22:01:41 and 22:03:24

from this let’s just extract the location part of the tweet

bzcat sample.bz2 | ./extract_locations.pl > locations

of these 1,300 there are 30 examples of iphone lat longs (eg iPhone: -23.492420,-46.846916)

cat locations | ./extract_lat_longs_from_locations.rb iphone > locations.iphone

and 36 examples of ut lat longs (eg UT: 51.503212,5.478329)

cat locations | ./extract_lat_longs_from_locations.rb ut > locations.ut

on a side note, does anyone have any idea what ÜT is ? a phone type, maybe a carrier?

we need to convert these lat/longs to x/y points so we can plot onto a map and we’ll use the standard mercator projection to do this

cat locations.{ut,iphone} | ./lat_long_to_merc.rb > x_y_points

for the heat map we want to aggregate into buckets so the pixels are nice and big. finally we’ll output some simple javascript we can cut and paste into some map html

cat x_y_points | ./bucket.rb | sort | uniq -c | ./as_draw_square.rb

the final result is this map !

a good start. next to do the same over a much larger sample using hadoop streaming and pig and then work towards an animation by aggregating on time slices