brain of mat kelcey...

e11.2 aggregating tweets by time of day

October 24, 2009 at 01:02 PM | categories: Uncategorized

for v3 lets aggregate by time of the day, should make for an interesting animation

browsing the data there are lots of other lat longs in data, not just iPhone: and ÜT: there are also one tagged with Coppó:, Pre:, etc perhaps should just try to take anything that looks like a lat long.

furthermore lets switch to a bigger dataset again, 4.7e6 tweets from Oct 13 07:00 thru Oct 19 17:00,

i've been streaming all my tweets ( as previously discussed ) and been storing them in a directory json_stream

here are the steps...

1. extract locations

use a streaming script to take a tweet in json form and emit the tweet time and location string

export HADOOP_STREAMING_JAR=$HADOOP_HOME/contrib/streaming/hadoop-*-streaming.jar
 -mapper ./extract_locations.rb -reducer /bin/cat \
 -input json_stream -output locations

sample output (4.7e6 tuples) { time, location string }

Wed Oct 14 22:01:41 +0000 2009    iPhone: -23.492420,-46.846916
Wed Oct 14 22:01:41 +0000 2009    Ottawa
Wed Oct 14 22:01:41 +0000 2009    DA HOOD
Wed Oct 14 22:01:42 +0000 2009    Earth

2. pluck lat longs from locations

make another pass and extract possible lat lons from the location strings

 -mapper ./extract_lat_longs_from_locations.rb -reducer /bin/cat \
 -input locations -output lat_lons

sample output (reduces down to 320e3 data points) { time, lat, lon }

Wed Oct 14 22:01:41 +0000 2009    -23.49242    -46.846916
Wed Oct 14 22:05:25 +0000 2009    35.670086    139.740766
Wed Oct 14 22:11:35 +0000 2009    41.37731257    -74.68153942
Wed Oct 14 22:15:18 +0000 2009    51.503212    5.478329

3. bucket data into timeslices and points for a map

we need to project the times into 10min slots; ie 00:05 will be slot 0, 00:12 will be slot 1.

also use to project the lat lons to x and y coords (0->1) using a simple mercator projection

 -mapper ./lat_long_to_merc_and_bucket.rb -reducer /bin/cat \
 -cmdenv BUCKET_SIZE=0.005 \
 -input lat_lons -output x_y_points

sample output { timeslice, normalised x position, normalised y position }

122     0.48    0.205
122     0.295   0.26
122     0.29    0.26
123     0.265   0.265

as a slight digression before we move onto aggregating per timeslice here's a pic of all 320e3 tweets on a heatmap.

some interesting noise on the greenwich meridian, must be incorrectly identified lat lons during the ./extract_lat_longs_from_locations.rb step.

log10 tweet location (click for a hires version)

log10 tweet location, click for a hires version

4. aggregate (x,y) pairs per timeslice

next we aggreate, per timeslice, the frequency of points each x,y point. we'll do this with a pig script, aggregate_per_timeslice.pig

# aggregating per timeslice
pts = load 'x_y_points/part-00000' as (timeslice:int, x:float, y:float);
pts2 = group pts by (timeslice,x,y);
pts3 = foreach pts2 generate $0, COUNT($1) ;
pts4 = foreach pts3 generate $0.$0, $0.$1, $0.$2, $1 as freq;
pts5 = order pts4 by timeslice;
store pts5 into 'aggregated_freqs';

results in the tuples in 'aggregated_freqs' { timeslice, normalised x position, normalised y position, frequency }

0    0.0    0.32    1
0    0.06    0.325    9
0    0.065    0.33    1
0    0.08    0.17    2
0    0.155    0.225    8

we need to normalise each frequency value for drawing on the map and would have like to have done this in pig also but turns out there isn't a log function in v0.3 of pig (??)

will have to do scaling when generating the images. isn't such a big deal since the dataset is quite small at this stage but was trying to use this whole thing as an excuse to learn pig :(

5. take aggregated_freqs and make 144 heat map images

use a simple script to read through the aggregated_freqs and generate a heap map for each frame

heat_maps.rb aggregated_freqs 0.005 frames

6. convert to animation

next bundle stills into an animation and upload to youtube

mencoder mencoder "mf://frames/*" -mf fps=25 -o rtw_tweet_v3.avi -ovc x264 -x264encopts bitrate=750
## 7. conclusions 1. didn't really end up using hadoop's power that much; streaming jobs that use just cat as a reducer as just a parallel way of doing 1:1 string mapping 1. aggregation was really easy in pig but lack of Log function is annoying; could have written a UDF, and there probably already is one but i couldn't find it 1. this visualisation came out pretty lame; funny to see how the really swish visualisations rely far more on pretty colours and smooth lines than the data itself. there are a bundle of things i could do with this one but it's time to move on to something else.