let’s rewrite v1 using hadoop tooling, code is on github
we’ll run hadoop in non distributed standalone mode. in this mode everything runs in a single jvm so it’s nice and simple to dev against.
step 1: extract the locations strings from the json stream
in v1 it was
bzcat sample.bz2 | ./extract_locations.pl > locations
using the the awesome hadoop streaming interface it’s not too different. this interface allows you to specify any app as the mapper or reducer. the main difference is that it works on directories not just files.
for the mapper we’ll use exactly the same script as before; extract_locations.pl and since there is no reduce component of this job so we use an “identity” script, ie cat, as the reduce phase.
mkdir json_stream bzcat sample.bz2 | gzip - > json_stream/input.gz # hadoop supports gzip out of the bound but not bzip2 :( export HADOOP_STREAMING_JAR=$HADOOP_HOME/contrib/streaming/hadoop-*-streaming.jar hadoop jar $HADOOP_STREAMING_JAR \ -mapper ./extract_locations.pl -reducer /bin/cat \ -input json_stream -output locations
this gives us the locations in a single file locations/part-0000