but does it scale?

running locally

remember when all that anyone would every ask about a technology was "but does it scale?" those were the days...

hadoop is fun, but it's complex. thinking in map reduce requires a mind shift and having to handle the joins yourself is a pain in the arse. why would anyone bother? because of the promise of linear scalability for insanely large datasets. so my question is how large is insanely large?

we've been working on a (far from insanely large) corpus of 8 docs. running on my home machine (modest quadcore) it takes 9 min to run the 13 map reduce tasks required to build the sips and for this size of data that's ludicrous. i wrote another version in plain ruby and it runs in 18 secs and it's just ruby for god sakes (fun to write, not so fun to wait for). imagine how fast (and painful) it would be with c++ using open-mp?

most of the time the cpu is not at 100% as a lot of the time is framework "stuff"; stopping and starting instances and the like. as an illustration consider ths case of just a single document. ruby <1s, hadoop 2m 40s. bam.

so at what size do we switch over to hadoop being worth it? and when do we have to move from one machine to a cluster?

lets start with some runs using hadoop in pseudo distributed mode, i'll leave it's default config for my quad core box at 2 mappers and reducers.

num docs size (gz'd) total tokens unique tokens ruby time hadoop local

1 22 kb 9e3 2e3 <1s 2m 40s

8 800 kb 408e3 27e3 18s 7m 20s

20 3 mb 1.5e6 66e3 1m 8s 16m

so far, not so good. but to be fair we're still only running on a single machine, this stuff was made for a cluster...
running on ec2
setup
ec2 is a great way to get a bunch of machines for a bit and hadoop is well supported with scripts provided to manage an ec2 cluster.
the first thing to do is prep the data and get it into the cloud. the complete steps then are...

remove the prj gut header/footer with the clean text scripts, unzipped this is 2.8gb

run rake prepare_input to reformat each file as a single line, gzipped results in 980mb

use chunky.rb to chunk these 7,900 files into 98 files of 10mb each

upload from home machine to s3 with s3cmd

fire up ec2 instances

download from s3 to hdfs using the hadoop distcp tool

go nuts

the main pro of this approach is that it minimising the the time 'wasted' between firing the ec2 instances up and having the data in hdfs. moving the data from s3 to hdfs is fast (an intracloud transfer) and the distcp works in parallel
a minor con is that the optimal size of file in hdfs is 64mb not 10mb but i don't want to make the smallest size 64mb since it'll be more awkward to do smaller scale tests.
results
so how does it preform? let's use a 10 node cluster running amazons medium cpu instance.
these 10 notes provide a 30/30 map/reduce capability.
since the data size is quite small had to up the mapred.map.tasks and mapred.reduce.tasks to 30/30.

#chunks num docs total tokens unique tokens ruby time cluster time

1 56 5.6e6 75e3 4m 32s 10m 37s

2 110 11e6 98e3 8m 48s

3 177 16e6 117e3 12m 17s 17m 50s

4 233 22e6 134e3 16m 45s

5 306 27e6 150e3 20m 15s 24m 49s

6 378 33e6 162e3 24m 13s

7 450 38e6 173e3 28m (40m) 33m 12s

8 531 44e6 189e3 32m 24s

9 599 50e6 197e3 36m 49s

10 686 55e6 212e3 41m 29s 45m 22s

this is quite interesting...

extrapolation says hadoop will overtake the ruby one at a runtime of 2h 5min (310mb). though with a sample size this small though such extrapolation needs to be taken with a grain of salt. so hadoop is roughly 6min + 4min per 10mb processed, ruby is 4min per 10mb.

so what conclusions can we draw? for me these results raises more questions than they answer.

is it that hadoop is scalable but not that performant? possible since the infrastructure cost is large. i would do some longer tests on ec2 but it's not free...
have i made a fundamental mistake regarding my map reduce implementation? are my joins terribly implemented?
could i have implemented the algorithm using less passes of the data?
is streaming that much slower than native java? i can't see how it would make that much of a difference
how would pig go? especially for the expensive join steps?

hmm. some more investigation is required!

<< take 3: markov chains index

sept 2009

num docs	size (gz'd)	total tokens	unique tokens	ruby time	hadoop local
1	22 kb	9e3	2e3	<1s	2m 40s
8	800 kb	408e3	27e3	18s	7m 20s
20	3 mb	1.5e6	66e3	1m 8s	16m

#chunks	num docs	total tokens	unique tokens	ruby time	cluster time
1	56	5.6e6	75e3	4m 32s	10m 37s
2	110	11e6	98e3	8m 48s
3	177	16e6	117e3	12m 17s	17m 50s
4	233	22e6	134e3	16m 45s
5	306	27e6	150e3	20m 15s	24m 49s
6	378	33e6	162e3	24m 13s
7	450	38e6	173e3	28m (40m)	33m 12s
8	531	44e6	189e3	32m 24s
9	599	50e6	197e3	36m 49s
10	686	55e6	212e3	41m 29s	45m 22s