<<  take 3: markov chains    index   

but does it scale?

running locally

remember when all that anyone would every ask about a technology was "but does it scale?" those were the days...

hadoop is fun, but it's complex. thinking in map reduce requires a mind shift and having to handle the joins yourself is a pain in the arse. why would anyone bother? because of the promise of linear scalability for insanely large datasets. so my question is how large is insanely large?

we've been working on a (far from insanely large) corpus of 8 docs. running on my home machine (modest quadcore) it takes 9 min to run the 13 map reduce tasks required to build the sips and for this size of data that's ludicrous. i wrote another version in plain ruby and it runs in 18 secs and it's just ruby for god sakes (fun to write, not so fun to wait for). imagine how fast (and painful) it would be with c++ using open-mp?

most of the time the cpu is not at 100% as a lot of the time is framework "stuff"; stopping and starting instances and the like. as an illustration consider ths case of just a single document. ruby <1s, hadoop 2m 40s. bam.

so at what size do we switch over to hadoop being worth it? and when do we have to move from one machine to a cluster?

lets start with some runs using hadoop in pseudo distributed mode, i'll leave it's default config for my quad core box at 2 mappers and reducers.

num docssize (gz'd)total tokensunique tokensruby timehadoop local
122 kb9e32e3<1s2m 40s
8800 kb408e327e318s7m 20s
203 mb1.5e666e31m 8s16m

so far, not so good. but to be fair we're still only running on a single machine, this stuff was made for a cluster...

running on ec2

setup

ec2 is a great way to get a bunch of machines for a bit and hadoop is well supported with scripts provided to manage an ec2 cluster.

the first thing to do is prep the data and get it into the cloud. the complete steps then are...

  1. remove the prj gut header/footer with the clean text scripts, unzipped this is 2.8gb
  2. run rake prepare_input to reformat each file as a single line, gzipped results in 980mb
  3. use chunky.rb to chunk these 7,900 files into 98 files of 10mb each
  4. upload from home machine to s3 with s3cmd
  5. fire up ec2 instances
  6. download from s3 to hdfs using the hadoop distcp tool
  7. go nuts

the main pro of this approach is that it minimising the the time 'wasted' between firing the ec2 instances up and having the data in hdfs. moving the data from s3 to hdfs is fast (an intracloud transfer) and the distcp works in parallel

a minor con is that the optimal size of file in hdfs is 64mb not 10mb but i don't want to make the smallest size 64mb since it'll be more awkward to do smaller scale tests.

results

so how does it preform? let's use a 10 node cluster running amazons medium cpu instance.
these 10 notes provide a 30/30 map/reduce capability.
since the data size is quite small had to up the mapred.map.tasks and mapred.reduce.tasks to 30/30.

#chunksnum docstotal tokensunique tokensruby timecluster time
1565.6e675e34m 32s10m 37s
211011e698e38m 48s 
317716e6117e312m 17s17m 50s
423322e6134e316m 45s 
530627e6150e320m 15s24m 49s
637833e6162e324m 13s 
745038e6173e328m (40m)33m 12s
853144e6189e332m 24s 
959950e6197e336m 49s 
1068655e6212e341m 29s45m 22s

this is quite interesting...

extrapolation says hadoop will overtake the ruby one at a runtime of 2h 5min (310mb). though with a sample size this small though such extrapolation needs to be taken with a grain of salt. so hadoop is roughly 6min + 4min per 10mb processed, ruby is 4min per 10mb.

so what conclusions can we draw? for me these results raises more questions than they answer.

hmm. some more investigation is required!

<<  take 3: markov chains    index   

sept 2009