<< take 3: markov chains index
remember when all that anyone would every ask about a technology was "but does it scale?" those were the days...
hadoop is fun, but it's complex. thinking in map reduce requires a mind shift and having to handle the joins yourself is a pain in the arse. why would anyone bother? because of the promise of linear scalability for insanely large datasets. so my question is how large is insanely large?
we've been working on a (far from insanely large) corpus of 8 docs. running on my home machine (modest quadcore) it takes 9 min to run the 13 map reduce tasks required to build the sips and for this size of data that's ludicrous. i wrote another version in plain ruby and it runs in 18 secs and it's just ruby for god sakes (fun to write, not so fun to wait for). imagine how fast (and painful) it would be with c++ using open-mp?
most of the time the cpu is not at 100% as a lot of the time is framework "stuff"; stopping and starting instances and the like. as an illustration consider ths case of just a single document. ruby <1s, hadoop 2m 40s. bam.
so at what size do we switch over to hadoop being worth it? and when do we have to move from one machine to a cluster?
lets start with some runs using hadoop in pseudo distributed mode, i'll leave it's default config for my quad core box at 2 mappers and reducers.
|num docs||size (gz'd)||total tokens||unique tokens||ruby time||hadoop local|
|1||22 kb||9e3||2e3||<1s||2m 40s|
|8||800 kb||408e3||27e3||18s||7m 20s|
|20||3 mb||1.5e6||66e3||1m 8s||16m|
|#chunks||num docs||total tokens||unique tokens||ruby time||cluster time|
|1||56||5.6e6||75e3||4m 32s||10m 37s|
|3||177||16e6||117e3||12m 17s||17m 50s|
|5||306||27e6||150e3||20m 15s||24m 49s|
|7||450||38e6||173e3||28m (40m)||33m 12s|
|10||686||55e6||212e3||41m 29s||45m 22s|
this is quite interesting...
extrapolation says hadoop will overtake the ruby one at a runtime of 2h 5min (310mb). though with a sample size this small though such extrapolation needs to be taken with a grain of salt. so hadoop is roughly 6min + 4min per 10mb processed, ruby is 4min per 10mb.
so what conclusions can we draw? for me these results raises more questions than they answer.
hmm. some more investigation is required!