index    the base algorithm  >>

the question

i had an interview questions once which was "how would you find the median of a trillion integers spread across a thousand machines"

though i managed to bumble my way to an answer that was generally on the right track i've never gone through my solution in detail,
so here goes....

firstly; what's a median?
informally the median of a list of elements is the "middle" element after the list is sorted.

eg for a sorted list with 11 elements [1 1 2 3 3 4 5 5 5 6 9]
the median of this list is the value in the "middle" with 5 elements on either side
ie. the value 4

so what makes this question about medians worthy of an interview question?

doing some back of envelope calculations we see that a trillion (1012) ints would consume, at 4 bytes per int,
4x1012 bytes = 4x109 kb = 4x106 mb = 4x103 gb.
spread over 1,000 machines thats 4gb worth of ints per machine.

not so easy to just sort...

index    the base algorithm  >>

nov 2008