i had an interview questions once which was "how would you find the median of a trillion integers spread across a thousand machines"
though i managed to bumble my way to an answer that was generally on the right track i've never gone through my solution in detail, so here goes....
firstly; what's a median? informally the median of a list of elements is the "middle" element after the list is sorted.
eg for a sorted list with 11 elements [1 1 2 3 3 4 5 5 5 6 9] the median of this list is the value in the "middle" with 5 elements on either side ie. the value 4
so what makes this question about medians worthy of an interview question?
doing some back of envelope calculations we see that a trillion (1012) ints would consume, at 4 bytes per int, 4x1012 bytes = 4x109 kb = 4x106 mb = 4x103 gb. spread over 1,000 machines thats 4gb worth of ints per machine.
not so easy to just sort...
nov 2008