<< single process erlang implementation index performance comparisons >>
single [1,2,3,4,5,6,7,8,9] multi [[1,2,3,4],[5,6],[7,8,9]]
recall there a number of operations required in distributed case
includes things like determining total number of elements, determining minimum value, etc
each of these operations can be done against each sub list individually with the results aggregated
eg to determine total number of elements
single -> length( [1,2,3,4,5,6,7,8,9] ) = 9 multi -> sum ( length([1,2,3,4]), length([5,6]), length([7,8,9]) ) = sum([4,2,3]) = 9
eg to determine minimum value
single -> min( [1,2,3,4,5,6,7,8,9] ) = 1 multi -> min( min([1,2,3,4]), min([5,6]), min([7,8,9]) ) = min([1,5,7]) = 1
in the single list case we pick the pivot as the first value.
in the multi list case we pick the pivot as the first value of the first list.
recall in the algorithm that rotation is sometimes required.
this is to ensure all potential values for a pivot are explored.
so in the multi list case the rotation needs to operate at two levels; rotate the first list and then rotate the list of lists
before [[1,2,3],[4,5,6],[7,8,9]] after [[4,5,6],[7,8,9],[2,3,1]]
the algorithm needs to reject values less than or greater than a particular value.
in the single list case it's never the case that the list becomes empty though this is possible in the multi list case.
consider the single list case for
[3,1,2,2,2,4,5] pivot 3, num less than 3 = 4, so 3 is 5rd order stat, discard all over 4 result [3,1,2,2,2]
now consider the multi list case for
[[3,1,2],[2,2],[4,5]] pivot 3, num less than 3 = 4, so 3 is 5rd order stat, discard all over 4 result [[3,1,2],[2,2],[]]in this case we can exclude this empty list from further processing
the multi process erlang version splits the work across two modules
a controller orchestrates the process
processing of lists is done by a number of spawned workers
if a worker ends up with no elements is it terminated
try it out with
bash> ./generate_test_data.rb 0 314 1000 10e3 | ./spread_across_files.rb numbers 4 bash> erl -sname bob -noshell -run controller init worker numbers.[0-3] 314
in fact if you have two (or more) machines, you can try it distributed style!
bash> erl -sname worker -noshell -setcookie 123
bash> erl -sname master -noshell -setcookie 123 -run controller init worker numbers.[0-3]
there is another optimisation we can make if we make an assumption about the data.
say we have a list of N ints where for each value e; 1 <= e <= M.
the list can be represented in two ways...
the N elements directly
[e1, e2, ... ,en]or a lookup take of the M distinct values and their frequencies
[{1,f1}, {2,f2}, ... , {M,fm}]the first case uses N ints, the second case uses 2M ints.
in the case that N >>> M we can make huge savings, in both time and space, by using the second.
eg consider a trillion ints with distinct values 1 to a billion
storing the list uses 1e12 ints
storing the lookup when all values are represented uses 2e9 ints
500 times less
this requires some minor mods to the worker module. we'll call it worker_freq
it also requires a different input format, instead of parsing the numbers directly (eg numbers.0)
we can convert to a erlang binary format of the dictionary (eg numbers.0.dict) that can be read by the workers
we use generate_binary_dicts to do this
try it out with
bash> ./generate_test_data.rb 0 314 1000 10e3 | ./spread_across_files.rb numbers 4 bash> erl -noshell -run generate_binary_dicts main numbers.[0-3] bash> erl -sname bob -noshell -run controller init worker_freq numbers.[0-3].dict 314
<< single process erlang implementation index performance comparisons >>
nov 2008