<< performance comparisons index conclusion >>
original spec was a trillion ints (1e12) over a thousand (1e3) machines this is equivalent to 1e9 per machine but lets say 3e9 over 3 machines to prove the aspects of concurrency we need some big boxes to run against, lets rent some from amazon ec2!
using 3 of the high-cpu extra large instance (8 virtual cores each) we'll need the data spread across 24 files.
bash> time ./generate_test_data.rb 1 3141592 10e6 3e9 | split -l 125000000 -dthis is taking too much space, each of the 24 files is 930mb, 400mb compressed at the highest level (a test showed bzip2 gave slightly better compression than lzma) that's almost 10gb compressed, and i'm not going to scp that to amazon.
could we just generate the data on the box? maybe. the app is taking 10min per file and i'd prefer not to waste the rental time generating data... perhaps storing this ints as strings isn't going to cut it for this much data.
let's work through an example of 32 bytes
bash> hexdump -C sample_test_data 00000000 31 32 34 31 38 37 30 0a 31 34 38 37 36 35 38 0a |1241870.1487658.| 00000010 31 30 30 35 30 34 37 0a 33 31 34 30 30 35 32 0a |1005047.3140052.|
how much smaller is it dumping at lower level erlang format and see what we get
erl> {ok,F} = file:open("test2",write). erl> B = term_to_binary(parse_file:to_list("sample_test_data")). erl> file:write(F, B). erl> file:close(F).
slightly better, 27 bytes, but hardly worth the effort
bash> hexdump -C test2 00000000 83 6c 00 00 00 04 62 00 12 f3 0e 62 00 16 b3 2a |.l....b....b...*| 00000010 62 00 0f 55 f7 62 00 2f e9 d4 6a |b..U.b./..j|i think we're going to have generate the dict format ( ie {value,freq} pairs ) from the start. it's worth noting that with the dict format idea it doesnt really make much difference how many elements there are. it's the range of elements that dictates the size.
let's downgrade to 100e6 total with a range of 1 to 100e3
bash> ./generate_test_data.rb 1 31415 100e3 100e6 > test.100e6 bash> test.100e6 | split -d -l 4166714a 556mb file total; 24 files of 24mb; each 24mb reduces to 1.3mb through parse_file:to_dict and further to 300kb after bzipping results in about 7mb total, much more reasonable
ec2 is very well documented, follow the getting started guide if you want to have a crack.
i wanted to use one of the newer high cpu instances (8 cores, 7gb ram) and found some notes in the dev guide about the recommend using a 2.6.18 XEN compatible kernel grepping the output of ec2-describe-images -a gives such a kernel, and a machine image that uses it
IMAGE aki-9800e5f1 ec2-public-images/vmlinuz-2.6.18-xenU-ec2-v1.0.x86_64.aki.manifest.xml IMAGE ami-332cc85a gentoo-c1.xlarge-nocona-1223748124/image.manifest.xml x86_64 machine aki-9800e5f1might use this instance (and i love gentoo for servers, how i could NOT use it)
so lets boot one up
mats> ec2run ami-332cc85a -k gsg-keypair -t c1.xlargeand get on it and get an erlang environment installed! (don't forget to emerge with smp enabled (like i did first time) a single threaded erlang vm might be a bit underwhelming on an 8 proc box...)
mats> ssh -i ~/dev/ec2/id_rsa-gsg-keypair root@ec2-67-202-25-26.compute-1.amazonaws.com ec2> USE="smp" emerge erlang
a quick check to see code is ok
ec2> erl -noshell -sname i1 -setcookie 123 -run controller init worker_freq data/x00.dictall good. if fact, a bit too good. it ran super fast and running all the data files x00.dict to x23.dict and it's done in a few seconds. it's hardly worth running up another instance at all. as the code works across multiple machines (i tested it between my main dev box and the pvr) so i think i might wrap this experiment up and not bother firing up another instance.
mats> ec2-terminate-instances i-6534820c
<< performance comparisons index conclusion >>
nov 2008