brain of mat kelcey...


xargs parallel execution

November 06, 2009 at 09:57 PM | categories: Uncategorized

just recently discovered xargs has a parallelise option!

i have 20 files, sample.01.gz to sample.20.gz, each ~100mb in size that i need to run a script over

one option is

zcat sample*gz | ./script.rb > output
but this will process the files sequentially on a single core.

to get some parallel action going i could generate a temp script that produces

zcat sample.01.gz | ./script.rb > sample.01.out &
   zcat sample.02.gz | ./script.rb > sample.02.out &
   ...
   zcat sample.20.gz | ./script.rb > sample.20.out &
and run that but this will have all 20 running at the same time and produce contention

(though with only 20 files this might not be a problem)

instead i can make a temp script, parse.sh

zcat $1 | ./script.rb > $1.out
and run
find samplegz | xargs -n1 -P4 sh parse.sh
cat out > output
what is this xargs command doing?
  • -n1 passes one arg a time to the run comamnd (instead of the xargs default of passing all args)
  • -P4 says have at most 4 commands running at the same time
100% on all cores (and only because the disk can keep up)

awesome!