xargs parallel execution

just recently discovered xargs has a parallelise option!

i have 20 files, sample.01.gz to sample.20.gz, each ~100mb in size that i need to run a script over

one option is

zcat sample*gz | ./script.rb > output

but this will process the files sequentially on a single core.

to get some parallel action going i could generate a temp script that produces

zcat sample.01.gz | ./script.rb > sample.01.out &
zcat sample.02.gz | ./script.rb > sample.02.out &
...
zcat sample.20.gz | ./script.rb > sample.20.out &

and run that but this will have all 20 running at the same time and produce contention

(though with only 20 files this might not be a problem)

instead i can make a temp script, parse.sh

zcat $1 | ./script.rb > $1.out

and run

find sample*gz | xargs -n1 -P4 sh parse.sh
cat *out > output

what is this xargs command doing?

  • -n1 passes one arg a time to the run comamnd (instead of the xargs default of passing all args)
  • -P4 says have at most 4 commands running at the same time

100% on all cores (and only because the disk can keep up)

awesome!

Tags: ,

2 Responses to “xargs parallel execution”

  1. Ole Tange says:

    As you have discovered you need to make temporary files to avoid mixing xargs’ output for different jobs.

    Parallel https://savannah.nongnu.org/projects/parallel/ does not suffer from this:

    ls sample*gz | parallel -j+0 -k ‘zcat {} | ./script.rb’ > output

    -k makes sure the output is in the same order as input.

    -j+0 runs number_of_cores jobs in parallel.

  2. matpalm says:

    thanks ole,
    had never heard of parallel before
    will have to add it to my must-install list
    mat

Leave a Reply