just recently discovered xargs has a parallelise option!
i have 20 files, sample.01.gz to sample.20.gz, each ~100mb in size that i need to run a script over
one option is
zcat sample*gz | ./script.rb > output
but this will process the files sequentially on a single core.
to get some parallel action going i could generate a temp script that produces
zcat sample.01.gz | ./script.rb > sample.01.out & zcat sample.02.gz | ./script.rb > sample.02.out & ... zcat sample.20.gz | ./script.rb > sample.20.out &
and run that but this will have all 20 running at the same time and produce contention
(though with only 20 files this might not be a problem)
instead i can make a temp script, parse.sh
zcat $1 | ./script.rb > $1.out
and run
find sample*gz | xargs -n1 -P4 sh parse.sh cat *out > output
what is this xargs command doing?
- -n1 passes one arg a time to the run comamnd (instead of the xargs default of passing all args)
- -P4 says have at most 4 commands running at the same time
100% on all cores (and only because the disk can keep up)
awesome!
As you have discovered you need to make temporary files to avoid mixing xargs’ output for different jobs.
Parallel https://savannah.nongnu.org/projects/parallel/ does not suffer from this:
ls sample*gz | parallel -j+0 -k ‘zcat {} | ./script.rb’ > output
-k makes sure the output is in the same order as input.
-j+0 runs number_of_cores jobs in parallel.
thanks ole,
had never heard of parallel before
will have to add it to my must-install list
mat