when working with larger data sets (ie more than can fit in memory) there are two important resources to juggle…
- cpu. how quickly can you process the data.
- disk io. how quickly can you get data to the cpu.
i remember reading once that depending on your situation you might be better off using data compressed on disk. why? because the extra cpu time used decompressing it is worth it for the time saved getting it off disk.
i’ve recently been working with a number crunching app (burns 100% cpu of a quadcore machine for an hour over a 7gb working dataset) and thought it’d be a good chance to try this theory.
quite surprisingly it actually worked; the 7.2gb dataset came down to 1.3gb and the runtime was reduced from 1hr 5m to 56m. cool.