June 28, 2009 at 11:32 AM | categories: gzip, big data, sys admin | View Comments

when working with larger data sets (ie more than can fit in memory) there are two important resources to juggle…cpu. how quickly can you process the data.disk io. how quickly can you get data to the cpu.i remember reading once that depending on your situation you might be better off using data compressed on disk. why? because the extra cpu time used decompressing it is worth it for the time saved getting it off disk.i’ve recently been working with a number crunching app (burns 100% cpu of a quadcore machine for an hour over a 7gb working dataset) and thought...

old projects...

latent semantic analysis via the singular value decomposition (for dummies)
semi supervised naive bayes
statistical synonyms
round the world tweets
decomposing social graphs on twitter
do it yourself statistically improbable phrases
should i burn it?
the median of a trillion numbers
deduping with resemblance metrics
simple supervised learning / should i read it?
audioscrobbler experiments
chaoscope experiment

brain of mat kelcey

how using compressed data can make you app faster