brain of mat kelcey
fastmap and the jaccard distance
October 31, 2008 at 11:31 AM | categories: algorithms, deduplication, c++ | View Comments
    
  given a set of pairwise distances how do you determine what points correspond to those distances?my latest experiment considers this problem in relation to jaccard distances, a resemblance measure similar to jaccard coefficients used in a previous experimentby using the fastmap algorithm we get points from distances and once you have points you have visualisation!...
  
openmp = easy multi threading
October 13, 2008 at 11:30 AM | categories: openmp, multicore, c++ | View Comments
    
  openmp is a compiler library, available in gcc since v4.2, for giving hints to a compiler about where code can be parallelized.say we have some code12for(int i=0; i<HUGE_NUMBER; ++i)  deadHardCalculation(i)we can make this run on multi threaded by simply adding some pragmas123456#pragma omp parallel num_threads(4){  #pragma omp for  for(int i=0; i<HUGE_NUMBER; ++i)    deadHardCalculation(i);}compiling with -fopenmp will generate an app that splits the work of the for loop across 4 threads.there’s support for dynamic / static scheduling, accumulators, all sortsthis tute is awesome.it increased the speed of my shingling code by 350% on a quad...
  
shingling and the jaccard index
October 06, 2008 at 11:30 AM | categories: ruby, algorithms, deduplication, c++ | View Comments
    
  on the weekend i did another experiment using shingling and the jaccard index to try to determine if two sets of data were “duplicates”it works quite well and includes a ruby and c++ version with low level bit operations.project page is www.matpalm.com/resemblancecode at github.com/matpalm/resemblance...
  
java is the new c++
October 05, 2008 at 11:29 AM | categories: rant, java, c++ | View Comments
    
  this year would have been my ten year anniversary of commercially coding in java. it’s not going to be though since the last six months have been ruby. even with my huge investment in java i’d be quite happy to never write a line of it again.i remember when java was first moving in. it was not as performant as c/c++ but it was much easier to write good clean code. and who really cares about performance? scalability is what matters and it's decided by design and architecture, not language choice. as a new language java made sure it had...
  
old projects...
- latent semantic analysis via the singular value decomposition (for dummies)
- semi supervised naive bayes
- statistical synonyms
- round the world tweets
- decomposing social graphs on twitter
- do it yourself statistically improbable phrases
- should i burn it?
- the median of a trillion numbers
- deduping with resemblance metrics
- simple supervised learning / should i read it?
- audioscrobbler experiments
- chaoscope experiment