# brain of mat kelcey

## visualising the consistent hash

September 26, 2010 at 04:00 PM | categories: algorithms | View Comments

consider the problem of allocating N resources across M servers (N >> M)a common approach is the straight forward modulo hash...if we have 4 servers; servers = [server0, server1, server2, server3] we can allocate a resource to a server by simplyhashing the resource hash(resource) = 54reducing modulo 4 54 % 4 = 2allocating to that numbered server servers[2] = server2we can visualise how this scheme maps resources to servers by allocating a colour to each server;server0 server1 server2 server3 and, assuming we are hashing to a value between 0 and 99, draw the following chart ...... where...

## how many terms in a trend?

May 11, 2010 at 07:46 PM | categories: e15, trending, algorithms, puzzled | View Comments

i've been poking around with a simple trending algorithm over the last few weeks and have uncovered a problem that, like most interesting ones, i'm not sure how to solve. the question revolves around discovering multi terms trends. a sensible place to start when looking for trends is to consider single terms but what if though we ended up with three equally trending terms 'happy', 'new' and 'year'? it's pretty obvious that the actual trend is 'happy new year' but what is the best way to express this as a single trend in an algorithmic sense?one approach i've been playing...

## e10.4 communities in social graphs

October 06, 2009 at 08:05 PM | categories: e10, twitter, social network, betweenness, algorithms, graph | View Comments

social graphs, like twitter or facebook, often follow the pattern of having clusters of highly connected components with an occasional edge joining these clusters.these connecting edges define the boundaries of communities in the social network and can be identified by algorithms that measure betweenness.the girvan-newman algorithm can be used to decompose a graph hierarchically based on successive removal of the edges with the highest betweenness.the algorithm is basicallycalculate the betweenness of each edge (using an all shortest paths algorithm)remove the edge(s) with the highest betweennesscheck for connected components (using tarjan's algorithm)repeat for graph or subgraphs if graph was split...

## e10.3 twitter crawl progress

September 29, 2009 at 08:43 PM | categories: e10, twitter, algorithms, hadoop | View Comments

since the twitter api is rate limited it's quite slow to crawl twitter and after a most of a week i've still only managed to get info on 8,000 users. i probably should subscribe to get a 20,000 an hr limit instead of the 150 i'm on now. i'll just let it chug along in the background of my pvr.while the crawl has been going on i've been trying some things on the data to decide what to do with it.i've managed to write a version of pagerank using pig which has been very interesting. (for those who haven't seen...

## e10.1 crawling twitter

September 19, 2009 at 09:31 PM | categories: e10, twitter, algorithms, graph | View Comments

our first goal is to get some data and the twitter api makes getting the data trivial. i'm focused mainly on the friends stuff but because it only gives user ids i'll also get the user info so i can put names to ids.a depth first crawl makes no sense for this one experiment, we're unlikely to get the entire graph and are more interested in following edges "close" to me. instead we'll use a breadth first search.since any call to twitter is expensive (in time that is, they rate limit their api calls) instead of a plain vanilla breadth...

## e10.0 introducing tgraph

September 19, 2009 at 02:41 PM | categories: big data, e10, twitter, hadoop, pig, algorithms | View Comments

so e9 sip is on hold for a bit while i kick off e10 tgraph. was looking for another problem to try hadoop with and came across a classic graph one, pagerank. a well understood algorithm like page rank will be a great chance to try pig, the query language that sits on top of hadoop mapreduce.so we need a graph to work on. my first thoughts were using one of the wikipedia linkage dumps but it feels a bit sterile. instead it's a good excuse to do a little crawl of the following graph of twitter.this will also be...

## bin packing

December 14, 2008 at 11:31 AM | categories: algorithms | View Comments

how to decide what next to backup onto a dvd?when is brute force good enough? will a random walk get a good enough result faster?matpalm.com/burn.it...

## the median of a trillion numbers

November 15, 2008 at 11:31 AM | categories: erlang, algorithms, ec2 | View Comments

i got asked in an interview once “how would find the median of a trillion numbers across a thousand machines?”the question has haunted me, until now.here’s my ruby and erlang implementation with a bit of running amazon ec2 thrown in for good measure….. matpalm.com/median/grab the code from github...

## fastmap and the jaccard distance

October 31, 2008 at 11:31 AM | categories: algorithms, deduplication, c++ | View Comments

given a set of pairwise distances how do you determine what points correspond to those distances?my latest experiment considers this problem in relation to jaccard distances, a resemblance measure similar to jaccard coefficients used in a previous experimentby using the fastmap algorithm we get points from distances and once you have points you have visualisation!...

## shingling and the jaccard index

October 06, 2008 at 11:30 AM | categories: ruby, algorithms, deduplication, c++ | View Comments

on the weekend i did another experiment using shingling and the jaccard index to try to determine if two sets of data were “duplicates”it works quite well and includes a ruby and c++ version with low level bit operations.project page is www.matpalm.com/resemblancecode at github.com/matpalm/resemblance...

old projects...

- latent semantic analysis via the singular value decomposition (for dummies)
- semi supervised naive bayes
- statistical synonyms
- round the world tweets
- decomposing social graphs on twitter
- do it yourself statistically improbable phrases
- should i burn it?
- the median of a trillion numbers
- deduping with resemblance metrics
- simple supervised learning / should i read it?
- audioscrobbler experiments
- chaoscope experiment