me on twitter

brain of mat kelcey

theano and the curse of GpuFromHost

February 22, 2015 at 10:00 PM | View Comments

i've been reviving some old theano code recently and in case you haven't seen it theano is a pretty awesome python library that reads a lot like numpy but provides two particularly interesting features.symbolic differentiation; not something i'll talk about here, but super useful if you're tinkering with new models and you're using a gradient descent method for learning (and these days, who's not..)the ability to run transparently on a gpu; well, almost transparently, this'll be the main focus of this post...let's work through a very simple model that's kinda like a system of linear equations. we'll compare 1) numpy...
Read and Post Comments

dead simple pymc

December 27, 2012 at 09:00 PM | View Comments

PyMC is a python library for working with bayesian statistical models, primarily using MCMCmethods. as a software engineer who has only just scratched the surface of statistics this whole MCMC business is blowing my mind so i've got to share some examples.let's start with the simplest thing possible, fitting a simple distribution.say we have a thousand values, 87.27, 67.98, 119.56, ... and we want to build a model of them.a common first step might be to generate a histogram.if i had to a make a guess i'd say this data looks normally distributed.somewhat unsurprising, not just because normal distributions...
Read and Post Comments

smoothing low support cases using confidence intervals

December 08, 2012 at 10:50 PM | View Comments

say you have three items; item1, item2 and item3 and you've somehow associated a count for each against one of five labels; A, B, C, D, E> data A B C D Eitem1 23700 20 1060 11 4item2 1581 889 20140 1967 200item3 1 0 1 76 0depending on what you're doing it'd be reasonable to normalise these values...
Read and Post Comments

item similarity by bipartite graph dispersion

August 20, 2012 at 08:00 PM | View Comments

the basis of most recommendation systems is the ability to rate similarity between items. there are lots of different ways to do this. one model is based the idea of an interest graph where the nodes of the graph are users and items and the edges of the graph represent an interest, whatever that might mean for the domain. if we only allow edges between users and items the graph is bipartite.let's consider a simple example of 3 users and 3 items; user1 likes item1, user2 likes all three items and user3 likes just item3.fig1user / item interest graphone way...
Read and Post Comments

finding names in common crawl

August 18, 2012 at 08:00 PM | View Comments

the central offering from common crawl is the raw bytes they've downloaded and, though this is useful for some people, a lot of us just wantthe visible text of web pages. luckily they've done this extraction as a part of post processing the crawl and it's freely available too!the first thing we need to do is determine which segments of the crawl are valid and ready for use (the crawl is always ongoing)$ s3cmd get s3://aws-publicdatasets/common-crawl/parse-output/valid_segments.txt$ head -n3 valid_segments.txt134169014725313416901482981341690149519given these segment ids we can lookup the related textData objects.if you just want one grab it's name using something like ...$...
Read and Post Comments

fuzzy jaccard

July 31, 2012 at 08:00 PM | View Comments

the jaccard coefficient is one of the fundamental measures for doing set similarity. ( recall jaccard(set1, set2) = |intersection| / |union|. when set1 == set2 this evaluates to 1.0 and when set1 and set2 have no intersection it evaluates to 0.0 )one thing that's always annoyed me about it though is that is loses any sense of partial similarity. as a set based measure it's all or consider the sets set1 = {i1, i2, i3} and set2 = {i1, i2, i4}jaccard(set1, set2) = 2/4 = 0.5 which is fine given you have no prior info about the relationship between...
Read and Post Comments

ggplot posixct cheat sheet

March 18, 2012 at 08:00 PM | View Comments

after having to google this stuff three times in the last few months i'm writing it down here so i can just cut and paste next time...> d = read.delim('data.tsv',header=F,,col.names=c('dts_str','freq'))> # YEAR MONTH DAY HOUR> head(d,3) dts_str freq1 2012_01_01_00 183932 2012_01_01_01 205363 2012_01_01_02 91840> tail(d,3) dts_str freq732 2012_01_31_21 103107733 2012_01_31_22 108921734 2012_01_31_23 78629> summary(d$freq) Min. 1st Qu. Median Mean 3rd Qu. Max. 10590 63620 82680...
Read and Post Comments

collocations in wikipedia, part 1

January 01, 2012 at 08:00 PM | View Comments

hmmm. did you mean collocations in wikipedia?...
Read and Post Comments

tokenising the visible english text of common crawl

December 10, 2011 at 04:00 PM | View Comments

Common crawl is a publically available 30TB web crawl taken between September 2009 and September 2010. As a small project I decided to extract and tokenised the visible text of the web pages in this dataset. All the code to do this is on github.The first thing was to get the data into a hadoop cluster. It's made up of 300,000 100mb gzipped arc files stored in S3.I wrote a dead simple distributed copy to do this.Only a few things of note about this job...The data in S3 is marked as requester payswhich, even though it's a no-op if you're...
Read and Post Comments

finding phrases with mutual information

November 15, 2011 at 11:00 PM | View Comments

continuing on with my series of mutual information experiments how might we extend the technique to identity sequences longer than just two terms?one novel way is to identify the bigrams of interest, replace them with a single token and simply repeat the entire process. (thanks ted for the idea)so say we had the 6 term sentence i went to new york cityit has 5 bigrams; ('i went', 'went to', 'to new', 'new york', 'york city')running the mutual information algorithm over this might identify new york as a bigram of interest. we can swap the two terms with a single token...
Read and Post Comments

collocations in wikipedia, part 2

November 05, 2011 at 05:00 PM | View Comments

in my last post we went through mutual information as a way of finding collocations.the astute reader may have noticed that for the list of top bigrams i onlyshowed ones that had a frequency above 5,000. why this cutoff? well it turns outthat one of the criticisms of this definition of mutual information is that it gives whacky results for low support cases. if we purely just sort by the mutual information score we find that the top 250,000 all have the same score and correpond to bigrams that occur only once in the corpus (and whose terms only appear...
Read and Post Comments

collocations in wikipedia, part 1

October 19, 2011 at 08:00 PM | View Comments

collocations are combinations of terms that occur together more frequently thanyou'd expect by chance. they can include proper noun phrases like 'Darth Vader'stock/colloquial phrases like 'flora and fauna' or 'old as the hills'common adjectives/noun pairs (notice how 'strong coffee' sounds ok but 'powerful coffee' doesn't?)let's go through a couple of techniques for finding collocations taken from the exceptional nlp text "foundations of statistical natural language processing" by manning and schutze.the first technique we'll try is mututal information, it's a wayof scoring terms based on how often they appear together vs how often they appear separately. the intuition is that if...
Read and Post Comments

an exercise in handling mislabelled training data

October 03, 2011 at 08:00 PM | View Comments

as part of my diy twitter client project i've been using the twitter sample streams as a sourceof unlabelled data for some mutual information analysis. these streams are a great source of random tweets but include a lot of non english content. extracting the english tweets would be pretty straight forward if the ['user']['lang'] field of a tweet was 100% representative of the tweet's language but a lot of the times it's not; can we usethese values at least as a starting point?one approach to seeing how consistent the relationship between user_lang and the tweet language is totrain a classifier...
Read and Post Comments

do all first links on wikipedia lead to philosophy?

August 13, 2011 at 03:00 PM | View Comments

(update: like all interesting things it turns out someone else had already done this :D)a recent xkcd posed the idea...wikipedia trivia: if you take any article, click on the first link in the article text not in parentheses or italics, and then repeat, you will eventually end up at Philosophy.this raises a number of questionsQ: though i wouldn't be surprised if it's true for most articles it can't be true for all articles. can it?Q: what's the distribution of distances (measured in "number of clicks away") from 'Philosophy'?Q: by this same measure what's the furthest article from 'Philosophy'?Q: are there...
Read and Post Comments

dimensionality reduction using random projections.

May 10, 2011 at 08:31 PM | View Comments

previously i've discussed dimensionality reduction using SVD and PCA but another interesting technique is using a random a random projection we project A (a NxM matrix) to A' (a NxO, O < M) by the transform AP=A' where P is a MxO matrix with random values.( well not totally random, each column must have unit length (ie entries in each column must add to 1) )though the results of this reduction are generally not as good as the results from SVD or PCA it has two huge benefitscan be done without needing to hold P in memory (since it's...
Read and Post Comments

pseudocounts and the good-turing estimation (part1)

April 03, 2011 at 03:04 PM | View Comments

say we are running the bar at a soldout bad religion concert. the bar serves beer, scotch and water and we decide to record orders over the night so that we can know how much to order for tomorrow's gig...drink#salesbeer1000scotch300water200using these numbers we can predict a number of things..what is the chance the next person will order a beer?it's a pretty simple probability; 1000 beers / 1500 total drinks = 0.66 or 66%what is the chance the next person will order a water?also straightforward; 200 waters / 1500 total drinks = 0.14 or 14%now say we run the t-shirt stand...
Read and Post Comments

visualising the consistent hash

September 26, 2010 at 04:00 PM | View Comments

consider the problem of allocating N resources across M servers (N >> M)a common approach is the straight forward modulo hash...if we have 4 servers; servers = [server0, server1, server2, server3] we can allocate a resource to a server by simplyhashing the resource hash(resource) = 54reducing modulo 4 54 % 4 = 2allocating to that numbered server servers[2] = server2we can visualise how this scheme maps resources to servers by allocating a colour to each server;server0 server1 server2 server3 and, assuming we are hashing to a value between 0 and 99, draw the following chart ...... where...
Read and Post Comments

simple text search in ruby using ferret

September 12, 2010 at 09:28 PM | View Comments

ferret is a lightweight text search engine for ruby, a bit like lucene but with less (ie no) java.i've been looking at it today as part of my named entity extraction prototype which needs to be able to fuzzily match one short string against a list of other short strings.let's go through an example, it's the only way my brain works sorry.moremaking a ferret index is simple; we'll just make a memory based index for this demo.12require 'ferret'index = we'll add a handful of places in africa and europe to our index.each document we add is simply a hash...
Read and Post Comments

my list of cool machine learning books

August 06, 2010 at 06:35 PM | View Comments

for the last month or so i've had my head down and have been focusing more on theory (ie reading) than on practice (ie coding)so rather than write no blog post here's mats-list-of-cool-machine-learning-books in the order i think you should consider reading them...moreif you know nothing about machine learning and haven't done maths since high school then this is the book for's a fantastically accesible introduction to the field. includes almost no theory and explains algorithms using actual python implementations.this book covers quite a bit more than programming c.i. while still being extremely practical (ie very few formula).about a...
Read and Post Comments

brutally short intro to weka

July 03, 2010 at 05:35 PM | View Comments

weka is a java based machine learning workbench that i've found useful to playing with to help understand some standard machine learning algorithms. in this quick demo i show how to build a classifier for three simple datasets; two of which address the basics of text classificationbrutally short intro to weka from Mat Kelcey on Vimeo....
Read and Post Comments

friend clustering by term usage

June 25, 2010 at 11:39 PM | View Comments

recently signed up to the infochimps api and wanted to do something quick and dirty to get a feel for here's a little experimentget the people i follow on twitterlook up the words that "represent" them according to the infochimps word bag apibuild a similiarity matrix based on the common use of those termsplot the connectivity for the top 30 or so pairingsit's basically partitioned into three groups...veztek (my boss john) and smcinnes (steve from the lonely planet community team) in the top righta big clump of nosqlness with mongodb - hbase - jpatanooga - kevinweil in the bottom...
Read and Post Comments

country codes in world cup tweets - viz1

June 21, 2010 at 07:43 PM | View Comments

#worldcup tweet viz1 from Mat Kelcey on's a simple visualisation of the use of official country codes (eg #aus) in a week's worth of tweets from the search stream for #worldcup.rate is about 2hours of tweets per sec. orb size denotes relative frequency of that country code. edges denote that those two countries feature a lot in the same tweets. movement is based on gravitational like attraction along edges.the quiet period at about 0:17 is a twitter outage :)here's the original processing applet version with a bit more discussion...
Read and Post Comments

moving average of a time series in R

June 15, 2010 at 04:15 PM | View Comments

in this a sliding window of 3 elements123456789> x = c(3,1,4,1,5,9,2,6,5,3,5,8)> ra_x = filter(x, rep(1,3)/3)> ra_xTime Series:Start = 1 End = 12 Frequency = 1 [1] NA 2.666667 2.000000 3.333333 5.000000 5.333333 5.666667 4.333333 [9] 4.666667 4.333333 5.333333 NA...
Read and Post Comments

#worldcup twitter analytics

June 14, 2010 at 10:06 PM | View Comments

since the world cup started i've spent more time looking at twitter data about the games than the actual games themselves. what a sad data nerd i am!anyways, here's the first few days analysis based the use of official country tags (eg #aus) in the search stream for #worldcup.tomorrow i might look in more detail at one of the games, wondering how many variants of 'goooooooal' i'll find :D...
Read and Post Comments

a quick study in tf/icf

June 09, 2010 at 09:58 PM | View Comments

while doing some more research on trending algorithms i came across a cool little paper about term frequency normalisation for streaming data: TF-ICF: A New Term Weighting Scheme for Clustering Dynamic Data Streams.i'm finding streaming related algorithms quite interesting lately and think are the way forward in terms of dealing with large amounts of constant data. it's just not feasible to use algorithms that expect you to have all the data at any given time; it forces you to reprocess all the data you've ever seen as you get new examples. my thinking is the best solutions are the ones...
Read and Post Comments

5 minute ggobi demo

June 04, 2010 at 11:12 PM | View Comments

brutally short demo of ggobi from Mat Kelcey on Vimeo.note: non embedded version has higher res at full screen...
Read and Post Comments

how many terms in a trend?

May 11, 2010 at 07:46 PM | View Comments

i've been poking around with a simple trending algorithm over the last few weeks and have uncovered a problem that, like most interesting ones, i'm not sure how to solve. the question revolves around discovering multi terms trends. a sensible place to start when looking for trends is to consider single terms but what if though we ended up with three equally trending terms 'happy', 'new' and 'year'? it's pretty obvious that the actual trend is 'happy new year' but what is the best way to express this as a single trend in an algorithmic sense?one approach i've been playing...
Read and Post Comments

trending topics in tweets about cheese; part2

May 01, 2010 at 04:54 PM | View Comments

prototyping in ruby was a great way to prove the concept but my main motivation for this project was to play some more with pig.the main approach will bemaintain a relation with one record per tokenfold 1 hours worth of new data at a time into the modelcheck the entries for the latest hour for any trendsthe full version is on github. read on for a line by line walkthrough!the ruby impl used the simplest approach possible for calculating mean and standard deviation; just keep a record of all the values seen so far and recalculate for each new value.for...
Read and Post Comments

trending topics in tweets about cheese; part1

April 27, 2010 at 11:42 PM | View Comments

what does it mean for a topic to be 'trending'? consider the following time series (430e3 tweets containing cheese collected over a month period bucketed into hourly timeslots)without a formal definition we can just look at this and say that the series was trending where there was a spike just before time 600. as a start then let's just define a trend as a value that was greater than was 'expected'.one really nice simple algorithm for detecting a trend is to say a value, v, is trending if v > mean + 3 * standard deviation of the data seen...
Read and Post Comments

latent semantic analysis via the singular value decomposition (for dummies)

April 19, 2010 at 08:50 PM | View Comments

i've been trying to get a deeper understanding of latent semantic analysis for awhile now.last week i came to the conclusion the other way to truly understand would be to start from the ground upso here goes; mat's guide to latent semantic analysis via the singular value decomposition (for dummies)...
Read and Post Comments

cool bash stuff; mkfifo

April 15, 2010 at 09:33 PM | View Comments

mkfifo is one of those shell commands provided as part of coreutils that not many people seem to know's an (semi contrived) example close to something i did the other day to show how awesome it issay you have a number of largish presorted files; run-00 to run-03; and you want to find the most frequent lines. you could do something like the following...sort -m run-* | uniq -c | sort -nr | headhowever you'll know that from previous posts i just loooove keeping all my data compressed on disk so instead i've got run-00.gz to run-03.gzwithout having to...
Read and Post Comments

e10.6 community detection for my twitter network

April 04, 2010 at 12:58 PM | View Comments

last night i applied my network decomposition algorithm to a graph of some of the people near me in twitter.first i build a friend graph for 100 people 'around' me (taken from a crawl i did last year). by 'friend' i mean that if alice follows bob then bob also follows the graph, some things to note though; it was an unfinished crawl (can a crawl of twitter EVER be finished) and was done october last year so is a bit out of date.moreand here is the dendrogram decompositionsome interesting clusterings come out..right at the bottom we have a...
Read and Post Comments

e10.5 revisiting community detection

March 30, 2010 at 08:42 PM | View Comments

i've decided to switch back to some previous work i did on community detection in (social) graphsthe last chunk of code i wrote which tried to deal with weighted directed graphs was terribly, terribly, broken but it seems that simplifying to undirected graphs is giving me much saner results. yay!here's an example of my work in progress generated from the new version of the codeconsider the graphand it's corresponding decompositionthe results are reasonable; the initial breaking of clusters [1,2,3,4,5,6] and [7,8,9,10,11,12] is the most obvious but some of the others are not as intuitive[1,2,5] and [7,8,10] remain as unbreakable cliques...
Read and Post Comments

brutally short intro to collaborative filtering

March 18, 2010 at 08:38 PM | View Comments

my favourite recommendations system is the collaborative filter; it gives good resultsand is easy to understand and extend as works on the intuition thatif i like coffee, chocolate and ice creamand you like coffee and chocolateyou might also like ice creamso we need a little bit of terminology; users (me and you), items (coffee, chocolate and ice cream)in a user based collaborative filter the process isto calculate recommendation for user1 for each other user (user2) calculate user_similarity_score between user1 and user2 (0 -> 1 value ) if the user_similarity_score is non zero for each item user2...
Read and Post Comments

sentiment analysis training data using mechanical turk

March 12, 2010 at 09:57 PM | View Comments

want to try doing some sentiment analysis work on tweets but i need some good training data.i could label a heap of tweets myself as being positive, neutral or negative but instead this seems to be the perfect job for mechanical turkso i put up 100 'cream cheese' tweets on mechanical turk, asked for 3 opinions per tweet and offered $0.01 per opinion. took under 30 minutes to get back all 300 opinions and only cost $4.50 ($3 for the work, $1.50 admin fee)the results are interesting in themselves...mostly they are consistent;for example all three sentiments for bagels and cream...
Read and Post Comments

mongodb + twitter + yahoo term extractor = fun!

March 07, 2010 at 09:38 PM | View Comments

ran a little experiment in using yahoo term extraction yesterday and it worked well enough. here's some code to pass some text to yahoo and get back an array of termsi've got to say mongodb is such an easy tool for working with json data. these 20 odd lines insert a text json tweet stream into mongo. so simple, why can't all code be this easy......
Read and Post Comments

what to do with a week off?

February 22, 2010 at 06:42 PM | View Comments

this week i'm between jobs so i have (a little) more time than usual to hack.i've got a list of pending things to do but can't decide what to do next, here's my list in (sort of) priority order...fix up my numerical underflow / overflow problems in my recent semi supervised classification through the exerecises from the first few chapters to introductory statistics with r and all of statistics. i'm particularly keen to write a intro stats blog post about statistical this mongdb tute i found; shouldn't take too a weka screencast. i did some little talks...
Read and Post Comments

semi supervised naive bayes for text classification

February 14, 2010 at 09:46 PM | View Comments

experiment 13; a test of semi supervised naive bayes for text classification is complete.semi supervised algorithms seem to work pretty well and i can see how they are a huge benefit for text classification where you can have an enormous corpus but not enough time to label it all......
Read and Post Comments

e12.3 stat syns FAIL!

February 05, 2010 at 08:31 PM | View Comments

after quite a bit of hacking the statistical synonyms idea doesn't seem to give terribly interesting results so i'm going onto do something else.for the record here's what I did do though....generate 3grams from 800e3 tweetscollect n-grams together that share the same first and last term; eg 'the blue cat', 'the green cat', 'the red cat'for each set generate all the combos of the middle terms; eg 'blue green', 'blue red', 'green red'count the occurrences of each pairdraw a graph of the 150 top occurring pairsviola! click this image for a bigger versionsome interesting result. few of the more complex...
Read and Post Comments

an intro to semi supervised document classification

January 31, 2010 at 02:02 PM | View Comments

here's a great lecture from tom mitchell about document classification using a semi supervised version of naive bayes.semi supervised algorithms only require some of the training examples to be labeled and are able to make use of any unlabelled ones, very common when we have a huge corpus.i've started an experiment brewing to test this out by porting some previous naive bayes work i did to use this semi supervised scheme and will published it when it's stuff!!...
Read and Post Comments

e12.2 entity set expansion

January 28, 2010 at 08:18 PM | View Comments

i've been doing some reading for my statistical synonyms project and have uncovered a heap of cool papers. most of them are around an idea (from the 1950's!) called the distributional hypothesis that simply states that words that appear in similar contexts often have similar meanings.the coolest paper so far is 'Web-Scale Distributional Similarity and Entity Set Expansion' by Pantel,Crestan,Borkovsky et al which has introduced me to an area of research i didn't really know existed; entity set expansion.entity set expansion is a bit like thesaurus building for proper nouns; given a seed set of related items can you expand...
Read and Post Comments

e12.1 statistical synonyms

January 23, 2010 at 12:54 PM | View Comments

i've had an idea brewing in my head for awhile now seeded by a great talk by peter norvig about statistically approaches to find patterns in thing he alludes to is the generation of synoyms based on n-gram models.the basic intuition is this; if a corpus contains occurrences of the phrases 'a x b' and 'a y b' then to some degree x and y are synonymous.the question becomes how do we calculate the strength of the relationship? how is it a function of the frequencies of a, b, x, y, 'a x b', 'a y b', 'a ?...
Read and Post Comments

a pig screencast

January 17, 2010 at 02:22 PM | View Comments

pig demo from Mat Kelcey on Vimeo.based on a talk i gave at work recently...
Read and Post Comments

tweets about cheese

November 15, 2009 at 08:45 PM | View Comments

people tweet about all sorts of stuff.sometimes it's really important ground breaking world changing stuff...but most of the time it's ridiculous waste of time stuff like 'i ate some cheese'in fact how much do people actually tweet about cheese?and when they do, what are the most important cheese related topics?lets gather some data...bash> curl -s -u user:pasword's poke around, but first some l33t hax0r bash aliases for the sake of brevityalias t='tail'alias h='head'alias s='sort'alias u='uniq'alias g='grep'let's start with a sample, the first 10 tweets...bash> ./parse_cheese_out.rb < cheese.out | hPasta with pesto and cheese. Some watermelon but alas did not...
Read and Post Comments

xargs parallel execution

November 06, 2009 at 09:57 PM | View Comments

just recently discovered xargs has a parallelise option!i have 20 files, sample.01.gz to sample.20.gz, each ~100mb in size that i need to run a script overone option iszcat sample*gz | ./script.rb > outputbut this will process the files sequentially on a single get some parallel action going i could generate a temp script that produceszcat sample.01.gz | ./script.rb > sample.01.out &zcat sample.02.gz | ./script.rb > sample.02.out &...zcat sample.20.gz | ./script.rb > sample.20.out &and run that but this will have all 20 running at the same time and produce contention(though with only 20 files this might not be a problem)instead...
Read and Post Comments

e11.3 at what time does the world tweet?

October 28, 2009 at 09:22 PM | View Comments

consider the graph below which shows the proportion of tweets per 10 min slot of the day (GMT0)it compares 4.7e6 tweets with any location vs  320e3 tweets with identifiable lat lonssome interesting observations with unanswered questions...the ebb and flow is not just a result of the time of day for high twitter traffic areas. the reduction between 06:00 and 10:00 comes close to zero. this is false, there is never a worldwide time when internet traffic hits zero. does twitter turn down it's gatdenhose for capacity reasons?the number of tweets with lat lons are correlated to those without EXCEPT past...
Read and Post Comments

e11.2 aggregating tweets by time of day

October 24, 2009 at 01:02 PM | View Comments

for v3 lets aggregate by time of the day, should make for an interesting animationbrowsing the data there are lots of other lat longs in data, not just iPhone: and ÜT: there are also one tagged with Coppó:, Pre:, etc perhaps should just try to take anything that looks like a lat long.furthermore lets switch to a bigger dataset again, 4.7e6 tweets from Oct 13 07:00 thru Oct 19 17:00,i've been streaming all my tweets ( as previously discussed ) and been storing them in a directory json_streamhere are the steps...use a streaming script to take a tweet in json...
Read and Post Comments

e11.1 from bash scripts to hadoop

October 18, 2009 at 02:10 PM | View Comments

let's rewrite v1 using hadoop tooling, code is on githubwe'll run hadoop in non distributed standalone mode. in this mode everything runs in a single jvm so it's nice and simple to dev v1 it wasbzcat sample.bz2 | ./ > locationsusing the the awesome hadoop streaming interface it's not too different. this interface allows you to specify any app as the mapper or reducer. the main difference is that it works on directories not just files.for the mapper we'll use exactly the same script as before; and since there is no reduce component of this job so we...
Read and Post Comments

e11.0 tweets around the world

October 16, 2009 at 08:47 PM | View Comments

was discussing the streaming twitter api with steve and though i knew about the private firehose i didn't know there was a lighter weight public gardenhose interface!since discovering this my pvr has basically been runningcurl -u mat_kelcey:XXX |\  gzip -9 - > sample.json.gzbut what am i going to do with all this data?while poking around i noticed there was a fair number of iPhone: and ÜT: lat long tagged locations (eg iPhone: 35.670086,139.740766) so as a first hack let's do some work extracing lat longs and displaying them as heat map points on a map.all the code is...
Read and Post Comments

e10.4 communities in social graphs

October 06, 2009 at 08:05 PM | View Comments

social graphs, like twitter or facebook, often follow the pattern of having clusters of highly connected components with an occasional edge joining these clusters.these connecting edges define the boundaries of communities in the social network and can be identified by algorithms that measure betweenness.the girvan-newman algorithm can be used to decompose a graph hierarchically based on successive removal of the edges with the highest betweenness.the algorithm is basicallycalculate the betweenness of each edge (using an all shortest paths algorithm)remove the edge(s) with the highest betweennesscheck for connected components (using tarjan's algorithm)repeat for graph or subgraphs if graph was split...
Read and Post Comments

simple statistics with R

October 03, 2009 at 03:43 PM | View Comments

i'm learning a new statistics language called R and it's pretty cool.make a vector ...12> c(3,1,4,1,5,9,2,6,5,3,5,8) [1] 3 1 4 1 5 9 2 6 5 3 5 8turn it into a frequency table ...123> table(c(3,1,4,1,5,9,2,6,5,3,5,8))1 2 3 4 5 6 8 92 1 2 1 3 1 1 1sort by frequency ...123> sort(table(c(3,1,4,1,5,9,2,6,5,3,5,8)))2 4 6 8 9 1 3 51 1 1 1 1 2 2 3and plot!1> barplot(sort(table(c(3,1,4,1,5,9,2,6,5,3,5,8))))so simple!...
Read and Post Comments

do a degree via youtube

October 01, 2009 at 08:40 PM | View Comments

i'm amazed by how much great content is on youtube, how could you NOT learn something!?13 x 1hr Statistical Aspects of Data Mining (Stats 202)20 x 1hr Machine Learning...
Read and Post Comments

e10.3 twitter crawl progress

September 29, 2009 at 08:43 PM | View Comments

since the twitter api is rate limited it's quite slow to crawl twitter and after a most of a week i've still only managed to get info on 8,000 users. i probably should subscribe to get a 20,000 an hr limit instead of the 150 i'm on now. i'll just let it chug along in the background of my pvr.while the crawl has been going on i've been trying some things on the data to decide what to do with it.i've managed to write a version of pagerank using pig which has been very interesting. (for those who haven't seen...
Read and Post Comments

e10.2 tgraph crawl order example

September 21, 2009 at 09:58 PM | View Comments

let's consider an example of the crawl order for tgraph...we seed our frontier with 'a' and bootstrap cost of 0.fetching the info for 'a' shows 2 outedges to 'b' and 'c', from our cost formula these all have cost 0 + 1 + Log10(2+1) = 1.6our frontier becomes [ {b,1.6}, {c,1.6} ]next is 'b' and see it has an outdegree of 3, these nodes, b1 -> b3, all have a cost of 1.6 + 1 + Log10(3+1) = 3.2our frontier becomes [ {c,1.6}, {b1,3.2}, {b2,3.2}, {b3,3.2} ]next is 'c' with an outdegree of 15. these 15 nodes, c1 -> c15,...
Read and Post Comments

e10.1 crawling twitter

September 19, 2009 at 09:31 PM | View Comments

our first goal is to get some data and the twitter api makes getting the data trivial. i'm focused mainly on the friends stuff but because it only gives user ids i'll also get the user info so i can put names to ids.a depth first crawl makes no sense for this one experiment, we're unlikely to get the entire graph and are more interested in following edges "close" to me. instead we'll use a breadth first search.since any call to twitter is expensive (in time that is, they rate limit their api calls) instead of a plain vanilla breadth...
Read and Post Comments

e10.0 introducing tgraph

September 19, 2009 at 02:41 PM | View Comments

so e9 sip is on hold for a bit while i kick off e10 tgraph. was looking for another problem to try hadoop with and came across a classic graph one, pagerank. a well understood algorithm like page rank will be a  great chance to try pig, the query language that sits on top of hadoop we need a graph to work on. my first thoughts were using one of the wikipedia linkage dumps but it feels a bit sterile. instead it's a good excuse to do a little crawl of the following graph of twitter.this will also be...
Read and Post Comments

first hadoop experiment

September 16, 2009 at 07:26 PM | View Comments

just finished my first hadoop fantastic results but heaps of of feedback from hadoop mailing groupmore results coming soon...
Read and Post Comments

how using compressed data can make you app faster

June 28, 2009 at 11:32 AM | View Comments

when working with larger data sets (ie more than can fit in memory) there are two important resources to juggle…cpu. how quickly can you process the data.disk io. how quickly can you get data to the cpu.i remember reading once that depending on your situation you might be better off using data compressed on disk. why? because the extra cpu time used decompressing it is worth it for the time saved getting it off disk.i’ve recently been working with a number crunching app (burns 100% cpu of a quadcore machine for an hour over a 7gb working dataset) and thought...
Read and Post Comments

erlang profiling

April 22, 2009 at 11:32 AM | View Comments

i just found fprof, the erlang profiler by randoming clicking around the erlang man page listtry123fprof:apply(Module, Function, Args).fprof:profile().fprof:analyse().for an interesting breakdown of a call...
Read and Post Comments

bin packing

December 14, 2008 at 11:31 AM | View Comments

how to decide what next to backup onto a dvd?when is brute force good enough? will a random walk get a good enough result faster?
Read and Post Comments

the median of a trillion numbers

November 15, 2008 at 11:31 AM | View Comments

i got asked in an interview once “how would find the median of a trillion numbers across a thousand machines?”the question has haunted me, until’s my ruby and erlang implementation with a bit of running amazon ec2 thrown in for good measure….. the code from github...
Read and Post Comments

fastmap and the jaccard distance

October 31, 2008 at 11:31 AM | View Comments

given a set of pairwise distances how do you determine what points correspond to those distances?my latest experiment considers this problem in relation to jaccard distances, a resemblance measure similar to jaccard coefficients used in a previous experimentby using the fastmap algorithm we get points from distances and once you have points you have visualisation!...
Read and Post Comments

openmp = easy multi threading

October 13, 2008 at 11:30 AM | View Comments

openmp is a compiler library, available in gcc since v4.2, for giving hints to a compiler about where code can be parallelized.say we have some code12for(int i=0; i<HUGE_NUMBER; ++i) deadHardCalculation(i)we can make this run on multi threaded by simply adding some pragmas123456#pragma omp parallel num_threads(4){ #pragma omp for for(int i=0; i&lt;HUGE_NUMBER; ++i) deadHardCalculation(i);}compiling with -fopenmp will generate an app that splits the work of the for loop across 4 threads.there’s support for dynamic / static scheduling, accumulators, all sortsthis tute is increased the speed of my shingling code by 350% on a quad...
Read and Post Comments

shingling and the jaccard index

October 06, 2008 at 11:30 AM | View Comments

on the weekend i did another experiment using shingling and the jaccard index to try to determine if two sets of data were “duplicates”it works quite well and includes a ruby and c++ version with low level bit operations.project page is at
Read and Post Comments

java is the new c++

October 05, 2008 at 11:29 AM | View Comments

this year would have been my ten year anniversary of commercially coding in java. it’s not going to be though since the last six months have been ruby. even with my huge investment in java i’d be quite happy to never write a line of it again.i remember when java was first moving in. it was not as performant as c/c++ but it was much easier to write good clean code. and who really cares about performance? scalability is what matters and it's decided by design and architecture, not language choice. as a new language java made sure it had...
Read and Post Comments

old projects...