brain of mat kelcey


brutally short intro to theano word embeddings

March 28, 2015 | View Comments

one thing in theano i couldn't immediately find examples for was a simple embedding lookup table, a critical component for anything with NLP.turns out that it's just one of those things that's so simple no one bothers writing it down...

hallucinating softmaxs

March 15, 2015 | View Comments

language modelling is a classic problem in NLP; given a sequence of words such as "my cat likes to ..." what's the next word? this problem is related to all sorts of things, everything from autocomplete to speech to text.the...

theano and the curse of GpuFromHost

February 22, 2015 | View Comments

i've been reviving some old theano code recently and in case you haven't seen it theano is a pretty awesome python library that reads a lot like numpy but provides two particularly interesting features.symbolic differentiation; not something i'll talk about...

dead simple pymc

December 27, 2012 | View Comments

PyMC is a python library for working with bayesian statistical models, primarily using MCMCmethods. as a software engineer who has only just scratched the surface of statistics this whole MCMC business is blowing my mind so i've got to share...

smoothing low support cases using confidence intervals

December 08, 2012 | View Comments

say you have three items; item1, item2 and item3 and you've somehow associated a count for each against one of five labels; A, B, C, D, E> data A ...

item similarity by bipartite graph dispersion

August 20, 2012 | View Comments

the basis of most recommendation systems is the ability to rate similarity between items. there are lots of different ways to do this. one model is based the idea of an interest graph where the nodes of the graph are...

finding names in common crawl

August 18, 2012 | View Comments

the central offering from common crawl is the raw bytes they've downloaded and, though this is useful for some people, a lot of us just wantthe visible text of web pages. luckily they've done this extraction as a part of...

fuzzy jaccard

July 31, 2012 | View Comments

the jaccard coefficient is one of the fundamental measures for doing set similarity. ( recall jaccard(set1, set2) = |intersection| / |union|. when set1 == set2 this evaluates to 1.0 and when set1 and set2 have no intersection it evaluates to...

ggplot posixct cheat sheet

March 18, 2012 | View Comments

after having to google this stuff three times in the last few months i'm writing it down here so i can just cut and paste next time...> d = read.delim('data.tsv',header=F,as.is=T,col.names=c('dts_str','freq'))> # YEAR MONTH DAY HOUR> head(d,3) ...

collocations in wikipedia, part 1

January 01, 2012 | View Comments

hmmm. did you mean collocations in wikipedia?...

tokenising the visible english text of common crawl

December 10, 2011 | View Comments

Common crawl is a publically available 30TB web crawl taken between September 2009 and September 2010. As a small project I decided to extract and tokenised the visible text of the web pages in this dataset. All the code to...

finding phrases with mutual information

November 15, 2011 | View Comments

continuing on with my series of mutual information experiments how might we extend the technique to identity sequences longer than just two terms?one novel way is to identify the bigrams of interest, replace them with a single token and simply...

collocations in wikipedia, part 2

November 05, 2011 | View Comments

in my last post we went through mutual information as a way of finding collocations.the astute reader may have noticed that for the list of top bigrams i onlyshowed ones that had a frequency above 5,000. why this cutoff? well...

collocations in wikipedia, part 1

October 19, 2011 | View Comments

collocations are combinations of terms that occur together more frequently thanyou'd expect by chance. they can include proper noun phrases like 'Darth Vader'stock/colloquial phrases like 'flora and fauna' or 'old as the hills'common adjectives/noun pairs (notice how 'strong coffee' sounds...

an exercise in handling mislabelled training data

October 03, 2011 | View Comments

as part of my diy twitter client project i've been using the twitter sample streams as a sourceof unlabelled data for some mutual information analysis. these streams are a great source of random tweets but include a lot of non...

do all first links on wikipedia lead to philosophy?

August 13, 2011 | View Comments

(update: like all interesting things it turns out someone else had already done this :D)a recent xkcd posed the idea...wikipedia trivia: if you take any article, click on the first link in the article text not in parentheses or italics,...

dimensionality reduction using random projections.

May 10, 2011 | View Comments

previously i've discussed dimensionality reduction using SVD and PCA but another interesting technique is using a random projection.in a random projection we project A (a NxM matrix) to A' (a NxO, O < M) by the transform AP=A' where P...

pseudocounts and the good-turing estimation (part1)

April 03, 2011 | View Comments

say we are running the bar at a soldout bad religion concert. the bar serves beer, scotch and water and we decide to record orders over the night so that we can know how much to order for tomorrow's gig...drink#salesbeer1000scotch300water200using...

visualising the consistent hash

September 26, 2010 | View Comments

consider the problem of allocating N resources across M servers (N >> M)a common approach is the straight forward modulo hash...if we have 4 servers; servers = [server0, server1, server2, server3] we can allocate a resource to a server by...

simple text search in ruby using ferret

September 12, 2010 | View Comments

ferret is a lightweight text search engine for ruby, a bit like lucene but with less (ie no) java.i've been looking at it today as part of my named entity extraction prototype which needs to be able to fuzzily match...

my list of cool machine learning books

August 06, 2010 | View Comments

for the last month or so i've had my head down and have been focusing more on theory (ie reading) than on practice (ie coding)so rather than write no blog post here's mats-list-of-cool-machine-learning-books in the order i think you should...

brutally short intro to weka

July 03, 2010 | View Comments

weka is a java based machine learning workbench that i've found useful to playing with to help understand some standard machine learning algorithms. in this quick demo i show how to build a classifier for three simple datasets; two of...

friend clustering by term usage

June 25, 2010 | View Comments

recently signed up to the infochimps api and wanted to do something quick and dirty to get a feel for it.so here's a little experimentget the people i follow on twitterlook up the words that "represent" them according to the...

country codes in world cup tweets - viz1

June 21, 2010 | View Comments

#worldcup tweet viz1 from Mat Kelcey on Vimeo.here's a simple visualisation of the use of official country codes (eg #aus) in a week's worth of tweets from the search stream for #worldcup.rate is about 2hours of tweets per sec. orb...

moving average of a time series in R

June 15, 2010 | View Comments

in this a sliding window of 3 elements123456789> x = c(3,1,4,1,5,9,2,6,5,3,5,8)> ra_x = filter(x, rep(1,3)/3)> ra_xTime Series:Start = 1 End = 12 Frequency = 1 [1] NA 2.666667 2.000000 3.333333 5.000000 5.333333 5.666667...

#worldcup twitter analytics

June 14, 2010 | View Comments

since the world cup started i've spent more time looking at twitter data about the games than the actual games themselves. what a sad data nerd i am!anyways, here's the first few days analysis based the use of official country...

a quick study in tf/icf

June 09, 2010 | View Comments

while doing some more research on trending algorithms i came across a cool little paper about term frequency normalisation for streaming data: TF-ICF: A New Term Weighting Scheme for Clustering Dynamic Data Streams.i'm finding streaming related algorithms quite interesting lately...

5 minute ggobi demo

June 04, 2010 | View Comments

brutally short demo of ggobi from Mat Kelcey on Vimeo.note: non embedded version has higher res at full screen...

how many terms in a trend?

May 11, 2010 | View Comments

i've been poking around with a simple trending algorithm over the last few weeks and have uncovered a problem that, like most interesting ones, i'm not sure how to solve. the question revolves around discovering multi terms trends. a sensible...

trending topics in tweets about cheese; part2

May 01, 2010 | View Comments

prototyping in ruby was a great way to prove the concept but my main motivation for this project was to play some more with pig.the main approach will bemaintain a relation with one record per tokenfold 1 hours worth of...

trending topics in tweets about cheese; part1

April 27, 2010 | View Comments

what does it mean for a topic to be 'trending'? consider the following time series (430e3 tweets containing cheese collected over a month period bucketed into hourly timeslots)without a formal definition we can just look at this and say that...

latent semantic analysis via the singular value decomposition (for dummies)

April 19, 2010 | View Comments

i've been trying to get a deeper understanding of latent semantic analysis for awhile now.last week i came to the conclusion the other way to truly understand would be to start from the ground upso here goes; mat's guide to...

cool bash stuff; mkfifo

April 15, 2010 | View Comments

mkfifo is one of those shell commands provided as part of coreutils that not many people seem to know about.here's an (semi contrived) example close to something i did the other day to show how awesome it issay you have...

e10.6 community detection for my twitter network

April 04, 2010 | View Comments

last night i applied my network decomposition algorithm to a graph of some of the people near me in twitter.first i build a friend graph for 100 people 'around' me (taken from a crawl i did last year). by 'friend'...

e10.5 revisiting community detection

March 30, 2010 | View Comments

i've decided to switch back to some previous work i did on community detection in (social) graphsthe last chunk of code i wrote which tried to deal with weighted directed graphs was terribly, terribly, broken but it seems that simplifying...

brutally short intro to collaborative filtering

March 18, 2010 | View Comments

my favourite recommendations system is the collaborative filter; it gives good resultsand is easy to understand and extend as required.it works on the intuition thatif i like coffee, chocolate and ice creamand you like coffee and chocolateyou might also like...

sentiment analysis training data using mechanical turk

March 12, 2010 | View Comments

want to try doing some sentiment analysis work on tweets but i need some good training data.i could label a heap of tweets myself as being positive, neutral or negative but instead this seems to be the perfect job for...

mongodb + twitter + yahoo term extractor = fun!

March 07, 2010 | View Comments

ran a little experiment in using yahoo term extraction yesterday and it worked well enough. here's some code to pass some text to yahoo and get back an array of termsi've got to say mongodb is such an easy tool...

what to do with a week off?

February 22, 2010 | View Comments

this week i'm between jobs so i have (a little) more time than usual to hack.i've got a list of pending things to do but can't decide what to do next, here's my list in (sort of) priority order...fix up...

semi supervised naive bayes for text classification

February 14, 2010 | View Comments

experiment 13; a test of semi supervised naive bayes for text classification is complete.semi supervised algorithms seem to work pretty well and i can see how they are a huge benefit for text classification where you can have an enormous...

e12.3 stat syns FAIL!

February 05, 2010 | View Comments

after quite a bit of hacking the statistical synonyms idea doesn't seem to give terribly interesting results so i'm going onto do something else.for the record here's what I did do though....generate 3grams from 800e3 tweetscollect n-grams together that share...

an intro to semi supervised document classification

January 31, 2010 | View Comments

here's a great lecture from tom mitchell about document classification using a semi supervised version of naive bayes.semi supervised algorithms only require some of the training examples to be labeled and are able to make use of any unlabelled ones,...

e12.2 entity set expansion

January 28, 2010 | View Comments

i've been doing some reading for my statistical synonyms project and have uncovered a heap of cool papers. most of them are around an idea (from the 1950's!) called the distributional hypothesis that simply states that words that appear in...

e12.1 statistical synonyms

January 23, 2010 | View Comments

i've had an idea brewing in my head for awhile now seeded by a great talk by peter norvig about statistically approaches to find patterns in data.one thing he alludes to is the generation of synoyms based on n-gram models.the...

a pig screencast

January 17, 2010 | View Comments

pig demo from Mat Kelcey on Vimeo.based on a talk i gave at work recently...

tweets about cheese

November 15, 2009 | View Comments

people tweet about all sorts of stuff.sometimes it's really important ground breaking world changing stuff...but most of the time it's ridiculous waste of time stuff like 'i ate some cheese'in fact how much do people actually tweet about cheese?and when...

xargs parallel execution

November 06, 2009 | View Comments

just recently discovered xargs has a parallelise option!i have 20 files, sample.01.gz to sample.20.gz, each ~100mb in size that i need to run a script overone option iszcat sample*gz | ./script.rb > outputbut this will process the files sequentially on...

e11.3 at what time does the world tweet?

October 28, 2009 | View Comments

consider the graph below which shows the proportion of tweets per 10 min slot of the day (GMT0)it compares 4.7e6 tweets with any location vs  320e3 tweets with identifiable lat lonssome interesting observations with unanswered questions...the ebb and flow is...

e11.2 aggregating tweets by time of day

October 24, 2009 | View Comments

for v3 lets aggregate by time of the day, should make for an interesting animationbrowsing the data there are lots of other lat longs in data, not just iPhone: and ÜT: there are also one tagged with Coppó:, Pre:, etc...

e11.1 from bash scripts to hadoop

October 18, 2009 | View Comments

let's rewrite v1 using hadoop tooling, code is on githubwe'll run hadoop in non distributed standalone mode. in this mode everything runs in a single jvm so it's nice and simple to dev against.in v1 it wasbzcat sample.bz2 | ./extract_locations.pl...

e11.0 tweets around the world

October 16, 2009 | View Comments

was discussing the streaming twitter api with steve and though i knew about the private firehose i didn't know there was a lighter weight public gardenhose interface!since discovering this my pvr has basically been runningcurl -u mat_kelcey:XXX http://stream.twitter.com/1/statuses/sample.json |\ ...

e10.4 communities in social graphs

October 06, 2009 | View Comments

social graphs, like twitter or facebook, often follow the pattern of having clusters of highly connected components with an occasional edge joining these clusters.these connecting edges define the boundaries of communities in the social network and can be identified by...

simple statistics with R

October 03, 2009 | View Comments

i'm learning a new statistics language called R and it's pretty cool.make a vector ...12> c(3,1,4,1,5,9,2,6,5,3,5,8) [1] 3 1 4 1 5 9 2 6 5 3 5 8turn it into a frequency table ...123> table(c(3,1,4,1,5,9,2,6,5,3,5,8))1 2 3 4 5...

do a degree via youtube

October 01, 2009 | View Comments

i'm amazed by how much great content is on youtube, how could you NOT learn something!?13 x 1hr Statistical Aspects of Data Mining (Stats 202)20 x 1hr Machine Learning...

e10.3 twitter crawl progress

September 29, 2009 | View Comments

since the twitter api is rate limited it's quite slow to crawl twitter and after a most of a week i've still only managed to get info on 8,000 users. i probably should subscribe to get a 20,000 an hr...

e10.2 tgraph crawl order example

September 21, 2009 | View Comments

let's consider an example of the crawl order for tgraph...we seed our frontier with 'a' and bootstrap cost of 0.fetching the info for 'a' shows 2 outedges to 'b' and 'c', from our cost formula these all have cost 0...

e10.1 crawling twitter

September 19, 2009 | View Comments

our first goal is to get some data and the twitter api makes getting the data trivial. i'm focused mainly on the friends stuff but because it only gives user ids i'll also get the user info so i can...

e10.0 introducing tgraph

September 19, 2009 | View Comments

so e9 sip is on hold for a bit while i kick off e10 tgraph. was looking for another problem to try hadoop with and came across a classic graph one, pagerank. a well understood algorithm like page rank will...

first hadoop experiment

September 16, 2009 | View Comments

just finished my first hadoop experiment.matpalm.com/sipnot fantastic results but heaps of of feedback from hadoop mailing groupmore results coming soon...

how using compressed data can make you app faster

June 28, 2009 | View Comments

when working with larger data sets (ie more than can fit in memory) there are two important resources to juggle…cpu. how quickly can you process the data.disk io. how quickly can you get data to the cpu.i remember reading once...

erlang profiling

April 22, 2009 | View Comments

i just found fprof, the erlang profiler by randoming clicking around the erlang man page listtry123fprof:apply(Module, Function, Args).fprof:profile().fprof:analyse().for an interesting breakdown of a call...

bin packing

December 14, 2008 | View Comments

how to decide what next to backup onto a dvd?when is brute force good enough? will a random walk get a good enough result faster?matpalm.com/burn.it...

the median of a trillion numbers

November 15, 2008 | View Comments

i got asked in an interview once “how would find the median of a trillion numbers across a thousand machines?”the question has haunted me, until now.here’s my ruby and erlang implementation with a bit of running amazon ec2 thrown in...

fastmap and the jaccard distance

October 31, 2008 | View Comments

given a set of pairwise distances how do you determine what points correspond to those distances?my latest experiment considers this problem in relation to jaccard distances, a resemblance measure similar to jaccard coefficients used in a previous experimentby using the...

openmp = easy multi threading

October 13, 2008 | View Comments

openmp is a compiler library, available in gcc since v4.2, for giving hints to a compiler about where code can be parallelized.say we have some code12for(int i=0; i<HUGE_NUMBER; ++i) deadHardCalculation(i)we can make this run on multi threaded by simply...

shingling and the jaccard index

October 06, 2008 | View Comments

on the weekend i did another experiment using shingling and the jaccard index to try to determine if two sets of data were “duplicates”it works quite well and includes a ruby and c++ version with low level bit operations.project page...

old projects...