brain of mat kelcey
friend clustering by term usage
June 25, 2010 at 11:39 PM | categories: twitter, network, infochimps | View Comments
recently signed up to the infochimps api and wanted to do something quick and dirty to get a feel for it.so here's a little experimentget the people i follow on twitterlook up the words that "represent" them according to the infochimps word bag apibuild a similiarity matrix based on the common use of those termsplot the connectivity for the top 30 or so pairingsit's basically partitioned into three groups...veztek (my boss john) and smcinnes (steve from the lonely planet community team) in the top righta big clump of nosqlness with mongodb - hbase - jpatanooga - kevinweil in the bottom...
country codes in world cup tweets - viz1
June 21, 2010 at 07:43 PM | categories: worldcup, twitter, visualisation | View Comments
#worldcup tweet viz1 from Mat Kelcey on Vimeo.here's a simple visualisation of the use of official country codes (eg #aus) in a week's worth of tweets from the search stream for #worldcup.rate is about 2hours of tweets per sec. orb size denotes relative frequency of that country code. edges denote that those two countries feature a lot in the same tweets. movement is based on gravitational like attraction along edges.the quiet period at about 0:17 is a twitter outage :)here's the original processing applet version with a bit more discussion...
#worldcup twitter analytics
June 14, 2010 at 10:06 PM | categories: worldcup, twitter | View Comments
since the world cup started i've spent more time looking at twitter data about the games than the actual games themselves. what a sad data nerd i am!anyways, here's the first few days analysis based the use of official country tags (eg #aus) in the search stream for #worldcup.tomorrow i might look in more detail at one of the games, wondering how many variants of 'goooooooal' i'll find :D...
trending topics in tweets about cheese; part2
May 01, 2010 at 04:54 PM | categories: twitter, trending, e15, pig | View Comments
prototyping in ruby was a great way to prove the concept but my main motivation for this project was to play some more with pig.the main approach will bemaintain a relation with one record per tokenfold 1 hours worth of new data at a time into the modelcheck the entries for the latest hour for any trendsthe full version is on github. read on for a line by line walkthrough!the ruby impl used the simplest approach possible for calculating mean and standard deviation; just keep a record of all the values seen so far and recalculate for each new value.for...
trending topics in tweets about cheese; part1
April 27, 2010 at 11:42 PM | categories: cheese, twitter, trending, e15 | View Comments
what does it mean for a topic to be 'trending'? consider the following time series (430e3 tweets containing cheese collected over a month period bucketed into hourly timeslots)without a formal definition we can just look at this and say that the series was trending where there was a spike just before time 600. as a start then let's just define a trend as a value that was greater than was 'expected'.one really nice simple algorithm for detecting a trend is to say a value, v, is trending if v > mean + 3 * standard deviation of the data seen...
e10.6 community detection for my twitter network
April 04, 2010 at 12:58 PM | categories: e10, twitter, betweenness, social network, graph | View Comments
last night i applied my network decomposition algorithm to a graph of some of the people near me in twitter.first i build a friend graph for 100 people 'around' me (taken from a crawl i did last year). by 'friend' i mean that if alice follows bob then bob also follows alice.here the graph, some things to note though; it was an unfinished crawl (can a crawl of twitter EVER be finished) and was done october last year so is a bit out of date.moreand here is the dendrogram decompositionsome interesting clusterings come out..right at the bottom we have a...
sentiment analysis training data using mechanical turk
March 12, 2010 at 09:57 PM | categories: twitter, mechanical turk, analysis, sentiment | View Comments
want to try doing some sentiment analysis work on tweets but i need some good training data.i could label a heap of tweets myself as being positive, neutral or negative but instead this seems to be the perfect job for mechanical turkso i put up 100 'cream cheese' tweets on mechanical turk, asked for 3 opinions per tweet and offered $0.01 per opinion. took under 30 minutes to get back all 300 opinions and only cost $4.50 ($3 for the work, $1.50 admin fee)the results are interesting in themselves...mostly they are consistent;for example all three sentiments for bagels and cream...
mongodb + twitter + yahoo term extractor = fun!
March 07, 2010 at 09:38 PM | categories: term extraction, mongodb, twitter, json, yahoo | View Comments
ran a little experiment in using yahoo term extraction yesterday and it worked well enough. here's some code to pass some text to yahoo and get back an array of termsi've got to say mongodb is such an easy tool for working with json data. these 20 odd lines insert a text json tweet stream into mongo. so simple, why can't all code be this easy......
tweets about cheese
November 15, 2009 at 08:45 PM | categories: ngrams, cheese, twitter | View Comments
people tweet about all sorts of stuff.sometimes it's really important ground breaking world changing stuff...but most of the time it's ridiculous waste of time stuff like 'i ate some cheese'in fact how much do people actually tweet about cheese?and when they do, what are the most important cheese related topics?lets gather some data...bash> curl -s -u user:pasword http://stream.twitter.com/1/statuses/filter.json?track=cheeselet's poke around, but first some l33t hax0r bash aliases for the sake of brevityalias t='tail'alias h='head'alias s='sort'alias u='uniq'alias g='grep'let's start with a sample, the first 10 tweets...bash> ./parse_cheese_out.rb < cheese.out | hPasta with pesto and cheese. Some watermelon but alas did not...
e11.3 at what time does the world tweet?
October 28, 2009 at 09:22 PM | categories: e11, twitter, r | View Comments
consider the graph below which shows the proportion of tweets per 10 min slot of the day (GMT0)it compares 4.7e6 tweets with any location vs 320e3 tweets with identifiable lat lonssome interesting observations with unanswered questions...the ebb and flow is not just a result of the time of day for high twitter traffic areas. the reduction between 06:00 and 10:00 comes close to zero. this is false, there is never a worldwide time when internet traffic hits zero. does twitter turn down it's gatdenhose for capacity reasons?the number of tweets with lat lons are correlated to those without EXCEPT past...
e11.2 aggregating tweets by time of day
October 24, 2009 at 01:02 PM | categories: e11, twitter, hadoop, pig | View Comments
for v3 lets aggregate by time of the day, should make for an interesting animationbrowsing the data there are lots of other lat longs in data, not just iPhone: and ÜT: there are also one tagged with Coppó:, Pre:, etc perhaps should just try to take anything that looks like a lat long.furthermore lets switch to a bigger dataset again, 4.7e6 tweets from Oct 13 07:00 thru Oct 19 17:00,i've been streaming all my tweets ( as previously discussed ) and been storing them in a directory json_streamhere are the steps...use a streaming script to take a tweet in json...
e11.1 from bash scripts to hadoop
October 18, 2009 at 02:10 PM | categories: e11, maps, twitter, hadoop, pig | View Comments
let's rewrite v1 using hadoop tooling, code is on githubwe'll run hadoop in non distributed standalone mode. in this mode everything runs in a single jvm so it's nice and simple to dev against.in v1 it wasbzcat sample.bz2 | ./extract_locations.pl > locationsusing the the awesome hadoop streaming interface it's not too different. this interface allows you to specify any app as the mapper or reducer. the main difference is that it works on directories not just files.for the mapper we'll use exactly the same script as before; extract_locations.pl and since there is no reduce component of this job so we...
e11.0 tweets around the world
October 16, 2009 at 08:47 PM | categories: e11, maps, twitter | View Comments
was discussing the streaming twitter api with steve and though i knew about the private firehose i didn't know there was a lighter weight public gardenhose interface!since discovering this my pvr has basically been runningcurl -u mat_kelcey:XXX http://stream.twitter.com/1/statuses/sample.json | gzip -9 - > sample.json.gzbut what am i going to do with all this data?while poking around i noticed there was a fair number of iPhone: and ÜT: lat long tagged locations (eg iPhone: 35.670086,139.740766) so as a first hack let's do some work extracing lat longs and displaying them as heat map points on a map.all the code is...
e10.4 communities in social graphs
October 06, 2009 at 08:05 PM | categories: e10, twitter, social network, betweenness, algorithms, graph | View Comments
social graphs, like twitter or facebook, often follow the pattern of having clusters of highly connected components with an occasional edge joining these clusters.these connecting edges define the boundaries of communities in the social network and can be identified by algorithms that measure betweenness.the girvan-newman algorithm can be used to decompose a graph hierarchically based on successive removal of the edges with the highest betweenness.the algorithm is basicallycalculate the betweenness of each edge (using an all shortest paths algorithm)remove the edge(s) with the highest betweennesscheck for connected components (using tarjan's algorithm)repeat for graph or subgraphs if graph was split...
e10.3 twitter crawl progress
September 29, 2009 at 08:43 PM | categories: e10, twitter, algorithms, hadoop | View Comments
since the twitter api is rate limited it's quite slow to crawl twitter and after a most of a week i've still only managed to get info on 8,000 users. i probably should subscribe to get a 20,000 an hr limit instead of the 150 i'm on now. i'll just let it chug along in the background of my pvr.while the crawl has been going on i've been trying some things on the data to decide what to do with it.i've managed to write a version of pagerank using pig which has been very interesting. (for those who haven't seen...
e10.1 crawling twitter
September 19, 2009 at 09:31 PM | categories: e10, twitter, algorithms, graph | View Comments
our first goal is to get some data and the twitter api makes getting the data trivial. i'm focused mainly on the friends stuff but because it only gives user ids i'll also get the user info so i can put names to ids.a depth first crawl makes no sense for this one experiment, we're unlikely to get the entire graph and are more interested in following edges "close" to me. instead we'll use a breadth first search.since any call to twitter is expensive (in time that is, they rate limit their api calls) instead of a plain vanilla breadth...
e10.0 introducing tgraph
September 19, 2009 at 02:41 PM | categories: big data, e10, twitter, hadoop, pig, algorithms | View Comments
so e9 sip is on hold for a bit while i kick off e10 tgraph. was looking for another problem to try hadoop with and came across a classic graph one, pagerank. a well understood algorithm like page rank will be a great chance to try pig, the query language that sits on top of hadoop mapreduce.so we need a graph to work on. my first thoughts were using one of the wikipedia linkage dumps but it feels a bit sterile. instead it's a good excuse to do a little crawl of the following graph of twitter.this will also be...
old projects...
- latent semantic analysis via the singular value decomposition (for dummies)
- semi supervised naive bayes
- statistical synonyms
- round the world tweets
- decomposing social graphs on twitter
- do it yourself statistically improbable phrases
- should i burn it?
- the median of a trillion numbers
- deduping with resemblance metrics
- simple supervised learning / should i read it?
- audioscrobbler experiments
- chaoscope experiment