brain of mat kelcey
trending topics in tweets about cheese; part2
May 01, 2010 at 04:54 PM | categories: twitter, trending, e15, pig | View Comments
prototyping in ruby was a great way to prove the concept but my main motivation for this project was to play some more with pig.the main approach will bemaintain a relation with one record per tokenfold 1 hours worth of new data at a time into the modelcheck the entries for the latest hour for any trendsthe full version is on github. read on for a line by line walkthrough!the ruby impl used the simplest approach possible for calculating mean and standard deviation; just keep a record of all the values seen so far and recalculate for each new value.for...
a pig screencast
January 17, 2010 at 02:22 PM | categories: screencast, hadoop, pig | View Comments
pig demo from Mat Kelcey on Vimeo.based on a talk i gave at work recently...
e11.2 aggregating tweets by time of day
October 24, 2009 at 01:02 PM | categories: e11, twitter, hadoop, pig | View Comments
for v3 lets aggregate by time of the day, should make for an interesting animationbrowsing the data there are lots of other lat longs in data, not just iPhone: and ÜT: there are also one tagged with Coppó:, Pre:, etc perhaps should just try to take anything that looks like a lat long.furthermore lets switch to a bigger dataset again, 4.7e6 tweets from Oct 13 07:00 thru Oct 19 17:00,i've been streaming all my tweets ( as previously discussed ) and been storing them in a directory json_streamhere are the steps...use a streaming script to take a tweet in json...
e11.1 from bash scripts to hadoop
October 18, 2009 at 02:10 PM | categories: e11, maps, twitter, hadoop, pig | View Comments
let's rewrite v1 using hadoop tooling, code is on githubwe'll run hadoop in non distributed standalone mode. in this mode everything runs in a single jvm so it's nice and simple to dev against.in v1 it wasbzcat sample.bz2 | ./extract_locations.pl > locationsusing the the awesome hadoop streaming interface it's not too different. this interface allows you to specify any app as the mapper or reducer. the main difference is that it works on directories not just files.for the mapper we'll use exactly the same script as before; extract_locations.pl and since there is no reduce component of this job so we...
e10.0 introducing tgraph
September 19, 2009 at 02:41 PM | categories: big data, e10, twitter, hadoop, pig, algorithms | View Comments
so e9 sip is on hold for a bit while i kick off e10 tgraph. was looking for another problem to try hadoop with and came across a classic graph one, pagerank. a well understood algorithm like page rank will be a great chance to try pig, the query language that sits on top of hadoop mapreduce.so we need a graph to work on. my first thoughts were using one of the wikipedia linkage dumps but it feels a bit sterile. instead it's a good excuse to do a little crawl of the following graph of twitter.this will also be...
old projects...
- latent semantic analysis via the singular value decomposition (for dummies)
- semi supervised naive bayes
- statistical synonyms
- round the world tweets
- decomposing social graphs on twitter
- do it yourself statistically improbable phrases
- should i burn it?
- the median of a trillion numbers
- deduping with resemblance metrics
- simple supervised learning / should i read it?
- audioscrobbler experiments
- chaoscope experiment