Posts Tagged ‘twitter’

friend clustering by term usage

Friday, June 25th, 2010

recently signed up to the infochimps api and wanted to do something quick and dirty to get a feel for it.

so here’s a little experiment

  1. get the people i follow on twitter
  2. look up the words that “represent” them according to the infochimps word bag api
  3. build a similiarity matrix based on the common use of those terms
  4. plot the connectivity for the top 30 or so pairings

it’s basically partitioned into three groups…

  1. veztek (my boss john) and smcinnes (steve from the lonely planet community team) in the top right
  2. a big clump of nosqlness with mongodb – hbase – jpatanooga – kevinweil in the bottom left
  3. everyone else

an interesting enough result given the time taken; the codes on github

country codes in world cup tweets – viz1

Monday, June 21st, 2010

#worldcup tweet viz1 from Mat Kelcey on Vimeo.

here’s a simple visualisation of the use of official country codes (eg #aus) in a week’s worth of tweets from the search stream for #worldcup.

rate is about 2hours of tweets per sec. orb size denotes relative frequency of that country code. edges denote that those two countries feature a lot in the same tweets. movement is based on gravitational like attraction along edges.

the quiet period at about 0:17 is a twitter outage :)

here’s the original processing applet version with a bit more discussion

#worldcup twitter analytics

Monday, June 14th, 2010

since the world cup started i’ve spent more time looking at twitter data about the games than the actual games themselves. what a sad data nerd i am!

anyways, here’s the first few days analysis based the use of official country tags (eg #aus) in the search stream for #worldcup.

tomorrow i might look in more detail at one of the games, wondering how many variants of ‘goooooooal’ i’ll find :D

trending topics in tweets about cheese; part2

Saturday, May 1st, 2010

prototyping in ruby was a great way to prove the concept but my main motivation for this project was to play some more with pig.

the main approach will be

  1. maintain a relation with one record per ngram we want to monitoring for trending
  2. fold 1 hours worth of new data at a time into the model
  3. check the entries for the latest hour for any trends

the full version is on github. read on for a line by line walkthrough

(more…)

trending topics in tweets about cheese; part1

Tuesday, April 27th, 2010

trending topics

what does it mean for a topic to be ‘trending’? consider the following time series (430e3 tweets containing cheese collected over a month period bucketed into hourly timeslots)

without a formal definition we can just look at this and say that the series was trending where there was a spike just before time 600. as a start then let’s just define a trend as a value that was greater than was ‘expected’.

how can we calculate trending?

one really nice simple algorithm for detecting a trend is to say a value, v, is trending if v > mean + 3 * standard deviation of the data seen so far. (thanks @peteskomoroch for the suggestion, works a treat)

let’s consider the same time series as before but this time with some overlaid data;
green – the mean
red – minimum trend value ( = mean + 3 * std dev )
blue – instances where the value > minimum trend value

(more…)

e10.6 community detection for my twitter network

Sunday, April 4th, 2010

last night i applied my network decomposition algorithm to a graph of some of the people near me in twitter.

first i build a friend graph for 100 people ‘around’ me (taken from a crawl i did last year). by ‘friend’ i mean that if alice follows bob then bob also follows alice.

here the graph, some things to note though; it was an unfinished crawl (can a crawl of twitter EVER be finished) and was done october last year so is a bit out of date.

friends (more…)

sentiment analysis training data using mechanical turk

Friday, March 12th, 2010

want to try doing some sentiment analysis work on tweets but i need some good training data.

i could label a heap of tweets myself as being positive, neutral or negative but instead this seems to be the perfect job for mechanical turk

so i put up 100 ‘cream cheese’ tweets on mechanical turk, asked for 3 opinions per tweet and offered $0.01 per opinion. took under 30 minutes to get back all 300 opinions and only cost $4.50 ($3 for the work, $1.50 admin fee)

the results are interesting in themselves…

mostly they are consistent;

for example all three sentiments for bagels and cream cheese for breakfast. very original were neutral

and all three sentiments for very few things are as good as a warm nyc bagel with cream cheese first thing in the am were positive.

but occasionally they aren’t consistent;

the tweet developing a recipe for orange cream cheese swirled cardamom brownies… that’s too long a name. hmm… suggestions? had one positive, one neutral and one negative

interestingly there was no case of a tweet having all opinions being negative; even bad idea. dont eat bagel with mixed berry cream cheese, right after u washed ur mouth with listerine. . ended up with two negatives and one positive (?)

hmmmm

mongodb + twitter + yahoo term extractor = fun!

Sunday, March 7th, 2010

ran a little experiment in using yahoo term extraction yesterday and it worked well enough. here’s some code to pass some text to yahoo and get back an array of terms

i’ve got to say mongodb is such an easy tool for working with json data. these 20 odd lines insert a text json tweet stream into mongo. so simple, why can’t all code be this easy…

tweets about cheese

Sunday, November 15th, 2009

people tweet about all sorts of stuff.

sometimes it’s really important ground breaking world changing stuff…
but most of the time it’s ridiculous waste of time stuff like ‘i ate some cheese’

in fact how much do people actually tweet about cheese?
and when they do, what are the most important cheese related topics?

lets gather some data…

(more…)

e11.3 at what time does the world tweet?

Wednesday, October 28th, 2009

consider the graph below which shows the proportion of tweets per 10 min slot of the day (GMT0)

it compares 4.7e6 tweets with any location vs  320e3 tweets with identifiable lat lons

timeslices_freq.comparison

some interesting observations with unanswered questions…

  1. the ebb and flow is not just a result of the time of day for high twitter traffic areas. the reduction between 06:00 and 10:00 comes close to zero. this is false, there is never a worldwide time when internet traffic hits zero. does twitter turn down it’s gatdenhose for capacity reasons?
  2. the number of tweets with lat lons are correlated to those without EXCEPT past 17:00 where the lat lon cases drop drastically. have a couple of ideas banging around my head why this is the case but nothing concrete. any ideas?

speaking of correlation here’s a scatterplot of tweets with lat lons vs without. we can see that time period uncorrelatedness that occurs past 17:00 as a quite obvious cluster.

timeslices_freq.scatter

and here is the R code for these graphs