<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>brain of matpalm &#187; twitter</title>
	<atom:link href="http://matpalm.com/blog/tag/twitter/feed/" rel="self" type="application/rss+xml" />
	<link>http://matpalm.com/blog</link>
	<description>thoughts from a data scientist wannabe</description>
	<lastBuildDate>Mon, 16 Aug 2010 11:38:22 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>friend clustering by term usage</title>
		<link>http://matpalm.com/blog/2010/06/25/friend-clustering-by-term-usage/</link>
		<comments>http://matpalm.com/blog/2010/06/25/friend-clustering-by-term-usage/#comments</comments>
		<pubDate>Fri, 25 Jun 2010 13:39:08 +0000</pubDate>
		<dc:creator>matpalm</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[infochimps]]></category>
		<category><![CDATA[network]]></category>
		<category><![CDATA[twitter]]></category>

		<guid isPermaLink="false">http://matpalm.com/blog/?p=662</guid>
		<description><![CDATA[recently signed up to the infochimps api and wanted to do something quick and dirty to get a feel for it.
so here&#8217;s a little experiment

get the people i follow on twitter
look up the words that &#8220;represent&#8221; them according to the infochimps word bag api
build a similiarity matrix based on the common use of those terms
plot [...]]]></description>
			<content:encoded><![CDATA[<p>recently signed up to the <a href="http://api.infochimps.com/">infochimps api</a> and wanted to do something quick and dirty to get a feel for it.</p>
<p>so here&#8217;s a little experiment</p>
<ol>
<li>get the people i follow on twitter</li>
<li>look up the words that &#8220;represent&#8221; them according to the <a href="http://api.infochimps.com/describe/soc/net/tw/wordbag">infochimps word bag api</a></li>
<li>build a similiarity matrix based on the common use of those terms</li>
<li>plot the connectivity for the top 30 or so pairings</li>
</ol>
<p><a href="http://matpalm.com/blog/wp-content/uploads/2010/06/top35.png"><img src="http://matpalm.com/blog/wp-content/uploads/2010/06/top35-300x178.png" alt="" title="top35" width="300" height="178" class="aligncenter size-medium wp-image-666" /></a></p>
<p>it&#8217;s basically partitioned into three groups&#8230;</p>
<ol>
<li>veztek (my boss john) and smcinnes (steve from the lonely planet community team) in the top right</li>
<li>a big clump of nosqlness with mongodb &#8211; hbase &#8211; jpatanooga &#8211; kevinweil in the bottom left</li>
<li>everyone else</li>
</ol>
<p>an interesting enough result given the time taken; the codes <a href="http://github.com/matpalm/twitter/tree/master/friend_cluster/">on github</a></p>
]]></content:encoded>
			<wfw:commentRss>http://matpalm.com/blog/2010/06/25/friend-clustering-by-term-usage/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>country codes in world cup tweets &#8211; viz1</title>
		<link>http://matpalm.com/blog/2010/06/21/country-codes-in-world-cup-tweets-viz1/</link>
		<comments>http://matpalm.com/blog/2010/06/21/country-codes-in-world-cup-tweets-viz1/#comments</comments>
		<pubDate>Mon, 21 Jun 2010 09:43:32 +0000</pubDate>
		<dc:creator>matpalm</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[twitter]]></category>
		<category><![CDATA[visualisation]]></category>
		<category><![CDATA[worldcup]]></category>

		<guid isPermaLink="false">http://matpalm.com/blog/?p=654</guid>
		<description><![CDATA[
#worldcup tweet viz1 from Mat Kelcey on Vimeo.
here&#8217;s a simple visualisation of the use of official country codes (eg #aus) in a week&#8217;s worth of tweets from the search stream for #worldcup.
rate is about 2hours of tweets per sec. orb size denotes relative frequency of that country code. edges denote that those two countries feature [...]]]></description>
			<content:encoded><![CDATA[<p><object width="400" height="300"><param name="allowfullscreen" value="true" /><param name="allowscriptaccess" value="always" /><param name="movie" value="http://vimeo.com/moogaloop.swf?clip_id=12710800&amp;server=vimeo.com&amp;show_title=1&amp;show_byline=1&amp;show_portrait=0&amp;color=&amp;fullscreen=1" /><embed src="http://vimeo.com/moogaloop.swf?clip_id=12710800&amp;server=vimeo.com&amp;show_title=1&amp;show_byline=1&amp;show_portrait=0&amp;color=&amp;fullscreen=1" type="application/x-shockwave-flash" allowfullscreen="true" allowscriptaccess="always" width="400" height="300"></embed></object>
<p><a href="http://vimeo.com/12710800">#worldcup tweet viz1</a> from <a href="http://vimeo.com/user2935988">Mat Kelcey</a> on <a href="http://vimeo.com">Vimeo</a>.</p>
<p>here&#8217;s a simple visualisation of the use of official country codes (eg #aus) in a week&#8217;s worth of tweets from the search stream for #worldcup.</p>
<p>rate is about 2hours of tweets per sec. orb size denotes relative frequency of that country code. edges denote that those two countries feature a lot in the same tweets. movement is based on gravitational like attraction along edges.</p>
<p>the quiet period at about 0:17 is a twitter outage :)</p>
<p><a href="http://matpalm.com/world_cup/viz1/">here&#8217;s the original processing applet version</a> with a bit more discussion</p>
]]></content:encoded>
			<wfw:commentRss>http://matpalm.com/blog/2010/06/21/country-codes-in-world-cup-tweets-viz1/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>#worldcup twitter analytics</title>
		<link>http://matpalm.com/blog/2010/06/14/worldcup-twitter-analytics/</link>
		<comments>http://matpalm.com/blog/2010/06/14/worldcup-twitter-analytics/#comments</comments>
		<pubDate>Mon, 14 Jun 2010 12:06:49 +0000</pubDate>
		<dc:creator>matpalm</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[twitter]]></category>
		<category><![CDATA[worldcup]]></category>

		<guid isPermaLink="false">http://matpalm.com/blog/?p=644</guid>
		<description><![CDATA[since the world cup started i&#8217;ve spent more time looking at twitter data about the games than the actual games themselves. what a sad data nerd i am!
anyways, here&#8217;s the first few days analysis based the use of official country tags (eg #aus) in the search stream for #worldcup.
tomorrow i might look in more detail [...]]]></description>
			<content:encoded><![CDATA[<p>since the world cup started i&#8217;ve spent more time looking at twitter data about the games than the actual games themselves. what a sad data nerd i am!</p>
<p>anyways, here&#8217;s the <a href="http://bit.ly/dkR46o">first few days analysis</a> based the use of official country tags (eg <a href="http://twitter.com/#search?q=%23aus">#aus</a>) in the search stream for <a href="http://twitter.com/#search?q=%23worldcup">#worldcup</a>.</p>
<p>tomorrow i might look in more detail at one of the games, wondering how many variants of &#8216;goooooooal&#8217; i&#8217;ll find :D</p>
]]></content:encoded>
			<wfw:commentRss>http://matpalm.com/blog/2010/06/14/worldcup-twitter-analytics/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>trending topics in tweets about cheese; part2</title>
		<link>http://matpalm.com/blog/2010/05/01/trending-topics-in-tweets-about-cheese-part2/</link>
		<comments>http://matpalm.com/blog/2010/05/01/trending-topics-in-tweets-about-cheese-part2/#comments</comments>
		<pubDate>Sat, 01 May 2010 06:54:53 +0000</pubDate>
		<dc:creator>matpalm</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[e15]]></category>
		<category><![CDATA[pig]]></category>
		<category><![CDATA[trending]]></category>
		<category><![CDATA[twitter]]></category>

		<guid isPermaLink="false">http://matpalm.com/blog/?p=559</guid>
		<description><![CDATA[prototyping in ruby was a great way to prove the concept but my main motivation for this project was to play some more with pig.
the main approach will be

maintain a relation with one record per ngram we want to monitoring for trending
fold 1 hours worth of new data at a time into the model
check the [...]]]></description>
			<content:encoded><![CDATA[<p>prototyping in ruby was a great way to prove the concept but my main motivation for this project was to play some more with pig.</p>
<p>the main approach will be</p>
<ol>
<li>maintain a relation with one record per ngram we want to monitoring for trending</li>
<li>fold 1 hours worth of new data at a time into the model</li>
<li>check the entries for the latest hour for any trends</li>
</ol>
<p>the <a href="http://github.com/matpalm/trending/blob/master/pig/trending.pig">full version is on github</a>. read on for a line by line walkthrough</p>
<p><span id="more-559"></span></p>
<p>the ruby impl used the simplest approach possible for calculating mean and stddev; maintain a record of all the values seen so far and recalculate for each new value.</p>
<p>for our pig version we&#8217;ll take a fixed space approach. rather than keep <em>all</em> the values for each time series it turns out we can get away with storing just 3&#8230;</p>
<ol>
<li>n: the number of values</li>
<li>m: the current mean of all values</li>
<li>ms: the current mean of the squares of all values</li>
</ol>
<p>the idea is that the mean<sub>n+1</sub> = ( n * mean<sub>n</sub> + new value ) / n+1<br />
a similar function holds that derives the standard deviation<sub>n+1</sub> from n, the mean<sub>n</sub> and the mean of the squares<sub>n</sub></p>
<p>let&#8217;s go over the pig script one command a time.</p>
<p>we&#8217;ll assume we&#8217;ve already run it 6 times and we&#8217;re now folding in the 7th hour</p>
<p>the first thing is to load the existing version of the model, in this case stored in the file &#8216;model.006&#8242;<br />
it contains everything we need for checking the trending for each ngram</p>
<pre><span>
&gt; raw_model = load 'model.006' as (key:chararray, n:int, m:double, ms:double);

&gt; describe raw_model;
raw_model: {key: chararray, n: int, m: double, ms: double}

&gt; dump raw_model;
(a b,6,1.3333333333333333,2.0)
(a a,3,1.3333333333333333,2.0)
(a c,4,1.25,1.75)
(a d,1,2.0,4.0)
(b a,3,1.0,1.0)
(b d,1,2.0,4.0)
(b c,6,1.5,2.5)
(d c,1,2.0,4.0)
(c a,4,1.0,1.0)
(d e,1,1.0,1.0)
(c d,4,2.0,4.0)
(d a,2,1.0,1.0)
</span></pre>
<p>next we tag each entry from the loaded model with a zero frequency. we&#8217;ll see later how this makes it easier to fold in the new data.</p>
<pre><span>
&gt; model = foreach raw_model generate key, n, m, ms, 0 as f;

&gt; describe model;
model: {key: chararray, n: int, m: double, ms: double, f: int}

&gt; dump model;
(a b,6,1.3333333333333333,2.0,0)
(a a,3,1.3333333333333333,2.0,0)
(a c,4,1.25,1.75,0)
(a d,1,2.0,4.0,0)
(b a,3,1.0,1.0,0)
(b d,1,2.0,4.0,0)
(b c,6,1.5,2.5,0)
(d c,1,2.0,4.0,0)
(c a,4,1.0,1.0,0)
(d e,1,1.0,1.0,0)
(c d,4,2.0,4.0,0)
(d a,2,1.0,1.0,0)
</span></pre>
<p>now that we&#8217;ve loaded the existing version of the model we can load the next hour of data, in this case contained in &#8216;chunks/006&#8242;.</p>
<pre><span>
&gt; next_chunk = load 'chunks/006';

&gt; dump next_chunk;
(a b a b)
(c d a b)
(a b c)
(a d d d)
</span></pre>
<p>from the text we want to get the frequency of the ngrams.<br />
the breaking apart of each line into its 2-grams is handled by a simple ruby script; <a href="http://github.com/matpalm/trending/blob/master/pig/ngram.rb">ngram.rb</a></p>
<pre><span>
&gt; define ngramer `ngram.rb` ship('ngram.rb');
&gt; ngrams = stream next_chunk through ngramer as (key:chararray);

&gt; describe ngrams;
ngrams: {key: chararray}

&gt; dump ngrams;
(a b)
(b a)
(a b)
(c d)
(d a)
(a b)
(a b)
(b c)
(a d)
(d d)
(d d)
</span></pre>
<p>calculating the frequencies of the ngrams is a simple two step process of first grouping by the key&#8230;</p>
<pre><span>
&gt; ngrams_grouped = group ngrams by key;

&gt; describe ngrams_grouped;
ngrams_grouped: {group: chararray, ngrams: {key: chararray}}

&gt; dump ngrams_grouped;
(a b,{(a b),(a b),(a b),(a b)})
(a d,{(a d)})
(b a,{(b a)})
(b c,{(b c)})
(c d,{(c d)})
(d a,{(d a)})
(d d,{(d d),(d,d)})
</span></pre>
<p>&#8230;and then generating the key, frequency pairs</p>
<pre><span>
&gt; ngram_freq = foreach ngrams_grouped generate group as key, SIZE(ngrams) as f;

&gt; describe ngram_freq;
ngram_freq: {key: chararray, f: long}

&gt; dump ngram_freq;
(a b,4L)
(a d,1L)
(b a,1L)
(b c,1L)
(c d,1L)
(d a,1L)
(d d,2L)
</span></pre>
<p>from this we know all the distinct 2grams that are contained in the next chunk we&#8217;re analysing<br />
for each of these 2grams one of two things is true;</p>
<ol>
<li>either the ngram has been seen before (thus it has an entry in the model)</li>
<li>this is the first time we&#8217;ve seen it, in which case we need to add a new entry to the model</li>
</ol>
<p>the easiest way i&#8217;ve worked out in pig to handle this is to generate a &#8217;seed&#8217; model just for this chunk and fold it into the real model but unioning the relations</p>
<p>(i&#8217;ve been using pig 0.3 to keep in line with the current version of elastic map reduce but it might be easier with the various extra joins that are in later versions of pig)</p>
<p>so first we generate the &#8217;seed&#8217; relation&#8230;</p>
<pre><span>
&gt; seed_values = foreach ngram_freq generate key, 0 as n, 0.0 as m, 0.0 as ms, f;

&gt; describe seed_values;
seed_values: {key: chararray, n: int, m: double, ms: double, f: long}

&gt; dump seed_values;
(a b,0,0.0,0.0,4L)
(a d,0,0.0,0.0,1L)
(b a,0,0.0,0.0,1L)
(b c,0,0.0,0.0,1L)
(c d,0,0.0,0.0,1L)
(d a,0,0.0,0.0,1L)
(d d,0,0.0,0.0,2L)
</span></pre>
<p>&#8230;and fold it in with a 3 step process; unioning with the original model, grouping and collapsing</p>
<p>first the union&#8230;</p>
<pre><span>
&gt; model_plus_seed = union model, seed_values;

&gt; describe model_plus_seed;
model_plus_seed: {key: chararray, n: int, m: double, ms: double, f: long}

&gt; dump model_plus_seed;
(a b,0,0.0,0.0,4L)
(a b,6,1.3333333333333333,2.0,0L)
(a d,0,0.0,0.0,1L)
(a a,3,1.3333333333333333,2.0,0L)
(b a,0,0.0,0.0,1L)
(a c,4,1.25,1.75,0L)
(b c,0,0.0,0.0,1L)
(a d,1,2.0,4.0,0L)
(c d,0,0.0,0.0,1L)
(b a,3,1.0,1.0,0L)
(d a,0,0.0,0.0,1L)
(b d,1,2.0,4.0,0L)
(d d,0,0.0,0.0,2L)
(b c,6,1.5,2.5,0L)
(d c,1,2.0,4.0,0L)
(c a,4,1.0,1.0,0L)
(d e,1,1.0,1.0,0L)
(c d,4,2.0,4.0,0L)
(d a,2,1.0,1.0,0L)
</span></pre>
<p>then the grouping&#8230;</p>
<pre><span>
&gt; model_plus_seed2 = group model_plus_seed by key;

&gt; describe model_plus_seed2 = group model_plus_seed by key;;
model_plus_seed2: {group: chararray, model_plus_seed: {key: chararray, n: int, m: double, ms: double, f: long}}

&gt; dump model_plus_seed2;
(a a,{(a a,3,1.3333333333333333,2.0,0L)})
(a b,{(a b,0,0.0,0.0,4L),(a b,6,1.3333333333333333,2.0,0L)})
(a c,{(a c,4,1.25,1.75,0L)})
(a d,{(a d,0,0.0,0.0,1L),(a d,1,2.0,4.0,0L)})
(b a,{(b a,0,0.0,0.0,1L),(b a,3,1.0,1.0,0L)})
(b c,{(b c,0,0.0,0.0,1L),(b c,6,1.5,2.5,0L)})
(b d,{(b d,1,2.0,4.0,0L)})
(c a,{(c a,4,1.0,1.0,0L)})
(c d,{(c d,0,0.0,0.0,1L),(c d,4,2.0,4.0,0L)})
(d a,{(d a,0,0.0,0.0,1L),(d a,2,1.0,1.0,0L)})
(d c,{(d c,1,2.0,4.0,0L)})
(d d,{(d d,0,0.0,0.0,2L)})
(d e,{(d e,1,1.0,1.0,0L)})
</span></pre>
<p>and finally the collapsing using MAX&#8230;</p>
<pre><span>
&gt; model_n =
     foreach model_plus_seed2 generate
        group as key,
        MAX(model_plus_seed.n) as n,
        MAX(model_plus_seed.m) as m,
        MAX(model_plus_seed.ms) as ms,
        MAX(model_plus_seed.f) as f;

&gt; describe model_n;
model_n: {key: chararray, n: int, m: double, ms: double, f: long}

&gt; dump model_n;
(a a,3,1.3333333333333333,2.0,0L)
(a b,6,1.3333333333333333,2.0,4L)
(a c,4,1.25,1.75,0L)
(a d,1,2.0,4.0,1L)
(b a,3,1.0,1.0,1L)
(b c,6,1.5,2.5,1L)
(b d,1,2.0,4.0,0L)
(c a,4,1.0,1.0,0L)
(c d,4,2.0,4.0,1L)
(d a,2,1.0,1.0,1L)
(d c,1,2.0,4.0,0L)
(d d,0,0.0,0.0,2L)
(d e,1,1.0,1.0,0L)
</span></pre>
<p>at this stage we have the original model weaved in with the new data but still need to update the mean and square of means for the values from the latest hour.</p>
<p>we can do this by first seperating out the values we need to update based on whether the frequency is non zero<br />
(recall non zero frequencies represent ngrams from the latest hour)</p>
<pre><span>
&gt; split model_n into to_update if f&gt;0, not_to_update if f==0;

&gt; dump to_update;
(a b,6,1.3333333333333333,2.0,4L)
(a d,1,2.0,4.0,1L)
(b a,3,1.0,1.0,1L)
(b c,6,1.5,2.5,1L)
(c d,4,2.0,4.0,1L)
(d a,2,1.0,1.0,1L)
(d d,0,0.0,0.0,2L)

&gt; dump not_to_update;
(a a,3,1.3333333333333333,2.0,0L)
(a c,4,1.25,1.75,0L)
(b d,1,2.0,4.0,0L)
(c a,4,1.0,1.0,0L)
(d c,1,2.0,4.0,0L)
(d e,1,1.0,1.0,0L)
</span></pre>
<p>we can now update the mean and std deviations based on the new frequency values</p>
<pre><span>
&gt; updated =
     foreach to_update {
         m2  = ((n*m)+f)/(n+1);
         ms2 = ((n*ms)+(f*f))/(n+1);
         generate key, n+1 as n, m2 as m, ms2 as ms, f;
     }

&gt; describe updated;
updated: {key: chararray, n: int, m: double, ms: double, f: long}

&gt; dump updated;
(a b,7,1.7142857142857142,4.0,4L)
(a d,2,1.5,2.5,1L)
(b a,4,1.0,1.0,1L)
(b c,7,1.4285714285714286,2.2857142857142856,1L)
(c d,5,1.8,3.4,1L)
(d a,3,1.0,1.0,1L)
(d d,1,2.0,4.0,2L)
</span></pre>
<p>these new rows, along with the rows we didn&#8217;t update, can be stored as the model at time n+1 ready for the next hours chunk</p>
<pre><span>
&gt; to_store = union model_n1, not_to_update;
&gt; store to_store into 'model.007';

&gt; dump to_store;
(a b,7,1.7142857142857142,4.0,4L)
(a a,3,1.3333333333333333,2.0,0L)
(a d,2,1.5,2.5,1L)
(a c,4,1.25,1.75,0L)
(b a,4,1.0,1.0,1L)
(b d,1,2.0,4.0,0L)
(b c,7,1.4285714285714286,2.2857142857142856,1L)
(c a,4,1.0,1.0,0L)
(c d,5,1.8,3.4,1L)
(d c,1,2.0,4.0,0L)
(d a,3,1.0,1.0,1L)
(d e,1,1.0,1.0,0L)
(d d,1,2.0,4.0,2L)
</span></pre>
<p>now that we&#8217;ve updated the model we can start making the trending check!</p>
<p>first step is to filter out entries that correspond to ngrams we are seeing for the first time<br />
( an new item can&#8217;t be trending )</p>
<pre><span>
&gt; requiring_trending_check = filter model_n1 by n&gt;1;

&gt; dump requiring_trending_check;
(a b,7,1.7142857142857142,4.0,4L)
(a d,2,1.5,2.5,1L)
(b a,4,1.0,1.0,1L)
(b c,7,1.4285714285714286,2.2857142857142856,1L)
(c d,5,1.8,3.4,1L)
(d a,3,1.0,1.0,1L)
</span></pre>
<p>and finally we can make the trending calculation!<br />
we can calculate the minimum trending value, based on mean + twice std dev&#8230;</p>
<pre><span>
&gt; calc_min_trending =
     foreach requiring_trending_check {
        sd_lhs = n * ms;
        sd_rhs = n * (m*m);
        sd = org.apache.pig.piggybank.evaluation.math.SQRT((sd_lhs-sd_rhs)/n);
        min_trend_value = m + (2*sd);
        generate key, f, m as mean, sd as std_dev,
                 min_trend_value as min_trend_value,
                 f / min_trend_value as percent_over_trend;
    }

&gt; describe calc_min_trending;
calc_min_trending: {key: chararray, f: long, mean: double, std_dev: double, min_trend_value: double, percent_over_trend: double}

&gt; dump calc_min_trending;
(a b,4L,1.7142857142857142,1.0301575072754257,3.7746007288365657,1.059714732061981)
(a d,1L,1.5,0.5,2.5,0.4)
(b a,1L,1.0,0.0,1.0,1.0)
(b c,1L,1.4285714285714286,0.4948716593053934,2.4183147471822153,0.4135111036167584)
(c d,1L,1.8,0.4,2.6,0.3846153846153848)
(d a,1L,1.0,0.0,1.0,1.0)
</span></pre>
<p>&#8230; and any entries with a frequency over the min trending value are deemed trending!<br />
( for this example it&#8217;s only the one )</p>
<pre><span>
&gt; trending = filter calc_min_trending by percent_over_trend &gt; 1;

&gt; describe trending;
trending: {key: chararray, f: long, mean: double, std_dev: double, min_trend_value: double, percent_over_trend: double}

&gt; dump trending;
(a b,4L,1.7142857142857142,1.0301575072754257,3.7746007288365657,1.059714732061981)
</span></pre>
<p>as a normalisation step i&#8217;ve been playing with also factoring in the frequency itself,<br />
haven&#8217;t come to a conclusion on whether this is a better metric or not&#8230;</p>
<pre><span>
&gt; trending2 =
     foreach trending {
        normalised_trend_value = org.apache.pig.piggybank.evaluation.math.LOG10(f) * percent_over_trend;
        generate key, min_trend_value, percent_over_trend, normalised_trend_value as normalised_trend_value;
     }

&gt; describe trending2;
trending2: {key: chararray, min_trend_value: double, percent_over_trend: double, normalised_trend_value: double}

&gt; dump trending2;
(a b,3.7746007288365657,1.059714732061981,0.6380118423953504)
</span></pre>
<p>and finally store the top trending values for processing!</p>
<pre><span>
&gt; trending_sorted = order trending2 by normalised_trend_value desc;
&gt; top_50 = limit trending_sorted 50;
&gt; store trending_sorted into 'trending.model.006;
</span></pre>
]]></content:encoded>
			<wfw:commentRss>http://matpalm.com/blog/2010/05/01/trending-topics-in-tweets-about-cheese-part2/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>trending topics in tweets about cheese; part1</title>
		<link>http://matpalm.com/blog/2010/04/27/trending-topics-in-tweets-about-cheese-part1/</link>
		<comments>http://matpalm.com/blog/2010/04/27/trending-topics-in-tweets-about-cheese-part1/#comments</comments>
		<pubDate>Tue, 27 Apr 2010 13:42:20 +0000</pubDate>
		<dc:creator>matpalm</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[cheese]]></category>
		<category><![CDATA[e15]]></category>
		<category><![CDATA[trending]]></category>
		<category><![CDATA[twitter]]></category>

		<guid isPermaLink="false">http://matpalm.com/blog/?p=499</guid>
		<description><![CDATA[trending topics
what does it mean for a topic to be &#8216;trending&#8217;? consider the following time series (430e3 tweets containing cheese collected over a month period bucketed into hourly timeslots)

without a formal definition we can just look at this and say that the series was trending where there was a spike just before time 600. as [...]]]></description>
			<content:encoded><![CDATA[<h3>trending topics</h3>
<p>what does it mean for a topic to be &#8216;trending&#8217;? consider the following time series (430e3 tweets containing cheese collected over a month period bucketed into hourly timeslots)</p>
<p><a href="http://www.matpalm.com/trending/tweets_over_day.60.nonaggregated.png"><img src="http://www.matpalm.com/trending/tweets_over_day.60.nonaggregated.png" alt="" /></a></p>
<p>without a formal definition we can just look at this and say that the series was trending where there was a spike just before time 600. as a start then let&#8217;s just define a trend as a value that was greater than was &#8216;expected&#8217;.</p>
<h3>how can we calculate trending?</h3>
<p>one really nice simple algorithm for detecting a trend is to say a value, v, is trending if v &gt; mean + 3 * standard deviation of the data seen so far. (thanks <a href="http://www.twitter.com/peteskomoroch">@peteskomoroch</a> for the suggestion, works a treat)</p>
<p>let&#8217;s consider the same time series as before but this time with some overlaid data;<br />
<span style="color: green;">green &#8211; the mean</span><br />
<span style="color: red;">red &#8211; minimum trend value ( = mean + 3 * std dev )</span><br />
<span style="color: blue;">blue &#8211; instances where the value &gt; minimum trend value</span></p>
<p><a href="http://www.matpalm.com/trending/tweets_over_day.60.trending.png"><img src="http://www.matpalm.com/trending/tweets_over_day.60.trending.png" alt="" /></a></p>
<p><span id="more-499"></span></p>
<p>here&#8217;s a zoom in on the last 200 values</p>
<p><a href="http://www.matpalm.com/trending/tweets_over_day.60.trending.zoom.png"><img src="http://www.matpalm.com/trending/tweets_over_day.60.trending.zoom.png" alt="" /></a></p>
<p>this works surprisingly well, the mean gives a solid expectation of the value with the standard deviation covering the daily periodic nature of the data.</p>
<p>it&#8217;s not perfect though as this system <em>only</em> ever allows a trend around the peaks of the cycle.</p>
<p>for example consider the troughs which have a frequency value around 250. if we had a value in one of those timeslot&#8217;s that was 1000, ie four times what was expected given that time of day, it would not be considered trending since the value has to be over 1500</p>
<h3>facet by hour</h3>
<p>one way to handle this is to not have a single time series but instead maintain 24 time series, one for each hour of the day.</p>
<p>faceting in this way gives the following trending</p>
<p><a href="http://www.matpalm.com/trending/tweets_over_day.60.periodic_trending.png"><img src="http://www.matpalm.com/trending/tweets_over_day.60.periodic_trending.png" alt="" /></a><br />
<a href="http://www.matpalm.com/trending/tweets_over_day.60.periodic_trending.zoom.png"><img src="http://www.matpalm.com/trending/tweets_over_day.60.periodic_trending.zoom.png" alt="" /></a></p>
<p>and though this doesn&#8217;t present any cases of trends at a trough we can see it was prettttty close a number of times.</p>
<h3>facet by ngram</h3>
<p>one other interesting way to facet, and the main purpose of this project, is to maintain a seperate time series for each ngram in the tweet.</p>
<p>the top 10 2-grams in my dataset are&#8230;</p>
<table>
<tbody>
<tr>
<td>freq</td>
<td>term1</td>
<td>term2</td>
</tr>
<tr>
<td>44389</td>
<td>and</td>
<td>cheese</td>
</tr>
<tr>
<td>33454</td>
<td>cheese</td>
<td>and</td>
</tr>
<tr>
<td>22815</td>
<td>mac</td>
<td>cheese</td>
</tr>
<tr>
<td>22532</td>
<td>grilled</td>
<td>cheese</td>
</tr>
<tr>
<td>18639</td>
<td>cream</td>
<td>cheese</td>
</tr>
<tr>
<td>15225</td>
<td>the</td>
<td>cheese</td>
</tr>
<tr>
<td>13592</td>
<td>mac</td>
<td>and</td>
</tr>
<tr>
<td>12967</td>
<td>chuck</td>
<td>cheese</td>
</tr>
<tr>
<td>12598</td>
<td>of</td>
<td>cheese</td>
</tr>
<tr>
<td>12296</td>
<td>cheese</td>
<td>on</td>
</tr>
</tbody>
</table>
<p>let&#8217;s look at the time series for a few of them.</p>
<h4>#4 grilled cheese</h4>
<p><a href="http://www.matpalm.com/trending/tweets_over_day.60.grilledcheese.trending.png"><img src="http://www.matpalm.com/trending/tweets_over_day.60.grilledcheese.trending.png" alt="" /></a></p>
<p>we get an interesting result from the very first spike at around 225&#8230; poor fangirl <a href="http://www.twitter.com/rachelljonas">@rachelljonas</a> spent 10 minutes tweeting like crazy trying to get the attention of <a href="http://www.twitter.com/nickjonas">@nickjonas</a> (some popstar i&#8217;ve never heard of) and bumped up &#8216;grilled cheese&#8217; for a single timeslot (here&#8217;s <a href="http://www.matpalm.com/trending/rachelljonas.html">her attempt</a> to get his attention&#8230;)</p>
<p>this raises an interesting point about spam and should possibly my first pre processing data cleaning step. how should we disregard too many tweets from a single user in a timeslot?</p>
<p>the next spike at around 375 shows potentially my first true trending topic, a sudden increase in the discussion of making grilled cheese. <a href="http://www.matpalm.com/trending/grilled_cheese.html">the data</a> has no dups so looks like it was just grilled cheese time!</p>
<h4>#5 cream cheese</h4>
<p><a href="http://www.matpalm.com/trending/tweets_over_day.60.creamcheese.trending.png"><img src="http://www.matpalm.com/trending/tweets_over_day.60.creamcheese.trending.png" alt="" /></a></p>
<p>one major spike at about 376, looking at <a href="http://www.matpalm.com/trending/cream_cheese.html">the data</a>. there might have been a competition being run relating to #gno #bagelfuls ?</p>
<h4>#412 goats cheese</h4>
<p>nothing uber interesting with the &#8216;goats cheese&#8217; time series but it does illustrate an interesting point. for all the examples we&#8217;ve looked at so far each timeslot of an hour has included as least one entry for the 2gram. by the time we&#8217;re getting to the less frequent ngrams we see as many timeslots with a zero frequency as with a non zero frequency.</p>
<p><a href="http://www.matpalm.com/trending/tweets_over_day.60.goatscheese.withzerofill.trending.png"><img src="http://www.matpalm.com/trending/tweets_over_day.60.goatscheese.withzerofill.trending.png" alt="" /></a></p>
<p>interestingly if you only consider the cases where the frequency values are non zero i think you get a better sense of where the values are trending.</p>
<p><a href="http://www.matpalm.com/trending/tweets_over_day.60.goatscheese.trending.png"><img src="http://www.matpalm.com/trending/tweets_over_day.60.goatscheese.trending.png" alt="" /></a></p>
<p>this also turns out to make things easier to process :)</p>
<h4>#1483 apple juice</h4>
<p>with &#8216;apple juice&#8217;, an even less frequent 2gram, the effect is even more noticable if you ignore the zero frequency cases.</p>
<p><a href="http://www.matpalm.com/trending/tweets_over_day.60.applejuice.withzerofill.trending.png"><img src="http://www.matpalm.com/trending/tweets_over_day.60.applejuice.withzerofill.trending.png" alt="" /></a><br />
<a href="http://www.matpalm.com/trending/tweets_over_day.60.applejuice.trending.png"><img src="http://www.matpalm.com/trending/tweets_over_day.60.applejuice.trending.png" alt="" /></a></p>
<p>so with two ways of faceting the data, either timeslots or ngrams, the next step is porting the algorithm to pig so we can run it at scale, write up coming soon!</p>
<p>( code ( in a pretty raw form ) <a href="http://github.com/matpalm/trending">available at github</a> )</p>
]]></content:encoded>
			<wfw:commentRss>http://matpalm.com/blog/2010/04/27/trending-topics-in-tweets-about-cheese-part1/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>e10.6 community detection for my twitter network</title>
		<link>http://matpalm.com/blog/2010/04/04/375/</link>
		<comments>http://matpalm.com/blog/2010/04/04/375/#comments</comments>
		<pubDate>Sun, 04 Apr 2010 02:58:28 +0000</pubDate>
		<dc:creator>matpalm</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[betweenness]]></category>
		<category><![CDATA[e10]]></category>
		<category><![CDATA[graph]]></category>
		<category><![CDATA[social network]]></category>
		<category><![CDATA[twitter]]></category>

		<guid isPermaLink="false">http://matpalm.com/blog/?p=375</guid>
		<description><![CDATA[last night i applied my network decomposition algorithm to a graph of some of the people near me in twitter.
first i build a friend graph for 100 people &#8216;around&#8217; me (taken from a crawl i did last year). by &#8216;friend&#8217; i mean that if alice follows bob then bob also follows alice.
here the graph, some [...]]]></description>
			<content:encoded><![CDATA[<p>last night i applied my network decomposition algorithm to a graph of some of the people near me in twitter.</p>
<p>first i build a friend graph for 100 people &#8216;around&#8217; me (taken from a <a href="http://matpalm.com/blog/2009/09/29/e10-3-twitter-crawl-progress/">crawl</a> i did last year). by &#8216;friend&#8217; i mean that if alice follows bob then bob also follows alice.</p>
<p>here the graph, some things to note though; it was an unfinished crawl (can a crawl of twitter EVER be finished) and was done october last year so is a bit out of date.</p>
<p><a href="http://matpalm.com/blog/wp-content/uploads/2010/04/friends.jpg"><img class="aligncenter size-large wp-image-377" title="friends" src="http://matpalm.com/blog/wp-content/uploads/2010/04/friends-1024x204.jpg" alt="friends" width="1024" height="204" /></a><span id="more-375"></span></p>
<p>and here is the dendrogram decomposition</p>
<p><a href="http://matpalm.com/blog/wp-content/uploads/2010/04/dendrogram.vert_.600.jpg"><img class="aligncenter size-full wp-image-391" title="dendrogram.vert.600" src="http://matpalm.com/blog/wp-content/uploads/2010/04/dendrogram.vert_.600.jpg" alt="dendrogram.vert.600" width="600" height="1500" /></a>some interesting clusterings come out..</p>
<p>right at the bottom we have a small clique (ie everyone following everyone else) of people i&#8217;ve known from when i was in <em>sydney</em></p>
<p><a href="http://matpalm.com/blog/wp-content/uploads/2010/04/sydney.nokia_.jpg"><img class="aligncenter size-full wp-image-387" title="sydney.nokia" src="http://matpalm.com/blog/wp-content/uploads/2010/04/sydney.nokia_.jpg" alt="sydney.nokia" width="185" height="98" /></a></p>
<p>this small group connects to the group i&#8217;m in; <a href="http://twitter.com/tinybuddha">tinybuddha</a> down to <a href="http://twitter.com/evanbottcher">evanbottcher</a>; which roughly describes the group of people i&#8217;ve met in <em>melbourne</em>.</p>
<p>the order of the single breakaways in the melbourne group is pretty arbitrary. i get quite different ordering if i run the decomposition multiple times due to the random tie breaking involved. i could either run the decomposition multiple times and work out some kind of averaging or choose another more granular way of deciding how to break ties.</p>
<p>the next connector after <em>syndey</em> and <em>melbourne</em> are unified is <a href="http://twitter.com/deanemorrow">deanemorrow</a> a coworker when i was at <a href="http://twitter.com/distra">distra</a>. this one sticks out for me as being the biggest flaw in the clustering since it would have made more sense to have him placed near distra at the bottom.</p>
<p>another interesting clique is near me..</p>
<p><a href="http://matpalm.com/blog/wp-content/uploads/2010/04/twers.jpg"><img class="aligncenter size-full wp-image-393" title="twers" src="http://matpalm.com/blog/wp-content/uploads/2010/04/twers.jpg" alt="twers" width="115" height="123" /></a>it has four thoughtworkers; <a href="http://twitter.com/markryall">mark</a>, <a href="http://twitter.com/grillp">gill</a>, <a href="http://twitter.com/debbiecheong">debs</a> and <a href="http://twitter.com/evanbottcher">evan</a> and one sensiser; <a href="http://twitter.com/kornys">korny</a>. did korny perhaps work for thoughtworks in a previous life ;)</p>
<p>another interesting note is there exists a path from me to <a href="http://twitter.com/norvig">peter norvig</a> (who is too busy for twitter it seems) but only because of the huge connector nodes that exist in twitter. an example in this case is <a href="http://twitter.com/tuaw">TUAW</a> who follow 30,000+ people and have even more followers. these nodes cause a bit of noise in the system since they are slightly false representations of what a &#8216;friend&#8217; means in my mind. not sure how to take these numbers into account&#8230;</p>
<p>things to do&#8230;</p>
<ul>
<li>the biggest oversimplification in this system is how i break ties for deciding which edge to cut out next if multiple exist with the same betweenness. currently it chooses the one that would make the most even sized break (based on smallest standard deviation of the connected components). though this is good for breaking a group into even sizes it&#8217;s bad since it favours breaking a single element off a large group. this is what has caused the &#8216;laddering&#8217; we see in the melbourne group.</li>
<li>the shortest path algorithm used to calculate edge betweenness is stochastic and if multiple shortest paths exist only one of them is chosen. it&#8217;d be better if all were considered with a weighting scheme.</li>
<li>it might be better to consider vertex betweenness instead of edge betweenness since one person could exist in multiple groups. if i started down this path though i think i&#8217;d rather just rewrite the lot using something like  the <a href="http://en.wikipedia.org/wiki/Clique_percolation_method">clique percolation method</a></li>
</ul>
<p><a href="http://github.com/matpalm/tgraph">all the code is on github</a></p>
]]></content:encoded>
			<wfw:commentRss>http://matpalm.com/blog/2010/04/04/375/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>sentiment analysis training data using mechanical turk</title>
		<link>http://matpalm.com/blog/2010/03/12/sentiment-analysis-training-data-using-mechanical-turk/</link>
		<comments>http://matpalm.com/blog/2010/03/12/sentiment-analysis-training-data-using-mechanical-turk/#comments</comments>
		<pubDate>Fri, 12 Mar 2010 11:57:38 +0000</pubDate>
		<dc:creator>matpalm</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[analysis]]></category>
		<category><![CDATA[mechanical turk]]></category>
		<category><![CDATA[sentiment]]></category>
		<category><![CDATA[twitter]]></category>

		<guid isPermaLink="false">http://matpalm.com/blog/?p=331</guid>
		<description><![CDATA[want to try doing some sentiment analysis work on tweets but i need some good training data.
i could label a heap of tweets myself as being positive, neutral or negative but instead this seems to be the perfect job for mechanical turk
so i put up 100 &#8216;cream cheese&#8217; tweets on mechanical turk, asked for 3 [...]]]></description>
			<content:encoded><![CDATA[<p>want to try doing some <a href="http://en.wikipedia.org/wiki/Sentiment_analysis">sentiment analysis</a> work on tweets but i need some good training data.</p>
<p>i could label a heap of tweets myself as being positive, neutral or negative but instead this seems to be the perfect job for <a href="https://www.mturk.com/mturk/welcome">mechanical turk</a></p>
<p>so i put up 100 &#8216;cream cheese&#8217; tweets on mechanical turk, asked for 3 opinions per tweet and offered $0.01 per opinion. took under 30 minutes to get back all 300 opinions and only cost $4.50 ($3 for the work, $1.50 admin fee)</p>
<p>the <a href="http://matpalm.com/twitter/mturk_result.csv">results</a> are interesting in themselves&#8230;</p>
<p>mostly they are consistent;</p>
<p>for example all three sentiments for <strong>bagels and cream cheese for breakfast. very original</strong> were neutral</p>
<p>and all three sentiments for <strong>very few things are as good as a warm nyc bagel with cream cheese first thing in the am</strong> were positive.</p>
<p>but occasionally they aren&#8217;t consistent;</p>
<p>the tweet <strong>developing a recipe for orange cream cheese swirled cardamom brownies&#8230; that&#8217;s too long a name. hmm&#8230; suggestions?</strong> had one positive, one neutral and one negative</p>
<p>interestingly there was no case of a tweet having all opinions being negative; even <strong>bad idea. dont eat bagel with mixed berry cream cheese, right after u washed ur mouth with listerine. . </strong> ended up with two negatives and one positive (?)</p>
<p>hmmmm</p>
]]></content:encoded>
			<wfw:commentRss>http://matpalm.com/blog/2010/03/12/sentiment-analysis-training-data-using-mechanical-turk/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>mongodb + twitter + yahoo term extractor = fun!</title>
		<link>http://matpalm.com/blog/2010/03/07/mongodb-twitter-yahoo-term-extractor-fun/</link>
		<comments>http://matpalm.com/blog/2010/03/07/mongodb-twitter-yahoo-term-extractor-fun/#comments</comments>
		<pubDate>Sun, 07 Mar 2010 11:38:25 +0000</pubDate>
		<dc:creator>matpalm</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[json]]></category>
		<category><![CDATA[mongodb]]></category>
		<category><![CDATA[term extraction]]></category>
		<category><![CDATA[twitter]]></category>
		<category><![CDATA[yahoo]]></category>

		<guid isPermaLink="false">http://matpalm.com/blog/?p=325</guid>
		<description><![CDATA[ran a little experiment in using yahoo term extraction yesterday and it worked well enough. here&#8217;s some code to pass some text to yahoo and get back an array of terms
i&#8217;ve got to say mongodb is such an easy tool for working with json data. these 20 odd lines insert a text json tweet stream [...]]]></description>
			<content:encoded><![CDATA[<p>ran a little experiment in using <a href="http://developer.yahoo.com/search/content/V1/termExtraction.html">yahoo term extraction</a> yesterday and it worked well enough. here&#8217;s <a href="http://github.com/matpalm/twitter/blob/master/cheese_terms/extract_terms.rb">some code</a> to pass some text to yahoo and get back an array of terms</p>
<p>i&#8217;ve got to say mongodb is such an easy tool for working with json data. <a href="http://github.com/matpalm/twitter/blob/master/cheese_terms/insert_into_mongo.rb">these 20 odd lines</a> insert a text json tweet stream into mongo. so simple, why can&#8217;t all code be this easy&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://matpalm.com/blog/2010/03/07/mongodb-twitter-yahoo-term-extractor-fun/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>tweets about cheese</title>
		<link>http://matpalm.com/blog/2009/11/15/tweets-about-cheese/</link>
		<comments>http://matpalm.com/blog/2009/11/15/tweets-about-cheese/#comments</comments>
		<pubDate>Sun, 15 Nov 2009 10:45:55 +0000</pubDate>
		<dc:creator>matpalm</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[cheese]]></category>
		<category><![CDATA[ngrams]]></category>
		<category><![CDATA[twitter]]></category>

		<guid isPermaLink="false">http://matpalm.com/blog/?p=226</guid>
		<description><![CDATA[people tweet about all sorts of stuff.
sometimes it&#8217;s really important ground breaking world changing stuff&#8230;
but most of the time it&#8217;s ridiculous waste of time stuff like &#8216;i ate some cheese&#8217;
in fact how much do people actually tweet about cheese?
and when they do, what are the most important cheese related topics?
lets gather some data&#8230;

using the twitter [...]]]></description>
			<content:encoded><![CDATA[<p>people tweet about all sorts of stuff.</p>
<p>sometimes it&#8217;s really important ground breaking world changing stuff&#8230;<br />
but most of the time it&#8217;s ridiculous waste of time stuff like &#8216;i ate some cheese&#8217;</p>
<p>in fact how much do people actually tweet about cheese?<br />
and when they do, what are the most important cheese related topics?</p>
<p>lets gather some data&#8230;</p>
<p><span id="more-226"></span></p>
<p>using the twitter search api this is dead simple. here&#8217;s a <a href="http://github.com/matpalm/twitter/tree/master/cheese/">hacktastic script</a> which does a polling search for cheese. start it up and bake at 200deg C, i mean, run overnight.</p>
<pre>bash&gt; ./collect_cheese.rb &gt;&gt; cheese.out</pre>
<p>in the morning we&#8217;ve got some tweets about cheese, yay!</p>
<p style="padding-left: 30px;">UPDATE!</p>
<p style="padding-left: 30px;">even easier than a hacktastic script is the new filter streaming api</p>
<pre style="padding-left: 30px;">bash&gt; curl -s -u user:pasword http://stream.twitter.com/1/statuses/filter.json?track=cheese</pre>
<p style="padding-left: 30px;">yay!</p>
<p>let&#8217;s poke around, but first some l33t hax0r bash aliases for the sake of brevity</p>
<pre>alias t='tail'
alias h='head'
alias s='sort'
alias u='uniq'
alias g='grep'</pre>
<p>let&#8217;s start with a sample, the first 10 tweets&#8230;</p>
<pre>bash&gt; ./parse_cheese_out.rb &lt; cheese.out | h
Pasta with pesto and cheese. Some watermelon but alas did not get to the salad. At least not yet. http://twitpic.com/pf1rr
guess imma have a steak n cheese since a certain sum1 stunntin in bring me chines
@souljaboytellem Cheese and Bread XD
Great tips for awesome Pizza. Who loves Pizza. Cheese Matters http://bit.ly/42xkZS
Hey bring that cheese.
Milwaukee airport officially 10x better than Lambert. Not surprisingly, a large cheese selection. Kind of a cheese motif in the shops here.
http://twitpic.com/pf1pl - The Cheese Cake is served.
LA has amazing restaurants. had the best grilled cheese (w/sharp cheddar, gruyere and dijon mustard) at comme ca for lunch with @maxwanger
Yummy pepperoni and bleu cheese pizza consumed. Next up a glass of wine. Yay weekend!
Applebees take-out flow aka Dinner for 1- Grilled Chicken/Shrimp and cheddar jack mac &amp; cheese.</pre>
<p>to get a context on what words are being used lets find the most frequent words, ie 1-grams (where an n-gram is a phrase of n terms) in the first 10 tweets.<br />
we&#8217;ll sanitize by removing weird characters, downcasing, removing urls, etc</p>
<pre>bash&gt; ./parse_cheese_out.rb &lt; cheese.out | h | ./sanitise.rb | ./cheese_grams.rb 1 | s | u -c | s -n | t
 2 in
 2 of
 2 with
 3 for
 3 not
 3 pizza
 4 the
 5 a
 5 and
11 cheese</pre>
<p>(this says &#8216;cheese&#8217; was mentioned 11 times while &#8216;with&#8217; was mentioned twice)</p>
<p>observations;</p>
<ol>
<li>it&#8217;s no surprise that cheese is the most frequent since it was our search term.</li>
<li>the most frequent words that aren&#8217;t cheese, &#8216;and&#8217;, &#8216;a&#8217; and &#8216;the&#8217;, are classic english constructs.</li>
<li>the next most frequent term is &#8216;pizza&#8217;. is pizza the biggest cheese use? well, in these 10 tweets perhaps&#8230;</li>
</ol>
<p>next lets consider the frequency of bi-grams, that is 2 word phrases.</p>
<pre>bash&gt; ./parse_cheese_out.rb &lt; cheese.out | h | ./sanitise.rb | ./cheese_grams.rb 2 | s | u -c | s -n | t
1 to the
1 up a
1 watermelon but
1 who loves
1 wine yay
1 with maxwanger
1 with pesto
1 w sharp
1 yay weekend
1 yummy pepperoni</pre>
<p>this result is not too interesting, or surprising. seems there are no common two word tuples across the first 10 tweets. fair enough.</p>
<p>how about bigrams across 100 tweets instead of 10?</p>
<pre>bash&gt; ./parse_cheese_out.rb &lt; cheese.out | h -100 | ./sanitise.rb | ./cheese_grams.rb 2 | s | u -c | s -n | t
 4 mac and
 4 nacho cheese
 5 &amp; cheese
 5 cheese and
 5 grilled cheese
 5 i love
 6 cheese cake
 6 hot pocket
 6 with cheese
14 and cheese</pre>
<p>now we&#8217;re getting somewhere!</p>
<p>more (arguable) observations</p>
<ol>
<li>when people are making something with cheese they say &#8216;X and cheese&#8217; more than &#8216;cheese and X&#8217; ( from the raw frequencies )</li>
<li>people like cheese more than they dislike it; ( the only sentiment is &#8216;love&#8217; )</li>
<li>&#8216;with&#8217; is a reasonable synonym for &#8216;and&#8217; ( since they are the two most frequent bigrams ending in cheese )</li>
</ol>
<p>one that jumps out at me is &#8216;mac and&#8217;. mac ? is this macaroni?</p>
<p>let&#8217;s try 3grams&#8230;</p>
<pre>bash&gt; ./parse_cheese_out.rb &lt; cheese.out | h -100 | ./sanitise.rb | ./cheese_grams.rb 3 | s | u -c | s -n | t
3 mac &amp; cheese
3 not so delicious
3 pocket next time
3 pocket not so
3 so delicious as
3 sounds better go
3 the ham and
3 wine and cheese
3 you like cheese
4 mac and cheese</pre>
<p>hmmm. looks like mac <em>is</em> macaroni.</p>
<p>from <a href="http://en.wikipedia.org/wiki/Macaroni_and_cheese">wikipedia</a>&#8230; &#8220;Macaroni and cheese (also referred to as macaroni cheese in the United Kingdom and mac &#8216;n&#8217; cheese in parts of the United States and Canada)&#8221;<br />
so macaroni cheese is pretty popular! at least in the 3gram space, at this time of day, which is late afternoon in the united states timezone</p>
<p>does this pattern continue as we consider more tweets? say 1000?</p>
<pre>bash&gt; ./parse_cheese_out.rb &lt; cheese.out | h -1000 | ./sanitise.rb | ./cheese_grams.rb 3 | s | u -c | s -n | t
 7 macaroni and cheese
 7 philly cheese steak
 8 on the moon
 9 steak and cheese
10 ham and cheese
11 wine and cheese
28 chuck e cheese
33 mac and cheese
33 mac &amp; cheese
48 mac n cheese</pre>
<p>yep. how about 10,000 ?</p>
<pre>bash&gt; ./parse_cheese_out.rb &lt; cheese.out | h -10000 | ./sanitise.rb | ./cheese_grams.rb 3 | s | u -c | s -n | t
 74 wine and cheese
 77 a grilled cheese
 78 egg and cheese
 78 ham and cheese
 81 with cream cheese
 89 at chuck e
194 mac &amp; cheese
294 mac and cheese
301 mac n cheese
337 chuck e cheese</pre>
<p>bam! <a href="http://www.chuckecheese.com/">chuck e cheese</a>, one of the bastions of children&#8217;s health food in america, takes over!</p>
<p>there is an overlap between &#8216;at chuck e&#8217; and &#8216;chuck e cheese&#8217; so let&#8217;s try 4 grams&#8230;</p>
<pre>bash&gt; ./parse_cheese_out.rb &lt; cheese.out | h -10000 | ./sanitise.rb | ./cheese_grams.rb 4 | s | u -c | s -n | t
27 mac n cheese and
27 water on the moon
28 bacon egg &amp; cheese
28 mac and cheese and
29 call of duty modern
29 duty modern warfare 2
29 of duty modern warfare
32 bacon egg and cheese
58 to chuck e cheese
82 at chuck e cheese</pre>
<p>observations&#8230;</p>
<ol>
<li>more people tweet when they&#8217;re at chucke cheese than when they&#8217;re planning to go there.</li>
<li>&#8216;call of duty modern warfare 2&#8242; wtf1?</li>
<li>&#8216;water on the moon&#8217; wtf2?</li>
</ol>
<p>let&#8217;s have a look at call of duty first.</p>
<pre>bash&gt; ./parse_cheese_out.rb &lt; cheese.out | ./sanitise.rb | g 'call of duty' | h | s | u -c
1 i unlocked the royale with cheese achievement on call of duty modern warfare 2
1 making stromboli with tomoto cheese and spinach then going to best buy to get the new call of duty
8 unlocked royale with cheese in call of duty modern warfare 2 xboxtweet</pre>
<p>so looks like one of those auto-tweet things (do they have a name yet?) people playing this on xbox have the option<br />
to tweet about how awesomely awesome they are when they unlock some secret level.</p>
<p>how best to get rid of them? let&#8217;s try exact deduping of a tweet and see if they &#8216;go away&#8217;</p>
<pre>bash&gt; ./parse_cheese_out.rb &lt; cheese.out | h -10000 | ./sanitise.rb | wc -l
10000
(sanity)
bash&gt; ./parse_cheese_out.rb &lt; cheese.out | h -10000 | ./sanitise.rb | sort | uniq | wc -l
9773</pre>
<p>so it&#8217;s getting rid of 230 odd tweets, not too destructive..</p>
<p>what about wtf2, ie &#8216;water on the moon&#8217; ?</p>
<pre>bash&gt; ./parse_cheese_out.rb &lt; cheese.out | g -i 'water on the moon' | h
Water on the moon... what?!? No cheese?
Water on the moon...wow. Did they find any cheese? I hope not, cause then there'll be some big ass moon rats.
Water on the moon? They probably just didn't realize this "water" is actually cheese.
RT @jdickerson: The significant amount of water on the moon means, of course, that the cheese is mozzarella.
so the moons not made of cheese? google's logo is changed to celebrate the discovery of water on the moon
NASA finds water on the moon. "Moon River" composed in 1961. So, why is NASA surprised? Anyway, can't make cheese without water, DOH.
"Significant water on the Moon" - so it looks like grey Swiss cheese, but with a consistency closer to mozarella rather than Parmesan...
Forgive my lack of awe at the news of water on the moon. I mean, used to think it was cheese. Water's not as cool as cheese.
RT @Grundy: Forgive my lack of awe at the news of water on the moon. I mean, used to think it was cheese. Water's not as cool as cheese.
RT @berkun: Of course there is water on the moon. The moon is made of cheese, and cheese has water in it.</pre>
<p>ahha, i see. comical reactions to nasa recent lcross impact report.</p>
<p>hypothesis</p>
<ol>
<li>if i had more data and did a sliding window analysis of say a few days then &#8216;water on the moon&#8217; would appear and disappear as a trending topic when dealing with cheese</li>
<li>people would be tweeting less about chuck e cheese in the middle of week. need more data to check this one&#8230;</li>
</ol>
<p>let&#8217;s try an hugmongous 10gram check</p>
<pre>bash&gt; ./parse_cheese_out.rb &lt; cheese.out | h -10000 | ./sanitise.rb | s | u | ./cheese_grams.rb 10 | s | u -c | s -n | t
8 is something that doesnt matter unless you are a cheese
8 mad ur man because he wont pull his dick out
8 man because he wont pull his dick out in chucky
8 ur man because he wont pull his dick out in
9 dick out in chucky cheese so u can give him
9 he wont pull his dick out in chucky cheese so
9 his dick out in chucky cheese so u can give
9 out in chucky cheese so u can give him some
9 pull his dick out in chucky cheese so u can
9 wont pull his dick out in chucky cheese so u</pre>
<p>this time. really. wtf?<br />
( i can&#8217;t believe i&#8217;m actually going to do this next grep&#8230; )</p>
<pre>bash&gt; ./parse_cheese_out.rb &lt; cheese.out | g -i 'pull his dick out' | h
RT @KevinHart4real #youKnowurahoeif U get mad @ur man becuz he won't pull his dick out n chucky cheese so u can give him head! Boom..oh my
RT @KevinHart4real: #youKnowurahoeif you get mad @ ur man because he won't pull his dick out in chucky cheese so u can give him some head!
RT @KevinHart4real: #youKnowurahoeif you get mad @ ur man because he won't pull his dick out in chucky cheese so u can give him some head!
RT :@KevinHart4real #youKnowurahoeif you get mad @ ur man because he won't pull his dick out in chucky cheese so u can give him some head!
#youKnowurahoeif you get mad @ ur man because he won't pull his dick out in chucky cheese so u can give him some head! (via @KevinHart4real)
RT @KevinHart4real #youKnowurahoeif you get mad @ ur man because he won't pull his dick out in chucky cheese so u can give him some head!
RT @KevinHart4real: #youKnowurahoeif you get mad @ ur man because he won't pull his dick out in chucky cheese so u can give him some hea ...
#youKnowurahoeif you get mad at cho boyfriend cuz he won't pull his dick out in chucky cheese so u can give him some head! BOOOOOOM !!!!!
RT @KevinHart4real #youKnowurahoeif you get mad @ ur man because he won't pull his dick out in chucky cheese so u can give him some head! Bo
RT @KevinHart4real: #youKnowurahoeif you get mad @ ur man because he won't pull his dick out in chucky cheese so u can give him some head!</pre>
<p>so retweets are a bit adhoc and it seems that they aren&#8217;t pulled out as part of the deduping</p>
<pre>bash&gt; ./parse_cheese_out.rb &lt; cheese.out | h -10000 | ./sanitise.rb | s | u | g 'pull his dick out'
gangstressb youknowurahoeif you get mad ur man because he wont pull his dick out in chucky cheese so u can give him some head boooom
ha rt kevinhart4real youknowurahoeif u get mad ur man because he wont pull his dick out in chucky cheese so u can give him some head
ha rt kevinhart4real youknowurahoeif u get mad ur man cause he wont pull his dick out n chucky cheese so u can give him sum head boom
kevinhart4real youknowurahoeif you get mad ur man because he wont pull his dick out in chucky cheese so u can give him some head ewww
rt kevinhart4real youknowurahoeif u get mad ur man becuz he wont pull his dick out n chucky cheese so u can give him head boom oh my
rt kevinhart4real youknowurahoeif you get mad ur man because he wont pull his dick out in chucky cheese so u can give him some hea
rt kevinhart4real youknowurahoeif you get mad ur man because he wont pull his dick out in chucky cheese so u can give him some head
rt kevinhart4real youknowurahoeif you get mad ur man because he wont pull his dick out in chucky cheese so u can give him some head bo
rt kevinhart4real youknowurahoeif you get mad ur man because he wont pull his dick out in chucky cheese so u can give him some head boooom
youknowurahoeif you get mad at cho boyfriend cuz he wont pull his dick out in chucky cheese so u can give him some head boooooom
youknowurahoeif you get mad ur man because he wont pull his dick out in chucky cheese so u can give him some head via kevinhart4real</pre>
<p>observations:</p>
<ol>
<li>people retweet in a quite adhoc fashion ( i guess this is why twitter has recently introduced the retweet feature? )</li>
<li>ppl abbrev however they can 2 get a tweet 2 140 chars. all the tweets have &#8217;so u can&#8217; instead of &#8217;so you can&#8217; so i&#8217;m assuming that was in the original tweet. but &#8216;because&#8217; has been shortened to &#8217;cause&#8217; or even &#8216;cuz&#8217; so that peoples additional insightful commentary (such as &#8216;ewww&#8217; or &#8216;booom&#8217;) could be appended&#8230; more statistically synonym building oppurtunities!</li>
<li>people left it as &#8216;chucky cheese&#8217; instead correcting it back to &#8216;chuck e cheese&#8217;</li>
</ol>
<p>anyways seems like removing all retweets is arguable. it would depend on what you&#8217;re trying to analyse for&#8230;<br />
let&#8217;s give it a crack anyways and use a simple retweet check (ie starts with &#8216;rt &#8216;)</p>
<pre>bash&gt; ./parse_cheese_out.rb &lt; cheese.out | h -10000 | ./sanitise.rb | g -v "^rt " | s | u | ./cheese_grams.rb 10 | s | u -c | s -n | t
5 pull his dick out in chucky cheese so u can
5 savory pies ham and cheese breakfast quiche submitted by bobkat2000
5 the early bird may get the worm but the second
5 wont pull his dick out in chucky cheese so u
6 and savory pies ham and cheese breakfast quiche submitted by
6 clue thin slices of stuffed with cheese and ham and
6 get the worm but the second mouse gets the cheese
6 quiche and savory pies ham and cheese breakfast quiche submitted
6 slices of stuffed with cheese and ham and then sauteed
6 thin slices of stuffed with cheese and ham and then</pre>
<p>looks like people are retweeting some kind of recipe&#8230;</p>
<pre>bash&gt; ./parse_cheese_out.rb &lt; cheese.out | h -10000 | ./sanitise.rb | s | u | g 'thin slices of stuffed'
mystery phrase is v a b u clue thin slices of stuffed with cheese and ham and then sauteed
mystery phrase is v a c o b eu clue thin slices of stuffed with cheese and ham and then sauteed
mystery phrase is vea c do b eu clue thin slices of stuffed with cheese and ham and then sauteed
mystery phrase is vea cordo b eu clue thin slices of stuffed with cheese and ham and then sauteed
mystery phrase is veal cordo b eu clue thin slices of stuffed with cheese and ham and then sauteed
mystery phrase is veal cordo bleu clue thin slices of stuffed with cheese and ham and then sauteed</pre>
<p>??? maybe it&#8217;s not a retweet? mystery phrase? is this some kind of game???<br />
seems so, check out http://twitter.com/tweet_words</p>
<p>this type of duplication would require a more complex deduping; perhaps shingling or simhash bucketing. for another day&#8230;.</p>
<p>so in summary the final 1 to 4 grams are&#8230;.</p>
<pre>bash&gt; ./parse_cheese_out.rb &lt; cheese.out | h -10000 | ./sanitise.rb | g -v "^rt " | s | u | ./cheese_grams.rb 1 | s | u -c | s -n | t
1336 is
1465 my
1597 of
1619 with
2065 to
3157 a
3161 i
3408 the
4175 and
10213 cheese

bash&gt; ./parse_cheese_out.rb &lt; cheese.out | h -10000 | ./sanitise.rb | g -v "^rt " | s | u | ./cheese_grams.rb 2 | s | u -c | s -n | t
310 of cheese
359 the cheese
362 e cheese
375 chuck e
402 grilled cheese
409 n cheese
431 &amp; cheese
445 cream cheese
796 cheese and
1018 and cheese

bash&gt; ./parse_cheese_out.rb &lt; cheese.out | h -10000 | ./sanitise.rb | g -v "^rt " | s | u | ./cheese_grams.rb 3 | s | u -c | s -n | t
72 wine and cheese
75 ham and cheese
77 egg and cheese
81 a grilled cheese
84 with cream cheese
94 at chuck e
203 mac &amp; cheese
300 mac n cheese
305 mac and cheese
342 chuck e cheese

bash&gt; ./parse_cheese_out.rb &lt; cheese.out | h -10000 | ./sanitise.rb | g -v "^rt " | s | u | ./cheese_grams.rb 4 | s | u -c | s -n | t
21 with cream cheese frosting
22 chuck e cheese with
23 and mac and cheese
24 a grilled cheese sandwich
25 bacon egg &amp; cheese
28 mac n cheese and
31 mac and cheese and
32 bacon egg and cheese
57 to chuck e cheese
87 at chuck e cheese</pre>
<p>todos:</p>
<ol>
<li>collect some data and compare use over a weekend compared to a weekday compared to the entire week</li>
<li>employ a more advanced deduping algorithm, i&#8217;ve been looking for an excuse to try some modifications to my sketching algorithms</li>
</ol>
<p>to be continued, who knew cheese could be so much fun!!!</p>
]]></content:encoded>
			<wfw:commentRss>http://matpalm.com/blog/2009/11/15/tweets-about-cheese/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>e11.3 at what time does the world tweet?</title>
		<link>http://matpalm.com/blog/2009/10/28/e11-3-at-what-time-does-the-world-tweet/</link>
		<comments>http://matpalm.com/blog/2009/10/28/e11-3-at-what-time-does-the-world-tweet/#comments</comments>
		<pubDate>Wed, 28 Oct 2009 11:22:51 +0000</pubDate>
		<dc:creator>matpalm</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[e11]]></category>
		<category><![CDATA[r]]></category>
		<category><![CDATA[twitter]]></category>

		<guid isPermaLink="false">http://matpalm.com/blog/?p=186</guid>
		<description><![CDATA[consider the graph below which shows the proportion of tweets per 10 min slot of the day (GMT0)
it compares 4.7e6 tweets with any location vs  320e3 tweets with identifiable lat lons

some interesting observations with unanswered questions&#8230;

the ebb and flow is not just a result of the time of day for high twitter traffic areas. the [...]]]></description>
			<content:encoded><![CDATA[<p>consider the graph below which shows the proportion of tweets per 10 min slot of the day (GMT0)</p>
<p>it compares 4.7e6 tweets with any location vs  320e3 tweets with identifiable lat lons</p>
<p style="text-align: center;"><img class="aligncenter size-full wp-image-200" title="timeslices_freq.comparison" src="http://matpalm.com/blog/wp-content/uploads/2009/10/timeslices_freq.comparison2.jpg" alt="timeslices_freq.comparison" width="750" height="480" /></p>
<p>some interesting observations with unanswered questions&#8230;</p>
<ol>
<li>the ebb and flow is not just a result of the time of day for high twitter traffic areas. the reduction between 06:00 and 10:00 comes close to zero. this is false, there is never a worldwide time when internet traffic hits zero. does twitter turn down it&#8217;s gatdenhose for capacity reasons?</li>
<li>the number of tweets with lat lons are correlated to those without EXCEPT past 17:00 where the lat lon cases drop drastically. have a couple of ideas banging around my head why this is the case but nothing concrete. any ideas?</li>
</ol>
<p>speaking of correlation here&#8217;s a scatterplot of tweets with lat lons vs without. we can see that time period uncorrelatedness that occurs past 17:00 as a quite obvious cluster.</p>
<p><img class="aligncenter size-full wp-image-190" title="timeslices_freq.scatter" src="http://matpalm.com/blog/wp-content/uploads/2009/10/timeslices_freq.scatter.jpg" alt="timeslices_freq.scatter" width="400" height="480" /></p>
<p><a href="http://github.com/matpalm/rtw_tweet/blob/master/v3/timeslices_freq.graphs.r">and here is the R code for these graphs</a></p>
]]></content:encoded>
			<wfw:commentRss>http://matpalm.com/blog/2009/10/28/e11-3-at-what-time-does-the-world-tweet/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
