<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>brain of matpalm</title>
	<atom:link href="http://matpalm.com/blog/feed/" rel="self" type="application/rss+xml" />
	<link>http://matpalm.com/blog</link>
	<description>thoughts from a data scientist wannabe</description>
	<lastBuildDate>Sat, 03 Jul 2010 07:37:32 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>brutally short intro to weka</title>
		<link>http://matpalm.com/blog/2010/07/03/brutally-short-intro-to-weka/</link>
		<comments>http://matpalm.com/blog/2010/07/03/brutally-short-intro-to-weka/#comments</comments>
		<pubDate>Sat, 03 Jul 2010 07:35:27 +0000</pubDate>
		<dc:creator>matpalm</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[brutally short intro]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[weka]]></category>

		<guid isPermaLink="false">http://matpalm.com/blog/?p=677</guid>
		<description><![CDATA[weka is a java based machine learning workbench that i&#8217;ve found useful to playing with to help understand some standard machine learning algorithms. in this quick demo i show how to build a classifier for three simple datasets; two of which address the basics of text classification

brutally short intro to weka from Mat Kelcey on [...]]]></description>
			<content:encoded><![CDATA[<p>weka is a java based machine learning workbench that i&#8217;ve found useful to playing with to help understand some standard machine learning algorithms. in this quick demo i show how to build a classifier for three simple datasets; two of which address the basics of text classification</p>
<p><object width="400" height="300"><param name="allowfullscreen" value="true" /><param name="allowscriptaccess" value="always" /><param name="movie" value="http://vimeo.com/moogaloop.swf?clip_id=13051595&amp;server=vimeo.com&amp;show_title=1&amp;show_byline=1&amp;show_portrait=0&amp;color=&amp;fullscreen=1" /><embed src="http://vimeo.com/moogaloop.swf?clip_id=13051595&amp;server=vimeo.com&amp;show_title=1&amp;show_byline=1&amp;show_portrait=0&amp;color=&amp;fullscreen=1" type="application/x-shockwave-flash" allowfullscreen="true" allowscriptaccess="always" width="400" height="300"></embed></object>
<p><a href="http://vimeo.com/13051595">brutally short intro to weka</a> from <a href="http://vimeo.com/user2935988">Mat Kelcey</a> on <a href="http://vimeo.com">Vimeo</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://matpalm.com/blog/2010/07/03/brutally-short-intro-to-weka/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>friend clustering by term usage</title>
		<link>http://matpalm.com/blog/2010/06/25/friend-clustering-by-term-usage/</link>
		<comments>http://matpalm.com/blog/2010/06/25/friend-clustering-by-term-usage/#comments</comments>
		<pubDate>Fri, 25 Jun 2010 13:39:08 +0000</pubDate>
		<dc:creator>matpalm</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[infochimps]]></category>
		<category><![CDATA[network]]></category>
		<category><![CDATA[twitter]]></category>

		<guid isPermaLink="false">http://matpalm.com/blog/?p=662</guid>
		<description><![CDATA[recently signed up to the infochimps api and wanted to do something quick and dirty to get a feel for it.
so here&#8217;s a little experiment

get the people i follow on twitter
look up the words that &#8220;represent&#8221; them according to the infochimps word bag api
build a similiarity matrix based on the common use of those terms
plot [...]]]></description>
			<content:encoded><![CDATA[<p>recently signed up to the <a href="http://api.infochimps.com/">infochimps api</a> and wanted to do something quick and dirty to get a feel for it.</p>
<p>so here&#8217;s a little experiment</p>
<ol>
<li>get the people i follow on twitter</li>
<li>look up the words that &#8220;represent&#8221; them according to the <a href="http://api.infochimps.com/describe/soc/net/tw/wordbag">infochimps word bag api</a></li>
<li>build a similiarity matrix based on the common use of those terms</li>
<li>plot the connectivity for the top 30 or so pairings</li>
</ol>
<p><a href="http://matpalm.com/blog/wp-content/uploads/2010/06/top35.png"><img src="http://matpalm.com/blog/wp-content/uploads/2010/06/top35-300x178.png" alt="" title="top35" width="300" height="178" class="aligncenter size-medium wp-image-666" /></a></p>
<p>it&#8217;s basically partitioned into three groups&#8230;</p>
<ol>
<li>veztek (my boss john) and smcinnes (steve from the lonely planet community team) in the top right</li>
<li>a big clump of nosqlness with mongodb &#8211; hbase &#8211; jpatanooga &#8211; kevinweil in the bottom left</li>
<li>everyone else</li>
</ol>
<p>an interesting enough result given the time taken; the codes <a href="http://github.com/matpalm/twitter/tree/master/friend_cluster/">on github</a></p>
]]></content:encoded>
			<wfw:commentRss>http://matpalm.com/blog/2010/06/25/friend-clustering-by-term-usage/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>country codes in world cup tweets &#8211; viz1</title>
		<link>http://matpalm.com/blog/2010/06/21/country-codes-in-world-cup-tweets-viz1/</link>
		<comments>http://matpalm.com/blog/2010/06/21/country-codes-in-world-cup-tweets-viz1/#comments</comments>
		<pubDate>Mon, 21 Jun 2010 09:43:32 +0000</pubDate>
		<dc:creator>matpalm</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[twitter]]></category>
		<category><![CDATA[visualisation]]></category>
		<category><![CDATA[worldcup]]></category>

		<guid isPermaLink="false">http://matpalm.com/blog/?p=654</guid>
		<description><![CDATA[
#worldcup tweet viz1 from Mat Kelcey on Vimeo.
here&#8217;s a simple visualisation of the use of official country codes (eg #aus) in a week&#8217;s worth of tweets from the search stream for #worldcup.
rate is about 2hours of tweets per sec. orb size denotes relative frequency of that country code. edges denote that those two countries feature [...]]]></description>
			<content:encoded><![CDATA[<p><object width="400" height="300"><param name="allowfullscreen" value="true" /><param name="allowscriptaccess" value="always" /><param name="movie" value="http://vimeo.com/moogaloop.swf?clip_id=12710800&amp;server=vimeo.com&amp;show_title=1&amp;show_byline=1&amp;show_portrait=0&amp;color=&amp;fullscreen=1" /><embed src="http://vimeo.com/moogaloop.swf?clip_id=12710800&amp;server=vimeo.com&amp;show_title=1&amp;show_byline=1&amp;show_portrait=0&amp;color=&amp;fullscreen=1" type="application/x-shockwave-flash" allowfullscreen="true" allowscriptaccess="always" width="400" height="300"></embed></object>
<p><a href="http://vimeo.com/12710800">#worldcup tweet viz1</a> from <a href="http://vimeo.com/user2935988">Mat Kelcey</a> on <a href="http://vimeo.com">Vimeo</a>.</p>
<p>here&#8217;s a simple visualisation of the use of official country codes (eg #aus) in a week&#8217;s worth of tweets from the search stream for #worldcup.</p>
<p>rate is about 2hours of tweets per sec. orb size denotes relative frequency of that country code. edges denote that those two countries feature a lot in the same tweets. movement is based on gravitational like attraction along edges.</p>
<p>the quiet period at about 0:17 is a twitter outage :)</p>
<p><a href="http://matpalm.com/world_cup/viz1/">here&#8217;s the original processing applet version</a> with a bit more discussion</p>
]]></content:encoded>
			<wfw:commentRss>http://matpalm.com/blog/2010/06/21/country-codes-in-world-cup-tweets-viz1/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>moving average of a time series in R</title>
		<link>http://matpalm.com/blog/2010/06/15/moving-average-of-a-time-series-in-r/</link>
		<comments>http://matpalm.com/blog/2010/06/15/moving-average-of-a-time-series-in-r/#comments</comments>
		<pubDate>Tue, 15 Jun 2010 06:15:10 +0000</pubDate>
		<dc:creator>matpalm</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[r]]></category>
		<category><![CDATA[simple stuff i keep forgetting]]></category>

		<guid isPermaLink="false">http://matpalm.com/blog/?p=649</guid>
		<description><![CDATA[in this a sliding window of 3 elements

> x = c(3,1,4,1,5,9,2,6,5,3,5,8)
> ra_x = filter(x, rep(1,3)/3)
> ra_x
Time Series:
Start = 1
End = 12
Frequency = 1
 [1]       NA 2.666667 2.000000 3.333333 5.000000 5.333333 5.666667 4.333333
 [9] 4.666667 4.333333 5.333333       NA

]]></description>
			<content:encoded><![CDATA[<p>in this a sliding window of 3 elements</p>
<pre>
> x = c(3,1,4,1,5,9,2,6,5,3,5,8)
> ra_x = filter(x, rep(1,3)/3)
> ra_x
Time Series:
Start = 1
End = 12
Frequency = 1
 [1]       NA 2.666667 2.000000 3.333333 5.000000 5.333333 5.666667 4.333333
 [9] 4.666667 4.333333 5.333333       NA
</pre>
]]></content:encoded>
			<wfw:commentRss>http://matpalm.com/blog/2010/06/15/moving-average-of-a-time-series-in-r/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>#worldcup twitter analytics</title>
		<link>http://matpalm.com/blog/2010/06/14/worldcup-twitter-analytics/</link>
		<comments>http://matpalm.com/blog/2010/06/14/worldcup-twitter-analytics/#comments</comments>
		<pubDate>Mon, 14 Jun 2010 12:06:49 +0000</pubDate>
		<dc:creator>matpalm</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[twitter]]></category>
		<category><![CDATA[worldcup]]></category>

		<guid isPermaLink="false">http://matpalm.com/blog/?p=644</guid>
		<description><![CDATA[since the world cup started i&#8217;ve spent more time looking at twitter data about the games than the actual games themselves. what a sad data nerd i am!
anyways, here&#8217;s the first few days analysis based the use of official country tags (eg #aus) in the search stream for #worldcup.
tomorrow i might look in more detail [...]]]></description>
			<content:encoded><![CDATA[<p>since the world cup started i&#8217;ve spent more time looking at twitter data about the games than the actual games themselves. what a sad data nerd i am!</p>
<p>anyways, here&#8217;s the <a href="http://bit.ly/dkR46o">first few days analysis</a> based the use of official country tags (eg <a href="http://twitter.com/#search?q=%23aus">#aus</a>) in the search stream for <a href="http://twitter.com/#search?q=%23worldcup">#worldcup</a>.</p>
<p>tomorrow i might look in more detail at one of the games, wondering how many variants of &#8216;goooooooal&#8217; i&#8217;ll find :D</p>
]]></content:encoded>
			<wfw:commentRss>http://matpalm.com/blog/2010/06/14/worldcup-twitter-analytics/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>a quick study in tf/icf</title>
		<link>http://matpalm.com/blog/2010/06/09/a-quick-study-in-tficf/</link>
		<comments>http://matpalm.com/blog/2010/06/09/a-quick-study-in-tficf/#comments</comments>
		<pubDate>Wed, 09 Jun 2010 11:58:08 +0000</pubDate>
		<dc:creator>matpalm</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://matpalm.com/blog/?p=638</guid>
		<description><![CDATA[while doing some more research on trending algorithms i came across a cool little paper about term frequency normalisation for streaming data: TF-ICF: A New Term Weighting Scheme for Clustering Dynamic Data Streams.
i&#8217;m finding streaming related algorithms quite interesting lately and think are the way forward in terms of dealing with large amounts of constant [...]]]></description>
			<content:encoded><![CDATA[<p>while doing some more research on trending algorithms i came across a cool little paper about term frequency normalisation for streaming data: <a href="http://aser.ornl.gov/publications/ICMLA06.pdf">TF-ICF: A New Term Weighting Scheme for Clustering Dynamic Data Streams</a>.</p>
<p>i&#8217;m finding streaming related algorithms quite interesting lately and think are the way forward in terms of dealing with large amounts of constant data. it&#8217;s just not feasible to use algorithms that expect you to have all the data at any given time; it forces you to reprocess all the data you&#8217;ve ever seen as you get new examples. my thinking is the best solutions are the ones that build models of the data and fold in new examples in batches. anyways, i&#8217;m getting off topic already.</p>
<p>tf/icf as presented in the paper is a variant on the classic <a href="http://en.wikipedia.org/wiki/Tf–idf">tf/idf</a> for term weighting but instead of requiring all weighting in all docs to be recalculated as a new document comes along (as tf/idf strictly does) it instead just approximates based on what has been seen before.</p>
<p>so how does it do? actually quite well, <a href="http://www.matpalm.com/tf_icf">here&#8217;s my experimental breakdown</a></p>
]]></content:encoded>
			<wfw:commentRss>http://matpalm.com/blog/2010/06/09/a-quick-study-in-tficf/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>5 minute ggobi demo</title>
		<link>http://matpalm.com/blog/2010/06/04/5-minute-ggobi-demo/</link>
		<comments>http://matpalm.com/blog/2010/06/04/5-minute-ggobi-demo/#comments</comments>
		<pubDate>Fri, 04 Jun 2010 13:12:53 +0000</pubDate>
		<dc:creator>matpalm</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[5min]]></category>
		<category><![CDATA[brutally short intro]]></category>
		<category><![CDATA[ggobi]]></category>

		<guid isPermaLink="false">http://matpalm.com/blog/?p=634</guid>
		<description><![CDATA[
brutally short demo of ggobi from Mat Kelcey on Vimeo.
note: non embedded version has higher res at full screen
]]></description>
			<content:encoded><![CDATA[<p><object width="400" height="300"><param name="allowfullscreen" value="true" /><param name="allowscriptaccess" value="always" /><param name="movie" value="http://vimeo.com/moogaloop.swf?clip_id=12292239&amp;server=vimeo.com&amp;show_title=1&amp;show_byline=1&amp;show_portrait=0&amp;color=&amp;fullscreen=1" /><embed src="http://vimeo.com/moogaloop.swf?clip_id=12292239&amp;server=vimeo.com&amp;show_title=1&amp;show_byline=1&amp;show_portrait=0&amp;color=&amp;fullscreen=1" type="application/x-shockwave-flash" allowfullscreen="true" allowscriptaccess="always" width="400" height="300"></embed></object>
<p><a href="http://vimeo.com/12292239">brutally short demo of ggobi</a> from <a href="http://vimeo.com/user2935988">Mat Kelcey</a> on <a href="http://vimeo.com">Vimeo</a>.</p>
<p>note: <a href="http://bit.ly/ctZmdA">non embedded version</a> has higher res at full screen</p>
]]></content:encoded>
			<wfw:commentRss>http://matpalm.com/blog/2010/06/04/5-minute-ggobi-demo/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>how many terms in a trend?</title>
		<link>http://matpalm.com/blog/2010/05/11/how-many-terms-in-a-trend/</link>
		<comments>http://matpalm.com/blog/2010/05/11/how-many-terms-in-a-trend/#comments</comments>
		<pubDate>Tue, 11 May 2010 09:46:11 +0000</pubDate>
		<dc:creator>matpalm</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[algorithms]]></category>
		<category><![CDATA[e15]]></category>
		<category><![CDATA[puzzled]]></category>
		<category><![CDATA[trending]]></category>

		<guid isPermaLink="false">http://matpalm.com/blog/?p=627</guid>
		<description><![CDATA[i&#8217;ve been poking around with a simple trending algorithm over the last few weeks and have uncovered a problem that, like most interesting ones, i&#8217;m not sure how to solve. the question revolves around discovering multi terms trends. 
a sensible place to start when looking for trends is to consider single terms but what if [...]]]></description>
			<content:encoded><![CDATA[<p>i&#8217;ve been poking around with a simple trending algorithm over the last few weeks and have uncovered a problem that, like most interesting ones, i&#8217;m not sure how to solve. the question revolves around discovering multi terms trends. </p>
<p>a sensible place to start when looking for trends is to consider single terms but what if though we ended up with three equally trending terms &#8216;happy&#8217;, &#8216;new&#8217; and &#8216;year&#8217;? it&#8217;s pretty obvious that the actual trend is &#8216;happy new year&#8217; but what is the best way to express this as a single trend in an algorithmic sense?</p>
<p>one approach i&#8217;ve been playing with is to collect unigrams, bigrams and trigrams (1,2,3 term &#8216;phrases&#8217;) and consider the cases where the terms overlap. basically if &#8216;happy new year&#8217; is trending then, in some sense, we can ignore trends for &#8216;happy new&#8217;, &#8216;new year&#8217;, &#8216;happy&#8217;, &#8216;new&#8217; and &#8216;year&#8217;. but does this result in to many false positives? would we miss &#8216;happy&#8217; as a trend if lots of people were chirpy about the change of year (as they usually are, on new years eve)</p>
<p>rather than outright ignore we could somehow reduce the weighting by removing the double counting.</p>
<p>eg if we had 3 trends;  (free beer,11), (free,12) &#038; (beer,25)<br />
we can take 11 (from the 2gram) off both 1grams to give  (free beer,11), (free,1) &#038; (beer,14)<br />
showing that &#8216;beer&#8217;, outside of the phrase &#8216;free beer&#8217;, is perhaps a trend in itself (as it should be)</p>
<p>this feels like it might work but would be non trivial (read: fun) to implement</p>
<p>another slightly different problem is around the handling of retweeting. my experiments have shown a huge amount of the &#8216;trends&#8217; found are related to retweets, which is fine in itself, but it gives quite strange trends since the retweeted portion of the text is usually quite long.</p>
<p>for example; say lots of people are retweeting something and, as some people do, are adding various bits and pieces at the beginning and end; eg &#8216;RT @bob omg i just found a peanut&#8217; or &#8216;omg i just found a peanut; via @bob lucky him!!&#8217;</p>
<p>if we&#8217;re considering bigrams (which i am in my current implementation) we end up with an odd selection of trends such as &#8216;just found&#8217;, &#8216;a peanut&#8217;, &#8216;omg i&#8217;, &#8216;found a&#8217;, &#8216;i just&#8217; and in these cases it&#8217;d be great to be able to just stitch them together into the common retweeted element &#8216;omg i just found a peanut&#8217;. </p>
<p>we could &#8217;solve&#8217; this problem by not just considering 1,2 and 3 grams but considering _all_ possible n-grams for each tweet and employing the technique we spoke of above of reducing the counts. it&#8217;d almost be feasible, since tweets are never that long, but feels uber clumsy and i&#8217;d hate to see the order statistic of that algorithm ;)</p>
<p>this seems more like a stitching problem of some kind;  eg if we have 4 grams &#8216;omg i just found&#8217;, &#8216;i just found a&#8217;, &#8216;just found a peanut&#8217; perhaps we can identify the non trivial overlap and stitch them together (?)</p>
<p>not sure, there are a number of things to try. was hoping that brain dumping some of this would help me see the light but nothing obvious jumps out :(</p>
]]></content:encoded>
			<wfw:commentRss>http://matpalm.com/blog/2010/05/11/how-many-terms-in-a-trend/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>trending topics in tweets about cheese; part2</title>
		<link>http://matpalm.com/blog/2010/05/01/trending-topics-in-tweets-about-cheese-part2/</link>
		<comments>http://matpalm.com/blog/2010/05/01/trending-topics-in-tweets-about-cheese-part2/#comments</comments>
		<pubDate>Sat, 01 May 2010 06:54:53 +0000</pubDate>
		<dc:creator>matpalm</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[e15]]></category>
		<category><![CDATA[pig]]></category>
		<category><![CDATA[trending]]></category>
		<category><![CDATA[twitter]]></category>

		<guid isPermaLink="false">http://matpalm.com/blog/?p=559</guid>
		<description><![CDATA[prototyping in ruby was a great way to prove the concept but my main motivation for this project was to play some more with pig.
the main approach will be

maintain a relation with one record per ngram we want to monitoring for trending
fold 1 hours worth of new data at a time into the model
check the [...]]]></description>
			<content:encoded><![CDATA[<p>prototyping in ruby was a great way to prove the concept but my main motivation for this project was to play some more with pig.</p>
<p>the main approach will be</p>
<ol>
<li>maintain a relation with one record per ngram we want to monitoring for trending</li>
<li>fold 1 hours worth of new data at a time into the model</li>
<li>check the entries for the latest hour for any trends</li>
</ol>
<p>the <a href="http://github.com/matpalm/trending/blob/master/pig/trending.pig">full version is on github</a>. read on for a line by line walkthrough</p>
<p><span id="more-559"></span></p>
<p>the ruby impl used the simplest approach possible for calculating mean and stddev; maintain a record of all the values seen so far and recalculate for each new value.</p>
<p>for our pig version we&#8217;ll take a fixed space approach. rather than keep <em>all</em> the values for each time series it turns out we can get away with storing just 3&#8230;</p>
<ol>
<li>n: the number of values</li>
<li>m: the current mean of all values</li>
<li>ms: the current mean of the squares of all values</li>
</ol>
<p>the idea is that the mean<sub>n+1</sub> = ( n * mean<sub>n</sub> + new value ) / n+1<br />
a similar function holds that derives the standard deviation<sub>n+1</sub> from n, the mean<sub>n</sub> and the mean of the squares<sub>n</sub></p>
<p>let&#8217;s go over the pig script one command a time.</p>
<p>we&#8217;ll assume we&#8217;ve already run it 6 times and we&#8217;re now folding in the 7th hour</p>
<p>the first thing is to load the existing version of the model, in this case stored in the file &#8216;model.006&#8242;<br />
it contains everything we need for checking the trending for each ngram</p>
<pre><span>
&gt; raw_model = load 'model.006' as (key:chararray, n:int, m:double, ms:double);

&gt; describe raw_model;
raw_model: {key: chararray, n: int, m: double, ms: double}

&gt; dump raw_model;
(a b,6,1.3333333333333333,2.0)
(a a,3,1.3333333333333333,2.0)
(a c,4,1.25,1.75)
(a d,1,2.0,4.0)
(b a,3,1.0,1.0)
(b d,1,2.0,4.0)
(b c,6,1.5,2.5)
(d c,1,2.0,4.0)
(c a,4,1.0,1.0)
(d e,1,1.0,1.0)
(c d,4,2.0,4.0)
(d a,2,1.0,1.0)
</span></pre>
<p>next we tag each entry from the loaded model with a zero frequency. we&#8217;ll see later how this makes it easier to fold in the new data.</p>
<pre><span>
&gt; model = foreach raw_model generate key, n, m, ms, 0 as f;

&gt; describe model;
model: {key: chararray, n: int, m: double, ms: double, f: int}

&gt; dump model;
(a b,6,1.3333333333333333,2.0,0)
(a a,3,1.3333333333333333,2.0,0)
(a c,4,1.25,1.75,0)
(a d,1,2.0,4.0,0)
(b a,3,1.0,1.0,0)
(b d,1,2.0,4.0,0)
(b c,6,1.5,2.5,0)
(d c,1,2.0,4.0,0)
(c a,4,1.0,1.0,0)
(d e,1,1.0,1.0,0)
(c d,4,2.0,4.0,0)
(d a,2,1.0,1.0,0)
</span></pre>
<p>now that we&#8217;ve loaded the existing version of the model we can load the next hour of data, in this case contained in &#8216;chunks/006&#8242;.</p>
<pre><span>
&gt; next_chunk = load 'chunks/006';

&gt; dump next_chunk;
(a b a b)
(c d a b)
(a b c)
(a d d d)
</span></pre>
<p>from the text we want to get the frequency of the ngrams.<br />
the breaking apart of each line into its 2-grams is handled by a simple ruby script; <a href="http://github.com/matpalm/trending/blob/master/pig/ngram.rb">ngram.rb</a></p>
<pre><span>
&gt; define ngramer `ngram.rb` ship('ngram.rb');
&gt; ngrams = stream next_chunk through ngramer as (key:chararray);

&gt; describe ngrams;
ngrams: {key: chararray}

&gt; dump ngrams;
(a b)
(b a)
(a b)
(c d)
(d a)
(a b)
(a b)
(b c)
(a d)
(d d)
(d d)
</span></pre>
<p>calculating the frequencies of the ngrams is a simple two step process of first grouping by the key&#8230;</p>
<pre><span>
&gt; ngrams_grouped = group ngrams by key;

&gt; describe ngrams_grouped;
ngrams_grouped: {group: chararray, ngrams: {key: chararray}}

&gt; dump ngrams_grouped;
(a b,{(a b),(a b),(a b),(a b)})
(a d,{(a d)})
(b a,{(b a)})
(b c,{(b c)})
(c d,{(c d)})
(d a,{(d a)})
(d d,{(d d),(d,d)})
</span></pre>
<p>&#8230;and then generating the key, frequency pairs</p>
<pre><span>
&gt; ngram_freq = foreach ngrams_grouped generate group as key, SIZE(ngrams) as f;

&gt; describe ngram_freq;
ngram_freq: {key: chararray, f: long}

&gt; dump ngram_freq;
(a b,4L)
(a d,1L)
(b a,1L)
(b c,1L)
(c d,1L)
(d a,1L)
(d d,2L)
</span></pre>
<p>from this we know all the distinct 2grams that are contained in the next chunk we&#8217;re analysing<br />
for each of these 2grams one of two things is true;</p>
<ol>
<li>either the ngram has been seen before (thus it has an entry in the model)</li>
<li>this is the first time we&#8217;ve seen it, in which case we need to add a new entry to the model</li>
</ol>
<p>the easiest way i&#8217;ve worked out in pig to handle this is to generate a &#8217;seed&#8217; model just for this chunk and fold it into the real model but unioning the relations</p>
<p>(i&#8217;ve been using pig 0.3 to keep in line with the current version of elastic map reduce but it might be easier with the various extra joins that are in later versions of pig)</p>
<p>so first we generate the &#8217;seed&#8217; relation&#8230;</p>
<pre><span>
&gt; seed_values = foreach ngram_freq generate key, 0 as n, 0.0 as m, 0.0 as ms, f;

&gt; describe seed_values;
seed_values: {key: chararray, n: int, m: double, ms: double, f: long}

&gt; dump seed_values;
(a b,0,0.0,0.0,4L)
(a d,0,0.0,0.0,1L)
(b a,0,0.0,0.0,1L)
(b c,0,0.0,0.0,1L)
(c d,0,0.0,0.0,1L)
(d a,0,0.0,0.0,1L)
(d d,0,0.0,0.0,2L)
</span></pre>
<p>&#8230;and fold it in with a 3 step process; unioning with the original model, grouping and collapsing</p>
<p>first the union&#8230;</p>
<pre><span>
&gt; model_plus_seed = union model, seed_values;

&gt; describe model_plus_seed;
model_plus_seed: {key: chararray, n: int, m: double, ms: double, f: long}

&gt; dump model_plus_seed;
(a b,0,0.0,0.0,4L)
(a b,6,1.3333333333333333,2.0,0L)
(a d,0,0.0,0.0,1L)
(a a,3,1.3333333333333333,2.0,0L)
(b a,0,0.0,0.0,1L)
(a c,4,1.25,1.75,0L)
(b c,0,0.0,0.0,1L)
(a d,1,2.0,4.0,0L)
(c d,0,0.0,0.0,1L)
(b a,3,1.0,1.0,0L)
(d a,0,0.0,0.0,1L)
(b d,1,2.0,4.0,0L)
(d d,0,0.0,0.0,2L)
(b c,6,1.5,2.5,0L)
(d c,1,2.0,4.0,0L)
(c a,4,1.0,1.0,0L)
(d e,1,1.0,1.0,0L)
(c d,4,2.0,4.0,0L)
(d a,2,1.0,1.0,0L)
</span></pre>
<p>then the grouping&#8230;</p>
<pre><span>
&gt; model_plus_seed2 = group model_plus_seed by key;

&gt; describe model_plus_seed2 = group model_plus_seed by key;;
model_plus_seed2: {group: chararray, model_plus_seed: {key: chararray, n: int, m: double, ms: double, f: long}}

&gt; dump model_plus_seed2;
(a a,{(a a,3,1.3333333333333333,2.0,0L)})
(a b,{(a b,0,0.0,0.0,4L),(a b,6,1.3333333333333333,2.0,0L)})
(a c,{(a c,4,1.25,1.75,0L)})
(a d,{(a d,0,0.0,0.0,1L),(a d,1,2.0,4.0,0L)})
(b a,{(b a,0,0.0,0.0,1L),(b a,3,1.0,1.0,0L)})
(b c,{(b c,0,0.0,0.0,1L),(b c,6,1.5,2.5,0L)})
(b d,{(b d,1,2.0,4.0,0L)})
(c a,{(c a,4,1.0,1.0,0L)})
(c d,{(c d,0,0.0,0.0,1L),(c d,4,2.0,4.0,0L)})
(d a,{(d a,0,0.0,0.0,1L),(d a,2,1.0,1.0,0L)})
(d c,{(d c,1,2.0,4.0,0L)})
(d d,{(d d,0,0.0,0.0,2L)})
(d e,{(d e,1,1.0,1.0,0L)})
</span></pre>
<p>and finally the collapsing using MAX&#8230;</p>
<pre><span>
&gt; model_n =
     foreach model_plus_seed2 generate
        group as key,
        MAX(model_plus_seed.n) as n,
        MAX(model_plus_seed.m) as m,
        MAX(model_plus_seed.ms) as ms,
        MAX(model_plus_seed.f) as f;

&gt; describe model_n;
model_n: {key: chararray, n: int, m: double, ms: double, f: long}

&gt; dump model_n;
(a a,3,1.3333333333333333,2.0,0L)
(a b,6,1.3333333333333333,2.0,4L)
(a c,4,1.25,1.75,0L)
(a d,1,2.0,4.0,1L)
(b a,3,1.0,1.0,1L)
(b c,6,1.5,2.5,1L)
(b d,1,2.0,4.0,0L)
(c a,4,1.0,1.0,0L)
(c d,4,2.0,4.0,1L)
(d a,2,1.0,1.0,1L)
(d c,1,2.0,4.0,0L)
(d d,0,0.0,0.0,2L)
(d e,1,1.0,1.0,0L)
</span></pre>
<p>at this stage we have the original model weaved in with the new data but still need to update the mean and square of means for the values from the latest hour.</p>
<p>we can do this by first seperating out the values we need to update based on whether the frequency is non zero<br />
(recall non zero frequencies represent ngrams from the latest hour)</p>
<pre><span>
&gt; split model_n into to_update if f&gt;0, not_to_update if f==0;

&gt; dump to_update;
(a b,6,1.3333333333333333,2.0,4L)
(a d,1,2.0,4.0,1L)
(b a,3,1.0,1.0,1L)
(b c,6,1.5,2.5,1L)
(c d,4,2.0,4.0,1L)
(d a,2,1.0,1.0,1L)
(d d,0,0.0,0.0,2L)

&gt; dump not_to_update;
(a a,3,1.3333333333333333,2.0,0L)
(a c,4,1.25,1.75,0L)
(b d,1,2.0,4.0,0L)
(c a,4,1.0,1.0,0L)
(d c,1,2.0,4.0,0L)
(d e,1,1.0,1.0,0L)
</span></pre>
<p>we can now update the mean and std deviations based on the new frequency values</p>
<pre><span>
&gt; updated =
     foreach to_update {
         m2  = ((n*m)+f)/(n+1);
         ms2 = ((n*ms)+(f*f))/(n+1);
         generate key, n+1 as n, m2 as m, ms2 as ms, f;
     }

&gt; describe updated;
updated: {key: chararray, n: int, m: double, ms: double, f: long}

&gt; dump updated;
(a b,7,1.7142857142857142,4.0,4L)
(a d,2,1.5,2.5,1L)
(b a,4,1.0,1.0,1L)
(b c,7,1.4285714285714286,2.2857142857142856,1L)
(c d,5,1.8,3.4,1L)
(d a,3,1.0,1.0,1L)
(d d,1,2.0,4.0,2L)
</span></pre>
<p>these new rows, along with the rows we didn&#8217;t update, can be stored as the model at time n+1 ready for the next hours chunk</p>
<pre><span>
&gt; to_store = union model_n1, not_to_update;
&gt; store to_store into 'model.007';

&gt; dump to_store;
(a b,7,1.7142857142857142,4.0,4L)
(a a,3,1.3333333333333333,2.0,0L)
(a d,2,1.5,2.5,1L)
(a c,4,1.25,1.75,0L)
(b a,4,1.0,1.0,1L)
(b d,1,2.0,4.0,0L)
(b c,7,1.4285714285714286,2.2857142857142856,1L)
(c a,4,1.0,1.0,0L)
(c d,5,1.8,3.4,1L)
(d c,1,2.0,4.0,0L)
(d a,3,1.0,1.0,1L)
(d e,1,1.0,1.0,0L)
(d d,1,2.0,4.0,2L)
</span></pre>
<p>now that we&#8217;ve updated the model we can start making the trending check!</p>
<p>first step is to filter out entries that correspond to ngrams we are seeing for the first time<br />
( an new item can&#8217;t be trending )</p>
<pre><span>
&gt; requiring_trending_check = filter model_n1 by n&gt;1;

&gt; dump requiring_trending_check;
(a b,7,1.7142857142857142,4.0,4L)
(a d,2,1.5,2.5,1L)
(b a,4,1.0,1.0,1L)
(b c,7,1.4285714285714286,2.2857142857142856,1L)
(c d,5,1.8,3.4,1L)
(d a,3,1.0,1.0,1L)
</span></pre>
<p>and finally we can make the trending calculation!<br />
we can calculate the minimum trending value, based on mean + twice std dev&#8230;</p>
<pre><span>
&gt; calc_min_trending =
     foreach requiring_trending_check {
        sd_lhs = n * ms;
        sd_rhs = n * (m*m);
        sd = org.apache.pig.piggybank.evaluation.math.SQRT((sd_lhs-sd_rhs)/n);
        min_trend_value = m + (2*sd);
        generate key, f, m as mean, sd as std_dev,
                 min_trend_value as min_trend_value,
                 f / min_trend_value as percent_over_trend;
    }

&gt; describe calc_min_trending;
calc_min_trending: {key: chararray, f: long, mean: double, std_dev: double, min_trend_value: double, percent_over_trend: double}

&gt; dump calc_min_trending;
(a b,4L,1.7142857142857142,1.0301575072754257,3.7746007288365657,1.059714732061981)
(a d,1L,1.5,0.5,2.5,0.4)
(b a,1L,1.0,0.0,1.0,1.0)
(b c,1L,1.4285714285714286,0.4948716593053934,2.4183147471822153,0.4135111036167584)
(c d,1L,1.8,0.4,2.6,0.3846153846153848)
(d a,1L,1.0,0.0,1.0,1.0)
</span></pre>
<p>&#8230; and any entries with a frequency over the min trending value are deemed trending!<br />
( for this example it&#8217;s only the one )</p>
<pre><span>
&gt; trending = filter calc_min_trending by percent_over_trend &gt; 1;

&gt; describe trending;
trending: {key: chararray, f: long, mean: double, std_dev: double, min_trend_value: double, percent_over_trend: double}

&gt; dump trending;
(a b,4L,1.7142857142857142,1.0301575072754257,3.7746007288365657,1.059714732061981)
</span></pre>
<p>as a normalisation step i&#8217;ve been playing with also factoring in the frequency itself,<br />
haven&#8217;t come to a conclusion on whether this is a better metric or not&#8230;</p>
<pre><span>
&gt; trending2 =
     foreach trending {
        normalised_trend_value = org.apache.pig.piggybank.evaluation.math.LOG10(f) * percent_over_trend;
        generate key, min_trend_value, percent_over_trend, normalised_trend_value as normalised_trend_value;
     }

&gt; describe trending2;
trending2: {key: chararray, min_trend_value: double, percent_over_trend: double, normalised_trend_value: double}

&gt; dump trending2;
(a b,3.7746007288365657,1.059714732061981,0.6380118423953504)
</span></pre>
<p>and finally store the top trending values for processing!</p>
<pre><span>
&gt; trending_sorted = order trending2 by normalised_trend_value desc;
&gt; top_50 = limit trending_sorted 50;
&gt; store trending_sorted into 'trending.model.006;
</span></pre>
]]></content:encoded>
			<wfw:commentRss>http://matpalm.com/blog/2010/05/01/trending-topics-in-tweets-about-cheese-part2/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>trending topics in tweets about cheese; part1</title>
		<link>http://matpalm.com/blog/2010/04/27/trending-topics-in-tweets-about-cheese-part1/</link>
		<comments>http://matpalm.com/blog/2010/04/27/trending-topics-in-tweets-about-cheese-part1/#comments</comments>
		<pubDate>Tue, 27 Apr 2010 13:42:20 +0000</pubDate>
		<dc:creator>matpalm</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[cheese]]></category>
		<category><![CDATA[e15]]></category>
		<category><![CDATA[trending]]></category>
		<category><![CDATA[twitter]]></category>

		<guid isPermaLink="false">http://matpalm.com/blog/?p=499</guid>
		<description><![CDATA[trending topics
what does it mean for a topic to be &#8216;trending&#8217;? consider the following time series (430e3 tweets containing cheese collected over a month period bucketed into hourly timeslots)

without a formal definition we can just look at this and say that the series was trending where there was a spike just before time 600. as [...]]]></description>
			<content:encoded><![CDATA[<h3>trending topics</h3>
<p>what does it mean for a topic to be &#8216;trending&#8217;? consider the following time series (430e3 tweets containing cheese collected over a month period bucketed into hourly timeslots)</p>
<p><a href="http://www.matpalm.com/trending/tweets_over_day.60.nonaggregated.png"><img src="http://www.matpalm.com/trending/tweets_over_day.60.nonaggregated.png" alt="" /></a></p>
<p>without a formal definition we can just look at this and say that the series was trending where there was a spike just before time 600. as a start then let&#8217;s just define a trend as a value that was greater than was &#8216;expected&#8217;.</p>
<h3>how can we calculate trending?</h3>
<p>one really nice simple algorithm for detecting a trend is to say a value, v, is trending if v &gt; mean + 3 * standard deviation of the data seen so far. (thanks <a href="http://www.twitter.com/peteskomoroch">@peteskomoroch</a> for the suggestion, works a treat)</p>
<p>let&#8217;s consider the same time series as before but this time with some overlaid data;<br />
<span style="color: green;">green &#8211; the mean</span><br />
<span style="color: red;">red &#8211; minimum trend value ( = mean + 3 * std dev )</span><br />
<span style="color: blue;">blue &#8211; instances where the value &gt; minimum trend value</span></p>
<p><a href="http://www.matpalm.com/trending/tweets_over_day.60.trending.png"><img src="http://www.matpalm.com/trending/tweets_over_day.60.trending.png" alt="" /></a></p>
<p><span id="more-499"></span></p>
<p>here&#8217;s a zoom in on the last 200 values</p>
<p><a href="http://www.matpalm.com/trending/tweets_over_day.60.trending.zoom.png"><img src="http://www.matpalm.com/trending/tweets_over_day.60.trending.zoom.png" alt="" /></a></p>
<p>this works surprisingly well, the mean gives a solid expectation of the value with the standard deviation covering the daily periodic nature of the data.</p>
<p>it&#8217;s not perfect though as this system <em>only</em> ever allows a trend around the peaks of the cycle.</p>
<p>for example consider the troughs which have a frequency value around 250. if we had a value in one of those timeslot&#8217;s that was 1000, ie four times what was expected given that time of day, it would not be considered trending since the value has to be over 1500</p>
<h3>facet by hour</h3>
<p>one way to handle this is to not have a single time series but instead maintain 24 time series, one for each hour of the day.</p>
<p>faceting in this way gives the following trending</p>
<p><a href="http://www.matpalm.com/trending/tweets_over_day.60.periodic_trending.png"><img src="http://www.matpalm.com/trending/tweets_over_day.60.periodic_trending.png" alt="" /></a><br />
<a href="http://www.matpalm.com/trending/tweets_over_day.60.periodic_trending.zoom.png"><img src="http://www.matpalm.com/trending/tweets_over_day.60.periodic_trending.zoom.png" alt="" /></a></p>
<p>and though this doesn&#8217;t present any cases of trends at a trough we can see it was prettttty close a number of times.</p>
<h3>facet by ngram</h3>
<p>one other interesting way to facet, and the main purpose of this project, is to maintain a seperate time series for each ngram in the tweet.</p>
<p>the top 10 2-grams in my dataset are&#8230;</p>
<table>
<tbody>
<tr>
<td>freq</td>
<td>term1</td>
<td>term2</td>
</tr>
<tr>
<td>44389</td>
<td>and</td>
<td>cheese</td>
</tr>
<tr>
<td>33454</td>
<td>cheese</td>
<td>and</td>
</tr>
<tr>
<td>22815</td>
<td>mac</td>
<td>cheese</td>
</tr>
<tr>
<td>22532</td>
<td>grilled</td>
<td>cheese</td>
</tr>
<tr>
<td>18639</td>
<td>cream</td>
<td>cheese</td>
</tr>
<tr>
<td>15225</td>
<td>the</td>
<td>cheese</td>
</tr>
<tr>
<td>13592</td>
<td>mac</td>
<td>and</td>
</tr>
<tr>
<td>12967</td>
<td>chuck</td>
<td>cheese</td>
</tr>
<tr>
<td>12598</td>
<td>of</td>
<td>cheese</td>
</tr>
<tr>
<td>12296</td>
<td>cheese</td>
<td>on</td>
</tr>
</tbody>
</table>
<p>let&#8217;s look at the time series for a few of them.</p>
<h4>#4 grilled cheese</h4>
<p><a href="http://www.matpalm.com/trending/tweets_over_day.60.grilledcheese.trending.png"><img src="http://www.matpalm.com/trending/tweets_over_day.60.grilledcheese.trending.png" alt="" /></a></p>
<p>we get an interesting result from the very first spike at around 225&#8230; poor fangirl <a href="http://www.twitter.com/rachelljonas">@rachelljonas</a> spent 10 minutes tweeting like crazy trying to get the attention of <a href="http://www.twitter.com/nickjonas">@nickjonas</a> (some popstar i&#8217;ve never heard of) and bumped up &#8216;grilled cheese&#8217; for a single timeslot (here&#8217;s <a href="http://www.matpalm.com/trending/rachelljonas.html">her attempt</a> to get his attention&#8230;)</p>
<p>this raises an interesting point about spam and should possibly my first pre processing data cleaning step. how should we disregard too many tweets from a single user in a timeslot?</p>
<p>the next spike at around 375 shows potentially my first true trending topic, a sudden increase in the discussion of making grilled cheese. <a href="http://www.matpalm.com/trending/grilled_cheese.html">the data</a> has no dups so looks like it was just grilled cheese time!</p>
<h4>#5 cream cheese</h4>
<p><a href="http://www.matpalm.com/trending/tweets_over_day.60.creamcheese.trending.png"><img src="http://www.matpalm.com/trending/tweets_over_day.60.creamcheese.trending.png" alt="" /></a></p>
<p>one major spike at about 376, looking at <a href="http://www.matpalm.com/trending/cream_cheese.html">the data</a>. there might have been a competition being run relating to #gno #bagelfuls ?</p>
<h4>#412 goats cheese</h4>
<p>nothing uber interesting with the &#8216;goats cheese&#8217; time series but it does illustrate an interesting point. for all the examples we&#8217;ve looked at so far each timeslot of an hour has included as least one entry for the 2gram. by the time we&#8217;re getting to the less frequent ngrams we see as many timeslots with a zero frequency as with a non zero frequency.</p>
<p><a href="http://www.matpalm.com/trending/tweets_over_day.60.goatscheese.withzerofill.trending.png"><img src="http://www.matpalm.com/trending/tweets_over_day.60.goatscheese.withzerofill.trending.png" alt="" /></a></p>
<p>interestingly if you only consider the cases where the frequency values are non zero i think you get a better sense of where the values are trending.</p>
<p><a href="http://www.matpalm.com/trending/tweets_over_day.60.goatscheese.trending.png"><img src="http://www.matpalm.com/trending/tweets_over_day.60.goatscheese.trending.png" alt="" /></a></p>
<p>this also turns out to make things easier to process :)</p>
<h4>#1483 apple juice</h4>
<p>with &#8216;apple juice&#8217;, an even less frequent 2gram, the effect is even more noticable if you ignore the zero frequency cases.</p>
<p><a href="http://www.matpalm.com/trending/tweets_over_day.60.applejuice.withzerofill.trending.png"><img src="http://www.matpalm.com/trending/tweets_over_day.60.applejuice.withzerofill.trending.png" alt="" /></a><br />
<a href="http://www.matpalm.com/trending/tweets_over_day.60.applejuice.trending.png"><img src="http://www.matpalm.com/trending/tweets_over_day.60.applejuice.trending.png" alt="" /></a></p>
<p>so with two ways of faceting the data, either timeslots or ngrams, the next step is porting the algorithm to pig so we can run it at scale, write up coming soon!</p>
<p>( code ( in a pretty raw form ) <a href="http://github.com/matpalm/trending">available at github</a> )</p>
]]></content:encoded>
			<wfw:commentRss>http://matpalm.com/blog/2010/04/27/trending-topics-in-tweets-about-cheese-part1/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
