<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>brain of matpalm &#187; Uncategorized</title>
	<atom:link href="http://matpalm.com/blog/category/uncategorized/feed/" rel="self" type="application/rss+xml" />
	<link>http://matpalm.com/blog</link>
	<description>thoughts from a data scientist wannabe</description>
	<lastBuildDate>Mon, 16 Aug 2010 11:38:22 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>my list of cool machine learning books</title>
		<link>http://matpalm.com/blog/2010/08/06/my-list-of-cool-machine-learning-books/</link>
		<comments>http://matpalm.com/blog/2010/08/06/my-list-of-cool-machine-learning-books/#comments</comments>
		<pubDate>Fri, 06 Aug 2010 08:35:20 +0000</pubDate>
		<dc:creator>matpalm</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[books]]></category>
		<category><![CDATA[machine learning]]></category>

		<guid isPermaLink="false">http://matpalm.com/blog/?p=746</guid>
		<description><![CDATA[for the last month or so i&#8217;ve had my head down and have been focusing more on theory (ie reading) than on practice (ie coding)
so rather than write no blog post here&#8217;s mats-list-of-cool-machine-learning-books in the order i think you should consider reading them&#8230;

1) &#8220;programming collective intelligence&#8221; by toby segaran




if you know nothing about machine learning [...]]]></description>
			<content:encoded><![CDATA[<p>for the last month or so i&#8217;ve had my head down and have been focusing more on theory (ie reading) than on practice (ie coding)</p>
<p>so rather than write no blog post here&#8217;s mats-list-of-cool-machine-learning-books in the order i think you should consider reading them&#8230;</p>
<p><span id="more-746"></span></p>
<h2>1) &#8220;<a href="http://amzn.to/a8iq8U" target="_blank">programming collective intelligence</a>&#8221; by toby segaran</h2>
<table>
<tbody style="vertical-align: top;">
<tr>
<td><img class="alignnone size-full wp-image-689" title="pci" src="http://matpalm.com/blog/wp-content/uploads/2010/08/pci.jpg" alt="" width="200" height="215" /></td>
<td>if you know nothing about machine learning and haven&#8217;t done maths since high school then this is the book for you.</p>
<p>it&#8217;s a fantastically accesible introduction to the field. includes almost no theory and explains algorithms using actual python implementations.</td>
</tr>
</tbody>
</table>
<h2>2) &#8220;<a href="http://amzn.to/cvFi7t" target="_blank">data mining</a>&#8221; by witten and frank</h2>
<table>
<tbody style="vertical-align: top;">
<tr>
<td><img class="alignnone size-full wp-image-684" src="http://matpalm.com/blog/wp-content/uploads/2010/08/dm1.jpg" alt="" width="200" height="215" /></td>
<td>this book covers quite a bit more than programming c.i. while still being extremely practical (ie very few formula).</p>
<p>about a fifth of the book is dedicated to weka, a machine learning workbench which was written by the authors. apart from the weka section this book has no code. i made <a href="http://vimeo.com/13051595">a little screencast on weka</a> awhile back if you&#8217;re after a summary.</td>
</tr>
</tbody>
</table>
<h2>3) &#8220;<a href="http://amzn.to/b8gp6U" target="_blank">introduction to data mining</a>&#8221; by tan, steinbach and kumar</h2>
<table>
<tbody style="vertical-align: top;">
<tr>
<td><img class="alignnone size-full wp-image-687" src="http://matpalm.com/blog/wp-content/uploads/2010/08/itdm.jpg" alt="" width="200" height="215" /></td>
<td>covers almost the same material as the witten/frank text but delves a little bit deeper and with more rigour. includes no code (none of the books do from now on) with algorithms described by formula.</p>
<p>has a number of appendices on linear algebra, probability, statistics etc so that you can read up if you&#8217;re a bit rusty or new to the fields (the witten/frank text lack these).</p>
<p>some people might argue having both of these books is a waste since they cover so much of the same ground but i&#8217;ve always found multiple explanations from different authors to be a great way to help understand a topic. i read the witten/frank text first and am glad i did but if i could only keep one i&#8217;d keep this one.</td>
</tr>
</tbody>
</table>
<h2>intermission</h2>
<p>at this point you&#8217;ve probably got enough mental firepower to handle some of the uni level machine learning course notes that are floating about online.</p>
<p>if you&#8217;re keen to get a better foundation of the maths side of things it&#8217;d be worth working through <a href="http://www.youtube.com/watch?v=UzxYlbK2c7E">andrew ng&#8217;s lecture series on machine learning.</a> (20 hours of a second year stanford course on machine learning)</p>
<p>i also found <a href="http://www.cs.cmu.edu/~awm/">andrew moore&#8217;s lecture slides</a> really great. (they do though require a reasonable understanding of the basics)</p>
<h2>4) &#8220;<a href="http://amzn.to/atpHZ2" target="_blank">foundations of statistical natural language processing</a>&#8221; by manning and schutze</h2>
<table>
<tbody style="vertical-align: top;">
<tr>
<td><img class="alignnone size-full wp-image-686" title="fosnlp" src="http://matpalm.com/blog/wp-content/uploads/2010/08/fosnlp.jpg" alt="" width="200" height="215" /></td>
<td>not a machine learning book as such but great for learning to deal with one of the most common types of data around; text. since most of machine learning theory is about maths (ie numbers) this is awesome in helping to understanding how to deal with text in a mathematical context.</td>
</tr>
</tbody>
</table>
<h2>5) &#8220;<a href="http://amzn.to/99UJfV" target="_blank">introduction to machine learning</a>&#8221; by ethem alpaydin</h2>
<table>
<tbody style="vertical-align: top;">
<tr>
<td><img class="alignnone size-full wp-image-686" src="http://matpalm.com/blog/wp-content/uploads/2010/08/itml.jpg" alt="" width="200" height="215" /></td>
<td>covers generally the same sort of topics as the data mining books but with much more rigour and theory (derivations, proofs, etc). i think this is a good thing though since understanding how things work at a low level gives you the ability to tweak and modify as required.</p>
<p>loads more formulas but again with appendixs that introduce the basics in enough detail to get by.</td>
</tr>
</tbody>
</table>
<h2>6) &#8220;<a href="http://amzn.to/ap9Kgf" target="_blank">all of statistics</a>&#8221; by larry wasserman</h2>
<table>
<tbody style="vertical-align: top;">
<tr>
<td><img class="alignnone size-full wp-image-686" src="http://matpalm.com/blog/wp-content/uploads/2010/08/aos.jpg" alt="" /></td>
<td>by this stage you&#8217;ll probably have an appreciation of how important statistics is for this domain and it might be worth foccussing on it for a bit.</p>
<p>personally i found this book to be a great read and though i&#8217;ve only read certain sections in depth i&#8217;m looking forward to when i get a chance to work through it cover to cover</td>
</tr>
</tbody>
</table>
<h2>7) &#8220;the elements of statistical learning&#8221; by hastie, tibshirani and friedman.</h2>
<table>
<tbody style="vertical-align: top;">
<tr>
<td><img class="alignnone size-full wp-image-686" src="http://matpalm.com/blog/wp-content/uploads/2010/08/eosl.jpg" alt="" /></td>
<td>with a bit more stats under your belt you might have a chance of getting through this one; the most complex of the lot.</p>
<p>this book is absolutely beautifully presented and now that it&#8217;s <a href="http://www-stat.stanford.edu/~tibs/ElemStatLearn/">FREE to download</a> you&#8217;ve got no reason not to have a crack at it.</p>
<p>a remarkable piece of work and one i&#8217;ve yet to get through fully cover to cover, it&#8217;s quite hardcore and right on the border of my level of understanding ( which makes it perfect for me :P )</td>
</tr>
</tbody>
</table>
<h2>ps. books i haven&#8217;t read that are in the mail</h2>
<h2>&#8220;<a href="http://amzn.to/dkcGxb" target="_blank">machine learning</a>&#8221; by tom mitchell</h2>
<table>
<tbody style="vertical-align: top;">
<tr>
<td><img class="alignnone size-full wp-image-686" src="http://matpalm.com/blog/wp-content/uploads/2010/08/ml.jpg" alt="" /></td>
<td>have been wanting to read this one for awhile, i&#8217;m a big fan of <a href="http://www.cs.cmu.edu/~tom/">tom mitchell</a>, but couldn&#8217;t justify the cost</p>
<p>however just found out the other day the paperback is a third of the price of the hardback i was looking at!! the book&#8217;s in the mail</td>
</tr>
</tbody>
</table>
<h2><a href="http://amzn.to/9IzWtN" target="_blank">&#8220;pattern recognition and machine learning</a>&#8221; by chris bishop</h2>
<table>
<tbody style="vertical-align: top;">
<tr>
<td><img class="alignnone size-full wp-image-686" src="http://matpalm.com/blog/wp-content/uploads/2010/08/prml.jpg" alt="" /></td>
<td>all of a sudden seemed like everyone was reading this but me so it was time to jump on the bandwagon</td>
</tr>
</tbody>
</table>
]]></content:encoded>
			<wfw:commentRss>http://matpalm.com/blog/2010/08/06/my-list-of-cool-machine-learning-books/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>brutally short intro to weka</title>
		<link>http://matpalm.com/blog/2010/07/03/brutally-short-intro-to-weka/</link>
		<comments>http://matpalm.com/blog/2010/07/03/brutally-short-intro-to-weka/#comments</comments>
		<pubDate>Sat, 03 Jul 2010 07:35:27 +0000</pubDate>
		<dc:creator>matpalm</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[brutally short intro]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[weka]]></category>

		<guid isPermaLink="false">http://matpalm.com/blog/?p=677</guid>
		<description><![CDATA[weka is a java based machine learning workbench that i&#8217;ve found useful to playing with to help understand some standard machine learning algorithms. in this quick demo i show how to build a classifier for three simple datasets; two of which address the basics of text classification

brutally short intro to weka from Mat Kelcey on [...]]]></description>
			<content:encoded><![CDATA[<p>weka is a java based machine learning workbench that i&#8217;ve found useful to playing with to help understand some standard machine learning algorithms. in this quick demo i show how to build a classifier for three simple datasets; two of which address the basics of text classification</p>
<p><object width="400" height="300"><param name="allowfullscreen" value="true" /><param name="allowscriptaccess" value="always" /><param name="movie" value="http://vimeo.com/moogaloop.swf?clip_id=13051595&amp;server=vimeo.com&amp;show_title=1&amp;show_byline=1&amp;show_portrait=0&amp;color=&amp;fullscreen=1" /><embed src="http://vimeo.com/moogaloop.swf?clip_id=13051595&amp;server=vimeo.com&amp;show_title=1&amp;show_byline=1&amp;show_portrait=0&amp;color=&amp;fullscreen=1" type="application/x-shockwave-flash" allowfullscreen="true" allowscriptaccess="always" width="400" height="300"></embed></object>
<p><a href="http://vimeo.com/13051595">brutally short intro to weka</a> from <a href="http://vimeo.com/user2935988">Mat Kelcey</a> on <a href="http://vimeo.com">Vimeo</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://matpalm.com/blog/2010/07/03/brutally-short-intro-to-weka/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>friend clustering by term usage</title>
		<link>http://matpalm.com/blog/2010/06/25/friend-clustering-by-term-usage/</link>
		<comments>http://matpalm.com/blog/2010/06/25/friend-clustering-by-term-usage/#comments</comments>
		<pubDate>Fri, 25 Jun 2010 13:39:08 +0000</pubDate>
		<dc:creator>matpalm</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[infochimps]]></category>
		<category><![CDATA[network]]></category>
		<category><![CDATA[twitter]]></category>

		<guid isPermaLink="false">http://matpalm.com/blog/?p=662</guid>
		<description><![CDATA[recently signed up to the infochimps api and wanted to do something quick and dirty to get a feel for it.
so here&#8217;s a little experiment

get the people i follow on twitter
look up the words that &#8220;represent&#8221; them according to the infochimps word bag api
build a similiarity matrix based on the common use of those terms
plot [...]]]></description>
			<content:encoded><![CDATA[<p>recently signed up to the <a href="http://api.infochimps.com/">infochimps api</a> and wanted to do something quick and dirty to get a feel for it.</p>
<p>so here&#8217;s a little experiment</p>
<ol>
<li>get the people i follow on twitter</li>
<li>look up the words that &#8220;represent&#8221; them according to the <a href="http://api.infochimps.com/describe/soc/net/tw/wordbag">infochimps word bag api</a></li>
<li>build a similiarity matrix based on the common use of those terms</li>
<li>plot the connectivity for the top 30 or so pairings</li>
</ol>
<p><a href="http://matpalm.com/blog/wp-content/uploads/2010/06/top35.png"><img src="http://matpalm.com/blog/wp-content/uploads/2010/06/top35-300x178.png" alt="" title="top35" width="300" height="178" class="aligncenter size-medium wp-image-666" /></a></p>
<p>it&#8217;s basically partitioned into three groups&#8230;</p>
<ol>
<li>veztek (my boss john) and smcinnes (steve from the lonely planet community team) in the top right</li>
<li>a big clump of nosqlness with mongodb &#8211; hbase &#8211; jpatanooga &#8211; kevinweil in the bottom left</li>
<li>everyone else</li>
</ol>
<p>an interesting enough result given the time taken; the codes <a href="http://github.com/matpalm/twitter/tree/master/friend_cluster/">on github</a></p>
]]></content:encoded>
			<wfw:commentRss>http://matpalm.com/blog/2010/06/25/friend-clustering-by-term-usage/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>country codes in world cup tweets &#8211; viz1</title>
		<link>http://matpalm.com/blog/2010/06/21/country-codes-in-world-cup-tweets-viz1/</link>
		<comments>http://matpalm.com/blog/2010/06/21/country-codes-in-world-cup-tweets-viz1/#comments</comments>
		<pubDate>Mon, 21 Jun 2010 09:43:32 +0000</pubDate>
		<dc:creator>matpalm</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[twitter]]></category>
		<category><![CDATA[visualisation]]></category>
		<category><![CDATA[worldcup]]></category>

		<guid isPermaLink="false">http://matpalm.com/blog/?p=654</guid>
		<description><![CDATA[
#worldcup tweet viz1 from Mat Kelcey on Vimeo.
here&#8217;s a simple visualisation of the use of official country codes (eg #aus) in a week&#8217;s worth of tweets from the search stream for #worldcup.
rate is about 2hours of tweets per sec. orb size denotes relative frequency of that country code. edges denote that those two countries feature [...]]]></description>
			<content:encoded><![CDATA[<p><object width="400" height="300"><param name="allowfullscreen" value="true" /><param name="allowscriptaccess" value="always" /><param name="movie" value="http://vimeo.com/moogaloop.swf?clip_id=12710800&amp;server=vimeo.com&amp;show_title=1&amp;show_byline=1&amp;show_portrait=0&amp;color=&amp;fullscreen=1" /><embed src="http://vimeo.com/moogaloop.swf?clip_id=12710800&amp;server=vimeo.com&amp;show_title=1&amp;show_byline=1&amp;show_portrait=0&amp;color=&amp;fullscreen=1" type="application/x-shockwave-flash" allowfullscreen="true" allowscriptaccess="always" width="400" height="300"></embed></object>
<p><a href="http://vimeo.com/12710800">#worldcup tweet viz1</a> from <a href="http://vimeo.com/user2935988">Mat Kelcey</a> on <a href="http://vimeo.com">Vimeo</a>.</p>
<p>here&#8217;s a simple visualisation of the use of official country codes (eg #aus) in a week&#8217;s worth of tweets from the search stream for #worldcup.</p>
<p>rate is about 2hours of tweets per sec. orb size denotes relative frequency of that country code. edges denote that those two countries feature a lot in the same tweets. movement is based on gravitational like attraction along edges.</p>
<p>the quiet period at about 0:17 is a twitter outage :)</p>
<p><a href="http://matpalm.com/world_cup/viz1/">here&#8217;s the original processing applet version</a> with a bit more discussion</p>
]]></content:encoded>
			<wfw:commentRss>http://matpalm.com/blog/2010/06/21/country-codes-in-world-cup-tweets-viz1/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>moving average of a time series in R</title>
		<link>http://matpalm.com/blog/2010/06/15/moving-average-of-a-time-series-in-r/</link>
		<comments>http://matpalm.com/blog/2010/06/15/moving-average-of-a-time-series-in-r/#comments</comments>
		<pubDate>Tue, 15 Jun 2010 06:15:10 +0000</pubDate>
		<dc:creator>matpalm</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[r]]></category>
		<category><![CDATA[simple stuff i keep forgetting]]></category>

		<guid isPermaLink="false">http://matpalm.com/blog/?p=649</guid>
		<description><![CDATA[in this a sliding window of 3 elements

> x = c(3,1,4,1,5,9,2,6,5,3,5,8)
> ra_x = filter(x, rep(1,3)/3)
> ra_x
Time Series:
Start = 1
End = 12
Frequency = 1
 [1]       NA 2.666667 2.000000 3.333333 5.000000 5.333333 5.666667 4.333333
 [9] 4.666667 4.333333 5.333333       NA

]]></description>
			<content:encoded><![CDATA[<p>in this a sliding window of 3 elements</p>
<pre>
> x = c(3,1,4,1,5,9,2,6,5,3,5,8)
> ra_x = filter(x, rep(1,3)/3)
> ra_x
Time Series:
Start = 1
End = 12
Frequency = 1
 [1]       NA 2.666667 2.000000 3.333333 5.000000 5.333333 5.666667 4.333333
 [9] 4.666667 4.333333 5.333333       NA
</pre>
]]></content:encoded>
			<wfw:commentRss>http://matpalm.com/blog/2010/06/15/moving-average-of-a-time-series-in-r/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>#worldcup twitter analytics</title>
		<link>http://matpalm.com/blog/2010/06/14/worldcup-twitter-analytics/</link>
		<comments>http://matpalm.com/blog/2010/06/14/worldcup-twitter-analytics/#comments</comments>
		<pubDate>Mon, 14 Jun 2010 12:06:49 +0000</pubDate>
		<dc:creator>matpalm</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[twitter]]></category>
		<category><![CDATA[worldcup]]></category>

		<guid isPermaLink="false">http://matpalm.com/blog/?p=644</guid>
		<description><![CDATA[since the world cup started i&#8217;ve spent more time looking at twitter data about the games than the actual games themselves. what a sad data nerd i am!
anyways, here&#8217;s the first few days analysis based the use of official country tags (eg #aus) in the search stream for #worldcup.
tomorrow i might look in more detail [...]]]></description>
			<content:encoded><![CDATA[<p>since the world cup started i&#8217;ve spent more time looking at twitter data about the games than the actual games themselves. what a sad data nerd i am!</p>
<p>anyways, here&#8217;s the <a href="http://bit.ly/dkR46o">first few days analysis</a> based the use of official country tags (eg <a href="http://twitter.com/#search?q=%23aus">#aus</a>) in the search stream for <a href="http://twitter.com/#search?q=%23worldcup">#worldcup</a>.</p>
<p>tomorrow i might look in more detail at one of the games, wondering how many variants of &#8216;goooooooal&#8217; i&#8217;ll find :D</p>
]]></content:encoded>
			<wfw:commentRss>http://matpalm.com/blog/2010/06/14/worldcup-twitter-analytics/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>a quick study in tf/icf</title>
		<link>http://matpalm.com/blog/2010/06/09/a-quick-study-in-tficf/</link>
		<comments>http://matpalm.com/blog/2010/06/09/a-quick-study-in-tficf/#comments</comments>
		<pubDate>Wed, 09 Jun 2010 11:58:08 +0000</pubDate>
		<dc:creator>matpalm</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://matpalm.com/blog/?p=638</guid>
		<description><![CDATA[while doing some more research on trending algorithms i came across a cool little paper about term frequency normalisation for streaming data: TF-ICF: A New Term Weighting Scheme for Clustering Dynamic Data Streams.
i&#8217;m finding streaming related algorithms quite interesting lately and think are the way forward in terms of dealing with large amounts of constant [...]]]></description>
			<content:encoded><![CDATA[<p>while doing some more research on trending algorithms i came across a cool little paper about term frequency normalisation for streaming data: <a href="http://aser.ornl.gov/publications/ICMLA06.pdf">TF-ICF: A New Term Weighting Scheme for Clustering Dynamic Data Streams</a>.</p>
<p>i&#8217;m finding streaming related algorithms quite interesting lately and think are the way forward in terms of dealing with large amounts of constant data. it&#8217;s just not feasible to use algorithms that expect you to have all the data at any given time; it forces you to reprocess all the data you&#8217;ve ever seen as you get new examples. my thinking is the best solutions are the ones that build models of the data and fold in new examples in batches. anyways, i&#8217;m getting off topic already.</p>
<p>tf/icf as presented in the paper is a variant on the classic <a href="http://en.wikipedia.org/wiki/Tf–idf">tf/idf</a> for term weighting but instead of requiring all weighting in all docs to be recalculated as a new document comes along (as tf/idf strictly does) it instead just approximates based on what has been seen before.</p>
<p>so how does it do? actually quite well, <a href="http://www.matpalm.com/tf_icf">here&#8217;s my experimental breakdown</a></p>
]]></content:encoded>
			<wfw:commentRss>http://matpalm.com/blog/2010/06/09/a-quick-study-in-tficf/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>5 minute ggobi demo</title>
		<link>http://matpalm.com/blog/2010/06/04/5-minute-ggobi-demo/</link>
		<comments>http://matpalm.com/blog/2010/06/04/5-minute-ggobi-demo/#comments</comments>
		<pubDate>Fri, 04 Jun 2010 13:12:53 +0000</pubDate>
		<dc:creator>matpalm</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[5min]]></category>
		<category><![CDATA[brutally short intro]]></category>
		<category><![CDATA[ggobi]]></category>

		<guid isPermaLink="false">http://matpalm.com/blog/?p=634</guid>
		<description><![CDATA[
brutally short demo of ggobi from Mat Kelcey on Vimeo.
note: non embedded version has higher res at full screen
]]></description>
			<content:encoded><![CDATA[<p><object width="400" height="300"><param name="allowfullscreen" value="true" /><param name="allowscriptaccess" value="always" /><param name="movie" value="http://vimeo.com/moogaloop.swf?clip_id=12292239&amp;server=vimeo.com&amp;show_title=1&amp;show_byline=1&amp;show_portrait=0&amp;color=&amp;fullscreen=1" /><embed src="http://vimeo.com/moogaloop.swf?clip_id=12292239&amp;server=vimeo.com&amp;show_title=1&amp;show_byline=1&amp;show_portrait=0&amp;color=&amp;fullscreen=1" type="application/x-shockwave-flash" allowfullscreen="true" allowscriptaccess="always" width="400" height="300"></embed></object>
<p><a href="http://vimeo.com/12292239">brutally short demo of ggobi</a> from <a href="http://vimeo.com/user2935988">Mat Kelcey</a> on <a href="http://vimeo.com">Vimeo</a>.</p>
<p>note: <a href="http://bit.ly/ctZmdA">non embedded version</a> has higher res at full screen</p>
]]></content:encoded>
			<wfw:commentRss>http://matpalm.com/blog/2010/06/04/5-minute-ggobi-demo/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>how many terms in a trend?</title>
		<link>http://matpalm.com/blog/2010/05/11/how-many-terms-in-a-trend/</link>
		<comments>http://matpalm.com/blog/2010/05/11/how-many-terms-in-a-trend/#comments</comments>
		<pubDate>Tue, 11 May 2010 09:46:11 +0000</pubDate>
		<dc:creator>matpalm</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[algorithms]]></category>
		<category><![CDATA[e15]]></category>
		<category><![CDATA[puzzled]]></category>
		<category><![CDATA[trending]]></category>

		<guid isPermaLink="false">http://matpalm.com/blog/?p=627</guid>
		<description><![CDATA[i&#8217;ve been poking around with a simple trending algorithm over the last few weeks and have uncovered a problem that, like most interesting ones, i&#8217;m not sure how to solve. the question revolves around discovering multi terms trends. 
a sensible place to start when looking for trends is to consider single terms but what if [...]]]></description>
			<content:encoded><![CDATA[<p>i&#8217;ve been poking around with a simple trending algorithm over the last few weeks and have uncovered a problem that, like most interesting ones, i&#8217;m not sure how to solve. the question revolves around discovering multi terms trends. </p>
<p>a sensible place to start when looking for trends is to consider single terms but what if though we ended up with three equally trending terms &#8216;happy&#8217;, &#8216;new&#8217; and &#8216;year&#8217;? it&#8217;s pretty obvious that the actual trend is &#8216;happy new year&#8217; but what is the best way to express this as a single trend in an algorithmic sense?</p>
<p>one approach i&#8217;ve been playing with is to collect unigrams, bigrams and trigrams (1,2,3 term &#8216;phrases&#8217;) and consider the cases where the terms overlap. basically if &#8216;happy new year&#8217; is trending then, in some sense, we can ignore trends for &#8216;happy new&#8217;, &#8216;new year&#8217;, &#8216;happy&#8217;, &#8216;new&#8217; and &#8216;year&#8217;. but does this result in to many false positives? would we miss &#8216;happy&#8217; as a trend if lots of people were chirpy about the change of year (as they usually are, on new years eve)</p>
<p>rather than outright ignore we could somehow reduce the weighting by removing the double counting.</p>
<p>eg if we had 3 trends;  (free beer,11), (free,12) &#038; (beer,25)<br />
we can take 11 (from the 2gram) off both 1grams to give  (free beer,11), (free,1) &#038; (beer,14)<br />
showing that &#8216;beer&#8217;, outside of the phrase &#8216;free beer&#8217;, is perhaps a trend in itself (as it should be)</p>
<p>this feels like it might work but would be non trivial (read: fun) to implement</p>
<p>another slightly different problem is around the handling of retweeting. my experiments have shown a huge amount of the &#8216;trends&#8217; found are related to retweets, which is fine in itself, but it gives quite strange trends since the retweeted portion of the text is usually quite long.</p>
<p>for example; say lots of people are retweeting something and, as some people do, are adding various bits and pieces at the beginning and end; eg &#8216;RT @bob omg i just found a peanut&#8217; or &#8216;omg i just found a peanut; via @bob lucky him!!&#8217;</p>
<p>if we&#8217;re considering bigrams (which i am in my current implementation) we end up with an odd selection of trends such as &#8216;just found&#8217;, &#8216;a peanut&#8217;, &#8216;omg i&#8217;, &#8216;found a&#8217;, &#8216;i just&#8217; and in these cases it&#8217;d be great to be able to just stitch them together into the common retweeted element &#8216;omg i just found a peanut&#8217;. </p>
<p>we could &#8217;solve&#8217; this problem by not just considering 1,2 and 3 grams but considering _all_ possible n-grams for each tweet and employing the technique we spoke of above of reducing the counts. it&#8217;d almost be feasible, since tweets are never that long, but feels uber clumsy and i&#8217;d hate to see the order statistic of that algorithm ;)</p>
<p>this seems more like a stitching problem of some kind;  eg if we have 4 grams &#8216;omg i just found&#8217;, &#8216;i just found a&#8217;, &#8216;just found a peanut&#8217; perhaps we can identify the non trivial overlap and stitch them together (?)</p>
<p>not sure, there are a number of things to try. was hoping that brain dumping some of this would help me see the light but nothing obvious jumps out :(</p>
]]></content:encoded>
			<wfw:commentRss>http://matpalm.com/blog/2010/05/11/how-many-terms-in-a-trend/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>trending topics in tweets about cheese; part2</title>
		<link>http://matpalm.com/blog/2010/05/01/trending-topics-in-tweets-about-cheese-part2/</link>
		<comments>http://matpalm.com/blog/2010/05/01/trending-topics-in-tweets-about-cheese-part2/#comments</comments>
		<pubDate>Sat, 01 May 2010 06:54:53 +0000</pubDate>
		<dc:creator>matpalm</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[e15]]></category>
		<category><![CDATA[pig]]></category>
		<category><![CDATA[trending]]></category>
		<category><![CDATA[twitter]]></category>

		<guid isPermaLink="false">http://matpalm.com/blog/?p=559</guid>
		<description><![CDATA[prototyping in ruby was a great way to prove the concept but my main motivation for this project was to play some more with pig.
the main approach will be

maintain a relation with one record per ngram we want to monitoring for trending
fold 1 hours worth of new data at a time into the model
check the [...]]]></description>
			<content:encoded><![CDATA[<p>prototyping in ruby was a great way to prove the concept but my main motivation for this project was to play some more with pig.</p>
<p>the main approach will be</p>
<ol>
<li>maintain a relation with one record per ngram we want to monitoring for trending</li>
<li>fold 1 hours worth of new data at a time into the model</li>
<li>check the entries for the latest hour for any trends</li>
</ol>
<p>the <a href="http://github.com/matpalm/trending/blob/master/pig/trending.pig">full version is on github</a>. read on for a line by line walkthrough</p>
<p><span id="more-559"></span></p>
<p>the ruby impl used the simplest approach possible for calculating mean and stddev; maintain a record of all the values seen so far and recalculate for each new value.</p>
<p>for our pig version we&#8217;ll take a fixed space approach. rather than keep <em>all</em> the values for each time series it turns out we can get away with storing just 3&#8230;</p>
<ol>
<li>n: the number of values</li>
<li>m: the current mean of all values</li>
<li>ms: the current mean of the squares of all values</li>
</ol>
<p>the idea is that the mean<sub>n+1</sub> = ( n * mean<sub>n</sub> + new value ) / n+1<br />
a similar function holds that derives the standard deviation<sub>n+1</sub> from n, the mean<sub>n</sub> and the mean of the squares<sub>n</sub></p>
<p>let&#8217;s go over the pig script one command a time.</p>
<p>we&#8217;ll assume we&#8217;ve already run it 6 times and we&#8217;re now folding in the 7th hour</p>
<p>the first thing is to load the existing version of the model, in this case stored in the file &#8216;model.006&#8242;<br />
it contains everything we need for checking the trending for each ngram</p>
<pre><span>
&gt; raw_model = load 'model.006' as (key:chararray, n:int, m:double, ms:double);

&gt; describe raw_model;
raw_model: {key: chararray, n: int, m: double, ms: double}

&gt; dump raw_model;
(a b,6,1.3333333333333333,2.0)
(a a,3,1.3333333333333333,2.0)
(a c,4,1.25,1.75)
(a d,1,2.0,4.0)
(b a,3,1.0,1.0)
(b d,1,2.0,4.0)
(b c,6,1.5,2.5)
(d c,1,2.0,4.0)
(c a,4,1.0,1.0)
(d e,1,1.0,1.0)
(c d,4,2.0,4.0)
(d a,2,1.0,1.0)
</span></pre>
<p>next we tag each entry from the loaded model with a zero frequency. we&#8217;ll see later how this makes it easier to fold in the new data.</p>
<pre><span>
&gt; model = foreach raw_model generate key, n, m, ms, 0 as f;

&gt; describe model;
model: {key: chararray, n: int, m: double, ms: double, f: int}

&gt; dump model;
(a b,6,1.3333333333333333,2.0,0)
(a a,3,1.3333333333333333,2.0,0)
(a c,4,1.25,1.75,0)
(a d,1,2.0,4.0,0)
(b a,3,1.0,1.0,0)
(b d,1,2.0,4.0,0)
(b c,6,1.5,2.5,0)
(d c,1,2.0,4.0,0)
(c a,4,1.0,1.0,0)
(d e,1,1.0,1.0,0)
(c d,4,2.0,4.0,0)
(d a,2,1.0,1.0,0)
</span></pre>
<p>now that we&#8217;ve loaded the existing version of the model we can load the next hour of data, in this case contained in &#8216;chunks/006&#8242;.</p>
<pre><span>
&gt; next_chunk = load 'chunks/006';

&gt; dump next_chunk;
(a b a b)
(c d a b)
(a b c)
(a d d d)
</span></pre>
<p>from the text we want to get the frequency of the ngrams.<br />
the breaking apart of each line into its 2-grams is handled by a simple ruby script; <a href="http://github.com/matpalm/trending/blob/master/pig/ngram.rb">ngram.rb</a></p>
<pre><span>
&gt; define ngramer `ngram.rb` ship('ngram.rb');
&gt; ngrams = stream next_chunk through ngramer as (key:chararray);

&gt; describe ngrams;
ngrams: {key: chararray}

&gt; dump ngrams;
(a b)
(b a)
(a b)
(c d)
(d a)
(a b)
(a b)
(b c)
(a d)
(d d)
(d d)
</span></pre>
<p>calculating the frequencies of the ngrams is a simple two step process of first grouping by the key&#8230;</p>
<pre><span>
&gt; ngrams_grouped = group ngrams by key;

&gt; describe ngrams_grouped;
ngrams_grouped: {group: chararray, ngrams: {key: chararray}}

&gt; dump ngrams_grouped;
(a b,{(a b),(a b),(a b),(a b)})
(a d,{(a d)})
(b a,{(b a)})
(b c,{(b c)})
(c d,{(c d)})
(d a,{(d a)})
(d d,{(d d),(d,d)})
</span></pre>
<p>&#8230;and then generating the key, frequency pairs</p>
<pre><span>
&gt; ngram_freq = foreach ngrams_grouped generate group as key, SIZE(ngrams) as f;

&gt; describe ngram_freq;
ngram_freq: {key: chararray, f: long}

&gt; dump ngram_freq;
(a b,4L)
(a d,1L)
(b a,1L)
(b c,1L)
(c d,1L)
(d a,1L)
(d d,2L)
</span></pre>
<p>from this we know all the distinct 2grams that are contained in the next chunk we&#8217;re analysing<br />
for each of these 2grams one of two things is true;</p>
<ol>
<li>either the ngram has been seen before (thus it has an entry in the model)</li>
<li>this is the first time we&#8217;ve seen it, in which case we need to add a new entry to the model</li>
</ol>
<p>the easiest way i&#8217;ve worked out in pig to handle this is to generate a &#8217;seed&#8217; model just for this chunk and fold it into the real model but unioning the relations</p>
<p>(i&#8217;ve been using pig 0.3 to keep in line with the current version of elastic map reduce but it might be easier with the various extra joins that are in later versions of pig)</p>
<p>so first we generate the &#8217;seed&#8217; relation&#8230;</p>
<pre><span>
&gt; seed_values = foreach ngram_freq generate key, 0 as n, 0.0 as m, 0.0 as ms, f;

&gt; describe seed_values;
seed_values: {key: chararray, n: int, m: double, ms: double, f: long}

&gt; dump seed_values;
(a b,0,0.0,0.0,4L)
(a d,0,0.0,0.0,1L)
(b a,0,0.0,0.0,1L)
(b c,0,0.0,0.0,1L)
(c d,0,0.0,0.0,1L)
(d a,0,0.0,0.0,1L)
(d d,0,0.0,0.0,2L)
</span></pre>
<p>&#8230;and fold it in with a 3 step process; unioning with the original model, grouping and collapsing</p>
<p>first the union&#8230;</p>
<pre><span>
&gt; model_plus_seed = union model, seed_values;

&gt; describe model_plus_seed;
model_plus_seed: {key: chararray, n: int, m: double, ms: double, f: long}

&gt; dump model_plus_seed;
(a b,0,0.0,0.0,4L)
(a b,6,1.3333333333333333,2.0,0L)
(a d,0,0.0,0.0,1L)
(a a,3,1.3333333333333333,2.0,0L)
(b a,0,0.0,0.0,1L)
(a c,4,1.25,1.75,0L)
(b c,0,0.0,0.0,1L)
(a d,1,2.0,4.0,0L)
(c d,0,0.0,0.0,1L)
(b a,3,1.0,1.0,0L)
(d a,0,0.0,0.0,1L)
(b d,1,2.0,4.0,0L)
(d d,0,0.0,0.0,2L)
(b c,6,1.5,2.5,0L)
(d c,1,2.0,4.0,0L)
(c a,4,1.0,1.0,0L)
(d e,1,1.0,1.0,0L)
(c d,4,2.0,4.0,0L)
(d a,2,1.0,1.0,0L)
</span></pre>
<p>then the grouping&#8230;</p>
<pre><span>
&gt; model_plus_seed2 = group model_plus_seed by key;

&gt; describe model_plus_seed2 = group model_plus_seed by key;;
model_plus_seed2: {group: chararray, model_plus_seed: {key: chararray, n: int, m: double, ms: double, f: long}}

&gt; dump model_plus_seed2;
(a a,{(a a,3,1.3333333333333333,2.0,0L)})
(a b,{(a b,0,0.0,0.0,4L),(a b,6,1.3333333333333333,2.0,0L)})
(a c,{(a c,4,1.25,1.75,0L)})
(a d,{(a d,0,0.0,0.0,1L),(a d,1,2.0,4.0,0L)})
(b a,{(b a,0,0.0,0.0,1L),(b a,3,1.0,1.0,0L)})
(b c,{(b c,0,0.0,0.0,1L),(b c,6,1.5,2.5,0L)})
(b d,{(b d,1,2.0,4.0,0L)})
(c a,{(c a,4,1.0,1.0,0L)})
(c d,{(c d,0,0.0,0.0,1L),(c d,4,2.0,4.0,0L)})
(d a,{(d a,0,0.0,0.0,1L),(d a,2,1.0,1.0,0L)})
(d c,{(d c,1,2.0,4.0,0L)})
(d d,{(d d,0,0.0,0.0,2L)})
(d e,{(d e,1,1.0,1.0,0L)})
</span></pre>
<p>and finally the collapsing using MAX&#8230;</p>
<pre><span>
&gt; model_n =
     foreach model_plus_seed2 generate
        group as key,
        MAX(model_plus_seed.n) as n,
        MAX(model_plus_seed.m) as m,
        MAX(model_plus_seed.ms) as ms,
        MAX(model_plus_seed.f) as f;

&gt; describe model_n;
model_n: {key: chararray, n: int, m: double, ms: double, f: long}

&gt; dump model_n;
(a a,3,1.3333333333333333,2.0,0L)
(a b,6,1.3333333333333333,2.0,4L)
(a c,4,1.25,1.75,0L)
(a d,1,2.0,4.0,1L)
(b a,3,1.0,1.0,1L)
(b c,6,1.5,2.5,1L)
(b d,1,2.0,4.0,0L)
(c a,4,1.0,1.0,0L)
(c d,4,2.0,4.0,1L)
(d a,2,1.0,1.0,1L)
(d c,1,2.0,4.0,0L)
(d d,0,0.0,0.0,2L)
(d e,1,1.0,1.0,0L)
</span></pre>
<p>at this stage we have the original model weaved in with the new data but still need to update the mean and square of means for the values from the latest hour.</p>
<p>we can do this by first seperating out the values we need to update based on whether the frequency is non zero<br />
(recall non zero frequencies represent ngrams from the latest hour)</p>
<pre><span>
&gt; split model_n into to_update if f&gt;0, not_to_update if f==0;

&gt; dump to_update;
(a b,6,1.3333333333333333,2.0,4L)
(a d,1,2.0,4.0,1L)
(b a,3,1.0,1.0,1L)
(b c,6,1.5,2.5,1L)
(c d,4,2.0,4.0,1L)
(d a,2,1.0,1.0,1L)
(d d,0,0.0,0.0,2L)

&gt; dump not_to_update;
(a a,3,1.3333333333333333,2.0,0L)
(a c,4,1.25,1.75,0L)
(b d,1,2.0,4.0,0L)
(c a,4,1.0,1.0,0L)
(d c,1,2.0,4.0,0L)
(d e,1,1.0,1.0,0L)
</span></pre>
<p>we can now update the mean and std deviations based on the new frequency values</p>
<pre><span>
&gt; updated =
     foreach to_update {
         m2  = ((n*m)+f)/(n+1);
         ms2 = ((n*ms)+(f*f))/(n+1);
         generate key, n+1 as n, m2 as m, ms2 as ms, f;
     }

&gt; describe updated;
updated: {key: chararray, n: int, m: double, ms: double, f: long}

&gt; dump updated;
(a b,7,1.7142857142857142,4.0,4L)
(a d,2,1.5,2.5,1L)
(b a,4,1.0,1.0,1L)
(b c,7,1.4285714285714286,2.2857142857142856,1L)
(c d,5,1.8,3.4,1L)
(d a,3,1.0,1.0,1L)
(d d,1,2.0,4.0,2L)
</span></pre>
<p>these new rows, along with the rows we didn&#8217;t update, can be stored as the model at time n+1 ready for the next hours chunk</p>
<pre><span>
&gt; to_store = union model_n1, not_to_update;
&gt; store to_store into 'model.007';

&gt; dump to_store;
(a b,7,1.7142857142857142,4.0,4L)
(a a,3,1.3333333333333333,2.0,0L)
(a d,2,1.5,2.5,1L)
(a c,4,1.25,1.75,0L)
(b a,4,1.0,1.0,1L)
(b d,1,2.0,4.0,0L)
(b c,7,1.4285714285714286,2.2857142857142856,1L)
(c a,4,1.0,1.0,0L)
(c d,5,1.8,3.4,1L)
(d c,1,2.0,4.0,0L)
(d a,3,1.0,1.0,1L)
(d e,1,1.0,1.0,0L)
(d d,1,2.0,4.0,2L)
</span></pre>
<p>now that we&#8217;ve updated the model we can start making the trending check!</p>
<p>first step is to filter out entries that correspond to ngrams we are seeing for the first time<br />
( an new item can&#8217;t be trending )</p>
<pre><span>
&gt; requiring_trending_check = filter model_n1 by n&gt;1;

&gt; dump requiring_trending_check;
(a b,7,1.7142857142857142,4.0,4L)
(a d,2,1.5,2.5,1L)
(b a,4,1.0,1.0,1L)
(b c,7,1.4285714285714286,2.2857142857142856,1L)
(c d,5,1.8,3.4,1L)
(d a,3,1.0,1.0,1L)
</span></pre>
<p>and finally we can make the trending calculation!<br />
we can calculate the minimum trending value, based on mean + twice std dev&#8230;</p>
<pre><span>
&gt; calc_min_trending =
     foreach requiring_trending_check {
        sd_lhs = n * ms;
        sd_rhs = n * (m*m);
        sd = org.apache.pig.piggybank.evaluation.math.SQRT((sd_lhs-sd_rhs)/n);
        min_trend_value = m + (2*sd);
        generate key, f, m as mean, sd as std_dev,
                 min_trend_value as min_trend_value,
                 f / min_trend_value as percent_over_trend;
    }

&gt; describe calc_min_trending;
calc_min_trending: {key: chararray, f: long, mean: double, std_dev: double, min_trend_value: double, percent_over_trend: double}

&gt; dump calc_min_trending;
(a b,4L,1.7142857142857142,1.0301575072754257,3.7746007288365657,1.059714732061981)
(a d,1L,1.5,0.5,2.5,0.4)
(b a,1L,1.0,0.0,1.0,1.0)
(b c,1L,1.4285714285714286,0.4948716593053934,2.4183147471822153,0.4135111036167584)
(c d,1L,1.8,0.4,2.6,0.3846153846153848)
(d a,1L,1.0,0.0,1.0,1.0)
</span></pre>
<p>&#8230; and any entries with a frequency over the min trending value are deemed trending!<br />
( for this example it&#8217;s only the one )</p>
<pre><span>
&gt; trending = filter calc_min_trending by percent_over_trend &gt; 1;

&gt; describe trending;
trending: {key: chararray, f: long, mean: double, std_dev: double, min_trend_value: double, percent_over_trend: double}

&gt; dump trending;
(a b,4L,1.7142857142857142,1.0301575072754257,3.7746007288365657,1.059714732061981)
</span></pre>
<p>as a normalisation step i&#8217;ve been playing with also factoring in the frequency itself,<br />
haven&#8217;t come to a conclusion on whether this is a better metric or not&#8230;</p>
<pre><span>
&gt; trending2 =
     foreach trending {
        normalised_trend_value = org.apache.pig.piggybank.evaluation.math.LOG10(f) * percent_over_trend;
        generate key, min_trend_value, percent_over_trend, normalised_trend_value as normalised_trend_value;
     }

&gt; describe trending2;
trending2: {key: chararray, min_trend_value: double, percent_over_trend: double, normalised_trend_value: double}

&gt; dump trending2;
(a b,3.7746007288365657,1.059714732061981,0.6380118423953504)
</span></pre>
<p>and finally store the top trending values for processing!</p>
<pre><span>
&gt; trending_sorted = order trending2 by normalised_trend_value desc;
&gt; top_50 = limit trending_sorted 50;
&gt; store trending_sorted into 'trending.model.006;
</span></pre>
]]></content:encoded>
			<wfw:commentRss>http://matpalm.com/blog/2010/05/01/trending-topics-in-tweets-about-cheese-part2/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
