i've been reviving some old theano code recently and in case you haven't seen it theano is a pretty awesome python library that reads a lot like numpy but provides two particularly interesting features.

- symbolic differentiation; not something i'll talk about here, but super useful if you're tinkering with new models and you're using a gradient descent method for learning (and these days, who's not..)
- the ability to run transparently on a gpu; well, almost transparently, this'll be the main focus of this post...

let's work through a very simple model that's kinda like a system of linear equations. we'll compare 1) numpy (our timing baseline) vs 2) theano on a cpu vs 3) theano on a gpu. keep in mind this model is contrived and doesn't really represent anything useful, it's more to demonstrate some matrix operations.

first consider the following numpy code (speed_test_numpy.py) which does a simple y=mx+b like calculation a number of times in a tight loop. this looping isn't just for benchmarking, lots of learning algorithms operate on a tight loop.

# define data # square matrices will do for a demo np.random.seed(123) m = np.random.randn(1000, 1000).astype('float32') x = np.random.randn(1000, 1000).astype('float32') b = np.random.randn(1000, 1000).astype('float32') # run tight loop start = time.time() for i in range(500): y = np.add(np.dot(m, x), b) print "numpy", time.time()-start, "sec"

this code on a 6 core 3.8Ghz AMD runs in a bit over 2min

$ python speed_test_numpy.py numpy 135.350140095 sec

now consider the same thing in theano (speed_test_theano.py)

import theano import theano.tensor as T # define data np.random.seed(123) m = np.random.randn(1000, 1000).astype('float32') x = np.random.randn(1000, 1000).astype('float32') b = np.random.randn(1000, 1000).astype('float32') # define a symbolic expression of the equations in theano tm = T.matrix("m") tx = T.matrix("x") tb = T.matrix("b") ty = T.add(T.dot(tm, tx), tb) # and compile it line = theano.function(inputs=[tx, tm, tb], outputs=[ty]) # then run same loop as before start = time.time() for i in range(500): y, = line(m, x, b) print "theano", time.time()-start, "sec"

hopefully it's clear enough what is happening here at a high level but just briefly the tm, tx, tb and ty variables represent a symbolic representation of what we want to do and the theano.function call compiles this into actual executable code. there is lots of gentle intro material that introduces this notation on the theano site.

when run on the cpu it takes about the same time as the numpy version

$ THEANO_FLAGS=device=cpu python speed_test_theano.py numpy 136.371109009 sec

but when "magically" run on the gpu it's quite a bit faster.

$ THEANO_FLAGS=device=gpu python speed_test_theano.py Using gpu device 0: GeForce GTX 970 theano 3.16091990471 sec

awesome! a x40 speed up! so we're done right? not quite, we can do better.

let's drill into what's actually happening; we can do this in two ways, debugging the compiled graph and theano profiling.

debugging allows us to see what a function has been compiled to. for the cpu case it's just a single blas gemm (general matrix mulitplication) call. that's exactly what'd we want, so great!

Gemm{no_inplace} [@A] '' 0 |b [@B] |TensorConstant{1.0} [@C] |m [@D] |x [@E] |TensorConstant{1.0} [@C]

profiling allows to see where time is spent. 100% in this single op, no surprise.

$ THEANO_FLAGS=device=cpu,profile=True python speed_test_theano.py ... <% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name> 100.0% 100.0% 136.282s 2.73e-01s 500 0 Gemm{no_inplace} ...

looking at the gpu version though things are a little different...

HostFromGpu [@A] '' 4 |GpuGemm{inplace} [@B] '' 3 |GpuFromHost [@C] '' 2 | |b [@D] |TensorConstant{1.0} [@E] |GpuFromHost [@F] '' 1 | |m [@G] |GpuFromHost [@H] '' 0 | |x [@I] |TensorConstant{1.0} [@E]

we can see a GpuGemm operation, the gpu equivalent of Gemm, but now there's a bunch of GpuFromHost & HostFromGpu operations too? what are these?

i'll tell you what they are, they are the bane of your existence! these represent transferring data to/from the gpu which is slow and, if we're not careful, can add up to a non trivial amount. if we review the profiling output we can see that, though we're faster than the non gpu version, we're spending >70% of the time just moving data.

(though remember this example is contrived, we'd expect to be doing more in our overall computation that just a single general matrix mulitply)

$ THEANO_FLAGS=device=gpu,profile=True python speed_test_theano.py ... <% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name> 26.4% 26.4% 0.776s 1.55e-03s 500 3 GpuGemm{inplace} 19.5% 45.9% 0.573s 1.15e-03s 500 0 GpuFromHost(x) 19.5% 65.4% 0.572s 1.14e-03s 500 1 GpuFromHost(m) 19.3% 84.7% 0.565s 1.13e-03s 500 2 GpuFromHost(b) 15.3% 100.0% 0.449s 8.99e-04s 500 4 HostFromGpu(GpuGemm{inplace}.0) ...

ouch!

the crux of this problem is that we actually have two types of variables in this model; the parameterisation of the model (m & b) and
those related to examples (x & y). so, though it's realistic to do a speed test with a tight loop over the same function many times,
what is *not* realistic is that we are passing the model parameters to/from the gpu
each and every input example. this is a complete waste; it's much more sensible to send them over to the gpu once at the
start of the loop and retreive them once at the end. this is an important and very common pattern.

how do we fix this? it's actually pretty simple; shared variables. yay!

consider the following; speed_test_theano_shared.py

# define data np.random.seed(123) m = np.random.randn(1000, 1000).astype('float32') x = np.random.randn(1000, 1000).astype('float32') b = np.random.randn(1000, 1000).astype('float32') # define a symbolic expression of the equations in theano tm = theano.shared(m) # copy m over to gpu once explicitly tx = T.matrix("x") tb = theano.shared(b) # copy b over to gpu once explicitly ty = T.add(T.dot(tm, tx), tb) line = theano.function(inputs=[tx], outputs=[ty]) # don't pass m & b each call # then run same loop as before start = time.time() for i in range(500): y, = line(x) print tm.get_value().shape # note: we can get the value back at any time

reviewing the debug we can see this removes a stack of the GpuFromHost calls.

HostFromGpu [@A] '' 2 |GpuGemm{no_inplace} [@B] '' 1 |[@C] |TensorConstant{1.0} [@D] | [@E] |GpuFromHost [@F] '' 0 | |x [@G] |TensorConstant{1.0} [@D]

and we're down to < 2s

$ THEANO_FLAGS=device=gpu,profile=True python speed_test_theano_shared.py Using gpu device 0: GeForce GTX 970 theano 1.93515706062 sec ... <% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name> 44.7% 44.7% 0.804s 1.61e-03s 500 1 GpuGemm{no_inplace} 30.2% 74.9% 0.543s 1.09e-03s 500 0 GpuFromHost(x) 25.1% 100.0% 0.451s 9.01e-04s 500 2 HostFromGpu(GpuGemm{no_inplace}.0) ...

what's even crazier is we can go further by moving the x and y matrices onto the gpu too. it turns out this isn't *too*
far fetched since if x and y were representing training examples we'd be iterating over them anyways (and if we could fit them
all onto the gpu that'd be great )

#define data np.random.seed(123) m = np.random.randn(1000, 1000).astype('float32') x = np.random.randn(1000, 1000).astype('float32') b = np.random.randn(1000, 1000).astype('float32') # define a symbolic expression of the equations in theano tm = theano.shared(m) tx = theano.shared(x) tb = theano.shared(b) ty = theano.shared(np.zeros((1000, 1000)).astype('float32')) # we need a shared var for y now mx_b = T.add(T.dot(tm, tx), tb) # and compile it train = theano.function(inputs=[], updates={ty: mx_b}) # update y on gpu # then run same loop as before start = time.time() for i in range(500): train() # now there's no input/output print tm.get_value().shape print "theano", time.time()-start, "sec"

the debug graph is like the cpu graph now, just one gemm call.

GpuGemm{no_inplace} [@A] '' 0 |[@B] |TensorConstant{1.0} [@C] | [@D] | [@E] |TensorConstant{1.0} [@C]

and runs in under a second. x150 the numpy version. nice! :)

$ THEANO_FLAGS=device=gpu,profile=True python speed_test_theano_shared2.py theano 0.896003007889 sec ... <% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name> 100.0% 100.0% 0.800s 1.60e-03s C 500 1 GpuGemm{no_inplace} ...]]>

PyMC is a python library for working with bayesian statistical models, primarily using MCMC methods. as a software engineer who has only just scratched the surface of statistics this whole MCMC business is blowing my mind so i've got to share some examples.

let's start with the simplest thing possible, fitting a simple distribution.

say we have a thousand values, ` 87.27, 67.98, 119.56, ...`

and we want to build a model of them.

a common first step might be to generate a histogram.

if i had to a make a guess i'd say this data looks normally distributed. somewhat unsurprising, not just because normal distributions are freakin everywhere, (this great khan academy video on the central limit theorem explains why) but because it was me who synthetically generated this data in the first place ;)

now a normal distribution is parameterised by two values; it's *mean* (technically speaking, the "middle" of the curve) and it's *standard deviation* (even more technically speaking, how "fat" it is) so let's use PyMC to figure out what these values are for this data.

*!!warning!! !!!total overkill alert!!!* there must be a bazillion simpler ways to fit a normal to this data but this post is about
dead-simple-PyMC not dead-simple-something-else.

first a definition of our model.

# simple_normal_model.py from pymc import * data = map(float, open('data', 'r').readlines()) mean = Uniform('mean', lower=min(data), upper=max(data)) precision = Uniform('precision', lower=0.0001, upper=1.0) process = Normal('process', mu=mean, tau=precision, value=data, observed=True)

working *backwards* through this code ...

- line 6 says i am trying to model some
`process`

that i believe is`Normal`

ly distributed defined by variables`mean`

and`precision`

. (precision is just the inverse of the variance, which in turn is just the standard deviation squared). i've already`observed`

this data and the`value`

s are in the variable`data`

- line 5 says i don't know the
`precision`

for my`process`

but my prior belief is it's value is somewhere between 0.0001 and 1.0. since i don't favor any values in this range my belief is`uniform`

across the values. note: assuming a uniform distribution for the precision is overly simplifying things quite a bit, but we can get away with it in this simple example and we'll come back to it. - line 4 says i don't know the
`mean`

for my data but i think it's somewhere between the`min`

and the`max`

of the observed`data`

. again this belief is`uniform`

across the range. - line 3 says the
`data`

for my unknown`process`

comes from a local file (just-plain-python)

the second part of the code runs the MCMC sampling.

# run_mcmc.py from pymc import * import simple_normal_model model = MCMC(simple_normal_model) model.sample(iter=500) print(model.stats())

working *forwards* through this code ...

- line 4 says build a MCMC for the model from the
`simple_normal_model`

file - line 5 says run a sample for 500 iterations
- line 6 says print some stats.

**and that's it!**

the output from our stats includes among other things estimates for the `mean`

and `precision`

we were trying to find

{ 'mean': {'95% HPD interval': array([ 94.53688316, 102.53626478]) ... }, 'precision': {'95% HPD interval': array([ 0.00072487, 0.03671603]) ... }, ... }

now i've brushed over a couple of things here (eg the use of uniform prior over the precision, see here for more details) but i can get away with it all because this problem is a trivial one and i'm not doing gibbs sampling in this case. the main point i'm trying to make is that it's dead simple to start writing these models.

one thing i do want to point out is that this estimation doesn't result in just one single value for mean and precision, it results in a distribution of the possible values. this is great since it gives us an idea of how confident we can be in the values as well as allowing this whole process to be iterative, ie the output values from this model can be feed easily into another.

all the code above parameterised the normal distribution with a mean and a precision. i've always thought of normals though in terms of means and standard deviations
(precision is a more bayesian way to think of things... apparently...) so the first extension to my above example i want to make is to redefine the problem
in terms of a prior on the standard deviation instead of the precision. mainly i want to do this to introduce the `deterministic`

concept
but it's also a subtle change in how the sampling search will be directed because it introduces a non linear transform.

data = map(float, open('data', 'r').readlines()) mean = Uniform('mean', lower=min(data), upper=max(data)) std_dev = Uniform('std_dev', lower=0, upper=50) @deterministic(plot=False) def precision(std_dev=std_dev): return 1.0 / (std_dev * std_dev) process = Normal('process', mu=mean, tau=precision, value=data, observed=True)

our code is almost the same but instead of a prior on the `precision`

we use a `deterministic`

method to map from the parameter we're
trying to fit (the `precision`

) to a variable we're trying to estimate (the `std_dev`

).

we fit the model using the same `run_mcmc.py`

but this time get estimates for the `std_dev`

not the `precision`

{ 'mean': {'95% HPD interval': array([ 94.23147867, 101.76893808]), ... 'std_dev': {'95% HPD interval': array([ 19.53993697, 21.1560098 ]), ... ... }

which all matches up to how i originally generated the data in the first place.. cool!

from numpy.random import normal data = [normal(100, 20) for _i in xrange(1000)]

for this example let's now dive a bit deeper than just the stats object.
to help understand how the sampler is converging on it's results we can also dump
a trace of it's progress at the end of `run_mcmc.py`

import numpy for p in ['mean', 'std_dev']: numpy.savetxt("%s.trace" % p, model.trace(p)[:])

plotting this we can see how quickly the sampled values converged.

let's consider a slightly more complex example.

again we have some data... `107.63, 207.43, 215.84, ...`

that plotted looks like this...

hmmm. looks like *two* normals this time with the one centered on 100 having a bit more data.

how could we model this one?

data = map(float, open('data', 'r').readlines()) theta = Uniform("theta", lower=0, upper=1) bern = Bernoulli("bern", p=theta, size=len(data)) mean1 = Uniform('mean1', lower=min(data), upper=max(data)) mean2 = Uniform('mean2', lower=min(data), upper=max(data)) std_dev = Uniform('std_dev', lower=0, upper=50) @deterministic(plot=False) def mean(bern=bern, mean1=mean1, mean2=mean2): return bern * mean1 + (1 - ber) * mean2 @deterministic(plot=False) def precision(std_dev=std_dev): return 1.0 / (std_dev * std_dev) process = Normal('process', mu=mean, tau=precision, value=data, observed=True)

reviewing the code again it's mostly the same the big difference being the `deterministic`

definition of the `mean`

.
it's now that we finally start to show off the awesome power of these non analytical approaches.

line 12 defines the mean not by one `mean`

variable
but instead as a mixture of two, `mean1`

and `mean2`

. for each value we're trying to model we pick either `mean1`

or `mean2`

based on *another* random variable `bern`

.
`bern`

is described by a
bernoulli distribution
and so is either 1 or 0, proportional to the parameter `theta`

.

ie the definition of our `mean`

is that when `theta`

is high, near 1.0, we pick `mean1`

most of the time and
when `theta`

is low, near 0.0, we pick `mean2`

most of the time.

what we are solving for then is not just `mean1`

and `mean2`

but how the values are split between them (described by `theta`

)
(and note for the sake of simplicity i made the two normal differ in their means but use a shared standard deviation. depending on what you're doing this
might or might not make sense)

reviewing the traces we can see the converged `mean`

s are 100 & 200 with `std dev`

20. the mix (`theta`

) is 0.33, which all agrees
with the synthetic data i generated for this example...

from numpy.random import normal import random data = [normal(100, 20) for _i in xrange(1000)] # 2/3rds of the data data += [normal(200, 20) for _i in xrange(500)] # 1/3rd of the data random.shuffle(data)

to me the awesome power of these methods is the ability in that function to pretty much write whatever i think best describes the process. too cool for school.

i also find it interesting to see how the convergence came along... the model starts in a local minima of both normals having mean a bit below 150 (the midpoint of the two actual ones) with a mixing proportion of somewhere in the ballpark of 0.5 / 0.5. around iteration 1500 it correctly splits them apart and starts to understand the mix is more like 0.3 / 0.7. finally by about iteration 2,500 it starts working on the standard deviation which in turn really helps narrow down the true means.

(thanks cam for helping me out with the formulation of this one..)

these are pretty simple examples thrown together to help me learn but i think they're still illustrative of the power of these methods (even when i'm completely ignore anything to do with conjugacy)

in general i've been working through an awesome book, doing bayesian data analysis, and can't recommend it enough.

i also found john's blog post on using jags in r was really helpful getting me going.

all the examples listed here are on github.

next is to rewrite everything in stan and do some comparision between pymc, stan and jags. fun times!

]]>say you have three items; item1, item2 and item3 and you've somehow associated a count for each against one of five labels; A, B, C, D, E

> data A B C D E item1 23700 20 1060 11 4 item2 1581 889 20140 1967 200 item3 1 0 1 76 0

depending on what you're doing it'd be reasonable to normalise these values and an l1-normalisation (ie rescale so they are the same proportion but add up to 1) gives us the following...

> l1_norm = function(x) x / sum(x) > l1 = t(apply(data, 1, l1_norm)) > l1 A B C D E item1 0.955838 0.00080661 0.042751 0.00044364 0.00016132 item2 0.063809 0.03588005 0.812851 0.07938814 0.00807200 item3 0.012821 0.00000000 0.012821 0.97435897 0.00000000

great... but you know it's fair enough if you think things don't feel right...

according to these normalised values item3 is "more of" a D (0.97) than item1 is an A (0.95) even though we've only collected 1/300th of the data for it. this just isn't right.

purely based on these numbers i'd think it's more sensible to expect item3 to be A or a C (since that's what we've seen with item1 and item2) but we just haven't seen enough data for it yet. what makes sense then is to smooth the value of item3 out and make it more like some sort of population average.

so firstly what makes a sensible population average? ie if we didn't know anything at all about a new item what would we want the proportions of labels to be? alternatively we can ask what do we think item3 is likely to look like later on as we gather more data for it? i think an l1-norm of the sums of all the values makes sense ...

> column_totals = apply(data, 2, sum) > population_average = l1_norm(column_totals) > population_average A B C D E 0.5094218 0.0183000 0.4268199 0.0413513 0.0041069

... and it seems fair. without any other info it's reasonable to "guess" a new item is likely to be somewhere between an A (0.50) and a C (0.42)

so now we have our item3, and our population average, and we want to mix them together in some way... how might we do this?

A B C D E item3 0.012821 0.000000 0.012821 0.974358 0.000000 pop_aver 0.509421 0.018300 0.426819 0.041351 0.004106

a linear weighted sum is nice and easy; ie a classic `item3 * alpha + pop_aver * (1-alpha)`

but then how do we pick alpha?

if we were to do this reweighting for item1 or item2 we'd want alpha to be large, ie nearer 1.0, to reflect the confidence we have in their current values since we have lots of data for them. for item3 we'd want alpha to be small, ie nearer 0, to reflect the lack of confidence we have in it.

enter the confidence interval, a way of testing how confident we are in a set of values.

firstly, a slight diversion re: confidence intervals...

consider three values, 100, 100 and 200. running this goodness of fit test gives the following result.

> library(NCStats) > gofCI(chisq.test(c(100, 100, 200)), conf.level=0.95) p.obs p.LCI p.UCI [1,] 0.25 0.21008 0.29468 [2,] 0.25 0.21008 0.29468 [3,] 0.50 0.45123 0.54877

you can read the first row of this table as "the count 100 was observed to be 0.25 (p.obs) of the total and i'm 95%
confident (conf.level) that the *true* value is between 0.21 (p.LCI = lower confidence interval) and 0.29 (p.UCI = upper confidence interval).

there are two important things to notice that can change the range of confidence interval...

1) upping the confidence level results in a wider confidence interval. ie "i'm 99.99% confident the value is true value is between 0.17 and 0.34, but only 1% confident it's between 0.249 and 0.2502"

> gofCI(chisq.test(c(100, 100, 200)), conf.level=0.9999) p.obs p.LCI p.UCI [1,] 0.25 0.17593 0.34230 [2,] 0.25 0.17593 0.34230 [3,] 0.50 0.40452 0.59548 > gofCI(chisq.test(c(100, 100, 200)), conf.level=0.01) p.obs p.LCI p.UCI [1,] 0.25 0.24973 0.25027 [2,] 0.25 0.24973 0.25027 [3,] 0.50 0.49969 0.50031

2) getting more data results in a narrower confidence interval. ie "even though the proportions stay the same as i gather x10, then x100, my original data i can narrow my confidence interval around the observed value"

> gofCI(chisq.test(c(10, 10, 20)), conf.level=0.95) p.obs p.LCI p.UCI [1,] 0.25 0.14187 0.40194 [2,] 0.25 0.14187 0.40194 [3,] 0.50 0.35200 0.64800 > gofCI(chisq.test(c(100, 100, 200)), conf.level=0.95) p.obs p.LCI p.UCI [1,] 0.25 0.21008 0.29468 [2,] 0.25 0.21008 0.29468 [3,] 0.50 0.45123 0.54877 > gofCI(chisq.test(c(1000, 1000, 2000)), conf.level=0.95) p.obs p.LCI p.UCI p.exp [1,] 0.25 0.23683 0.26365 [2,] 0.25 0.23683 0.26365 [3,] 0.50 0.48451 0.51549

so it turns out this confidence interval is exactly what we're after; a way of estimating a pessimistic value (the lower bound) that gets closer to the observed value as the size of the observed data grows.

note: there's a lot of discussion on how best to do these calculations. there is a more "correct" and principled version of this calculation that is provided by MultinomialCI but i found it's results weren't as good for my purposes.

awesome, so back to the problem at hand; how do we pick our mixing parameter alpha?

let's extract the lower bound of the confidence interval value for our items using a very large confidence (99.99%) (to enforce a wide interval). the original l1-normalised values are shown here again for comparison.

> l1 A B C D E item1 0.95583 0.00080 0.04275 0.00044 0.00016 item2 0.06380 0.03588 0.81285 0.07938 0.00807 item3 0.01282 0.00000 0.01282 0.97435 0.00000 > library(NCStats) > gof_ci_lower = function(x) gofCI(chisq.test(x), conf.level=0.9999)[,2] > gof_chi_ci = t(apply(data, 1, gof_ci_lower)) > gof_chi_ci A B C D E item1 0.95048 0.00035 0.03803 0.00015 0.00003 item2 0.05803 0.03156 0.80302 0.07296 0.00614 item3 0.00000 0.00000 0.00000 0.79725 0.00000

we see that item1, which had a lot of support data, has dropped it's A value only slightly from 0.955 to 0.950 whereas item3 which had very little support, has had it's D value drop drastically from 0.97 to 0.79. by using a conf.level closer and closer 1.0 we see make this drop more and more drastic.

because each of the values in the `gof_chi_ci matrix`

are lower bounds the rows no longer add up to 1.0 (as they do in the l1-value
matrix). we can calculate how much we've "lost" with `1 - sum(rows)`

and it turns out this residual is pretty much
exactly what we were after when we were for our mixing parameter alpha!

> gof_chi_ci$residual = as.vector(1 - apply(gof_chi_ci, 1, sum)) > gof_chi_ci A B C D E residual item1 0.95048 0.00035 0.03803 0.00015 0.00003 0.01096 item2 0.05803 0.03156 0.80302 0.07296 0.00614 0.02829 item3 0.00000 0.00000 0.00000 0.79725 0.00000 0.20275

in the case of item1 the residual is low, ie the confidence interval lower bound was close to the observed value so we shouldn't mix in much of the population average. but in the case of item3 the residual is high, we lost a lost by the confidence interval being very wide, so we might as well mix in more of the population average.

now what i've said here is completely unprincipled. i just made it up and the maths work because everything is normalised. but having said that the results are really good so i'm going with it :)

putting it all together then we have the following bits of data...

> l1 # our original estimates A B C D E item1 0.95583 0.00080 0.04275 0.00044 0.00016 item2 0.06380 0.03588 0.81285 0.07938 0.00807 item3 0.01282 0.00000 0.01282 0.97435 0.00000 > population_average # the population average A B C D E item1 0.50942 0.01830 0.42681 0.04135 0.00410 > gof_chi_ci # lower bound of our confidences A B C D E item1 0.95048 0.00035 0.03803 0.00015 0.00003 item2 0.05803 0.03156 0.80302 0.07296 0.00614 item3 0.00000 0.00000 0.00000 0.79725 0.00000 > gof_chi_ci_residual = as.vector(1 - apply(gof_chi_ci, 1, sum)) > gof_chi_ci_residual # how much we should mix in the population average [1] 0.01096 0.02829 0.20275 0.40759

since there's lots of support for item1 the residual is small, only 0.01, so we smooth only a little of the population average in and end up not changing the values that much

> l1[1,] A B C D E item1 0.95583 0.00080 0.04275 0.00044 0.00016 > gof_chi_ci[1,] + population_average * gof_chi_ci_residual[1] A B C D E item1 0.95606 0.00055 0.04270 0.00060 0.00007

but item3 has a higher residual and so we smooth more of the population average in and it's shifted more much strongly from D towards A and B

> l1[3,] A B C D E item3 0.01282 0.00000 0.01282 0.97435 0.00000 > gof_chi_ci[3,] + population_average * gof_chi_ci_residual[3] A B C D E item3 0.10329 0.00371 0.08653 0.80563 0.00083]]>

one model is based the idea of an interest graph where the nodes of the graph are users and items and the edges of the graph represent an interest, whatever that might mean for the domain.

if we only allow edges between users and items the graph is bipartite.

let's consider a simple example of 3 users and 3 items; user1 likes item1, user2 likes all three items and user3 likes just item3.

fig1 user / item interest graph |

one way to model similiarity between items is as follows....

let's consider a token starting at item1. we're going to repeatedly "bounce" this token back and forth between the items and the users based on the interest edges.

so, since item1 is connected to user1 and user2 we'll pick one of them randomly and move the token across. it's 50/50 which of user1 or user2 we end up at (fig2).

next we bounce the token back to the items; if the token had gone to user1 then it has to go back to item1 since user1 has no other edges, but if it had gone to user2 it could back to any of the three items with equal probability; 1/3rd.

the result of this is that the token has 0.66 chance of ending up back at item1 and equal 0.16 chance of ending up at either item2 or item3 (fig3)

fig2 dispersing from item1 to users | fig3 dispersing back from users to items |

( note this is different than if we'd started at item2. in that case we'd have gone to user2 with certainity and then it would have been uniformly random which of the items we'd ended up at )

for illustration let's do another iteration...

bouncing back to the users item1's 0.66 gets split 50/50 between user1 and user2. all of item2's 0.16 goes to user2 and item3 splits it's 0.16 between user2 and user3. we end up with fig4 (no, not that figure 4). bouncing back to the items we get to fig5.

fig4 | fig5 |

if we keep repeating things we converge on the values

{item1: 0.40, item2: 0.20, item3: 0.40}and these represent the probabilities of ending up in a particular item if we bounced forever.

note since this is convergent it also doesn't actually matter which item we'd started at, it would always get the same result in the limit.

to people familiar with power methods this convergence is no surprise. you might also recognise a similiarity between this and the most famous power method of them all, pagerank.

so what has this all got to do with item similiarity?

well, the values of the probabilities might all converge to the same set regardless of which item we start at
**but** each item gets there in different ways.

most importantly we can capture this difference by taking away a bit of probability each iteration of the dispersion.

so, again, say we start at item1. after we go to users and back to items we are at fig3 again.

but this time, before we got back to the users side, let's take away a small proportion of the probability mass, say, 1/4. this would be 0.16 for item1 and 0.04 for item2 and item3. this leaves us with fig6.

fig3 (again) | fig6 |

we can then repeat iteratively as before, items -> users -> items -> users. but each time we are on the items side we take away 1/4 of the mass until it's all gone.

iteration | taken from item1 | taken from item2 | taken from item3 |

1 | 0.16 | 0.04 | 0.04 |

2 | 0.09 | 0.04 | 0.05 |

3 | 0.06 | 0.02 | 0.05 |

... | ... | ... | ... |

final sum | 0.50 | 0.20 | 0.30 |

if we do the same for item2 and item3 we get different values...

starting at | total taken from item1 | total taken from item2 | total taken from item3 |

item1 | 0.50 | 0.20 | 0.30 |

item2 | 0.38 | 0.24 | 0.38 |

item3 | 0.30 | 0.20 | 0.50 |

finally these totals can be used as features for a pairwise comparison of the items. intuitively we can see that for any row wise similarity function we might choose to use sim(item1, item3) > sim(item1, item2) or sim(item2, item3)

one last thing to consider is that the amount of decay, 1/4 in the above example, is of course configurable and we get different results using a value between 0.0 and 1.0.

a very low value, ~= 0.0, produces the limit value, all items being classed the same. a higher value, ~= 1.0, stops the iterations after only one "bounce" and represents the minimal amount of dispersal.

]]>the first thing we need to do is determine which segments of the crawl are valid and ready for use (the crawl is always ongoing)

```
$ s3cmd get s3://aws-publicdatasets/common-crawl/parse-output/valid_segments.txt
$ head -n3 valid_segments.txt
1341690147253
1341690148298
1341690149519
```

given these segment ids we can lookup the related textData objects.

if you just want one grab it's name using something like ...

```
$ s3cmd ls s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690147253/ 2>/dev/null \
| grep textData | head -n1 | awk '{print $4}'
s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690147253/textData-00000
```

but if you want the lot you can get the listing with ...

```
$ cat valid_segments.txt \
| xargs -I{} s3cmd ls s3://aws-publicdatasets/common-crawl/parse-output/segment/{}/ \
| grep textData | awk '{print $4}' > all_valid_segments.tsv
```

( note: this listing is roughly 200,000 textData files and takes awhile to fetch )

each textData file is a hadoop sequence files, the key being the crawled url and the value being the extracted visible text.

to have a quick look at one you can get hadoop to dump the sequence file contents with ...

```
$ hadoop fs -text textData-00000 | less
http://webprofessionals.org/intel-to-acquire-mcafee-moving-into-online-security-ny-times/ Web Professionals
Professional association for web designers, developers, marketers, analysts and other web professionals.
Home
...
The company’s share price has fallen about 20 percent in the last five years, closing on Wednesday at $19.59 a share.
Intel, however, has been bulking up its software arsenal. Last year, it bought Wind River for $884 million, giving it a software maker with a presence in the consumer electronics and wireless markets.
With McAfee, Intel will take hold of a company that sells antivirus software to consumers and businesses and a suite of more sophisticated security products and services aimed at corporations.
```

( note: the visible text is broken into *one line* per block element from the original html. as such the value in the key/value pairs includes carriage returns and, for something like less, gets
outputted as being seperate lines )

now that we have some text, what can we do with it? one thing is to look for noun phrases and the quickest simplest way is to use something like the python natural language toolkit. it's certainly not the fastest to run but for most people would be the quickest to get going.

extract_noun_phrases.py is an example of doing sentence then word tokenisation, pos tagging and finally noun chunk phrase extraction.

given the text ...

```
Last year, Microsoft bought Wind River for $884 million. This makes it the largest software maker with a presence in North Kanada.
```

it extract noun phrases ...

```
Microsoft
Wind River
North Kanada
```

to run this at larger scale we can wrap it in a simple streaming job

```
hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input textDataFiles \
-output counts \
-mapper extract_noun_phrases.py \
-reducer aggregate \
-file extract_noun_phrases.py
```

run it across a small 50mb sample of textData files the top noun phrases extracted ...

rank | phrase | freq |

1 | 10094 | Posted |

2 | 9597 | November |

3 | 9553 | February |

4 | 8929 | Copyright |

5 | 8726 | September |

6 | 8709 | January |

7 | 8434 | April |

8 | 8307 | August |

9 | 7963 | October |

10 | 7963 | December |

this is not terribly interesting and the main thing that's going on here is that this is just being extracted from the boiler plate of the pages. one tough problem when dealing with visible text on a web page is that it might be visible but that doesn't mean it's interesting to the actual content of the page. here we see 'posted' and 'copyright', we're just extracting the chrome of the page.

check out the full list of values with freq >= 20 here there are some more interesting ones a bit later

so it's fun to look at noun phrases but i've actually brushed over some key details here

- not filtering on english text first generates a
*lot*of "noise". "G úûv ÝT M", "U ŠDú T" and "Y CKdñˆô" are not terribly interesting english noun phrases. - running this at scale you'd probably want to change from streaming and start using an in process java library like the stanford parser
- when it comes to actually doing named entity recognition it's a bit more complex. there's a wavii blog post from manish that talks a bit more about it.

( recall jaccard(set1, set2) = |intersection| / |union|. when set1 == set2 this evaluates to 1.0 and when set1 and set2 have no intersection it evaluates to 0.0 )

one thing that's always annoyed me about it though is that is loses any sense of partial similarity. as a set based measure it's all or nothing.

so consider the sets *set1 = {i1, i2, i3}* and *set2 = {i1, i2, i4}*

jaccard(set1, set2) = 2/4 = 0.5 which is fine given you have *no* prior info about the relationship between i3 and i4.

but what if you have a similarity function, s, and s(i3, i4) ~= 1.0? in this case you don't want a jaccard of 0.5, you want something closer to 1.0. by saying i3 ~= i4 you're saying the sets are almost the same.

after lots of googling i couldn't find a jaccard variant that supports this idea so i rolled my own. the idea is that we want to count the values in the complement of the intersection not as 0.0 on the jaccard numerator but as some value ranging between 0.0 and 1.0 based on the similarity of the elements. after some experiments i found that just counting each as the root mean sqr value of the pairwise sims of them all works pretty well. i'd love to know the name of this technique (or any similar better one) so i can read some more about it.

def fuzzy_jaccard(s1, s2, sim): union = s1.union(s2) intersection = s1.intersection(s2) # calculate root mean square sims between elements in just s1 and just s2 just_s1 = s1 - intersection just_s2 = s2 - intersection sims = [sim(i1, i2) for i1 in just_s1 for i2 in just_s2] sqr_sims = [s * s for s in sims] root_mean_sqr_sim = sqrt(float(sum(sqr_sims)) / len(sqr_sims)) # use this root_mean_sqr_sim to count these values in the complement as, in some way, being "partially" in the intersection return float(len(intersection) + (root_mean_sqr_sim * intersection_complement_size)) / len(union)

looking at our example of *{i1, i2, i3}* vs *{i1, i2, i4}*...

when s(i3, i4) = 0.0 it degenerates to normal jaccard and scores 0.5

print fuzzy_jaccard(set([1,2,3]), set([1,2,4]), lambda i1, i2: 0.0) # = 0.5 (2/4) ie normal jaccard

when s(i3, i4) = 1.0 it treats the values as the same and scores 1.0

print fuzzy_jaccard(set([1,2,3]), set([1,2,4]), lambda i1, i2: 1.0) # = 1.0 (4/4) treating i3 == i4

when s(i3, i4) = 0.9 it scores inbetween with 0.8

print fuzzy_jaccard(set([1,2,3]), set([1,2,4]), lambda i1, i2: 0.8) # = 0.9 (3.6/4)

this is great for me because now given an appropriate similiarity function i'm able to get a lot more discrimination between sets.

]]>after having to google this stuff three times in the last few months i'm writing it down here so i can just cut and paste next time...

> d = read.delim('data.tsv',header=F,as.is=T,col.names=c('dts_str','freq')) > # YEAR MONTH DAY HOUR > head(d,3) dts_str freq 1 2012_01_01_00 18393 2 2012_01_01_01 20536 3 2012_01_01_02 91840 > tail(d,3) dts_str freq 732 2012_01_31_21 103107 733 2012_01_31_22 108921 734 2012_01_31_23 78629 > summary(d$freq) Min. 1st Qu. Median Mean 3rd Qu. Max. 10590 63620 82680 86770 105700 169900

> d$dts = as.POSIXct(d$dts_str, format="%Y_%m_%d_%H") > head(d,3) dts_str freq dts 1 2012_01_01_00 18393 2012-01-01 00:00:00 2 2012_01_01_01 20536 2012-01-01 01:00:00 3 2012_01_01_02 91840 2012-01-01 02:00:00 > ggplot(d, aes(dts, freq)) + geom_point() + scale_x_datetime(major="10 days", minor="1 day", format="%d-%b-%Y")

> d$dow = as.factor(format(d$dts, format="%a")) # day of week > head(d,3) dts_str freq dts dow 1 2012_01_01_00 18393 2012-01-01 00:00:00 Sun 2 2012_01_01_01 20536 2012-01-01 01:00:00 Sun 3 2012_01_01_02 91840 2012-01-01 02:00:00 Sun > ggplot(d,aes(dow,freq)) + geom_boxplot() + geom_smooth(aes(group=1)) + scale_x_discrete(limits=c('Mon','Tue','Wed','Thu','Fri','Sat','Sun')) # provide explicit factor ordering + xlab('day of week') + ylab('freq') + opts(title='freq by day of week')

> by_dow = ddply(d, "dow", summarize, freq=sum(freq)) > ggplot(by_dow,aes(dow,freq)) + geom_bar() + scale_x_discrete(limits=c('Mon','Tue','Wed','Thu','Fri','Sat','Sun')) + xlab('day of week') + ylab('freq') + opts(title='total freq by day of week')

> d$hr = format(d$dts, format="%H") > head(d,3) dts_str freq dts dow hr 1 2012_01_01_00 18393 2012-01-01 00:00:00 Sun 00 2 2012_01_01_01 20536 2012-01-01 01:00:00 Sun 01 3 2012_01_01_02 91840 2012-01-01 02:00:00 Sun 02 > ggplot(d,aes(hr,freq)) + geom_boxplot() + geom_smooth(aes(group=1)) + xlab('hr of day') + ylab('freq') + opts(title='freq by hr of day')

> by_hr = ddply(d, "hr", summarize, freq=sum(freq)) > ggplot(by_hr,aes(hr,freq)) + geom_bar() + xlab('hr of day') + ylab('freq') + opts(title='total freq by hr of day')

d$weekend = 'weekday' d[d$dow=='Sat'|d$dow=='Sun',]$weekend = 'weekend' # terrible style :( ggplot(d,aes(hr,freq)) + geom_boxplot(aes(fill=weekend)) + geom_smooth(aes(group=weekend)) + xlab('hr of day') + ylab('freq') + opts(title='freq by hr of day')]]>

Common crawl is a publically available 30TB web crawl taken between September 2009 and September 2010. As a small project I decided to extract and tokenised the visible text of the web pages in this dataset. All the code to do this is on github.

The first thing was to get the data into a hadoop cluster. It's made up of 300,000 100mb gzipped arc files stored in S3. I wrote a dead simple distributed copy to do this.

Only a few things of note about this job...

- The data in S3 is marked as requester pays which, even though it's a no-op if you're accessing the data from EC2, needs the "x-amz-request-payer" header to be set.
- Pulling from S3 to EC2 is network bound so I ran using the MultithreadedMapRunner to ensure I could get as much throughput as possible.
- The code includes lots of retry logic but also sets mapred.max.map.failures.percent=100 to allow tasks to fail without killing the entire job (Eg there was one s3 object which had bad ACLs that couldn't be read, no amount of retries would have helped)

The next step was to filter out everything that didn't have a mime type of 'text/html'. This is pretty straightforward since the arc file format specifies the mime type in a header. I used the ArcInputFormat from Apache Nutch to actually generate the hadoop map input records.

Across the 3,000,000,000 objects in the crawl there ended up being 2,000 distinct mime types, 700 of each occuring only once, with about 90% of them being nonsense.

The top five mime types were ...

rank | mime type | freq | overall% |

1 | text/html | 2,970,000,000 | 91% |

2 | text/plain | 79,000,000 | 2% |

3 | text/xml | 52,000,000 | 1% |

4 | application/pdf | 48,000,000 | 1% |

5 | application/x-javascript | 26,000,000 | <1% |

6 | text/css | 25,000,000 | <1% |

Even though there's probably interesting content in the non text/html object types it seemed that just handling text/html would get me the biggest bang for my buck.

Initially I had some problems with encoding. Though http response headers include an encoding
field that is *meant* to indicate what encoding the payload is I found it to be wrong about 30% of the time :( I just ignored what the header said and
used the CharsetDetector
provided in Apache Tika. CharsetDetector inspects a chunk of bytes, uses heuristics to guess the encoding, decodes and reencodes as UTF-8.

Next was to extract the visible text from this raw html. After playing with a few libraries I ended up going with boilerpipe. In particular I ended up using the KeepEverythingWithMinKWordsExtractor extractor. Boilerpipe, roughly, returns a single line per block element of the html.

I then used LanguageIdentifier, again a part of Tika, to filter out everything but english text. It didn't seem to have any false positives but looking at the top 5 languages something seems amiss...

rank | language | freq |

1 | English (en) | 1,600,000,000 |

2 | Lithuanian (lt) | 270,000,000 |

3 | Norwegian (no) | 150,000,000 |

4 | Estonian (et) | 140,000,000 |

5 | French (fr) | 140,000,000 |

I never got around to sampling some of the Lithuanian ones to see what was actually going on but I'm a bit suspicious. I might have actually lost a bit of content in this step...f

The final step was to tokenise the text. I used the stanford parser, in particular I modified their example DocumentPreprocessor to make this simplified SentenceTokeniser

This tokeniser was wrapped in a TokeniseSentences hadoop job that did some additional sanity checking like ignoring one/two word sentences etc.

The final output was 92,000,000,000 sentences (3TB gzipped). Next will be to finish porting my near duplicate sketching algorithm to hadoop to run it across this data.

]]>continuing on with my series of mutual information experiments how might we extend the technique to identity sequences longer than just two terms?

one novel way is to identify the bigrams of interest, replace them with a single token and simply repeat the entire process. (thanks ted for the idea)

so say we had the 6 term sentence `i went to new york city`

it has 5 bigrams; `('i went', 'went to', 'to new', 'new york', 'york city')`

running the mutual information algorithm over this might identify `new york`
as a bigram of interest.

we can swap the two terms with a single token
`(new_york)` giving us a new sentence with 5 terms; `i went to '(new_york)' city`

this new sentence has 4 bigrams `('i went', 'went to', 'to (new_york)', '(new_york) city')`

another run of mutual information might now identify the pair `(new_york) city` so we replace
it with the token `((new_york)_city)` and just keep repeating.

lets run this over a small sample of 300,000 sentences taken from visible text of the freebase wiki dump after it's been tokenised by the stanford parser

(to speed things a little i calculate mutual information and replace the top 10 bigrams in the text before recalculating)

example starting sentences include...

A solid and dependable performer Taylor held the record having played in games for the Phillies at second base t... A surface may also exhibit both specular and diffuse reflection as is the case for example of glossy paint as us... A variety of names have since been given to the Wandering Jew including Matathias Buttadeus Paul Marrane and Isa... A.D.A.M. has control over Eggman 's computer and therefore every robot he owns he can also spread to other compu... Absolute magnitude magazine cover Though this image is subject to copyright its use is covered by the U.S. fair ...

after the first iteration we get the bigrams we've seen before...

Socorro LINEAR expr expr United States Los Angeles median income

but after the second iteration we get a mix of single term bigrams and immediately
start seeing some new composite bigrams; in this case the trigram `'per square mile'`

(expr_expr) (expr_expr) capita income (t_t) t per (square_mile) Las Vegas

unfortunately there's lots of noise too. `'expr expr expr expr'` comes from an single sentence, the term 'expr' repeated 450 times,
that must have been poorly parsed originally. the `'t t t`' case is something similar.

by the 16th iteration we get our first 4gram phrase `' U.S. fair use laws'`

had been U.S. ((fair_use)_laws) Rotten Tomatoes science fiction (New_York) City

and by the 70th iteration we get our first 5gram phrase `'United Nations Security Council Resolution'`.
jujitsu fans out there will be pleased to see some grappling coming in too!

alas more rubbish as well with the align styling tags leaking in.

(((United_Nations)_Security)_Council) Resolution Submission (rear_(naked_choke)) Asian (Pacific_Islander) (UD_(align_left)) ((align_left)_((align_center)_(Win_(align_left)))) lieutenant colonel

it's only two passes later that we get a big continuation of this one
with `'United Nations Security Council Resolution adopted unanimously'`

(((((United_Nations)_Security)_Council)_Resolution)_adopted) unanimously (United_States) ((align_left)_((align_center)_(Win_(align_left)))) Flying Corps Saddam Hussein TKO punches

i was a bit suspicous of this one but grabbing the original text we can see how it makes for an interesting construct in the text...

United Nations Security Council Resolution adopted unanimously on August after recalling Resolution the Council ... United Nations Security Council Resolution adopted unanimously on March after recalling all previous resolutions... United Nations Security Council Resolution adopted unanimously on February after noting that the Council had bee... United Nations Security Council Resolution adopted unanimously on December after reaffirming all resolutions on ... United Nations Security Council Resolution adopted unanimously on May after a complaint by Senegal against Portu... United Nations Security Council Resolution adopted unanimously on June after recalling resolutions and the Counc... United Nations Security Council Resolution adopted unanimously on July after noting the recent entry into force ... United Nations Security Council Resolution adopted unanimously on May after recalling all resolutions on the sit... United Nations Security Council Resolution adopted unanimously on January after recalling all previous resolutio... United Nations Security Council Resolution adopted unanimously on June after hearing representations from Botswa... United Nations Security Council Resolution adopted unanimously on May after reaffirming Resolution and all subse... United Nations Security Council Resolution adopted unanimously on August after reaffirming previous resolutions ... United Nations Security Council Resolution adopted unanimously on December after reaffirming all resolutions on ... United Nations Security Council Resolution was adopted unanimously on October after recalling resolutions and on... United Nations Security Council resolution adopted unanimously on March after reaffirming resolutions and on the... United Nations Security Council Resolution adopted unanimously on June after recalling all previous resolutions ... United Nations Security Council Resolution adopted unanimously on January after reaffirming Resolution on the si... United Nations Security Council Resolution adopted unanimously on February after reaffirming resolutions and in ...

interesting. i wonder has this come from a template perhaps? maybe just cut n paste? one author with fixed style?

even by the end of my run, 950 iterations, (aka last night) there continue to be valid short phrases being picked up

(County_Kansas) (United_States) County Clare Sunday night Rift Valley Charlton Heston

during the processing we've been replacing these tokens in the original text. so how does it look by this time? well, not a whole lot has changed actually.

the following 3 random examples show how little the text differs (should have left it running much longer!!)

(He_played) for a (short_time) with (Duke_Ellington) for (which_he) is (best_remembered) His (debut_single) Mi God Mi King topped the Jamaican (singles_chart) and a string of hits followed including Heel And Toe Monkey And Ape (Ghost_Rider) and Crucifixion although his best-remembered song is Mini Bus which lamented the demise (of_the) jolly bus and which (was_awarded) the title Song Of The Year in (from_the) Jamaica (Broadcasting_Corporation) However this number is certainly an improvement (from_the) cars it averaged yearly ((during_the)_1980s)

the top three are noise alas...

rank | num underscores | phrase |

1 | 127 | expr_expr_expr_..... (128 times) |

2 | 95 | September_Socorro_LINEAR_September_Socorro_LINEAR_... (32 times) |

3 | 63 | t_t_t_... (64 times) |

it's not until the 77th we get something that isn't (arguably) just a repeated pattern or noisy parsing

rank | num underscores | phrase |

77 | 10 | At_the_census_there_were_people_households_and_families_residing_in |

which were identified as phrases due to the large frequency of occurances of variations of the following...

freq | original phrase |

72 | As of the census of there were people households and families residing in the city |

60 | As of the census of there were people households and families residing in the town |

46 | As of the census of there were people households and families residing in the CDP |

32 | As of the census of there were people households and families residing in the village |

26 | As of the census of there were people households and families residing in the township |

15 | As of the census of there were people households and families residing in the borough |

3 | At the census there were people households and families residing in the city |

2 | At the census there were people households and families residing in the village |

2 | As of the census of there were people households and families residing on the base |

fascinating stuff!!

- use patterns found in this experiment to clean up noise and rerun
- work out a way to fold the composition into the scoring
- work on larger dataset
- approach to dealing with duplicates? don't want to just uniq since they represent something