one thing in theano i couldn't immediately find examples for was a simple embedding lookup table, a critical component for anything with NLP. turns out that it's just one of those things that's so simple no one bothers writing it down :/

tl;dr : you can just use numpy indexing and everything just works.

consider the following theano.tensor example of 2d embeddings for 5 items. each row represents a seperate embeddable item.

>>> E = np.asarray(np.random.randn(5, 2)) >>> t_E = theano.shared(E) >>> t_E.eval() array([[-0.72310919, -1.81050727], [ 0.2272197 , -1.23468159], [-0.59782901, -1.20510837], [-0.55842279, -1.57878187], [ 0.63385967, -0.35352725]])

to pick a subset of the embeddings it's as simple as just using indexing. for example to get the third & first embeddings it's ...

>>> idxs = np.asarray([2, 0]) >>> t_E[idxs].eval() array([[-0.59782901, -1.20510837], # third row of E [-0.72310919, -1.81050727]]) # first row of E

if we want to concatenate them into a single vector (a common operation when we're feeding up to, say, a densely connected hidden layer), it's a reshape

>>> t_E[idxs].reshape((1, -1)).eval() array([[-0.59782901, -1.20510837, -0.72310919, -1.81050727]]) # third & first row concatenated

all the required multi dimensional operations you need for batching just work too..

eg. if we wanted to run a batch of size 2 with the first batch item being the third & first embeddings and the second batch item being the fourth & fourth embeddings we'd do the following...

>>> idxs = np.asarray([[2, 0], [3, 3]]) # batch of size 2, first example in batch is pair [2, 0], second example in batch is [3, 3] >>> t_E[idxs].eval() array([[[-0.59782901, -1.20510837], # third row of E [-0.72310919, -1.81050727]], # first row of E [[-0.55842279, -1.57878187], # fourth row of E [-0.55842279, -1.57878187]]]) # fourth row of E >>> t_E[idxs].reshape((idxs.shape[0], -1)).eval() array([[-0.59782901, -1.20510837, -0.72310919, -1.81050727], # first item in batch; third & first row concatenated [-0.55842279, -1.57878187, -0.55842279, -1.57878187]]) # second item in batch; fourth row duplicated

this type of packing of the data into matrices is crucial to enable linear algebra libs and GPUs to really fire up.

consider the following as-simple-as-i-can-think-up "network" that uses embeddings;

given 6 items we want to train 2d embeddings such that the first two items have the same embeddings, the third and fourth have the same embeddings and the last two have the same embeddings. additionally we want all other combos to have different embeddings.

the *entire* theano code (sans imports) is the following..

first we initialise the embedding matrix as before

E = np.asarray(np.random.randn(6, 2), dtype='float32') t_E = theano.shared(E)

the "network" is just a dot product of two embeddings ...

t_idxs = T.ivector() t_embedding_output = t_E[t_idxs] t_dot_product = T.dot(t_embedding_output[0], t_embedding_output[1])

... where the training cost is a an L1 penality against the "label" of 1.0 for the pairs we want to have the same embeddings and 0.0 for the ones we want to have a different embeddings.

t_label = T.iscalar() gradient = T.grad(cost=abs(t_label - t_dot_product), wrt=t_E) updates = [(t_E, t_E - 0.01 * gradient)] train = theano.function(inputs=[t_idxs, t_label], outputs=[], updates=updates)

we can generate training examples by randomly picking two elements and assigning label 1.0 for the pairs 0 & 1, 2 & 3 and 4 & 6 (and 0.0 otherwise) and every once in awhile write them out to a file.

print "i n d0 d1" for i in range(0, 10000): v1, v2 = random.randint(0, 5), random.randint(0, 5) label = 1.0 if (v1/2 == v2/2) else 0.0 train([v1, v2], label) if i % 100 == 0: for n, embedding in enumerate(t_E.get_value()): print i, n, embedding[0], embedding[1]

plotting this shows the convergence of the embeddings (labels denote initial embedding location)...

easy peasy.

it's interesting to observe the effect of this (somewhat) arbitrary cost function i picked.

for the pairs where we wanted the embeddings to be same the cost function, \( |1 - a \cdot b | \), is minimised when the dotproduct is 1, this happens when the vectors
are the same and have unit length. you can see this is case for pairs 0 & 1 and 4 & 5 which have both ended up on the unit circle. but what about 2 & 3?
they've gone to the origin and the dotproduct of the origin with itself is 0, so it's *maximising* the cost, not minimising it! why?

it's because of the other constraint we added. for all the pairs we wanted the embeddings to be different the cost function, \( |0 - a \cdot b | \), is minimised when
the dotproduct is 0. this happens when the vectors are orthogonal. both 0 & 1 and 4 & 5 can be on the unit sphere and orthogonal but for them to be both orthogonal
to 2 & 3 *they* have to be at the origin. since my loss is an L1 loss (instead of, say, a L2 squared loss) the pair 2 & 3 is overall better at the origin because it
gets more from minimising this constraint than worrying about the first.

the pair 2 & 3 has come together not because we were training embeddings to be the same but because we were also training them to be different. this wouldn't be a problem if we were using 3d embeddings since they could all be both one the unit sphere and orthogonal at the same time.

]]>language modelling is a classic problem in NLP; given a sequence of words such as "my cat likes to ..." what's the next word? this problem is related to all sorts of things, everything from autocomplete to speech to text.

the classic solution to language modelling is based on just counting. if a speech to text system is sure its heard "my cat likes to" but then can't decide if the next word if "sleep" or "beep" we can just look at relative counts; if we've observed in a large corpus how cats like to sleep more than they like to beep we can say "sleep" is more likely. (note: this would be different if it was "my roomba likes to ...")

the first approach i saw to solving this problem with neural nets is from bengio et al. "a neural probabilistic language model" (2003). this paper was a huge eye opener for me and was the first case i'd seen of using a distributed, rather than purely symbolic, representation of text. definitely "word embeddings" are all the rage these days!

bengio takes the approach of using a softmax to estimate the distribution of possible words given the two previous words. ie \( P({w}_3 | {w}_1, {w}_2) \). depending on your task though it might make more sense to instead estimate the likelihood of the triple directly ie \( P({w}_1, {w}_2, {w}_3) \).

let's work through a empirical comparison of these two on a synthetic problem. we'll call the first the *softmax* approach and the second the *logisitic_regression* approach.

rather than use real text data let's work on a simpler synthetic dataset with a vocab of only 6 tokens; "A", "B", "C", "D", "E" & "F". be warned: a vocab this small is so contrived that it's hard to generalise any result from it. in particular a normal english vocab in the hundreds of thousands would be soooooooo much sparser.

we'll use random walks on the following erdos renyi graph as a generating grammar. egs "phrases" include "D C", "A F A F A A", "A A", "E D C" & "E A A"

the main benefit of such a contrived small vocab is that it's feasible to analyse all 6^{3} = 216 trigrams.
let's consider the distributions associated with a couple of specific (w1, w2) pairs.

there are only 45 trigrams that this grammar generates and the most frequent one is FAA. FAF is also possible but the other FA? cases can never occur.

F A A 0.20 # the most frequent trigram generated F A B 0.0 # never generated F A C 0.0 F A D 0.0 F A E 0.0 F A F 0.14 # the 4th most frequent trigram

if we train a simple softmax based neural probabilistic language model (nplm) we see the distribution of \( P({w}_3 | {w}_1=F, {w}_2=A ) \) converge to what we expect; FAA has a likelihood of 0.66, FAF has 0.33 and the others 0.0

this is a good illustration of the convergence we expect to see with a softmax. each observed positive example of FAA is also an implicit negative example for FAB, FAC, FAD, FAE & FAF and as such each FAA causes the likelihood of FAA to go up while pushing the others down. since we observe FAA twice as much as FAF it gets twice the likelihood and since we never see FAB, FAC, FAD or FAE they only ever get pushed down and converge to 0.0

since the implementation behind this is (overly) simple we can run a couple of times to ensure things are converging consistently. here's 6 runs, from random starting parameters, and we can see each converges to the same result..

now consider the logisitic model where instead of learning the distribution of w3 given (w1, w2) we instead model the likelihood of the triple directly \( P({w}_1, {w}_2, {w}_3) \). in this case we're modelling whether a specific example is true or not, not how it relates to others, so one big con is that there are no implicit negatives like the softmax. we need explicit negative examples and for this experiment i've generated them by random sampling the trigrams that don't occur in the observed set. ( the generation of "good" negatives is a surprisingly hard problem )

if we do 6 runs again instead of learning the distribution we have FAA and FAF converging to 1.0 and the others converge to 0.0. run4 actually has FAB tending to 1.0 too but i wouldn't be surprised at all if it dropped later; these graphs in general are what i'd expect given i'm just using a fixed global learning rate (ie nothing at all special about adapative learning rate)

now insteading considering the most frequent w1, w2 trigrams let's consider the least frequent.

C B A 0.003 C B B 0.07 # 28th most frequent (of 45 possible trigrams) C B C 0.0 C B D 0.003 C B E 0.002 C B F 0.001 # the least frequent trigram generated

as before the softmax learns the distribution; CBB is the most frequent, CBC has 0.0 probability and the others are roughly equal. these examples are far less frequent in the dataset so the model, quite rightly, allocates less of the models complexity to getting them right.

the logisitic model as before has, generally, everything converging to 1.0 except CBC which converges to 0.0

finally consider the case of C -> C -> ?. this one is interesting since C -> C never actually occurs in the grammar.

first let's consider the logistic case. CC only ever occurs in the training data as an explicit negative so we see all of them converging to 0.0 ( amusingly in run4 CCC alllllmost made it )

now consider the softmax. recall that the softmax learns by explicit positives and implicit negatives, but, since there are no cases of observed CC?, the softmax would not have seen any CC? cases.

so what is going on here? the convergence is all over the place! run2 and run6 seems to suggest CCA is the only likely case whereas run3 and run4 oscillate between CCB and CCF ???

it turns out these are artifacts of the training. there was no pressure in any way to get CC? "right" so these are just the side effects of how the embeddings for tokens, particularly C in this case, are being used for the other actual observed examples. we call these hallucinations.

another slightly different way to view this is to run the experiment 100 times and just consider the converged state (or at least the final state after a fixed number of iterations)

if we consider FA again we can see its low variance convergence of FAA to 0.66 and FAF to 0.33.

if we consider CB again we can see its higher variance convergence to the numbers we reviewe before; CBB ~= 0.4, CBC = 0.0 and the others around 0.15

considering CC though we see CCA and CCB have a bimodal distribution between 0.0 and 1.0 unlike any of the others. fascinating!

this is interesting but i'm unsure how much of it is just due to an overly simple model. this implementation just uses a simple fixed global learning rate (no per weight adaptation at all), uses very simple weight initialisation and has no regularisation at all :/

all the code can be found on github

]]>i've been reviving some old theano code recently and in case you haven't seen it theano is a pretty awesome python library that reads a lot like numpy but provides two particularly interesting features.

- symbolic differentiation; not something i'll talk about here, but super useful if you're tinkering with new models and you're using a gradient descent method for learning (and these days, who's not..)
- the ability to run transparently on a gpu; well, almost transparently, this'll be the main focus of this post...

let's work through a very simple model that's kinda like a system of linear equations. we'll compare 1) numpy (our timing baseline) vs 2) theano on a cpu vs 3) theano on a gpu. keep in mind this model is contrived and doesn't really represent anything useful, it's more to demonstrate some matrix operations.

first consider the following numpy code (speed_test_numpy.py) which does a simple y=mx+b like calculation a number of times in a tight loop. this looping isn't just for benchmarking, lots of learning algorithms operate on a tight loop.

# define data # square matrices will do for a demo np.random.seed(123) m = np.random.randn(1000, 1000).astype('float32') x = np.random.randn(1000, 1000).astype('float32') b = np.random.randn(1000, 1000).astype('float32') # run tight loop start = time.time() for i in range(500): y = np.add(np.dot(m, x), b) print "numpy", time.time()-start, "sec"

this code on a 6 core 3.8Ghz AMD runs in a bit over 2min

$ python speed_test_numpy.py numpy 135.350140095 sec

now consider the same thing in theano (speed_test_theano.py)

import theano import theano.tensor as T # define data np.random.seed(123) m = np.random.randn(1000, 1000).astype('float32') x = np.random.randn(1000, 1000).astype('float32') b = np.random.randn(1000, 1000).astype('float32') # define a symbolic expression of the equations in theano tm = T.matrix("m") tx = T.matrix("x") tb = T.matrix("b") ty = T.add(T.dot(tm, tx), tb) # and compile it line = theano.function(inputs=[tx, tm, tb], outputs=[ty]) # then run same loop as before start = time.time() for i in range(500): y, = line(m, x, b) print "theano", time.time()-start, "sec"

hopefully it's clear enough what is happening here at a high level but just briefly the tm, tx, tb and ty variables represent a symbolic representation of what we want to do and the theano.function call compiles this into actual executable code. there is lots of gentle intro material that introduces this notation on the theano site.

when run on the cpu it takes about the same time as the numpy version

$ THEANO_FLAGS=device=cpu python speed_test_theano.py numpy 136.371109009 sec

but when "magically" run on the gpu it's quite a bit faster.

$ THEANO_FLAGS=device=gpu python speed_test_theano.py Using gpu device 0: GeForce GTX 970 theano 3.16091990471 sec

awesome! a x40 speed up! so we're done right? not quite, we can do better.

let's drill into what's actually happening; we can do this in two ways, debugging the compiled graph and theano profiling.

debugging allows us to see what a function has been compiled to. for the cpu case it's just a single blas gemm (general matrix mulitplication) call. that's exactly what'd we want, so great!

Gemm{no_inplace} [@A] '' 0 |b [@B] |TensorConstant{1.0} [@C] |m [@D] |x [@E] |TensorConstant{1.0} [@C]

profiling allows to see where time is spent. 100% in this single op, no surprise.

$ THEANO_FLAGS=device=cpu,profile=True python speed_test_theano.py ... <% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name> 100.0% 100.0% 136.282s 2.73e-01s 500 0 Gemm{no_inplace} ...

looking at the gpu version though things are a little different...

HostFromGpu [@A] '' 4 |GpuGemm{inplace} [@B] '' 3 |GpuFromHost [@C] '' 2 | |b [@D] |TensorConstant{1.0} [@E] |GpuFromHost [@F] '' 1 | |m [@G] |GpuFromHost [@H] '' 0 | |x [@I] |TensorConstant{1.0} [@E]

we can see a GpuGemm operation, the gpu equivalent of Gemm, but now there's a bunch of GpuFromHost & HostFromGpu operations too? what are these?

i'll tell you what they are, they are the bane of your existence! these represent transferring data to/from the gpu which is slow and, if we're not careful, can add up to a non trivial amount. if we review the profiling output we can see that, though we're faster than the non gpu version, we're spending >70% of the time just moving data.

(though remember this example is contrived, we'd expect to be doing more in our overall computation that just a single general matrix mulitply)

$ THEANO_FLAGS=device=gpu,profile=True python speed_test_theano.py ... <% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name> 26.4% 26.4% 0.776s 1.55e-03s 500 3 GpuGemm{inplace} 19.5% 45.9% 0.573s 1.15e-03s 500 0 GpuFromHost(x) 19.5% 65.4% 0.572s 1.14e-03s 500 1 GpuFromHost(m) 19.3% 84.7% 0.565s 1.13e-03s 500 2 GpuFromHost(b) 15.3% 100.0% 0.449s 8.99e-04s 500 4 HostFromGpu(GpuGemm{inplace}.0) ...

ouch!

the crux of this problem is that we actually have two types of variables in this model; the parameterisation of the model (m & b) and
those related to examples (x & y). so, though it's realistic to do a speed test with a tight loop over the same function many times,
what is *not* realistic is that we are passing the model parameters to/from the gpu
each and every input example. this is a complete waste; it's much more sensible to send them over to the gpu once at the
start of the loop and retreive them once at the end. this is an important and very common pattern.

how do we fix this? it's actually pretty simple; shared variables. yay!

consider the following; speed_test_theano_shared.py

# define data np.random.seed(123) m = np.random.randn(1000, 1000).astype('float32') x = np.random.randn(1000, 1000).astype('float32') b = np.random.randn(1000, 1000).astype('float32') # define a symbolic expression of the equations in theano tm = theano.shared(m) # copy m over to gpu once explicitly tx = T.matrix("x") tb = theano.shared(b) # copy b over to gpu once explicitly ty = T.add(T.dot(tm, tx), tb) line = theano.function(inputs=[tx], outputs=[ty]) # don't pass m & b each call # then run same loop as before start = time.time() for i in range(500): y, = line(x) print tm.get_value().shape # note: we can get the value back at any time

reviewing the debug we can see this removes a stack of the GpuFromHost calls.

HostFromGpu [@A] '' 2 |GpuGemm{no_inplace} [@B] '' 1 |[@C] |TensorConstant{1.0} [@D] | [@E] |GpuFromHost [@F] '' 0 | |x [@G] |TensorConstant{1.0} [@D]

and we're down to < 2s

$ THEANO_FLAGS=device=gpu,profile=True python speed_test_theano_shared.py Using gpu device 0: GeForce GTX 970 theano 1.93515706062 sec ... <% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name> 44.7% 44.7% 0.804s 1.61e-03s 500 1 GpuGemm{no_inplace} 30.2% 74.9% 0.543s 1.09e-03s 500 0 GpuFromHost(x) 25.1% 100.0% 0.451s 9.01e-04s 500 2 HostFromGpu(GpuGemm{no_inplace}.0) ...

what's even crazier is we can go further by moving the x and y matrices onto the gpu too. it turns out this isn't *too*
far fetched since if x and y were representing training examples we'd be iterating over them anyways (and if we could fit them
all onto the gpu that'd be great )

#define data np.random.seed(123) m = np.random.randn(1000, 1000).astype('float32') x = np.random.randn(1000, 1000).astype('float32') b = np.random.randn(1000, 1000).astype('float32') # define a symbolic expression of the equations in theano tm = theano.shared(m) tx = theano.shared(x) tb = theano.shared(b) ty = theano.shared(np.zeros((1000, 1000)).astype('float32')) # we need a shared var for y now mx_b = T.add(T.dot(tm, tx), tb) # and compile it train = theano.function(inputs=[], updates={ty: mx_b}) # update y on gpu # then run same loop as before start = time.time() for i in range(500): train() # now there's no input/output print tm.get_value().shape print "theano", time.time()-start, "sec"

the debug graph is like the cpu graph now, just one gemm call.

GpuGemm{no_inplace} [@A] '' 0 |[@B] |TensorConstant{1.0} [@C] | [@D] | [@E] |TensorConstant{1.0} [@C]

and runs in under a second. x150 the numpy version. nice! :)

$ THEANO_FLAGS=device=gpu,profile=True python speed_test_theano_shared2.py theano 0.896003007889 sec ... <% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name> 100.0% 100.0% 0.800s 1.60e-03s C 500 1 GpuGemm{no_inplace} ...]]>

PyMC is a python library for working with bayesian statistical models, primarily using MCMC methods. as a software engineer who has only just scratched the surface of statistics this whole MCMC business is blowing my mind so i've got to share some examples.

let's start with the simplest thing possible, fitting a simple distribution.

say we have a thousand values, ` 87.27, 67.98, 119.56, ...`

and we want to build a model of them.

a common first step might be to generate a histogram.

if i had to a make a guess i'd say this data looks normally distributed. somewhat unsurprising, not just because normal distributions are freakin everywhere, (this great khan academy video on the central limit theorem explains why) but because it was me who synthetically generated this data in the first place ;)

now a normal distribution is parameterised by two values; it's *mean* (technically speaking, the "middle" of the curve) and it's *standard deviation* (even more technically speaking, how "fat" it is) so let's use PyMC to figure out what these values are for this data.

*!!warning!! !!!total overkill alert!!!* there must be a bazillion simpler ways to fit a normal to this data but this post is about
dead-simple-PyMC not dead-simple-something-else.

first a definition of our model.

# simple_normal_model.py from pymc import * data = map(float, open('data', 'r').readlines()) mean = Uniform('mean', lower=min(data), upper=max(data)) precision = Uniform('precision', lower=0.0001, upper=1.0) process = Normal('process', mu=mean, tau=precision, value=data, observed=True)

working *backwards* through this code ...

- line 6 says i am trying to model some
`process`

that i believe is`Normal`

ly distributed defined by variables`mean`

and`precision`

. (precision is just the inverse of the variance, which in turn is just the standard deviation squared). i've already`observed`

this data and the`value`

s are in the variable`data`

- line 5 says i don't know the
`precision`

for my`process`

but my prior belief is it's value is somewhere between 0.0001 and 1.0. since i don't favor any values in this range my belief is`uniform`

across the values. note: assuming a uniform distribution for the precision is overly simplifying things quite a bit, but we can get away with it in this simple example and we'll come back to it. - line 4 says i don't know the
`mean`

for my data but i think it's somewhere between the`min`

and the`max`

of the observed`data`

. again this belief is`uniform`

across the range. - line 3 says the
`data`

for my unknown`process`

comes from a local file (just-plain-python)

the second part of the code runs the MCMC sampling.

# run_mcmc.py from pymc import * import simple_normal_model model = MCMC(simple_normal_model) model.sample(iter=500) print(model.stats())

working *forwards* through this code ...

- line 4 says build a MCMC for the model from the
`simple_normal_model`

file - line 5 says run a sample for 500 iterations
- line 6 says print some stats.

**and that's it!**

the output from our stats includes among other things estimates for the `mean`

and `precision`

we were trying to find

{ 'mean': {'95% HPD interval': array([ 94.53688316, 102.53626478]) ... }, 'precision': {'95% HPD interval': array([ 0.00072487, 0.03671603]) ... }, ... }

now i've brushed over a couple of things here (eg the use of uniform prior over the precision, see here for more details) but i can get away with it all because this problem is a trivial one and i'm not doing gibbs sampling in this case. the main point i'm trying to make is that it's dead simple to start writing these models.

one thing i do want to point out is that this estimation doesn't result in just one single value for mean and precision, it results in a distribution of the possible values. this is great since it gives us an idea of how confident we can be in the values as well as allowing this whole process to be iterative, ie the output values from this model can be feed easily into another.

all the code above parameterised the normal distribution with a mean and a precision. i've always thought of normals though in terms of means and standard deviations
(precision is a more bayesian way to think of things... apparently...) so the first extension to my above example i want to make is to redefine the problem
in terms of a prior on the standard deviation instead of the precision. mainly i want to do this to introduce the `deterministic`

concept
but it's also a subtle change in how the sampling search will be directed because it introduces a non linear transform.

data = map(float, open('data', 'r').readlines()) mean = Uniform('mean', lower=min(data), upper=max(data)) std_dev = Uniform('std_dev', lower=0, upper=50) @deterministic(plot=False) def precision(std_dev=std_dev): return 1.0 / (std_dev * std_dev) process = Normal('process', mu=mean, tau=precision, value=data, observed=True)

our code is almost the same but instead of a prior on the `precision`

we use a `deterministic`

method to map from the parameter we're
trying to fit (the `precision`

) to a variable we're trying to estimate (the `std_dev`

).

we fit the model using the same `run_mcmc.py`

but this time get estimates for the `std_dev`

not the `precision`

{ 'mean': {'95% HPD interval': array([ 94.23147867, 101.76893808]), ... 'std_dev': {'95% HPD interval': array([ 19.53993697, 21.1560098 ]), ... ... }

which all matches up to how i originally generated the data in the first place.. cool!

from numpy.random import normal data = [normal(100, 20) for _i in xrange(1000)]

for this example let's now dive a bit deeper than just the stats object.
to help understand how the sampler is converging on it's results we can also dump
a trace of it's progress at the end of `run_mcmc.py`

import numpy for p in ['mean', 'std_dev']: numpy.savetxt("%s.trace" % p, model.trace(p)[:])

plotting this we can see how quickly the sampled values converged.

let's consider a slightly more complex example.

again we have some data... `107.63, 207.43, 215.84, ...`

that plotted looks like this...

hmmm. looks like *two* normals this time with the one centered on 100 having a bit more data.

how could we model this one?

data = map(float, open('data', 'r').readlines()) theta = Uniform("theta", lower=0, upper=1) bern = Bernoulli("bern", p=theta, size=len(data)) mean1 = Uniform('mean1', lower=min(data), upper=max(data)) mean2 = Uniform('mean2', lower=min(data), upper=max(data)) std_dev = Uniform('std_dev', lower=0, upper=50) @deterministic(plot=False) def mean(bern=bern, mean1=mean1, mean2=mean2): return bern * mean1 + (1 - ber) * mean2 @deterministic(plot=False) def precision(std_dev=std_dev): return 1.0 / (std_dev * std_dev) process = Normal('process', mu=mean, tau=precision, value=data, observed=True)

reviewing the code again it's mostly the same the big difference being the `deterministic`

definition of the `mean`

.
it's now that we finally start to show off the awesome power of these non analytical approaches.

line 12 defines the mean not by one `mean`

variable
but instead as a mixture of two, `mean1`

and `mean2`

. for each value we're trying to model we pick either `mean1`

or `mean2`

based on *another* random variable `bern`

.
`bern`

is described by a
bernoulli distribution
and so is either 1 or 0, proportional to the parameter `theta`

.

ie the definition of our `mean`

is that when `theta`

is high, near 1.0, we pick `mean1`

most of the time and
when `theta`

is low, near 0.0, we pick `mean2`

most of the time.

what we are solving for then is not just `mean1`

and `mean2`

but how the values are split between them (described by `theta`

)
(and note for the sake of simplicity i made the two normal differ in their means but use a shared standard deviation. depending on what you're doing this
might or might not make sense)

reviewing the traces we can see the converged `mean`

s are 100 & 200 with `std dev`

20. the mix (`theta`

) is 0.33, which all agrees
with the synthetic data i generated for this example...

from numpy.random import normal import random data = [normal(100, 20) for _i in xrange(1000)] # 2/3rds of the data data += [normal(200, 20) for _i in xrange(500)] # 1/3rd of the data random.shuffle(data)

to me the awesome power of these methods is the ability in that function to pretty much write whatever i think best describes the process. too cool for school.

i also find it interesting to see how the convergence came along... the model starts in a local minima of both normals having mean a bit below 150 (the midpoint of the two actual ones) with a mixing proportion of somewhere in the ballpark of 0.5 / 0.5. around iteration 1500 it correctly splits them apart and starts to understand the mix is more like 0.3 / 0.7. finally by about iteration 2,500 it starts working on the standard deviation which in turn really helps narrow down the true means.

(thanks cam for helping me out with the formulation of this one..)

these are pretty simple examples thrown together to help me learn but i think they're still illustrative of the power of these methods (even when i'm completely ignore anything to do with conjugacy)

in general i've been working through an awesome book, doing bayesian data analysis, and can't recommend it enough.

i also found john's blog post on using jags in r was really helpful getting me going.

all the examples listed here are on github.

next is to rewrite everything in stan and do some comparision between pymc, stan and jags. fun times!

]]>say you have three items; item1, item2 and item3 and you've somehow associated a count for each against one of five labels; A, B, C, D, E

> data A B C D E item1 23700 20 1060 11 4 item2 1581 889 20140 1967 200 item3 1 0 1 76 0

depending on what you're doing it'd be reasonable to normalise these values and an l1-normalisation (ie rescale so they are the same proportion but add up to 1) gives us the following...

> l1_norm = function(x) x / sum(x) > l1 = t(apply(data, 1, l1_norm)) > l1 A B C D E item1 0.955838 0.00080661 0.042751 0.00044364 0.00016132 item2 0.063809 0.03588005 0.812851 0.07938814 0.00807200 item3 0.012821 0.00000000 0.012821 0.97435897 0.00000000

great... but you know it's fair enough if you think things don't feel right...

according to these normalised values item3 is "more of" a D (0.97) than item1 is an A (0.95) even though we've only collected 1/300th of the data for it. this just isn't right.

purely based on these numbers i'd think it's more sensible to expect item3 to be A or a C (since that's what we've seen with item1 and item2) but we just haven't seen enough data for it yet. what makes sense then is to smooth the value of item3 out and make it more like some sort of population average.

so firstly what makes a sensible population average? ie if we didn't know anything at all about a new item what would we want the proportions of labels to be? alternatively we can ask what do we think item3 is likely to look like later on as we gather more data for it? i think an l1-norm of the sums of all the values makes sense ...

> column_totals = apply(data, 2, sum) > population_average = l1_norm(column_totals) > population_average A B C D E 0.5094218 0.0183000 0.4268199 0.0413513 0.0041069

... and it seems fair. without any other info it's reasonable to "guess" a new item is likely to be somewhere between an A (0.50) and a C (0.42)

so now we have our item3, and our population average, and we want to mix them together in some way... how might we do this?

A B C D E item3 0.012821 0.000000 0.012821 0.974358 0.000000 pop_aver 0.509421 0.018300 0.426819 0.041351 0.004106

a linear weighted sum is nice and easy; ie a classic `item3 * alpha + pop_aver * (1-alpha)`

but then how do we pick alpha?

if we were to do this reweighting for item1 or item2 we'd want alpha to be large, ie nearer 1.0, to reflect the confidence we have in their current values since we have lots of data for them. for item3 we'd want alpha to be small, ie nearer 0, to reflect the lack of confidence we have in it.

enter the confidence interval, a way of testing how confident we are in a set of values.

firstly, a slight diversion re: confidence intervals...

consider three values, 100, 100 and 200. running this goodness of fit test gives the following result.

> library(NCStats) > gofCI(chisq.test(c(100, 100, 200)), conf.level=0.95) p.obs p.LCI p.UCI [1,] 0.25 0.21008 0.29468 [2,] 0.25 0.21008 0.29468 [3,] 0.50 0.45123 0.54877

you can read the first row of this table as "the count 100 was observed to be 0.25 (p.obs) of the total and i'm 95%
confident (conf.level) that the *true* value is between 0.21 (p.LCI = lower confidence interval) and 0.29 (p.UCI = upper confidence interval).

there are two important things to notice that can change the range of confidence interval...

1) upping the confidence level results in a wider confidence interval. ie "i'm 99.99% confident the value is true value is between 0.17 and 0.34, but only 1% confident it's between 0.249 and 0.2502"

> gofCI(chisq.test(c(100, 100, 200)), conf.level=0.9999) p.obs p.LCI p.UCI [1,] 0.25 0.17593 0.34230 [2,] 0.25 0.17593 0.34230 [3,] 0.50 0.40452 0.59548 > gofCI(chisq.test(c(100, 100, 200)), conf.level=0.01) p.obs p.LCI p.UCI [1,] 0.25 0.24973 0.25027 [2,] 0.25 0.24973 0.25027 [3,] 0.50 0.49969 0.50031

2) getting more data results in a narrower confidence interval. ie "even though the proportions stay the same as i gather x10, then x100, my original data i can narrow my confidence interval around the observed value"

> gofCI(chisq.test(c(10, 10, 20)), conf.level=0.95) p.obs p.LCI p.UCI [1,] 0.25 0.14187 0.40194 [2,] 0.25 0.14187 0.40194 [3,] 0.50 0.35200 0.64800 > gofCI(chisq.test(c(100, 100, 200)), conf.level=0.95) p.obs p.LCI p.UCI [1,] 0.25 0.21008 0.29468 [2,] 0.25 0.21008 0.29468 [3,] 0.50 0.45123 0.54877 > gofCI(chisq.test(c(1000, 1000, 2000)), conf.level=0.95) p.obs p.LCI p.UCI p.exp [1,] 0.25 0.23683 0.26365 [2,] 0.25 0.23683 0.26365 [3,] 0.50 0.48451 0.51549

so it turns out this confidence interval is exactly what we're after; a way of estimating a pessimistic value (the lower bound) that gets closer to the observed value as the size of the observed data grows.

note: there's a lot of discussion on how best to do these calculations. there is a more "correct" and principled version of this calculation that is provided by MultinomialCI but i found it's results weren't as good for my purposes.

awesome, so back to the problem at hand; how do we pick our mixing parameter alpha?

let's extract the lower bound of the confidence interval value for our items using a very large confidence (99.99%) (to enforce a wide interval). the original l1-normalised values are shown here again for comparison.

> l1 A B C D E item1 0.95583 0.00080 0.04275 0.00044 0.00016 item2 0.06380 0.03588 0.81285 0.07938 0.00807 item3 0.01282 0.00000 0.01282 0.97435 0.00000 > library(NCStats) > gof_ci_lower = function(x) gofCI(chisq.test(x), conf.level=0.9999)[,2] > gof_chi_ci = t(apply(data, 1, gof_ci_lower)) > gof_chi_ci A B C D E item1 0.95048 0.00035 0.03803 0.00015 0.00003 item2 0.05803 0.03156 0.80302 0.07296 0.00614 item3 0.00000 0.00000 0.00000 0.79725 0.00000

we see that item1, which had a lot of support data, has dropped it's A value only slightly from 0.955 to 0.950 whereas item3 which had very little support, has had it's D value drop drastically from 0.97 to 0.79. by using a conf.level closer and closer 1.0 we see make this drop more and more drastic.

because each of the values in the `gof_chi_ci matrix`

are lower bounds the rows no longer add up to 1.0 (as they do in the l1-value
matrix). we can calculate how much we've "lost" with `1 - sum(rows)`

and it turns out this residual is pretty much
exactly what we were after when we were for our mixing parameter alpha!

> gof_chi_ci$residual = as.vector(1 - apply(gof_chi_ci, 1, sum)) > gof_chi_ci A B C D E residual item1 0.95048 0.00035 0.03803 0.00015 0.00003 0.01096 item2 0.05803 0.03156 0.80302 0.07296 0.00614 0.02829 item3 0.00000 0.00000 0.00000 0.79725 0.00000 0.20275

in the case of item1 the residual is low, ie the confidence interval lower bound was close to the observed value so we shouldn't mix in much of the population average. but in the case of item3 the residual is high, we lost a lost by the confidence interval being very wide, so we might as well mix in more of the population average.

now what i've said here is completely unprincipled. i just made it up and the maths work because everything is normalised. but having said that the results are really good so i'm going with it :)

putting it all together then we have the following bits of data...

> l1 # our original estimates A B C D E item1 0.95583 0.00080 0.04275 0.00044 0.00016 item2 0.06380 0.03588 0.81285 0.07938 0.00807 item3 0.01282 0.00000 0.01282 0.97435 0.00000 > population_average # the population average A B C D E item1 0.50942 0.01830 0.42681 0.04135 0.00410 > gof_chi_ci # lower bound of our confidences A B C D E item1 0.95048 0.00035 0.03803 0.00015 0.00003 item2 0.05803 0.03156 0.80302 0.07296 0.00614 item3 0.00000 0.00000 0.00000 0.79725 0.00000 > gof_chi_ci_residual = as.vector(1 - apply(gof_chi_ci, 1, sum)) > gof_chi_ci_residual # how much we should mix in the population average [1] 0.01096 0.02829 0.20275 0.40759

since there's lots of support for item1 the residual is small, only 0.01, so we smooth only a little of the population average in and end up not changing the values that much

> l1[1,] A B C D E item1 0.95583 0.00080 0.04275 0.00044 0.00016 > gof_chi_ci[1,] + population_average * gof_chi_ci_residual[1] A B C D E item1 0.95606 0.00055 0.04270 0.00060 0.00007

but item3 has a higher residual and so we smooth more of the population average in and it's shifted more much strongly from D towards A and B

> l1[3,] A B C D E item3 0.01282 0.00000 0.01282 0.97435 0.00000 > gof_chi_ci[3,] + population_average * gof_chi_ci_residual[3] A B C D E item3 0.10329 0.00371 0.08653 0.80563 0.00083]]>

one model is based the idea of an interest graph where the nodes of the graph are users and items and the edges of the graph represent an interest, whatever that might mean for the domain.

if we only allow edges between users and items the graph is bipartite.

let's consider a simple example of 3 users and 3 items; user1 likes item1, user2 likes all three items and user3 likes just item3.

fig1 user / item interest graph |

one way to model similiarity between items is as follows....

let's consider a token starting at item1. we're going to repeatedly "bounce" this token back and forth between the items and the users based on the interest edges.

so, since item1 is connected to user1 and user2 we'll pick one of them randomly and move the token across. it's 50/50 which of user1 or user2 we end up at (fig2).

next we bounce the token back to the items; if the token had gone to user1 then it has to go back to item1 since user1 has no other edges, but if it had gone to user2 it could back to any of the three items with equal probability; 1/3rd.

the result of this is that the token has 0.66 chance of ending up back at item1 and equal 0.16 chance of ending up at either item2 or item3 (fig3)

fig2 dispersing from item1 to users | fig3 dispersing back from users to items |

( note this is different than if we'd started at item2. in that case we'd have gone to user2 with certainity and then it would have been uniformly random which of the items we'd ended up at )

for illustration let's do another iteration...

bouncing back to the users item1's 0.66 gets split 50/50 between user1 and user2. all of item2's 0.16 goes to user2 and item3 splits it's 0.16 between user2 and user3. we end up with fig4 (no, not that figure 4). bouncing back to the items we get to fig5.

fig4 | fig5 |

if we keep repeating things we converge on the values

{item1: 0.40, item2: 0.20, item3: 0.40}and these represent the probabilities of ending up in a particular item if we bounced forever.

note since this is convergent it also doesn't actually matter which item we'd started at, it would always get the same result in the limit.

to people familiar with power methods this convergence is no surprise. you might also recognise a similiarity between this and the most famous power method of them all, pagerank.

so what has this all got to do with item similiarity?

well, the values of the probabilities might all converge to the same set regardless of which item we start at
**but** each item gets there in different ways.

most importantly we can capture this difference by taking away a bit of probability each iteration of the dispersion.

so, again, say we start at item1. after we go to users and back to items we are at fig3 again.

but this time, before we got back to the users side, let's take away a small proportion of the probability mass, say, 1/4. this would be 0.16 for item1 and 0.04 for item2 and item3. this leaves us with fig6.

fig3 (again) | fig6 |

we can then repeat iteratively as before, items -> users -> items -> users. but each time we are on the items side we take away 1/4 of the mass until it's all gone.

iteration | taken from item1 | taken from item2 | taken from item3 |

1 | 0.16 | 0.04 | 0.04 |

2 | 0.09 | 0.04 | 0.05 |

3 | 0.06 | 0.02 | 0.05 |

... | ... | ... | ... |

final sum | 0.50 | 0.20 | 0.30 |

if we do the same for item2 and item3 we get different values...

starting at | total taken from item1 | total taken from item2 | total taken from item3 |

item1 | 0.50 | 0.20 | 0.30 |

item2 | 0.38 | 0.24 | 0.38 |

item3 | 0.30 | 0.20 | 0.50 |

finally these totals can be used as features for a pairwise comparison of the items. intuitively we can see that for any row wise similarity function we might choose to use sim(item1, item3) > sim(item1, item2) or sim(item2, item3)

one last thing to consider is that the amount of decay, 1/4 in the above example, is of course configurable and we get different results using a value between 0.0 and 1.0.

a very low value, ~= 0.0, produces the limit value, all items being classed the same. a higher value, ~= 1.0, stops the iterations after only one "bounce" and represents the minimal amount of dispersal.

]]>the first thing we need to do is determine which segments of the crawl are valid and ready for use (the crawl is always ongoing)

```
$ s3cmd get s3://aws-publicdatasets/common-crawl/parse-output/valid_segments.txt
$ head -n3 valid_segments.txt
1341690147253
1341690148298
1341690149519
```

given these segment ids we can lookup the related textData objects.

if you just want one grab it's name using something like ...

```
$ s3cmd ls s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690147253/ 2>/dev/null \
| grep textData | head -n1 | awk '{print $4}'
s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690147253/textData-00000
```

but if you want the lot you can get the listing with ...

```
$ cat valid_segments.txt \
| xargs -I{} s3cmd ls s3://aws-publicdatasets/common-crawl/parse-output/segment/{}/ \
| grep textData | awk '{print $4}' > all_valid_segments.tsv
```

( note: this listing is roughly 200,000 textData files and takes awhile to fetch )

each textData file is a hadoop sequence files, the key being the crawled url and the value being the extracted visible text.

to have a quick look at one you can get hadoop to dump the sequence file contents with ...

```
$ hadoop fs -text textData-00000 | less
http://webprofessionals.org/intel-to-acquire-mcafee-moving-into-online-security-ny-times/ Web Professionals
Professional association for web designers, developers, marketers, analysts and other web professionals.
Home
...
The company’s share price has fallen about 20 percent in the last five years, closing on Wednesday at $19.59 a share.
Intel, however, has been bulking up its software arsenal. Last year, it bought Wind River for $884 million, giving it a software maker with a presence in the consumer electronics and wireless markets.
With McAfee, Intel will take hold of a company that sells antivirus software to consumers and businesses and a suite of more sophisticated security products and services aimed at corporations.
```

( note: the visible text is broken into *one line* per block element from the original html. as such the value in the key/value pairs includes carriage returns and, for something like less, gets
outputted as being seperate lines )

now that we have some text, what can we do with it? one thing is to look for noun phrases and the quickest simplest way is to use something like the python natural language toolkit. it's certainly not the fastest to run but for most people would be the quickest to get going.

extract_noun_phrases.py is an example of doing sentence then word tokenisation, pos tagging and finally noun chunk phrase extraction.

given the text ...

```
Last year, Microsoft bought Wind River for $884 million. This makes it the largest software maker with a presence in North Kanada.
```

it extract noun phrases ...

```
Microsoft
Wind River
North Kanada
```

to run this at larger scale we can wrap it in a simple streaming job

```
hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input textDataFiles \
-output counts \
-mapper extract_noun_phrases.py \
-reducer aggregate \
-file extract_noun_phrases.py
```

run it across a small 50mb sample of textData files the top noun phrases extracted ...

rank | phrase | freq |

1 | 10094 | Posted |

2 | 9597 | November |

3 | 9553 | February |

4 | 8929 | Copyright |

5 | 8726 | September |

6 | 8709 | January |

7 | 8434 | April |

8 | 8307 | August |

9 | 7963 | October |

10 | 7963 | December |

this is not terribly interesting and the main thing that's going on here is that this is just being extracted from the boiler plate of the pages. one tough problem when dealing with visible text on a web page is that it might be visible but that doesn't mean it's interesting to the actual content of the page. here we see 'posted' and 'copyright', we're just extracting the chrome of the page.

check out the full list of values with freq >= 20 here there are some more interesting ones a bit later

so it's fun to look at noun phrases but i've actually brushed over some key details here

- not filtering on english text first generates a
*lot*of "noise". "G úûv ÝT M", "U ŠDú T" and "Y CKdñˆô" are not terribly interesting english noun phrases. - running this at scale you'd probably want to change from streaming and start using an in process java library like the stanford parser
- when it comes to actually doing named entity recognition it's a bit more complex. there's a wavii blog post from manish that talks a bit more about it.

( recall jaccard(set1, set2) = |intersection| / |union|. when set1 == set2 this evaluates to 1.0 and when set1 and set2 have no intersection it evaluates to 0.0 )

one thing that's always annoyed me about it though is that is loses any sense of partial similarity. as a set based measure it's all or nothing.

so consider the sets *set1 = {i1, i2, i3}* and *set2 = {i1, i2, i4}*

jaccard(set1, set2) = 2/4 = 0.5 which is fine given you have *no* prior info about the relationship between i3 and i4.

but what if you have a similarity function, s, and s(i3, i4) ~= 1.0? in this case you don't want a jaccard of 0.5, you want something closer to 1.0. by saying i3 ~= i4 you're saying the sets are almost the same.

after lots of googling i couldn't find a jaccard variant that supports this idea so i rolled my own. the idea is that we want to count the values in the complement of the intersection not as 0.0 on the jaccard numerator but as some value ranging between 0.0 and 1.0 based on the similarity of the elements. after some experiments i found that just counting each as the root mean sqr value of the pairwise sims of them all works pretty well. i'd love to know the name of this technique (or any similar better one) so i can read some more about it.

def fuzzy_jaccard(s1, s2, sim): union = s1.union(s2) intersection = s1.intersection(s2) # calculate root mean square sims between elements in just s1 and just s2 just_s1 = s1 - intersection just_s2 = s2 - intersection sims = [sim(i1, i2) for i1 in just_s1 for i2 in just_s2] sqr_sims = [s * s for s in sims] root_mean_sqr_sim = sqrt(float(sum(sqr_sims)) / len(sqr_sims)) # use this root_mean_sqr_sim to count these values in the complement as, in some way, being "partially" in the intersection return float(len(intersection) + (root_mean_sqr_sim * intersection_complement_size)) / len(union)

looking at our example of *{i1, i2, i3}* vs *{i1, i2, i4}*...

when s(i3, i4) = 0.0 it degenerates to normal jaccard and scores 0.5

print fuzzy_jaccard(set([1,2,3]), set([1,2,4]), lambda i1, i2: 0.0) # = 0.5 (2/4) ie normal jaccard

when s(i3, i4) = 1.0 it treats the values as the same and scores 1.0

print fuzzy_jaccard(set([1,2,3]), set([1,2,4]), lambda i1, i2: 1.0) # = 1.0 (4/4) treating i3 == i4

when s(i3, i4) = 0.9 it scores inbetween with 0.8

print fuzzy_jaccard(set([1,2,3]), set([1,2,4]), lambda i1, i2: 0.8) # = 0.9 (3.6/4)

this is great for me because now given an appropriate similiarity function i'm able to get a lot more discrimination between sets.

]]>after having to google this stuff three times in the last few months i'm writing it down here so i can just cut and paste next time...

> d = read.delim('data.tsv',header=F,as.is=T,col.names=c('dts_str','freq')) > # YEAR MONTH DAY HOUR > head(d,3) dts_str freq 1 2012_01_01_00 18393 2 2012_01_01_01 20536 3 2012_01_01_02 91840 > tail(d,3) dts_str freq 732 2012_01_31_21 103107 733 2012_01_31_22 108921 734 2012_01_31_23 78629 > summary(d$freq) Min. 1st Qu. Median Mean 3rd Qu. Max. 10590 63620 82680 86770 105700 169900

> d$dts = as.POSIXct(d$dts_str, format="%Y_%m_%d_%H") > head(d,3) dts_str freq dts 1 2012_01_01_00 18393 2012-01-01 00:00:00 2 2012_01_01_01 20536 2012-01-01 01:00:00 3 2012_01_01_02 91840 2012-01-01 02:00:00 > ggplot(d, aes(dts, freq)) + geom_point() + scale_x_datetime(major="10 days", minor="1 day", format="%d-%b-%Y")

> d$dow = as.factor(format(d$dts, format="%a")) # day of week > head(d,3) dts_str freq dts dow 1 2012_01_01_00 18393 2012-01-01 00:00:00 Sun 2 2012_01_01_01 20536 2012-01-01 01:00:00 Sun 3 2012_01_01_02 91840 2012-01-01 02:00:00 Sun > ggplot(d,aes(dow,freq)) + geom_boxplot() + geom_smooth(aes(group=1)) + scale_x_discrete(limits=c('Mon','Tue','Wed','Thu','Fri','Sat','Sun')) # provide explicit factor ordering + xlab('day of week') + ylab('freq') + opts(title='freq by day of week')

> by_dow = ddply(d, "dow", summarize, freq=sum(freq)) > ggplot(by_dow,aes(dow,freq)) + geom_bar() + scale_x_discrete(limits=c('Mon','Tue','Wed','Thu','Fri','Sat','Sun')) + xlab('day of week') + ylab('freq') + opts(title='total freq by day of week')

> d$hr = format(d$dts, format="%H") > head(d,3) dts_str freq dts dow hr 1 2012_01_01_00 18393 2012-01-01 00:00:00 Sun 00 2 2012_01_01_01 20536 2012-01-01 01:00:00 Sun 01 3 2012_01_01_02 91840 2012-01-01 02:00:00 Sun 02 > ggplot(d,aes(hr,freq)) + geom_boxplot() + geom_smooth(aes(group=1)) + xlab('hr of day') + ylab('freq') + opts(title='freq by hr of day')

> by_hr = ddply(d, "hr", summarize, freq=sum(freq)) > ggplot(by_hr,aes(hr,freq)) + geom_bar() + xlab('hr of day') + ylab('freq') + opts(title='total freq by hr of day')

d$weekend = 'weekday' d[d$dow=='Sat'|d$dow=='Sun',]$weekend = 'weekend' # terrible style :( ggplot(d,aes(hr,freq)) + geom_boxplot(aes(fill=weekend)) + geom_smooth(aes(group=weekend)) + xlab('hr of day') + ylab('freq') + opts(title='freq by hr of day')]]>