it's a simple enough sounding problem; given a cart in 2d, with a pole balancing on it, move the cart left and right to keep the pole balanced (i.e. within a small angle of vertical).

the entire problem can be described by input 4 variables; the cart position, the cart velocity, the pole angle, the pole angular velocity but even still it's surprisingly difficult to solve. there are loads of implementations of it and if you want to tinker i'd *highly* recommend starting with openai's gym version

as soon as you've solved it though you might want to play with a more complex physics environment and for that you need a serious physics simulator, e.g. the industry standard (and fully open sourced) bullet physics engine.

the simplest cartpole i could make includes 1) the ground 2) a cart (red) and 3) a pole (green). the blue line here denotes the z-axis. ( all code on github )

a couple of key differences to the standard cartpole are

- the cart and pole move in 3d, not 1d.
- the pole is
*not*connected to the cart (and since it's relatively light it makes from some crazy dynamics...) - each run is started with a push of the cart in a random direction.
- the simulation state is 28d not 4d like the vanilla cartpole
- 7d pose of the pole (3d postition + 4d quaternion orientation)
- 7d pose of the cart (required since cart isn't connected to pole)
- 7d pole pose at last step (can ne compared to current pose as a form of "velocity")
- 7d cart pose at last step

- for the discrete control version there are 5 actions (don't push, push up, down, left, right) as opposed to 2 (left, right)
- ( i've included a "dont move" action since, once the pole is balanced, the best thing to do is stop )

- for the continuous control version the actions is a 2d tuple; the forces to apply in x & y directions.

running this cartpole simulation with random actions gives pretty terrible performance with either random discrete or continuous control. we're lucky to get 5 or 10 steps (of a maximum 200 for our episode) in the video each time the cart skips back to the center represents the pole falling out of bounds and the sim reseting.

- 5 actions; go left, right, up, down, do nothing
- +1 reward for each step pole is up.

after training a vanilla dqn using kera-rl we get reasonable controller

the training curve is what we expect; terrible number of steps at the start then gradually getting a reasonable number of full runs (steps=200). still never gets to perfect runs 100% of the time though. there's also an interesting blip around episode 1,000 where it looked like it was doing OK and then diverged and recovered by about episode 7,500.

training with a baseline likelihood ratio policy gradient algorithm works well too... after 12 hrs it's getting 70% success rate keeping the pole balanced.

seems very sensitive to the initialisation though; here's three runs i started at the same time. it's interesting how much quicker the green one was getting good results....

what's *really* interesting is looking at what the blue (and actually red) models are doing....

whereas the green one (in the video above) is constantly making adjustments these other two models (red and blue) are much happier trying to stay still, i.e. long sequences of a 0 action. if they manage to get it balanced, or even nearly so, they just stop. this prior of 0 means though that if it's *not* balanced and they wait too long they haven't got time to recover. that's really cool! digging a bit further we can see that early in training there were cases where it manages to balance the pole very quickly and then just stopping for the rest of the episode. these were very successful, compared to other runs in the same batch, and hence this prior formed; do nothing and get a great (relative) reward! it's taking a loooong time for them to recover from these early successes and, eventually, they'll have an arguably better model at convergence.

here's the proportions of actions per run for the cases where the episode resulted in a reward of 200 (i.e. it's balanced). notice how the red and blue ones don't do well for particular initial starts, these correspond to cases where the behaviour of "no action" overwhelms particular starting pushes.

run | stop | left | right | up | down |
---|---|---|---|---|---|

green | 0.32 | 0.16 | 0.15 | 0.22 | 0.13 |

red | 0.65 | 0.12 | 0.00 | 0.11 | 0.10 |

blue | 0.80 | 0.11 | 0.00 | 0.00 | 0.08 |

- 2d action; force to apply on cart in x & y directions
- +1 base reward for each step pole is up. up to an additional +4 as force applied tends to 0.

Continuous control with deep reinforcement learning introduced an continuous control version of deep q networks using an actor/critic model. My implementation for this problem is ddpg_cartpole.py and it learns a reasonable policy but is super tricky to train :/ my fav run is at about 40s where the cart plays the classic table-cloth trick :)

anyways maybe balancing on a cart is too easy; what about on a kuka arm :)

]]>how can we train a simulated bot to drive around the following track using reinforcement learning and neural networks?

let's build something using

- the standard 2d robot simulator (STDR) as a general framework for simulating and controlling the bot ( built on top of the robot operating system (ROS) )
- tensorflow for the RL and NN side of things.

all the code can be found on github

our bot has 3 sonars; one that points forward, one left and another right. like most things in ROS these sonars are made available via a simple pub/sub model. each sonar publishes to a topic and our code subscribes to these topics building a 3 tuple as shown below. the elements of the tuple are (the range forward, range left, range right).

the bot is controlled by three simple commands; step forward, rotate clockwise or rotate anti clockwise. for now these are distinct, i.e. while rotating the bot isn't moving forward. again this is trivial in ROS, we just publish a msg representing this movement to a topic that the bot subscribes to.

we'll use simple forward movement as a signal the bot is doing well or not;

- choosing to move forward and hitting a wall scores the bot -1
- choosing to move forward and not hitting a wall scores the bot +1
- turning left or right scores 0

the reinforcement learning task is to learn a decision making process where given some *state* of the world an *agent*
chooses some *action* with the goal of maximising some *reward*.

for our drivebot

- the
*state*is based on the current range values of the sonar; this is all the bot knows - the
*agent*is the bot itself - the
*actions*are 'go forward', 'turn left' or 'turn right' - the
*reward*is the score based on progress

we'll call the a tuple of \( (current\ state, action, reward, new\ state) \) an *event* and a sequence of events an *episode*.

each episode will be run as 1) place the bot in a random location and 2) let it run until either a) it's not received a positive reward in 30 events (i.e. it's probably stuck) or b) we've recorded 1,000 events.

before we do anything too fancy we need to run a baseline..

if the largest sonar reading is from the forward one: go forward elif the largest sonar reading is from the left one: turn left else: turn right

an example episode looks like this...

if we run for 200 episodes we can

- plot the total reward for each episode and
- build a frequency table of the (action, reward) pairs we observed during all episodes.

freq [action, reward] 47327 [F, 1] # moved forward and made progress; +1 10343 [R, 0] # turned right 8866 [L, 0] # turned left 200 [F, 0] # noise (see below) 93 [L, 1] # noise (see below) 79 [R, 1] # noise (see below) 36 [F, -1] # moved forwarded and hit wall; -1 |

this baseline, like all good baselines, is pretty good! we can see that ~750 is about the limit of what we expect to be able to get as a total reward over 1,000 events (recall turning gives no reward).

there are some entries that i've marked as 'noise' in the frequency table, eg [R, 1], which are cases which shouldn't be possible. these
come about from how the simulation is run; stdr simulating the environment is run asynchronously to the bot
so it's possible to take an action (like turning)
and at the next 'tick' we've had relative movement from the action *before* (e.g. going straight). it's not a big deal and just the kind
of thing you have to handle with async messaging (more noise).

also note that the baseline isn't perfect and it doesn't get a high score most of the time. there are two main causes of getting stuck.

- it's possible to get locked into a left, right, left, right oscillation loop when coming out of a corner.
- if the bot tries to take a corner too tightly the front sonar can have a high reading but the bot can collide with the corner. (note the "blind spot" between the front and left/right sonar cones)

our first non baseline will be based on discrete q learning.
in q learning we learn a Q(uality) function which, given a state and action, returns the *expected total reward* until
the end of the episode (if that action was taken).

\( Q(state, action) = expected\ reward\ until\ end\ of\ episode \)

in the case that both the set of states and the set of actions are discrete we can represent the Q function as a table.

even though our sonar readings are a continuous state (three float values) we've already been talking about a discretised version of them; the mapping to one of furthest_sonar_forward, furthest_sonar_left, furthest_sonar_right.

a q table that corresponds to the baseline policy then might look something like ...

actions state go_forward turn_left turn_right furthest_sonar_forward 100 99 99 furthest_sonar_left 95 99 90 furthest_sonar_right 95 90 99

once we have such a table we can make optimal decisions by running \( Q(state, action) \) for all actions and choosing the highest Q value.

but how would we populate such a table in the first place?

we'll use the idea of *value iteration*. recall that the Q function returns the expected *total* reward until
the end of the episode and as such it can be defined iteratively.

given an event \( (state_1, action, reward, state_2) \) we can define \( Q(state_1, action) \) as \(reward\) + the maximum reward that's possible to get from \(state2\)

\( Q(s_1, a) = r + max_{a'} Q(s_2, a') \)

if a process is stochastic we can introduce a discount on future rewards, gamma, that reflects that immediate awards are to be weighted more than potential future rewards. (note: our simulation is totally deterministic but it's still useful to use a discount to avoid snowballing sums)

\( Q(s_1, a) = r + \gamma . max_{a'} Q(s_2, a') \)

given this definition we can learn the q table by populating it randomly and then updating incrementally based on observed events.

\( Q(s_1, a) = \alpha . Q(s_1, a) + (1 - \alpha) . (r + \gamma . max_{a'} Q(s_2, a')) \)

running this policy over 200 episodes ( \( \gamma = 0.9, \alpha = 0.1 \) ) from a randomly initialised table we get the following total rewards per episode.

freq [action, reward] 52162 [F, 1] 10399 [R, 0] 7473 [L, 0] 1983 [R, 1] # noise 1112 [L, 1] # noise 221 [F, -1] 191 [F, 0] # noise |

this looks very much like the baseline which is not surprising since the actual q table entries end up defining the same behaviour.

actions state go_forward turn_left turn_right furthest_sonar_forward 8.3 4.5 4.3 furthest_sonar_left 2.5 2.5 5.9 furthest_sonar_right 2.5 6.1 2.4

also note how quickly this policy was learnt. for the first few episodes the bot wasn't doing great but it only took ~10 episodes to converge.

comparing the baseline to this run it's maybe fractionally better, but not by a whole lot...

using a q table meant we needed to map our continuous state space (three sonar readings) to a discrete
one (furthest_sonar_forward, etc) and though this mapping *seems* like a good idea maybe we can
learn better representations directly from the raw sonars? and what better tool to do this than a neural network! #buzzword

first let's consider the decision making side of things; given a state (the three sonar readings) which action shall we take? an initial
thought might be to build a network representing \( Q(s, a) \) and then run it forward once for each
action and pick the max Q value. this would work but we're actually going to represent things a little differently
and have the network output the Q values for *all* actions every time. we can simply run an arg_max over all
the Q values to pick the best action.

how then can we train such a network? consider again our original value update equation ...

\( Q(s_1, a) = r + \gamma . max_{a'} Q(s_2, a') \)

we want to use our Q network in two ways.

- for the left hand side, s1, we want to calculate the Q value for a particular action \( a \)
- for the right hand side, s2, we want to calculate the maximum Q value across all actions.

our network is already setup to do s2 well but for s1 we actually only want the Q value for one action not all of them. to pull out the one we want we can use a one hot mask followed by a sum to reduce to a single value. it may seem like a clumsy way to calculate it but having the network set up like this is worth it for the inference case and the calculation for s2 (both of which require all values)

once we have these two values it's just a matter of including the reward and discount ( \(\gamma\) ) and minimising the difference between the two sides (called the temporal difference). squared loss seems to work fine for this problem.

graphically the training graph looks like the following. recall that a training example is a single event \( (state_1, action, reward, state_2) \) where in this network the action is represented by a one hot mask over all the actions.

training this network (the Q network being just a single layer MLP with 3 nodes) gives us the following results.

freq [action, reward] 62644 [F, 1] 36530 [R, 0] 6217 [F, -1] 3636 [R, 1] # noise |

this network doesn't score as high as the discrete q table and the main reason is that it's gotten stuck in a behavioural local minima.

looking at the frequency of actions we can see this network never bothered with turning left and you can actually get a semi reasonable result if you're ok just doing 180 degree turns all the time...

even still this approach is (maybe) doing slightly better than the discrete case...

note there's nothing in the reward system that heavily punishes going backwards, you might "waste" some time turning around but
the reward system is *any* movement not just movement forward. we'll come back to this when we address a slightly harder
problem but for now this brings up a common problem in all decision making processes;
the explore/exploit tradeoff.

there are lots of ways of handling explore vs exploit and we'll use a simple approach that has worked well for me in the past..

given a set of Q values for actions, say, [ 6.8, 7.7, 3.9 ] instead of just picking the max, 0.8, we'll do a weighted pick by sampling from the distribution we get by normalising the values [ 0.36, 0.41, 0.21 ]

further to this we'll either squash (or stretch) the values by raising them to a power \( \rho \) before normalising them.

- when \(\rho\) is low, e.g. \(\rho\)=0.01, values are squashed and the weighted pick is more uniform. [ 6.8, 7.7, 3.9 ] -> [ 0.3338, 0.3342, 0.3319 ] resulting in an explore like behaviour.
- when \(\rho\) is high, e.g. \(\rho\)=20, values are stretched and the weighted pick is more like picking the maximum. [ 6.8, 7.7, 3.9 ] -> [ 0.0768, 0.9231, 0.0000 ] resulting in an exploit like behaviour

annealing \(\rho\) from a low value to a high one over training gives a smooth explore -> exploit transistion.

trying this with our network gives the following result.

first 200 freq [action, reward] 4598 [R, 0] 4564 [L, 0] 2862 [F, -1] 1735 [F, 1] 177 [R, 1] # noise 175 [L, 1] # noise 69 [F, 0] # noise | last 200 freq [action, reward] 45789 [F, 1] 33986 [R, 0] 4596 [F, -1] 1274 [R, 1] # noise 558 [L, 0] 36 [L, 1] # noise 1 [F, 0] # noise |

this run actually kept \(\rho\) low for a couple of hundred iterations before annealing it from 0.01 to 50. we can see for the first 200 episodes we have an equal mix of F, L & R (so are definitely exploring) but by the end of the run we're back to favoring just turning right again :/ let's take a closer look.

the following graphs show the proportions of actions taken over time for two runs. the first is for the baseline case and shows a pretty constant ratio of F/L/R over time. the second plot is quite different though. here we have three distinct parts; 1) a burnin period of equal F/L/R when the bot was running 100% explore 2) a period of ramping up towards exploit where we do get higher scores related to a high F amount and finally 3) where we get locked into just favoring R again.

what went wrong? and what can we do about it?

there are actually two important things happening and two clever approaches to avoiding them. you can read a lot more about these two approaches in deepmind's epic Playing Atari paper [1]

the first problem is related to the instability of training the Q network with *two* updates per example.

recall that each training example updates both Q(s1) and Q(s2) and it turns out it can be unstable to train both of these at the same time. a simple enough workaround is to keep a full copy of the network (called the "target network") and use it for evaluating Q(s2). we don't backpropogate updates to the target network and instead take a fresh copy from the Q(s1) network every n training steps. (it's called the "target" network since it provides a more stationary target for Q(s1) to learn against)

the second problem is related to the order of examples we are training with.

the core sequence of an episode is \( ( state_1,\ action_1,\ reward_1,\ state_2,\ action_2,\ reward_2,\ state_3,\ ... ) \) which for training get's broken down to individual events i.e. \( ( state_1,\ action_1,\ reward_1,\ state_2 ) \) followed by \( ( state_2,\ action_2,\ reward_2,\ state_3 ) \) etc. as such each event's \( state_2 \) is going to be the \( state_1 \) for the next event. this type of correlation between successive examples is bad news for any iterative optimizer.

the solution is to use 'experience replay' [2] where we simly keep old events in a memory and replay them back as training examples in a random order. it's very similar to the ideas behind why we shuffle input data for any learning problem.

adding these two gives the best result ...

freq [action, reward] 116850 [F, 1] 18594 [R, 0] 18418 [L, 0] 2050 [L, 1] # noise 2026 [R, 1] # noise 1681 [F, -1] 1 [F, 0] # noise |

this run used a bot with a high \(\rho\) value (i.e. maximum exploit) that was fed 1,000 events/second randomly
from the explore/exploit job. we can see a few episodes of doing poorly before converging quickly (note that
this experience replay provides events a *lot* quicker than just normal simulation)

overall this approach nails it compared to the previous runs.

- a harder reward function
- instead of a reward per movement we can give a reward only at discrete points on the track.

- continuous control
- instead of three discrete actions we should try to learn continuous control actions ([3]) e.g. acceleration & steering.
- will most probably require an actor/critic implementation

- add an adversial net [4] as a way to transfer learn between this simulated robot and the raspberry pi powered whippersnapper rover i'm building.

- [1] Playing Atari with Deep Reinforcement Learning (pdf)
- [2] Reinforcement Learning for Robots Using Neural Networks (pdf)
- [3] Continuous control with deep reinforcement learning (arxiv)
- [4] Domain-Adversarial Training of Neural Networks (arxiv)

follow along further with this project by reading my google+ robot0 stream

see more of what i'm generally reading on my google+ reading stream

]]>one thing in theano i couldn't immediately find examples for was a simple embedding lookup table, a critical component for anything with NLP. turns out that it's just one of those things that's so simple no one bothered writing it down :/

tl;dr : you can just use numpy indexing and everything just works.

consider the following theano.tensor example of 2d embeddings for 5 items. each row represents a seperate embeddable item.

>>> E = np.random.randn(5, 2) >>> t_E = theano.shared(E) >>> t_E.eval() array([[-0.72310919, -1.81050727], [ 0.2272197 , -1.23468159], [-0.59782901, -1.20510837], [-0.55842279, -1.57878187], [ 0.63385967, -0.35352725]])

to pick a subset of the embeddings it's as simple as just using indexing. for example to get the third & first embeddings it's ...

>>> idxs = np.asarray([2, 0]) >>> t_E[idxs].eval() array([[-0.59782901, -1.20510837], # third row of E [-0.72310919, -1.81050727]]) # first row of E

if we want to concatenate them into a single vector (a common operation when we're feeding up to, say, a densely connected hidden layer), it's a reshape

>>> t_E[idxs].reshape((1, -1)).eval() array([[-0.59782901, -1.20510837, -0.72310919, -1.81050727]]) # third & first row concatenated

all the required multi dimensional operations you need for batching just work too..

eg. if we wanted to run a batch of size 2 with the first batch item being the third & first embeddings and the second batch item being the fourth & fourth embeddings we'd do the following...

>>> idxs = np.asarray([[2, 0], [3, 3]]) # batch of size 2, first example in batch is pair [2, 0], second example in batch is [3, 3] >>> t_E[idxs].eval() array([[[-0.59782901, -1.20510837], # third row of E [-0.72310919, -1.81050727]], # first row of E [[-0.55842279, -1.57878187], # fourth row of E [-0.55842279, -1.57878187]]]) # fourth row of E >>> t_E[idxs].reshape((idxs.shape[0], -1)).eval() array([[-0.59782901, -1.20510837, -0.72310919, -1.81050727], # first item in batch; third & first row concatenated [-0.55842279, -1.57878187, -0.55842279, -1.57878187]]) # second item in batch; fourth row duplicated

this type of packing of the data into matrices is crucial to enable linear algebra libs and GPUs to really fire up.

consider the following as-simple-as-i-can-think-up "network" that uses embeddings;

given 6 items we want to train 2d embeddings such that the first two items have the same embeddings, the third and fourth have the same embeddings and the last two have the same embeddings. additionally we want all other combos to have different embeddings.

the *entire* theano code (sans imports) is the following..

first we initialise the embedding matrix as before

E = np.asarray(np.random.randn(6, 2), dtype='float32') t_E = theano.shared(E)

the "network" is just a dot product of two embeddings ...

t_idxs = T.ivector() t_embedding_output = t_E[t_idxs] t_dot_product = T.dot(t_embedding_output[0], t_embedding_output[1])

... where the training cost is a an L1 penality against the "label" of 1.0 for the pairs we want to have the same embeddings and 0.0 for the ones we want to have a different embeddings.

t_label = T.iscalar() gradient = T.grad(cost=abs(t_label - t_dot_product), wrt=t_E) updates = [(t_E, t_E - 0.01 * gradient)] train = theano.function(inputs=[t_idxs, t_label], outputs=[], updates=updates)

we can generate training examples by randomly picking two elements and assigning label 1.0 for the pairs 0 & 1, 2 & 3 and 4 & 6 (and 0.0 otherwise) and every once in awhile write them out to a file.

print "i n d0 d1" for i in range(0, 10000): v1, v2 = random.randint(0, 5), random.randint(0, 5) label = 1.0 if (v1/2 == v2/2) else 0.0 train([v1, v2], label) if i % 100 == 0: for n, embedding in enumerate(t_E.get_value()): print i, n, embedding[0], embedding[1]

plotting this shows the convergence of the embeddings (labels denote initial embedding location)...

0 & 1 come together, as do 2 & 3 and 4 & 5. ta da!

it's interesting to observe the effect of this (somewhat) arbitrary cost function i picked.

for the pairs where we wanted the embeddings to be same the cost function, \( |1 - a \cdot b | \), is minimised when the dotproduct is 1 and this happens when the vectors
are the same and have unit length. you can see this is case for pairs 0 & 1 and 4 & 5 which have come together and ended up on the unit circle. but what about 2 & 3?
they've gone to the origin and the dotproduct of the origin with itself is 0, so it's *maximising* the cost, not minimising it! why?

it's because of the other constraint we added. for all the pairs we wanted the embeddings to be different the cost function, \( |0 - a \cdot b | \), is minimised when
the dotproduct is 0. this happens when the vectors are orthogonal. both 0 & 1 and 4 & 5 can be on the unit sphere and orthogonal but for them to be both orthogonal
to 2 & 3 *they* have to be at the origin. since my loss is an L1 loss (instead of, say, a L2 squared loss) the pair 2 & 3 is overall better at the origin because it
gets more from minimising this constraint than worrying about the first.

the pair 2 & 3 has come together not because we were training embeddings to be the same but because we were also training them to be different. this wouldn't be a problem if we were using 3d embeddings since they could all be both on the unit sphere and orthogonal at the same time.

you can also see how the points never fully converge. in 2d with this loss it's impossible to get the cost down to 0 so they continue to get bumped around. in 3d, as just mentioned, the cost can be 0 and the points would converge.

there's one non trivial optimisation you can do regarding your embeddings that relates to how sparse the embedding update is.
in the above example we have 6 embeddings in total and, even though we only update 2 of them at a time, we are calculating the
gradient with respect to the *entire* t_E matrix. the end result is that we calculate (and apply) a gradient that for the majority of rows is just zeros.

... gradient = T.grad(cost=abs(t_label - t_dot_product), wrt=t_E) updates = [(t_E, t_E - 0.01 * gradient)] ... print gradient.eval({t_idxs: [1, 2], t_label: 0}) [[ 0.00000000e+00 0.00000000e+00] [ 9.60363150e-01 2.22545816e-04] [ 1.00614786e+00 -3.63630615e-03] [ 0.00000000e+00 0.00000000e+00] [ 0.00000000e+00 0.00000000e+00] [ 0.00000000e+00 0.00000000e+00]]

you can imagine how much sparser things are when you've got 1M embeddings and are only updating <10 per example :/

rather than do all this wasted work we can be a bit more explicit about both how we want the gradient calculated and updated by using inc_subtensor

... t_embedding_output = t_E[t_idxs] ... gradient = T.grad(cost=abs(t_label - t_dot_product), wrt=t_embedding_output) updates = [(t_E, T.inc_subtensor(t_embedding_output, -0.01 * gradient))] ... print gradient.eval({t_idxs: [1, 2], t_label: 0}) [[ 9.60363150e-01 2.22545816e-04] [ 1.00614786e+00 -3.63630615e-03]]

and of course you should only do this once you've proven it's the slow part...

]]>language modelling is a classic problem in NLP; given a sequence of words such as "my cat likes to ..." what's the next word? this problem is related to all sorts of things, everything from autocomplete to speech to text.

the classic solution to language modelling is based on just counting. if a speech to text system is sure its heard "my cat likes to" but then can't decide if the next word if "sleep" or "beep" we can just look at relative counts; if we've observed in a large corpus how cats like to sleep more than they like to beep we can say "sleep" is more likely. (note: this would be different if it was "my roomba likes to ...")

the first approach i saw to solving this problem with neural nets is from bengio et al. "a neural probabilistic language model" (2003). this paper was a huge eye opener for me and was the first case i'd seen of using a distributed, rather than purely symbolic, representation of text. definitely "word embeddings" are all the rage these days!

bengio takes the approach of using a softmax to estimate the distribution of possible words given the two previous words. ie \( P({w}_3 | {w}_1, {w}_2) \). depending on your task though it might make more sense to instead estimate the likelihood of the triple directly ie \( P({w}_1, {w}_2, {w}_3) \).

let's work through a empirical comparison of these two on a synthetic problem. we'll call the first the *softmax* approach and the second the *logisitic_regression* approach.

rather than use real text data let's work on a simpler synthetic dataset with a vocab of only 6 tokens; "A", "B", "C", "D", "E" & "F". be warned: a vocab this small is so contrived that it's hard to generalise any result from it. in particular a normal english vocab in the hundreds of thousands would be soooooooo much sparser.

we'll use random walks on the following erdos renyi graph as a generating grammar. egs "phrases" include "D C", "A F A F A A", "A A", "E D C" & "E A A"

the main benefit of such a contrived small vocab is that it's feasible to analyse all 6^{3} = 216 trigrams.
let's consider the distributions associated with a couple of specific (w1, w2) pairs.

there are only 45 trigrams that this grammar generates and the most frequent one is FAA. FAF is also possible but the other FA? cases can never occur.

F A A 0.20 # the most frequent trigram generated F A B 0.0 # never generated F A C 0.0 F A D 0.0 F A E 0.0 F A F 0.14 # the 4th most frequent trigram

if we train a simple softmax based neural probabilistic language model (nplm) we see the distribution of \( P({w}_3 | {w}_1=F, {w}_2=A ) \) converge to what we expect; FAA has a likelihood of 0.66, FAF has 0.33 and the others 0.0

this is a good illustration of the convergence we expect to see with a softmax. each observed positive example of FAA is also an implicit negative example for FAB, FAC, FAD, FAE & FAF and as such each FAA causes the likelihood of FAA to go up while pushing the others down. since we observe FAA twice as much as FAF it gets twice the likelihood and since we never see FAB, FAC, FAD or FAE they only ever get pushed down and converge to 0.0

since the implementation behind this is (overly) simple we can run a couple of times to ensure things are converging consistently. here's 6 runs, from random starting parameters, and we can see each converges to the same result..

now consider the logisitic model where instead of learning the distribution of w3 given (w1, w2) we instead model the likelihood of the triple directly \( P({w}_1, {w}_2, {w}_3) \). in this case we're modelling whether a specific example is true or not, not how it relates to others, so one big con is that there are no implicit negatives like the softmax. we need explicit negative examples and for this experiment i've generated them by random sampling the trigrams that don't occur in the observed set. ( the generation of "good" negatives is a surprisingly hard problem )

if we do 6 runs again instead of learning the distribution we have FAA and FAF converging to 1.0 and the others converge to 0.0. run4 actually has FAB tending to 1.0 too but i wouldn't be surprised at all if it dropped later; these graphs in general are what i'd expect given i'm just using a fixed global learning rate (ie nothing at all special about adapative learning rate)

now insteading considering the most frequent w1, w2 trigrams let's consider the least frequent.

C B A 0.003 C B B 0.07 # 28th most frequent (of 45 possible trigrams) C B C 0.0 C B D 0.003 C B E 0.002 C B F 0.001 # the least frequent trigram generated

as before the softmax learns the distribution; CBB is the most frequent, CBC has 0.0 probability and the others are roughly equal. these examples are far less frequent in the dataset so the model, quite rightly, allocates less of the models complexity to getting them right.

the logisitic model as before has, generally, everything converging to 1.0 except CBC which converges to 0.0

finally consider the case of C -> C -> ?. this one is interesting since C -> C never actually occurs in the grammar.

first let's consider the logistic case. CC only ever occurs in the training data as an explicit negative so we see all of them converging to 0.0 ( amusingly in run4 CCC alllllmost made it )

now consider the softmax. recall that the softmax learns by explicit positives and implicit negatives, but, since there are no cases of observed CC?, the softmax would not have seen any CC? cases.

so what is going on here? the convergence is all over the place! run2 and run6 seems to suggest CCA is the only likely case whereas run3 and run4 oscillate between CCB and CCF ???

it turns out these are artifacts of the training. there was no pressure in any way to get CC? "right" so these are just the side effects of how the embeddings for tokens, particularly C in this case, are being used for the other actual observed examples. we call these hallucinations.

another slightly different way to view this is to run the experiment 100 times and just consider the converged state (or at least the final state after a fixed number of iterations)

if we consider FA again we can see its low variance convergence of FAA to 0.66 and FAF to 0.33.

if we consider CB again we can see its higher variance convergence to the numbers we reviewe before; CBB ~= 0.4, CBC = 0.0 and the others around 0.15

considering CC though we see CCA and CCB have a bimodal distribution between 0.0 and 1.0 unlike any of the others. fascinating!

this is interesting but i'm unsure how much of it is just due to an overly simple model. this implementation just uses a simple fixed global learning rate (no per weight adaptation at all), uses very simple weight initialisation and has no regularisation at all :/

all the code can be found on github

]]>i've been reviving some old theano code recently and in case you haven't seen it theano is a pretty awesome python library that reads a lot like numpy but provides two particularly interesting features.

- symbolic differentiation; not something i'll talk about here, but super useful if you're tinkering with new models and you're using a gradient descent method for learning (and these days, who's not..)
- the ability to run transparently on a gpu; well, almost transparently, this'll be the main focus of this post...

let's work through a very simple model that's kinda like a system of linear equations. we'll compare 1) numpy (our timing baseline) vs 2) theano on a cpu vs 3) theano on a gpu. keep in mind this model is contrived and doesn't really represent anything useful, it's more to demonstrate some matrix operations.

first consider the following numpy code (speed_test_numpy.py) which does a simple y=mx+b like calculation a number of times in a tight loop. this looping isn't just for benchmarking, lots of learning algorithms operate on a tight loop.

# define data # square matrices will do for a demo np.random.seed(123) m = np.random.randn(1000, 1000).astype('float32') x = np.random.randn(1000, 1000).astype('float32') b = np.random.randn(1000, 1000).astype('float32') # run tight loop start = time.time() for i in range(500): y = np.add(np.dot(m, x), b) print "numpy", time.time()-start, "sec"

this code on a 6 core 3.8Ghz AMD runs in a bit over 2min

$ python speed_test_numpy.py numpy 135.350140095 sec

now consider the same thing in theano (speed_test_theano.py)

import theano import theano.tensor as T # define data np.random.seed(123) m = np.random.randn(1000, 1000).astype('float32') x = np.random.randn(1000, 1000).astype('float32') b = np.random.randn(1000, 1000).astype('float32') # define a symbolic expression of the equations in theano tm = T.matrix("m") tx = T.matrix("x") tb = T.matrix("b") ty = T.add(T.dot(tm, tx), tb) # and compile it line = theano.function(inputs=[tx, tm, tb], outputs=[ty]) # then run same loop as before start = time.time() for i in range(500): y, = line(m, x, b) print "theano", time.time()-start, "sec"

hopefully it's clear enough what is happening here at a high level but just briefly the tm, tx, tb and ty variables represent a symbolic representation of what we want to do and the theano.function call compiles this into actual executable code. there is lots of gentle intro material that introduces this notation on the theano site.

when run on the cpu it takes about the same time as the numpy version

$ THEANO_FLAGS=device=cpu python speed_test_theano.py numpy 136.371109009 sec

but when "magically" run on the gpu it's quite a bit faster.

$ THEANO_FLAGS=device=gpu python speed_test_theano.py Using gpu device 0: GeForce GTX 970 theano 3.16091990471 sec

awesome! a x40 speed up! so we're done right? not quite, we can do better.

let's drill into what's actually happening; we can do this in two ways, debugging the compiled graph and theano profiling.

debugging allows us to see what a function has been compiled to. for the cpu case it's just a single blas gemm (general matrix mulitplication) call. that's exactly what'd we want, so great!

Gemm{no_inplace} [@A] '' 0 |b [@B] |TensorConstant{1.0} [@C] |m [@D] |x [@E] |TensorConstant{1.0} [@C]

profiling allows to see where time is spent. 100% in this single op, no surprise.

$ THEANO_FLAGS=device=cpu,profile=True python speed_test_theano.py ... <% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name> 100.0% 100.0% 136.282s 2.73e-01s 500 0 Gemm{no_inplace} ...

looking at the gpu version though things are a little different...

HostFromGpu [@A] '' 4 |GpuGemm{inplace} [@B] '' 3 |GpuFromHost [@C] '' 2 | |b [@D] |TensorConstant{1.0} [@E] |GpuFromHost [@F] '' 1 | |m [@G] |GpuFromHost [@H] '' 0 | |x [@I] |TensorConstant{1.0} [@E]

we can see a GpuGemm operation, the gpu equivalent of Gemm, but now there's a bunch of GpuFromHost & HostFromGpu operations too? what are these?

i'll tell you what they are, they are the bane of your existence! these represent transferring data to/from the gpu which is slow and, if we're not careful, can add up to a non trivial amount. if we review the profiling output we can see that, though we're faster than the non gpu version, we're spending >70% of the time just moving data.

(though remember this example is contrived, we'd expect to be doing more in our overall computation that just a single general matrix mulitply)

$ THEANO_FLAGS=device=gpu,profile=True python speed_test_theano.py ... <% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name> 26.4% 26.4% 0.776s 1.55e-03s 500 3 GpuGemm{inplace} 19.5% 45.9% 0.573s 1.15e-03s 500 0 GpuFromHost(x) 19.5% 65.4% 0.572s 1.14e-03s 500 1 GpuFromHost(m) 19.3% 84.7% 0.565s 1.13e-03s 500 2 GpuFromHost(b) 15.3% 100.0% 0.449s 8.99e-04s 500 4 HostFromGpu(GpuGemm{inplace}.0) ...

ouch!

the crux of this problem is that we actually have two types of variables in this model; the parameterisation of the model (m & b) and
those related to examples (x & y). so, though it's realistic to do a speed test with a tight loop over the same function many times,
what is *not* realistic is that we are passing the model parameters to/from the gpu
each and every input example. this is a complete waste; it's much more sensible to send them over to the gpu once at the
start of the loop and retreive them once at the end. this is an important and very common pattern.

how do we fix this? it's actually pretty simple; shared variables. yay!

consider the following; speed_test_theano_shared.py

# define data np.random.seed(123) m = np.random.randn(1000, 1000).astype('float32') x = np.random.randn(1000, 1000).astype('float32') b = np.random.randn(1000, 1000).astype('float32') # define a symbolic expression of the equations in theano tm = theano.shared(m) # copy m over to gpu once explicitly tx = T.matrix("x") tb = theano.shared(b) # copy b over to gpu once explicitly ty = T.add(T.dot(tm, tx), tb) line = theano.function(inputs=[tx], outputs=[ty]) # don't pass m & b each call # then run same loop as before start = time.time() for i in range(500): y, = line(x) print tm.get_value().shape # note: we can get the value back at any time

reviewing the debug we can see this removes a stack of the GpuFromHost calls.

HostFromGpu [@A] '' 2 |GpuGemm{no_inplace} [@B] '' 1 |[@C] |TensorConstant{1.0} [@D] | [@E] |GpuFromHost [@F] '' 0 | |x [@G] |TensorConstant{1.0} [@D]

and we're down to < 2s

$ THEANO_FLAGS=device=gpu,profile=True python speed_test_theano_shared.py Using gpu device 0: GeForce GTX 970 theano 1.93515706062 sec ... <% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name> 44.7% 44.7% 0.804s 1.61e-03s 500 1 GpuGemm{no_inplace} 30.2% 74.9% 0.543s 1.09e-03s 500 0 GpuFromHost(x) 25.1% 100.0% 0.451s 9.01e-04s 500 2 HostFromGpu(GpuGemm{no_inplace}.0) ...

what's even crazier is we can go further by moving the x and y matrices onto the gpu too. it turns out this isn't *too*
far fetched since if x and y were representing training examples we'd be iterating over them anyways (and if we could fit them
all onto the gpu that'd be great )

#define data np.random.seed(123) m = np.random.randn(1000, 1000).astype('float32') x = np.random.randn(1000, 1000).astype('float32') b = np.random.randn(1000, 1000).astype('float32') # define a symbolic expression of the equations in theano tm = theano.shared(m) tx = theano.shared(x) tb = theano.shared(b) ty = theano.shared(np.zeros((1000, 1000)).astype('float32')) # we need a shared var for y now mx_b = T.add(T.dot(tm, tx), tb) # and compile it train = theano.function(inputs=[], updates={ty: mx_b}) # update y on gpu # then run same loop as before start = time.time() for i in range(500): train() # now there's no input/output print tm.get_value().shape print "theano", time.time()-start, "sec"

the debug graph is like the cpu graph now, just one gemm call.

GpuGemm{no_inplace} [@A] '' 0 |[@B] |TensorConstant{1.0} [@C] | [@D] | [@E] |TensorConstant{1.0} [@C]

and runs in under a second. x150 the numpy version. nice! :)

$ THEANO_FLAGS=device=gpu,profile=True python speed_test_theano_shared2.py theano 0.896003007889 sec ... <% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name> 100.0% 100.0% 0.800s 1.60e-03s C 500 1 GpuGemm{no_inplace} ...]]>

PyMC is a python library for working with bayesian statistical models, primarily using MCMC methods. as a software engineer who has only just scratched the surface of statistics this whole MCMC business is blowing my mind so i've got to share some examples.

let's start with the simplest thing possible, fitting a simple distribution.

say we have a thousand values, ` 87.27, 67.98, 119.56, ...`

and we want to build a model of them.

a common first step might be to generate a histogram.

if i had to a make a guess i'd say this data looks normally distributed. somewhat unsurprising, not just because normal distributions are freakin everywhere, (this great khan academy video on the central limit theorem explains why) but because it was me who synthetically generated this data in the first place ;)

now a normal distribution is parameterised by two values; it's *mean* (technically speaking, the "middle" of the curve) and it's *standard deviation* (even more technically speaking, how "fat" it is) so let's use PyMC to figure out what these values are for this data.

*!!warning!! !!!total overkill alert!!!* there must be a bazillion simpler ways to fit a normal to this data but this post is about
dead-simple-PyMC not dead-simple-something-else.

first a definition of our model.

# simple_normal_model.py from pymc import * data = map(float, open('data', 'r').readlines()) mean = Uniform('mean', lower=min(data), upper=max(data)) precision = Uniform('precision', lower=0.0001, upper=1.0) process = Normal('process', mu=mean, tau=precision, value=data, observed=True)

working *backwards* through this code ...

- line 6 says i am trying to model some
`process`

that i believe is`Normal`

ly distributed defined by variables`mean`

and`precision`

. (precision is just the inverse of the variance, which in turn is just the standard deviation squared). i've already`observed`

this data and the`value`

s are in the variable`data`

- line 5 says i don't know the
`precision`

for my`process`

but my prior belief is it's value is somewhere between 0.0001 and 1.0. since i don't favor any values in this range my belief is`uniform`

across the values. note: assuming a uniform distribution for the precision is overly simplifying things quite a bit, but we can get away with it in this simple example and we'll come back to it. - line 4 says i don't know the
`mean`

for my data but i think it's somewhere between the`min`

and the`max`

of the observed`data`

. again this belief is`uniform`

across the range. - line 3 says the
`data`

for my unknown`process`

comes from a local file (just-plain-python)

the second part of the code runs the MCMC sampling.

# run_mcmc.py from pymc import * import simple_normal_model model = MCMC(simple_normal_model) model.sample(iter=500) print(model.stats())

working *forwards* through this code ...

- line 4 says build a MCMC for the model from the
`simple_normal_model`

file - line 5 says run a sample for 500 iterations
- line 6 says print some stats.

**and that's it!**

the output from our stats includes among other things estimates for the `mean`

and `precision`

we were trying to find

{ 'mean': {'95% HPD interval': array([ 94.53688316, 102.53626478]) ... }, 'precision': {'95% HPD interval': array([ 0.00072487, 0.03671603]) ... }, ... }

now i've brushed over a couple of things here (eg the use of uniform prior over the precision, see here for more details) but i can get away with it all because this problem is a trivial one and i'm not doing gibbs sampling in this case. the main point i'm trying to make is that it's dead simple to start writing these models.

one thing i do want to point out is that this estimation doesn't result in just one single value for mean and precision, it results in a distribution of the possible values. this is great since it gives us an idea of how confident we can be in the values as well as allowing this whole process to be iterative, ie the output values from this model can be feed easily into another.

all the code above parameterised the normal distribution with a mean and a precision. i've always thought of normals though in terms of means and standard deviations
(precision is a more bayesian way to think of things... apparently...) so the first extension to my above example i want to make is to redefine the problem
in terms of a prior on the standard deviation instead of the precision. mainly i want to do this to introduce the `deterministic`

concept
but it's also a subtle change in how the sampling search will be directed because it introduces a non linear transform.

data = map(float, open('data', 'r').readlines()) mean = Uniform('mean', lower=min(data), upper=max(data)) std_dev = Uniform('std_dev', lower=0, upper=50) @deterministic(plot=False) def precision(std_dev=std_dev): return 1.0 / (std_dev * std_dev) process = Normal('process', mu=mean, tau=precision, value=data, observed=True)

our code is almost the same but instead of a prior on the `precision`

we use a `deterministic`

method to map from the parameter we're
trying to fit (the `precision`

) to a variable we're trying to estimate (the `std_dev`

).

we fit the model using the same `run_mcmc.py`

but this time get estimates for the `std_dev`

not the `precision`

{ 'mean': {'95% HPD interval': array([ 94.23147867, 101.76893808]), ... 'std_dev': {'95% HPD interval': array([ 19.53993697, 21.1560098 ]), ... ... }

which all matches up to how i originally generated the data in the first place.. cool!

from numpy.random import normal data = [normal(100, 20) for _i in xrange(1000)]

for this example let's now dive a bit deeper than just the stats object.
to help understand how the sampler is converging on it's results we can also dump
a trace of it's progress at the end of `run_mcmc.py`

import numpy for p in ['mean', 'std_dev']: numpy.savetxt("%s.trace" % p, model.trace(p)[:])

plotting this we can see how quickly the sampled values converged.

let's consider a slightly more complex example.

again we have some data... `107.63, 207.43, 215.84, ...`

that plotted looks like this...

hmmm. looks like *two* normals this time with the one centered on 100 having a bit more data.

how could we model this one?

data = map(float, open('data', 'r').readlines()) theta = Uniform("theta", lower=0, upper=1) bern = Bernoulli("bern", p=theta, size=len(data)) mean1 = Uniform('mean1', lower=min(data), upper=max(data)) mean2 = Uniform('mean2', lower=min(data), upper=max(data)) std_dev = Uniform('std_dev', lower=0, upper=50) @deterministic(plot=False) def mean(bern=bern, mean1=mean1, mean2=mean2): return bern * mean1 + (1 - ber) * mean2 @deterministic(plot=False) def precision(std_dev=std_dev): return 1.0 / (std_dev * std_dev) process = Normal('process', mu=mean, tau=precision, value=data, observed=True)

reviewing the code again it's mostly the same the big difference being the `deterministic`

definition of the `mean`

.
it's now that we finally start to show off the awesome power of these non analytical approaches.

line 12 defines the mean not by one `mean`

variable
but instead as a mixture of two, `mean1`

and `mean2`

. for each value we're trying to model we pick either `mean1`

or `mean2`

based on *another* random variable `bern`

.
`bern`

is described by a
bernoulli distribution
and so is either 1 or 0, proportional to the parameter `theta`

.

ie the definition of our `mean`

is that when `theta`

is high, near 1.0, we pick `mean1`

most of the time and
when `theta`

is low, near 0.0, we pick `mean2`

most of the time.

what we are solving for then is not just `mean1`

and `mean2`

but how the values are split between them (described by `theta`

)
(and note for the sake of simplicity i made the two normal differ in their means but use a shared standard deviation. depending on what you're doing this
might or might not make sense)

reviewing the traces we can see the converged `mean`

s are 100 & 200 with `std dev`

20. the mix (`theta`

) is 0.33, which all agrees
with the synthetic data i generated for this example...

from numpy.random import normal import random data = [normal(100, 20) for _i in xrange(1000)] # 2/3rds of the data data += [normal(200, 20) for _i in xrange(500)] # 1/3rd of the data random.shuffle(data)

to me the awesome power of these methods is the ability in that function to pretty much write whatever i think best describes the process. too cool for school.

i also find it interesting to see how the convergence came along... the model starts in a local minima of both normals having mean a bit below 150 (the midpoint of the two actual ones) with a mixing proportion of somewhere in the ballpark of 0.5 / 0.5. around iteration 1500 it correctly splits them apart and starts to understand the mix is more like 0.3 / 0.7. finally by about iteration 2,500 it starts working on the standard deviation which in turn really helps narrow down the true means.

(thanks cam for helping me out with the formulation of this one..)

these are pretty simple examples thrown together to help me learn but i think they're still illustrative of the power of these methods (even when i'm completely ignore anything to do with conjugacy)

in general i've been working through an awesome book, doing bayesian data analysis, and can't recommend it enough.

i also found john's blog post on using jags in r was really helpful getting me going.

all the examples listed here are on github.

next is to rewrite everything in stan and do some comparision between pymc, stan and jags. fun times!

]]>say you have three items; item1, item2 and item3 and you've somehow associated a count for each against one of five labels; A, B, C, D, E

> data A B C D E item1 23700 20 1060 11 4 item2 1581 889 20140 1967 200 item3 1 0 1 76 0

depending on what you're doing it'd be reasonable to normalise these values and an l1-normalisation (ie rescale so they are the same proportion but add up to 1) gives us the following...

> l1_norm = function(x) x / sum(x) > l1 = t(apply(data, 1, l1_norm)) > l1 A B C D E item1 0.955838 0.00080661 0.042751 0.00044364 0.00016132 item2 0.063809 0.03588005 0.812851 0.07938814 0.00807200 item3 0.012821 0.00000000 0.012821 0.97435897 0.00000000

great... but you know it's fair enough if you think things don't feel right...

according to these normalised values item3 is "more of" a D (0.97) than item1 is an A (0.95) even though we've only collected 1/300th of the data for it. this just isn't right.

purely based on these numbers i'd think it's more sensible to expect item3 to be A or a C (since that's what we've seen with item1 and item2) but we just haven't seen enough data for it yet. what makes sense then is to smooth the value of item3 out and make it more like some sort of population average.

so firstly what makes a sensible population average? ie if we didn't know anything at all about a new item what would we want the proportions of labels to be? alternatively we can ask what do we think item3 is likely to look like later on as we gather more data for it? i think an l1-norm of the sums of all the values makes sense ...

> column_totals = apply(data, 2, sum) > population_average = l1_norm(column_totals) > population_average A B C D E 0.5094218 0.0183000 0.4268199 0.0413513 0.0041069

... and it seems fair. without any other info it's reasonable to "guess" a new item is likely to be somewhere between an A (0.50) and a C (0.42)

so now we have our item3, and our population average, and we want to mix them together in some way... how might we do this?

A B C D E item3 0.012821 0.000000 0.012821 0.974358 0.000000 pop_aver 0.509421 0.018300 0.426819 0.041351 0.004106

a linear weighted sum is nice and easy; ie a classic `item3 * alpha + pop_aver * (1-alpha)`

but then how do we pick alpha?

if we were to do this reweighting for item1 or item2 we'd want alpha to be large, ie nearer 1.0, to reflect the confidence we have in their current values since we have lots of data for them. for item3 we'd want alpha to be small, ie nearer 0, to reflect the lack of confidence we have in it.

enter the confidence interval, a way of testing how confident we are in a set of values.

firstly, a slight diversion re: confidence intervals...

consider three values, 100, 100 and 200. running this goodness of fit test gives the following result.

> library(NCStats) > gofCI(chisq.test(c(100, 100, 200)), conf.level=0.95) p.obs p.LCI p.UCI [1,] 0.25 0.21008 0.29468 [2,] 0.25 0.21008 0.29468 [3,] 0.50 0.45123 0.54877

you can read the first row of this table as "the count 100 was observed to be 0.25 (p.obs) of the total and i'm 95%
confident (conf.level) that the *true* value is between 0.21 (p.LCI = lower confidence interval) and 0.29 (p.UCI = upper confidence interval).

there are two important things to notice that can change the range of confidence interval...

1) upping the confidence level results in a wider confidence interval. ie "i'm 99.99% confident the value is true value is between 0.17 and 0.34, but only 1% confident it's between 0.249 and 0.2502"

> gofCI(chisq.test(c(100, 100, 200)), conf.level=0.9999) p.obs p.LCI p.UCI [1,] 0.25 0.17593 0.34230 [2,] 0.25 0.17593 0.34230 [3,] 0.50 0.40452 0.59548 > gofCI(chisq.test(c(100, 100, 200)), conf.level=0.01) p.obs p.LCI p.UCI [1,] 0.25 0.24973 0.25027 [2,] 0.25 0.24973 0.25027 [3,] 0.50 0.49969 0.50031

2) getting more data results in a narrower confidence interval. ie "even though the proportions stay the same as i gather x10, then x100, my original data i can narrow my confidence interval around the observed value"

> gofCI(chisq.test(c(10, 10, 20)), conf.level=0.95) p.obs p.LCI p.UCI [1,] 0.25 0.14187 0.40194 [2,] 0.25 0.14187 0.40194 [3,] 0.50 0.35200 0.64800 > gofCI(chisq.test(c(100, 100, 200)), conf.level=0.95) p.obs p.LCI p.UCI [1,] 0.25 0.21008 0.29468 [2,] 0.25 0.21008 0.29468 [3,] 0.50 0.45123 0.54877 > gofCI(chisq.test(c(1000, 1000, 2000)), conf.level=0.95) p.obs p.LCI p.UCI p.exp [1,] 0.25 0.23683 0.26365 [2,] 0.25 0.23683 0.26365 [3,] 0.50 0.48451 0.51549

so it turns out this confidence interval is exactly what we're after; a way of estimating a pessimistic value (the lower bound) that gets closer to the observed value as the size of the observed data grows.

note: there's a lot of discussion on how best to do these calculations. there is a more "correct" and principled version of this calculation that is provided by MultinomialCI but i found it's results weren't as good for my purposes.

awesome, so back to the problem at hand; how do we pick our mixing parameter alpha?

let's extract the lower bound of the confidence interval value for our items using a very large confidence (99.99%) (to enforce a wide interval). the original l1-normalised values are shown here again for comparison.

> l1 A B C D E item1 0.95583 0.00080 0.04275 0.00044 0.00016 item2 0.06380 0.03588 0.81285 0.07938 0.00807 item3 0.01282 0.00000 0.01282 0.97435 0.00000 > library(NCStats) > gof_ci_lower = function(x) gofCI(chisq.test(x), conf.level=0.9999)[,2] > gof_chi_ci = t(apply(data, 1, gof_ci_lower)) > gof_chi_ci A B C D E item1 0.95048 0.00035 0.03803 0.00015 0.00003 item2 0.05803 0.03156 0.80302 0.07296 0.00614 item3 0.00000 0.00000 0.00000 0.79725 0.00000

we see that item1, which had a lot of support data, has dropped it's A value only slightly from 0.955 to 0.950 whereas item3 which had very little support, has had it's D value drop drastically from 0.97 to 0.79. by using a conf.level closer and closer 1.0 we see make this drop more and more drastic.

because each of the values in the `gof_chi_ci matrix`

are lower bounds the rows no longer add up to 1.0 (as they do in the l1-value
matrix). we can calculate how much we've "lost" with `1 - sum(rows)`

and it turns out this residual is pretty much
exactly what we were after when we were for our mixing parameter alpha!

> gof_chi_ci$residual = as.vector(1 - apply(gof_chi_ci, 1, sum)) > gof_chi_ci A B C D E residual item1 0.95048 0.00035 0.03803 0.00015 0.00003 0.01096 item2 0.05803 0.03156 0.80302 0.07296 0.00614 0.02829 item3 0.00000 0.00000 0.00000 0.79725 0.00000 0.20275

in the case of item1 the residual is low, ie the confidence interval lower bound was close to the observed value so we shouldn't mix in much of the population average. but in the case of item3 the residual is high, we lost a lost by the confidence interval being very wide, so we might as well mix in more of the population average.

now what i've said here is completely unprincipled. i just made it up and the maths work because everything is normalised. but having said that the results are really good so i'm going with it :)

putting it all together then we have the following bits of data...

> l1 # our original estimates A B C D E item1 0.95583 0.00080 0.04275 0.00044 0.00016 item2 0.06380 0.03588 0.81285 0.07938 0.00807 item3 0.01282 0.00000 0.01282 0.97435 0.00000 > population_average # the population average A B C D E item1 0.50942 0.01830 0.42681 0.04135 0.00410 > gof_chi_ci # lower bound of our confidences A B C D E item1 0.95048 0.00035 0.03803 0.00015 0.00003 item2 0.05803 0.03156 0.80302 0.07296 0.00614 item3 0.00000 0.00000 0.00000 0.79725 0.00000 > gof_chi_ci_residual = as.vector(1 - apply(gof_chi_ci, 1, sum)) > gof_chi_ci_residual # how much we should mix in the population average [1] 0.01096 0.02829 0.20275 0.40759

since there's lots of support for item1 the residual is small, only 0.01, so we smooth only a little of the population average in and end up not changing the values that much

> l1[1,] A B C D E item1 0.95583 0.00080 0.04275 0.00044 0.00016 > gof_chi_ci[1,] + population_average * gof_chi_ci_residual[1] A B C D E item1 0.95606 0.00055 0.04270 0.00060 0.00007

but item3 has a higher residual and so we smooth more of the population average in and it's shifted more much strongly from D towards A and B

> l1[3,] A B C D E item3 0.01282 0.00000 0.01282 0.97435 0.00000 > gof_chi_ci[3,] + population_average * gof_chi_ci_residual[3] A B C D E item3 0.10329 0.00371 0.08653 0.80563 0.00083]]>

one model is based the idea of an interest graph where the nodes of the graph are users and items and the edges of the graph represent an interest, whatever that might mean for the domain.

if we only allow edges between users and items the graph is bipartite.

let's consider a simple example of 3 users and 3 items; user1 likes item1, user2 likes all three items and user3 likes just item3.

fig1 user / item interest graph |

one way to model similiarity between items is as follows....

let's consider a token starting at item1. we're going to repeatedly "bounce" this token back and forth between the items and the users based on the interest edges.

so, since item1 is connected to user1 and user2 we'll pick one of them randomly and move the token across. it's 50/50 which of user1 or user2 we end up at (fig2).

next we bounce the token back to the items; if the token had gone to user1 then it has to go back to item1 since user1 has no other edges, but if it had gone to user2 it could back to any of the three items with equal probability; 1/3rd.

the result of this is that the token has 0.66 chance of ending up back at item1 and equal 0.16 chance of ending up at either item2 or item3 (fig3)

fig2 dispersing from item1 to users | fig3 dispersing back from users to items |

( note this is different than if we'd started at item2. in that case we'd have gone to user2 with certainity and then it would have been uniformly random which of the items we'd ended up at )

for illustration let's do another iteration...

bouncing back to the users item1's 0.66 gets split 50/50 between user1 and user2. all of item2's 0.16 goes to user2 and item3 splits it's 0.16 between user2 and user3. we end up with fig4 (no, not that figure 4). bouncing back to the items we get to fig5.

fig4 | fig5 |

if we keep repeating things we converge on the values

{item1: 0.40, item2: 0.20, item3: 0.40}and these represent the probabilities of ending up in a particular item if we bounced forever.

note since this is convergent it also doesn't actually matter which item we'd started at, it would always get the same result in the limit.

to people familiar with power methods this convergence is no surprise. you might also recognise a similiarity between this and the most famous power method of them all, pagerank.

so what has this all got to do with item similiarity?

well, the values of the probabilities might all converge to the same set regardless of which item we start at
**but** each item gets there in different ways.

most importantly we can capture this difference by taking away a bit of probability each iteration of the dispersion.

so, again, say we start at item1. after we go to users and back to items we are at fig3 again.

but this time, before we got back to the users side, let's take away a small proportion of the probability mass, say, 1/4. this would be 0.16 for item1 and 0.04 for item2 and item3. this leaves us with fig6.

fig3 (again) | fig6 |

we can then repeat iteratively as before, items -> users -> items -> users. but each time we are on the items side we take away 1/4 of the mass until it's all gone.

iteration | taken from item1 | taken from item2 | taken from item3 |

1 | 0.16 | 0.04 | 0.04 |

2 | 0.09 | 0.04 | 0.05 |

3 | 0.06 | 0.02 | 0.05 |

... | ... | ... | ... |

final sum | 0.50 | 0.20 | 0.30 |

if we do the same for item2 and item3 we get different values...

starting at | total taken from item1 | total taken from item2 | total taken from item3 |

item1 | 0.50 | 0.20 | 0.30 |

item2 | 0.38 | 0.24 | 0.38 |

item3 | 0.30 | 0.20 | 0.50 |

finally these totals can be used as features for a pairwise comparison of the items. intuitively we can see that for any row wise similarity function we might choose to use sim(item1, item3) > sim(item1, item2) or sim(item2, item3)

one last thing to consider is that the amount of decay, 1/4 in the above example, is of course configurable and we get different results using a value between 0.0 and 1.0.

a very low value, ~= 0.0, produces the limit value, all items being classed the same. a higher value, ~= 1.0, stops the iterations after only one "bounce" and represents the minimal amount of dispersal.

]]>the first thing we need to do is determine which segments of the crawl are valid and ready for use (the crawl is always ongoing)

```
$ s3cmd get s3://aws-publicdatasets/common-crawl/parse-output/valid_segments.txt
$ head -n3 valid_segments.txt
1341690147253
1341690148298
1341690149519
```

given these segment ids we can lookup the related textData objects.

if you just want one grab it's name using something like ...

```
$ s3cmd ls s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690147253/ 2>/dev/null \
| grep textData | head -n1 | awk '{print $4}'
s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690147253/textData-00000
```

but if you want the lot you can get the listing with ...

```
$ cat valid_segments.txt \
| xargs -I{} s3cmd ls s3://aws-publicdatasets/common-crawl/parse-output/segment/{}/ \
| grep textData | awk '{print $4}' > all_valid_segments.tsv
```

( note: this listing is roughly 200,000 textData files and takes awhile to fetch )

each textData file is a hadoop sequence files, the key being the crawled url and the value being the extracted visible text.

to have a quick look at one you can get hadoop to dump the sequence file contents with ...

```
$ hadoop fs -text textData-00000 | less
http://webprofessionals.org/intel-to-acquire-mcafee-moving-into-online-security-ny-times/ Web Professionals
Professional association for web designers, developers, marketers, analysts and other web professionals.
Home
...
The company’s share price has fallen about 20 percent in the last five years, closing on Wednesday at $19.59 a share.
Intel, however, has been bulking up its software arsenal. Last year, it bought Wind River for $884 million, giving it a software maker with a presence in the consumer electronics and wireless markets.
With McAfee, Intel will take hold of a company that sells antivirus software to consumers and businesses and a suite of more sophisticated security products and services aimed at corporations.
```

( note: the visible text is broken into *one line* per block element from the original html. as such the value in the key/value pairs includes carriage returns and, for something like less, gets
outputted as being seperate lines )

now that we have some text, what can we do with it? one thing is to look for noun phrases and the quickest simplest way is to use something like the python natural language toolkit. it's certainly not the fastest to run but for most people would be the quickest to get going.

extract_noun_phrases.py is an example of doing sentence then word tokenisation, pos tagging and finally noun chunk phrase extraction.

given the text ...

```
Last year, Microsoft bought Wind River for $884 million. This makes it the largest software maker with a presence in North Kanada.
```

it extract noun phrases ...

```
Microsoft
Wind River
North Kanada
```

to run this at larger scale we can wrap it in a simple streaming job

```
hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input textDataFiles \
-output counts \
-mapper extract_noun_phrases.py \
-reducer aggregate \
-file extract_noun_phrases.py
```

run it across a small 50mb sample of textData files the top noun phrases extracted ...

rank | phrase | freq |

1 | 10094 | Posted |

2 | 9597 | November |

3 | 9553 | February |

4 | 8929 | Copyright |

5 | 8726 | September |

6 | 8709 | January |

7 | 8434 | April |

8 | 8307 | August |

9 | 7963 | October |

10 | 7963 | December |

this is not terribly interesting and the main thing that's going on here is that this is just being extracted from the boiler plate of the pages. one tough problem when dealing with visible text on a web page is that it might be visible but that doesn't mean it's interesting to the actual content of the page. here we see 'posted' and 'copyright', we're just extracting the chrome of the page.

check out the full list of values with freq >= 20 here there are some more interesting ones a bit later

so it's fun to look at noun phrases but i've actually brushed over some key details here

- not filtering on english text first generates a
*lot*of "noise". "G úûv ÝT M", "U ŠDú T" and "Y CKdñˆô" are not terribly interesting english noun phrases. - running this at scale you'd probably want to change from streaming and start using an in process java library like the stanford parser
- when it comes to actually doing named entity recognition it's a bit more complex. there's a wavii blog post from manish that talks a bit more about it.

( recall jaccard(set1, set2) = |intersection| / |union|. when set1 == set2 this evaluates to 1.0 and when set1 and set2 have no intersection it evaluates to 0.0 )

one thing that's always annoyed me about it though is that is loses any sense of partial similarity. as a set based measure it's all or nothing.

so consider the sets *set1 = {i1, i2, i3}* and *set2 = {i1, i2, i4}*

jaccard(set1, set2) = 2/4 = 0.5 which is fine given you have *no* prior info about the relationship between i3 and i4.

but what if you have a similarity function, s, and s(i3, i4) ~= 1.0? in this case you don't want a jaccard of 0.5, you want something closer to 1.0. by saying i3 ~= i4 you're saying the sets are almost the same.

after lots of googling i couldn't find a jaccard variant that supports this idea so i rolled my own. the idea is that we want to count the values in the complement of the intersection not as 0.0 on the jaccard numerator but as some value ranging between 0.0 and 1.0 based on the similarity of the elements. after some experiments i found that just counting each as the root mean sqr value of the pairwise sims of them all works pretty well. i'd love to know the name of this technique (or any similar better one) so i can read some more about it.

def fuzzy_jaccard(s1, s2, sim): union = s1.union(s2) intersection = s1.intersection(s2) # calculate root mean square sims between elements in just s1 and just s2 just_s1 = s1 - intersection just_s2 = s2 - intersection sims = [sim(i1, i2) for i1 in just_s1 for i2 in just_s2] sqr_sims = [s * s for s in sims] root_mean_sqr_sim = sqrt(float(sum(sqr_sims)) / len(sqr_sims)) # use this root_mean_sqr_sim to count these values in the complement as, in some way, being "partially" in the intersection return float(len(intersection) + (root_mean_sqr_sim * intersection_complement_size)) / len(union)

looking at our example of *{i1, i2, i3}* vs *{i1, i2, i4}*...

when s(i3, i4) = 0.0 it degenerates to normal jaccard and scores 0.5

print fuzzy_jaccard(set([1,2,3]), set([1,2,4]), lambda i1, i2: 0.0) # = 0.5 (2/4) ie normal jaccard

when s(i3, i4) = 1.0 it treats the values as the same and scores 1.0

print fuzzy_jaccard(set([1,2,3]), set([1,2,4]), lambda i1, i2: 1.0) # = 1.0 (4/4) treating i3 == i4

when s(i3, i4) = 0.9 it scores inbetween with 0.8

print fuzzy_jaccard(set([1,2,3]), set([1,2,4]), lambda i1, i2: 0.8) # = 0.9 (3.6/4)

this is great for me because now given an appropriate similiarity function i'm able to get a lot more discrimination between sets.

]]>