the first thing i thought when we setup our bee hive was "i wonder how you could count the number of bees coming and going?"

after a little research i discovered it seems noone has a good non intrusive system for doing it yet. it can apparently be useful for all sorts of hive health checking.

the first thing to do was collect some sample data. a raspberry pi, a standard pi camera and a solar panel is a pretty simple thing to get going and at 1 frame every 10 seconds you get 5,000+ images over a day (6am to 9pm).

here's an example image... how many bees can you count?

the second thing was to decide exactly what i was trying to get the neural net to do. if the task is "count bees in an image" you could arguably try to regress directly to the number but it didn't feel like the easiest thing to start with and it doesn't allow any fun tracking of individual bees over frames. instead i decided to focus on localising every bee in the image.

a quick sanity check of an off the shelf single shot multi box detector didn't give great results. kinda not surprisingly, especially given the density of bees around the hive entrance. (protip: transfer learning is not the answer to everything) but that's ok; i have a super constrained image, only have 1 class to detect and don't actually care about a bounding box as such, just whether a bee is there or not. can we do something simpler?

my first quick experiment was a patch based "bee / no bee in image" detector. i.e. given an image patch what's the probability there is at least 1 bee in the image. doing this as a fully convolutional net on very small patches meant it could easily run on full res data. this approach kinda of worked but was failing for the the hive entrance where there is a much denser region of bees.

i quickly realised this could easily be framed instead as an image to image translation problem. the input is the RGB camera image and the output is a single channel image where a "white" pixel denotes the center of a bee.

RGB input (cropped) | single channel output (cropped) |

step three was labelling. it wasn't too hard to roll a little TkInter app for selecting / deselecting bees on an image and saving the results in a sqlite database. i spent quite a bit of time getting this tool right; anyone who's done a reasonable amount of hand labelling knows the difference it can make :/ we'll see later how luckily (as we'll see) having access to a lot of samples means i could get quite a good result with semi supervised approaches

the architecture of the network is a very vanilla u-net.

- a fully convolutional network trained on half resolution patches but run against full resolution images
- encoding is a sequence of 4 3x3 convolutions with stride 2
- decoding is a sequence of nearest neighbours resizes + 3x3 convolution (stride 1) + skip connection from the encoders
- final layer is a 1x1 convolution (stride 1) with sigmoid activation (i.e. binary bee / no bee choice per pixel)

after some emperical experiments i chose to only decode back to half the resolution of the input. it was good enough.

i did the decoding using a nearest neighbour resize instead of a deconvolution pretty much out of habit.

the network was trained with Adam and it's small enough that batch norm didn't seem to help much. i was surprised how simple and how few filters i could get away with.

i applied the standard sort of data augmentation you'd expect; random rotation & image colour distortion. the patch based training approach means we inherently get a form of random cropping. i didn't flip the images since i've always got the camera on the same side of the hive.

one subtle aspect was the need to post process the output predictions. with a probabilistic output we get a blurry cloud around where bees might be. to convert this into a hard one-bee-one-pixel decision i added thresholding + connected components + centroid detection using the skimage measure module. this bit was hand rolled and tuned purely based on eyeballing results; it could totally be included in the end to end stack as a learnt component though. TODO :)

input | raw model output | cluster centroids |

my initial experiments were with images over a short period of a single day. it was very easy to get a model running extremely well on this data with a small number of labelled images (~30)

day 1 sample 1 | day 1 sample 2 | day 1 sample 3 |

things got more complicated when i started to include longer periods over multiple days. one key aspect was the lighting difference (time of day, as well as different weather). another was that i was putting the camera out manually each day and just sticking it in kinda the same spot with blue tac. a third more surprising difference was that with grass growing the tops of dandelions look a lot like bees apparently (i.e. the first round of trained models hadn't seen this and when it appeared it was a constant false positive)

most of this was solved already by data augmentation and none of it was a show stopper. in general the data doesn't have much variation, which is great since that allows us to get away with a simple network and training scheme.

day 1 sample | day 2 sample | day 3 sample |

this image shows an example of the predictions. it's interesting to note this is a case where there were many more bees than any image i labelled, a great validation that the fully convolutional approach trained on smaller patches works.

it does ok across a range of detections; i imagine the lack of diversity in the background is a biiiiig help here and that running this network on some arbitrary hive wouldn't be anywhere near as good.

high density around entrance | varying bee sizes | high speed bees! |

the ability to get a large number of images makes this a great candidate for semi supervised learning.

a very simple approach of ...

- capture 10,000 images
- label 100 images & train
`model_1`

- use
`model_1`

to label other 9,900 images - train
`model_2`

with "labelled" 10,000

... results in a model_2 that does better than model_1.

here's an example. note that model_1 has some false positives (far left center & blade of grass) and some false negatives (bees around the hive entrance)

model_1 | model_2 |

this kind of data is also a great example of when correcting a bad model is quicker than labelling from scratch...

- label 10 images & train model
- use model to label next 100 images
- use labelling tool to
*correct*the labels of these 100 images - retrain model with 110 images
- repeat ....

this is very common pattern i've seen and sometimes makes you need to rethink your labelling tool a bit..

being able to locate bees means you can count them! this makes for fun graphs like this that show the number of bees over a day. i love how they all get busy and race home around 4pm :)

running a model for inference on the pi was a big part of this project.

the first baseline was to freeze the tensorflow graph and just run it directly on the pi. this works without any problem, it's just the pi can only do 1 image / second :/

i was very excited about the possibility of getting this model to run on the pi using a movidus neural compute stick. they are an awesome bit of kit.

sadly it doesn't work :/ since their API to convert from a tensorflow graph to their internal model format doesn't support the way i was doing decoding so i had to convert upsizing from using nearest neighbours resampling to using a deconvolution. that's not a big deal, except it just doesn't work. there are a bunch of little problems i've got bug reproduction cases for. once they are fixed i can revisit...

this led me to the third version of my model; can we regress directly from the RGB input to a count of the bees? if we do this we can avoid any problems relating to unsupported ops & kernel bugs on the neural compute stick, though it's unlikely this will be as good as the centroids approach of v2.

i was originally wary of trying this since i expected it would take a lot more labelling (it's no longer a patch based system) however! given a model that does pretty well at locating them, and a lot of unlabelled data, we can make a pretty good synthetic dataset by applying the v2 rgb image -> bee location model and just counting the number of detections.

this model was pretty easy to train and gives reasonable results... (though it's not as good as just counting the centroids detected by the v2 model)

sample actual vs predicted for some test data | ||||||||||||

actual | 40 | 19 | 16 | 15 | 13 | 12 | 11 | 10 | 8 | 7 | 6 | 4 |

v2 (centroids) predicted | 39 | 19 | 16 | 13 | 13 | 14 | 11 | 8 | 8 | 7 | 6 | 4 |

v3 (raw count) predicted | 33.1 | 15.3 | 12.3 | 12.5 | 13.3 | 10.4 | 9.3 | 8.7 | 6.3 | 7.1 | 5.9 | 4.2 |

... but unfortunately *still* doesn't work on the NCS. (it runs, i just can't get it to give
anything but random results). i've generated some
more bug reproduction cases
and again will come back to it... eventually...

as aways there's still a million things to tinker with...

- get things running on the neural compute stick; pending some work on their side...
- get the entire thing ported to the je vois embedded camera i've done a bit of tinkering with this but wanted to have the NCS working as a baseline first. i want 120fps bee detection!!!
- tracking bees over multiple frames / with multiple cameras for optical flow visualisation
- more detailed study of benefit of semi supervised approach and training a larger model to label for a smaller model
- investigate power usage of the NCS; how to factor that into hyperparam tuning?
- switch to building a small version of farm.bot for doing some cnc controlled seedling genetic experiments (i.e. something completely different)

all the code for this is on github

]]>the most familiar form of a convolutional network to most people is the type used for classifying images.

we can think of these types of networks as being made up of two halves.

the first is a sequence of convolutional layers with some form of spatial downsampling; e.g. pooling or having a stride > 1 ...

some input | (64, 64, 3) |

a convolution; stride 2, kernel size 8 | (32, 32, 8) |

and another (kernel size 16) | (16, 16, 16) |

and another (kernel size 32) | (8, 8, 32) |

... followed by a second half which is a sequence of fully connected layers ...

output from convolutions | (8, 8, 32) |

flattened | (2048) |

fully connected to 128 | (128) |

fully connected to 10 | (10) |

(note: here, and following, we're going to ignore any leading batch dimension)

in these networks the first half "squeezes" spatial information into depth information while the second half acts as a standard classifier.

one property of any fully connected layer is that the number of parameters is dictated by the input size; in this example of a classifier it's the flattened size of the final volume of the first half (the 2048d vector)

this idea of the number of parameters being linked to the input size is **not** the case for the layers in
the first half though; there the number of parameters is not dictated by the input size but instead by the kernel size
and number of output channels. specifically the spatial size of the input doesn't matter.

e.g. using pooling for downsampling for any arbitrary (H, W) ...

input | (H, W, 3) |

convolution, stride=1, #kernels=5 | (H, W, 5) |

pooling, size=2 | (H//2, W//2, 5) |

... vs stride > 1 for downsampling.

input | (H, W, 3) |

convolution, stride=2, #kernels=5 | (H//2, W//2, 5) |

so in our original example the first half of the network going from `(64, 64, 4)`

to `(8, 8, 32)`

*could* actually be applied to an input of any spatial size. if for example we gave an input
of `(128, 128, 4)`

we'd get an output of `(16, 16, 32)`

. **but!** you wouldn't though be able to run
this `(16, 16, 32)`

through the second classifier half, since the size of the flattened tensor would now
be the wrong size.

now let's consider the common architecture for an image to image network. again it has two halves.

the first half is like the prior example; some convolutions with a form of downsampling as a way of trading spatial information for channel information.

but the second half isn't a classifier, it's the reverse of the first half; a sequence of convolutions with some form of upsampling;

this upsampling can can either deconvolutions with a stride>1 or something like nearest neighbour upsampling

e.g.

some input | (64, 64, 3) |

convolution | (64, 64, 8) |

pooling | (32, 32, 8) |

convolution | (32, 32, 16) |

pooling | (16, 16, 16) |

nearest neigbour upsample resize | (32, 32, 16) |

convolution | (32, 32, 8) |

nearest neigbour upsample resize | (64, 64, 8) |

convolution | (64, 64, 3) |

we can see that *none* of these operations require a fixed spatial size, so it's fine to apply them to an image of whatever size,
even something like `(128000, 128000, 3)`

which would produce an output of `(128000, 128000, 3)`

. this ability to apply to huge images
is a great trick for when you're dealing with huge image data like medical scans.

so what does it mean then for a network to be "fully convolutional"? for me it's basically not using any operations that require a fixed tensor size as input.

in this above example we'd say we're training on "patches" on `(64, 64)`

. these would probably be random crops within a larger image
and, note, that means that each training image doesn't even need to be the same resolution or aspect (as long as it's larger than 64x64)

a 1x1 kernel in a convolutional layer at first appears a bit strange. why would you bother?

consider a volume of `(1, 1, 3)`

that we apply a 1x1 convolution to. with a kernel size of 5 we'd end up getting a volume of `(1, 1, 5)`

.
an interesting interpretation of this is that it's exactly equivalent to having a fully connected layer between 3 inputs and 5 outputs.

a volume then of, say, `(10, 20, 3)`

that we apply this same convolution to gives a volume of `(10, 20, 5)`

so what what we're doing is equivalent to applying the same fully connected "layer" *per pixel* to the `(10, 20)`

input.

tie this in with the idea of the fully convolutional network...

some input | (64, 64, 3) |

some convolutions + downsampling | (32, 32, 8) |

more convolutions + downsampling | (16, 16, 16) |

more convolutions + downsampling | (8, 8, 32) |

a 1x1 convolution, stride=1 & kernel size 10 | (8, 8, 10) |

a 1x1 convolution, stride=1, kernel size 1 & sigmoid activation | (8, 8, 1) |

what we've got is like our original classifier example but that's operating in a fully convolutional way; the last three layers are the same as a sequence of fully connected layers going from 32 to 10 to 1 output, but across an 8x8 grid in parallel.

and as before we'd be able to train this on an input of `(64, 64, 3)`

with an output of `(8, 8, 1)`

but apply it to whatever
multiple of in the input size we'd like. e.g. an input `(640, 320, 3)`

would result in output of `(80, 40, 1)`

we can think of this final `(80, 40, 1)`

as kind of similar to a 10x5 heat map across whatever is being captured by the `(8, 8, 1)`

output.

the papers were i first saw these ideas were OverFeat (Sermanet et al) & Network in Network (Lin et al)

]]>one really useful visualisation you can do while training a network is visualise the norms of the variables and gradients.

how are they useful? some random things that immediately come to mind include the fact that...

- diverging norms of variables might mean you haven't got enough regularisation.
- zero norm gradient means learning has somehow stopped.
- exploding gradient norms means learning is unstable and you might need to clip (hellloooo deep reinforcement learning).

let's consider a simple bounding box regression conv net (the specifics aren't important, i just grabbed this from another project, just needed something for illustration) ...

# (256, 320, 3) input image model = slim.conv2d(images, num_outputs=8, kernel_size=3, stride=2, weights_regularizer=l2(0.01), scope="c0") # (128, 160, 8) model = slim.conv2d(model, num_outputs=16, kernel_size=3, stride=2, weights_regularizer=l2(0.01), scope="c1") # (64, 80, 16) model = slim.conv2d(model, num_outputs=32, kernel_size=3, stride=2, weights_regularizer=l2(0.01), scope="c2") # (32, 40, 32) model = slim.conv2d(model, num_outputs=4, kernel_size=1, stride=1, weights_regularizer=l2(0.01), scope="c3") # (32, 40, 4) 1x1 bottleneck to get num of params down betwee c2 & h0 model = slim.dropout(model, keep_prob=0.5, is_training=is_training) # (5120,) 32x40x4 -> 32 is where the majority of params are so going to be most prone to overfitting. model = slim.fully_connected(model, num_outputs=32, weights_regularizer=l2(0.01), scope="h0") # (32,) model = slim.fully_connected(model, num_outputs=4, activation_fn=None, scope="out") # (4,) = bounding box (x1, y1, dx, dy)

a simple training loop using feed_dict would be something along the lines of ...

optimiser = tf.train.AdamOptimizer() train_op = optimiser.minimize(loss=some_loss) with tf.Session() as sess: while True: _ = sess.run(train_op, feed_dict=blah)

but if we want to get access to gradients we need to do things a little differently and call `compute_gradients`

and `apply_gradients`

ourselves ...

optimiser = tf.train.AdamOptimizer()gradients = optimiser.compute_gradients(loss=some_loss) train_op = optimiser.apply_gradients(gradients)with tf.Session() as sess: while True: _ = sess.run(train_op, feed_dict=blah)

with access to the gradients we can inspect them and create tensorboard summaries for them ...

optimiser = tf.train.AdamOptimizer() gradients = optimiser.compute_gradients(loss=some_loss)l2_norm = lambda t: tf.sqrt(tf.reduce_sum(tf.pow(t, 2))) for gradient, variable in gradients: tf.summary.histogram("gradients/" + variable.name, l2_norm(gradient)) tf.summary.histogram("variables/" + variable.name, l2_norm(variable))train_op = optimiser.apply_gradients(gradients) with tf.Session() as sess:summaries_op = tf.summary.merge_all() summary_writer = tf.summary.FileWriter("/tmp/tb", sess.graph)for step in itertools.count(): _,summary= sess.run([train_op,summaries_op], feed_dict=blah)summary_writer.add_summary(summary, step)

( though we may only want to run the expensive `summaries_op`

once in awhile... )

with logging like this we get 8 histogram summaries per variable; the cross product of

- layer weights vs layer biases
- variable vs gradients
- norms vs values

e.g. for conv layer c3 in the above model we get the summaries shown below. note: nothing terribly interesting in this example, but a couple of things

- red : very large magnitude of gradient very early in training; this is classic variable rescaling.
- blue: non zero gradients at end of training, so stuff still happening at this layer in terms of the balance of l2 regularisation vs loss. (note: no bias regularisation means it'll continue to drift)

sometimes the histograms aren't enough and you need to do some more serious plotting. in these cases i hackily wrap the gradient calc in tf.Print and plot with ggplot

e.g. here's some gradient norms from an old actor / critic model (cartpole++)

on a related note you can also explicitly write summaries which is sometimes easier to do than generating the summaries through the graph.

i find this especially true for image summaries where there are many pure python options for post processing with, say, PIL

e.g. explicit scalar values

summary_writer =tf.summary.FileWriter("/tmp/blah") summary = tf.Summary(value=[ tf.Summary.Value(tag="foo", simple_value=1.0), tf.Summary.Value(tag="bar", simple_value=2.0), ]) summary_writer.add_summary(summary, step)

e.g. explicit image summaries using PIL post processing

summary_values = [] # (note: could already contain simple_values like above) for i in range(6): # wrap np array with PIL image and canvas img = Image.fromarray(some_np_array_probably_output_of_network[i])) canvas = ImageDraw.Draw(img) # draw a box in the top left canvas.line([0,0, 0,10, 10,10, 10,0, 0,0], fill="white") # write some text canvas.text(xy=(0,0), text="some string to add to image", fill="black") # serialise out to an image summary sio = StringIO.StringIO() img.save(sio, format="png") image = tf.Summary.Image(height=256, width=320, colorspace=3, #RGB encoded_image_string=sio.getvalue()) summary_values.append(tf.Summary.Value(tag="img/%d" % idx, image=image)) summary_writer.add_summary(tf.Summary(value=summary_values), step)]]>

as crazy as it might sound, careerwise at least, i'm leaving google brain and we're moving back to australia. #sad~~Panda~~Kangaroo. it's been a super fun 6 years in the US but our move was never going to be permanent and it feels like now is the right time for the family. believe me it's hard to leave a joint google brain / X robotics project involving deep reinforcement learning robots. hard i say!

where will we be going? back to melbourne where we lived for the 6 years prior to coming here. we're keen to try something different so we've bought a farm about an hour out of the city. my wife and i both grew up in semi rural settings so we have some idea of what to expect. our kids are excited their backyard is about to grow by a factor of x250.

what will i be doing? i actually have no idea. i'm a pretty applied person, as opposed to a hardcore researcher, and have experience in a range of areas so my resume looks ok ( even if my linkedin avatar is the i-have-no-idea-what-im-doing-dog ) some recent robotics experience + lots of machine learning + moving to a farm might result in some interesting ideas. remote work is also a strong possibility; i think there is value i could add to a number of US companies even from across the ocean. to be honest i haven't thought about it too much yet, want to focus on getting everyone home as smoothly as possible first.

what's the tech scene like in melbourne? seemed fun when i was there, lots of smart people and i think interest in machine learning has only been growing. the tech talk i did at our little data science group just before moving here was half a dozen people, the tech talk i did in melbourne about 1 year ago was hundreds of people. when i did neural networks at uni in the late 90s it was embarrassing for the next 10 years to talk about it but these days it seems everyone is wanting to use them in some form.

we have about 7 weeks before we leave the bay area so i hope i get to catch up with everyone before we go! beers in the city sometime soon!

]]>it's a simple enough sounding problem; given a cart in 2d, with a pole balancing on it, move the cart left and right to keep the pole balanced (i.e. within a small angle of vertical).

the entire problem can be described by input 4 variables; the cart position, the cart velocity, the pole angle, the pole angular velocity but even still it's surprisingly difficult to solve. there are loads of implementations of it and if you want to tinker i'd *highly* recommend starting with openai's gym version

as soon as you've solved it though you might want to play with a more complex physics environment and for that you need a serious physics simulator, e.g. the industry standard (and fully open sourced) bullet physics engine.

the simplest cartpole i could make includes 1) the ground 2) a cart (red) and 3) a pole (green). the blue line here denotes the z-axis. ( all code and repro instructions are on github )

- the cart and pole move in 3d, not 1d.
- the pole is
*not*connected to the cart (and since it's relatively light it makes from some crazy dynamics...) - each run is started with a push of the cart in a random direction.

there are two state representations available; a low dimensional one based on the cart & pole pose and a high dimensional one based on raw pixels.

in both representations we use the idea of action repeats; per env.step we apply the choosen action 5 times, take a state snapshot, apply the action another 5 times and take another snapshot. the delta between these two snapshots provides enough information to infer velocity (if the learning algorithm finds that useful to do )

- the low dimensional state is (2, 2, 7)
- axis=0 represents the two snapshots; 0=first, 1=second
- axis=1 represents the object; 0=cart, 1=pole
- axis=2 is the 7d pose; 3d postition + 4d quaternion orientation
- this representation is usually just flattened to (28,) when used

- the high dimensional state is (50, 50, 6)
`[:,:,0:3]`

(the first three channels) is the RGB of a 50x50 render at first snapshot`[:,:,3:6]`

(the second three channels) is the RGB of a 50x50 render at the second snapshot- ( TODO: i concatted in the channel axis for ease of use with conv2d but conv3d is available and i should switch )

an example of the sequence of 50,50 renderings as seen by the network is the following (though network doesn't see the overlaid debugging info)

there are two basic methods for control; discrete and continuous

- the discrete control version uses 5 discrete actions; don't push, push up, down, left, right
- ( i've included a "dont move" action since, once the pole is balanced, the best thing to do is stop )

- the continuous control version uses a 2d action; the forces to apply in the x & y directions.

reward is just +1 for each step the pole is still upright

running this cartpole simulation with random actions gives pretty terrible performance with either random discrete or continuous control. we're lucky to get 5 or 10 steps (of a maximum 200 for our episode) in the video each time the cart skips back to the center represents the pole falling out of bounds and the sim reseting.

- 5 actions; go left, right, up, down, do nothing
- state is cart & pole poses

after training a vanilla dqn using kera-rl we get reasonable controller

the training curve is what we expect; terrible number of steps at the start then gradually getting a reasonable number of full runs (steps=200). still never gets to perfect runs 100% of the time though. there's also an interesting blip around episode 1,000 where it looked like it was doing OK and then diverged and recovered by about episode 7,500.

training with a baseline likelihood ratio policy gradient algorithm works well too... after 12 hrs it's getting 70% success rate keeping the pole balanced.

seems very sensitive to the initialisation though; here's three runs i started at the same time. it's interesting how much quicker the green one was getting good results....

what's *really* interesting is looking at what the blue (and actually red) models are doing....

whereas the green one (in the video above) is constantly making adjustments these other two models (red and blue) are much happier trying to stay still, i.e. long sequences of a 0 action. if they manage to get it balanced, or even nearly so, they just stop. this prior of 0 means though that if it's *not* balanced and they wait too long they haven't got time to recover. that's really cool! digging a bit further we can see that early in training there were cases where it manages to balance the pole very quickly and then just stopping for the rest of the episode. these were very successful, compared to other runs in the same batch, and hence this prior formed; do nothing and get a great (relative) reward! it's taking a loooong time for them to recover from these early successes and, eventually, they'll have an arguably better model at convergence.

here's the proportions of actions per run for the cases where the episode resulted in a reward of 200 (i.e. it's balanced). notice how the red and blue ones don't do well for particular initial starts, these correspond to cases where the behaviour of "no action" overwhelms particular starting pushes.

run | stop | left | right | up | down |
---|---|---|---|---|---|

green | 0.32 | 0.16 | 0.15 | 0.22 | 0.13 |

red | 0.65 | 0.12 | 0.00 | 0.11 | 0.10 |

blue | 0.80 | 0.11 | 0.00 | 0.00 | 0.08 |

- 2d action; force to apply on cart in x & y directions
- state is cart & pole poses

Continuous control with deep reinforcement learning introduced an continuous control version of deep q networks using an actor/critic model.

my implementation for this problem is ddpg_cartpole.py and it learns a reasonable policy though for the few long runs i've done it diverges after awhile. (i've also yet to have this run stable with raw pixels (probably bugs in my code no doubt))

- 2d action; force to apply on cart in x & y directions
- state is 2 50x50 RGB images

Continuous Deep Q-Learning with Model-based Acceleration introduced normalised advantage functions.

my implementation is naf_cartpole.py and i've found NAF to be a lot easier/stable to train than DDPG.

based on raw pixels i haven't yet got a model that can balance most of the time. it's definitely getting somewhere though if we look at episode length over time. (caps out at 200 which is the max episode length)

here's an example of NAF at the start of training. note: these are the 50x50 images *as seen by the conv nets*

here's some examples of eval after training for 24hrs. i can still see mistakes so it should be doing better :/

anyways maybe balancing on a cart is too boring; what about on a kuka arm :)

]]>how can we train a simulated bot to drive around the following track using reinforcement learning and neural networks?

let's build something using

- the standard 2d robot simulator (STDR) as a general framework for simulating and controlling the bot ( built on top of the robot operating system (ROS) )
- tensorflow for the RL and NN side of things.

all the code can be found on github

our bot has 3 sonars; one that points forward, one left and another right. like most things in ROS these sonars are made available via a simple pub/sub model. each sonar publishes to a topic and our code subscribes to these topics building a 3 tuple as shown below. the elements of the tuple are (the range forward, range left, range right).

the bot is controlled by three simple commands; step forward, rotate clockwise or rotate anti clockwise. for now these are distinct, i.e. while rotating the bot isn't moving forward. again this is trivial in ROS, we just publish a msg representing this movement to a topic that the bot subscribes to.

we'll use simple forward movement as a signal the bot is doing well or not;

- choosing to move forward and hitting a wall scores the bot -1
- choosing to move forward and not hitting a wall scores the bot +1
- turning left or right scores 0

the reinforcement learning task is to learn a decision making process where given some *state* of the world an *agent*
chooses some *action* with the goal of maximising some *reward*.

for our drivebot

- the
*state*is based on the current range values of the sonar; this is all the bot knows - the
*agent*is the bot itself - the
*actions*are 'go forward', 'turn left' or 'turn right' - the
*reward*is the score based on progress

we'll call the a tuple of \( (current\ state, action, reward, new\ state) \) an *event* and a sequence of events an *episode*.

each episode will be run as 1) place the bot in a random location and 2) let it run until either a) it's not received a positive reward in 30 events (i.e. it's probably stuck) or b) we've recorded 1,000 events.

before we do anything too fancy we need to run a baseline..

if the largest sonar reading is from the forward one: go forward elif the largest sonar reading is from the left one: turn left else: turn right

an example episode looks like this...

if we run for 200 episodes we can

- plot the total reward for each episode and
- build a frequency table of the (action, reward) pairs we observed during all episodes.

freq [action, reward] 47327 [F, 1] # moved forward and made progress; +1 10343 [R, 0] # turned right 8866 [L, 0] # turned left 200 [F, 0] # noise (see below) 93 [L, 1] # noise (see below) 79 [R, 1] # noise (see below) 36 [F, -1] # moved forwarded and hit wall; -1 |

this baseline, like all good baselines, is pretty good! we can see that ~750 is about the limit of what we expect to be able to get as a total reward over 1,000 events (recall turning gives no reward).

there are some entries that i've marked as 'noise' in the frequency table, eg [R, 1], which are cases which shouldn't be possible. these
come about from how the simulation is run; stdr simulating the environment is run asynchronously to the bot
so it's possible to take an action (like turning)
and at the next 'tick' we've had relative movement from the action *before* (e.g. going straight). it's not a big deal and just the kind
of thing you have to handle with async messaging (more noise).

also note that the baseline isn't perfect and it doesn't get a high score most of the time. there are two main causes of getting stuck.

- it's possible to get locked into a left, right, left, right oscillation loop when coming out of a corner.
- if the bot tries to take a corner too tightly the front sonar can have a high reading but the bot can collide with the corner. (note the "blind spot" between the front and left/right sonar cones)

our first non baseline will be based on discrete q learning.
in q learning we learn a Q(uality) function which, given a state and action, returns the *expected total reward* until
the end of the episode (if that action was taken).

\( Q(state, action) = expected\ reward\ until\ end\ of\ episode \)

in the case that both the set of states and the set of actions are discrete we can represent the Q function as a table.

even though our sonar readings are a continuous state (three float values) we've already been talking about a discretised version of them; the mapping to one of furthest_sonar_forward, furthest_sonar_left, furthest_sonar_right.

a q table that corresponds to the baseline policy then might look something like ...

actions state go_forward turn_left turn_right furthest_sonar_forward 100 99 99 furthest_sonar_left 95 99 90 furthest_sonar_right 95 90 99

once we have such a table we can make optimal decisions by running \( Q(state, action) \) for all actions and choosing the highest Q value.

but how would we populate such a table in the first place?

we'll use the idea of *value iteration*. recall that the Q function returns the expected *total* reward until
the end of the episode and as such it can be defined iteratively.

given an event \( (state_1, action, reward, state_2) \) we can define \( Q(state_1, action) \) as \(reward\) + the maximum reward that's possible to get from \(state2\)

\( Q(s_1, a) = r + max_{a'} Q(s_2, a') \)

if a process is stochastic we can introduce a discount on future rewards, gamma, that reflects that immediate awards are to be weighted more than potential future rewards. (note: our simulation is totally deterministic but it's still useful to use a discount to avoid snowballing sums)

\( Q(s_1, a) = r + \gamma . max_{a'} Q(s_2, a') \)

given this definition we can learn the q table by populating it randomly and then updating incrementally based on observed events.

\( Q(s_1, a) = \alpha . Q(s_1, a) + (1 - \alpha) . (r + \gamma . max_{a'} Q(s_2, a')) \)

running this policy over 200 episodes ( \( \gamma = 0.9, \alpha = 0.1 \) ) from a randomly initialised table we get the following total rewards per episode.

freq [action, reward] 52162 [F, 1] 10399 [R, 0] 7473 [L, 0] 1983 [R, 1] # noise 1112 [L, 1] # noise 221 [F, -1] 191 [F, 0] # noise |

this looks very much like the baseline which is not surprising since the actual q table entries end up defining the same behaviour.

actions state go_forward turn_left turn_right furthest_sonar_forward 8.3 4.5 4.3 furthest_sonar_left 2.5 2.5 5.9 furthest_sonar_right 2.5 6.1 2.4

also note how quickly this policy was learnt. for the first few episodes the bot wasn't doing great but it only took ~10 episodes to converge.

comparing the baseline to this run it's maybe fractionally better, but not by a whole lot...

using a q table meant we needed to map our continuous state space (three sonar readings) to a discrete
one (furthest_sonar_forward, etc) and though this mapping *seems* like a good idea maybe we can
learn better representations directly from the raw sonars? and what better tool to do this than a neural network! #buzzword

first let's consider the decision making side of things; given a state (the three sonar readings) which action shall we take? an initial
thought might be to build a network representing \( Q(s, a) \) and then run it forward once for each
action and pick the max Q value. this would work but we're actually going to represent things a little differently
and have the network output the Q values for *all* actions every time. we can simply run an arg_max over all
the Q values to pick the best action.

how then can we train such a network? consider again our original value update equation ...

\( Q(s_1, a) = r + \gamma . max_{a'} Q(s_2, a') \)

we want to use our Q network in two ways.

- for the left hand side, s1, we want to calculate the Q value for a particular action \( a \)
- for the right hand side, s2, we want to calculate the maximum Q value across all actions.

our network is already setup to do s2 well but for s1 we actually only want the Q value for one action not all of them. to pull out the one we want we can use a one hot mask followed by a sum to reduce to a single value. it may seem like a clumsy way to calculate it but having the network set up like this is worth it for the inference case and the calculation for s2 (both of which require all values)

once we have these two values it's just a matter of including the reward and discount ( \(\gamma\) ) and minimising the difference between the two sides (called the temporal difference). squared loss seems to work fine for this problem.

graphically the training graph looks like the following. recall that a training example is a single event \( (state_1, action, reward, state_2) \) where in this network the action is represented by a one hot mask over all the actions.

training this network (the Q network being just a single layer MLP with 3 nodes) gives us the following results.

freq [action, reward] 62644 [F, 1] 36530 [R, 0] 6217 [F, -1] 3636 [R, 1] # noise |

this network doesn't score as high as the discrete q table and the main reason is that it's gotten stuck in a behavioural local minima.

looking at the frequency of actions we can see this network never bothered with turning left and you can actually get a semi reasonable result if you're ok just doing 180 degree turns all the time...

even still this approach is (maybe) doing slightly better than the discrete case...

note there's nothing in the reward system that heavily punishes going backwards, you might "waste" some time turning around but
the reward system is *any* movement not just movement forward. we'll come back to this when we address a slightly harder
problem but for now this brings up a common problem in all decision making processes;
the explore/exploit tradeoff.

there are lots of ways of handling explore vs exploit and we'll use a simple approach that has worked well for me in the past..

given a set of Q values for actions, say, [ 6.8, 7.7, 3.9 ] instead of just picking the max, 0.8, we'll do a weighted pick by sampling from the distribution we get by normalising the values [ 0.36, 0.41, 0.21 ]

further to this we'll either squash (or stretch) the values by raising them to a power \( \rho \) before normalising them.

- when \(\rho\) is low, e.g. \(\rho\)=0.01, values are squashed and the weighted pick is more uniform. [ 6.8, 7.7, 3.9 ] -> [ 0.3338, 0.3342, 0.3319 ] resulting in an explore like behaviour.
- when \(\rho\) is high, e.g. \(\rho\)=20, values are stretched and the weighted pick is more like picking the maximum. [ 6.8, 7.7, 3.9 ] -> [ 0.0768, 0.9231, 0.0000 ] resulting in an exploit like behaviour

annealing \(\rho\) from a low value to a high one over training gives a smooth explore -> exploit transistion.

trying this with our network gives the following result.

first 200 freq [action, reward] 4598 [R, 0] 4564 [L, 0] 2862 [F, -1] 1735 [F, 1] 177 [R, 1] # noise 175 [L, 1] # noise 69 [F, 0] # noise | last 200 freq [action, reward] 45789 [F, 1] 33986 [R, 0] 4596 [F, -1] 1274 [R, 1] # noise 558 [L, 0] 36 [L, 1] # noise 1 [F, 0] # noise |

this run actually kept \(\rho\) low for a couple of hundred iterations before annealing it from 0.01 to 50. we can see for the first 200 episodes we have an equal mix of F, L & R (so are definitely exploring) but by the end of the run we're back to favoring just turning right again :/ let's take a closer look.

the following graphs show the proportions of actions taken over time for two runs. the first is for the baseline case and shows a pretty constant ratio of F/L/R over time. the second plot is quite different though. here we have three distinct parts; 1) a burnin period of equal F/L/R when the bot was running 100% explore 2) a period of ramping up towards exploit where we do get higher scores related to a high F amount and finally 3) where we get locked into just favoring R again.

what went wrong? and what can we do about it?

there are actually two important things happening and two clever approaches to avoiding them. you can read a lot more about these two approaches in deepmind's epic Playing Atari paper [1]

the first problem is related to the instability of training the Q network with *two* updates per example.

recall that each training example updates both Q(s1) and Q(s2) and it turns out it can be unstable to train both of these at the same time. a simple enough workaround is to keep a full copy of the network (called the "target network") and use it for evaluating Q(s2). we don't backpropogate updates to the target network and instead take a fresh copy from the Q(s1) network every n training steps. (it's called the "target" network since it provides a more stationary target for Q(s1) to learn against)

the second problem is related to the order of examples we are training with.

the core sequence of an episode is \( ( state_1,\ action_1,\ reward_1,\ state_2,\ action_2,\ reward_2,\ state_3,\ ... ) \) which for training get's broken down to individual events i.e. \( ( state_1,\ action_1,\ reward_1,\ state_2 ) \) followed by \( ( state_2,\ action_2,\ reward_2,\ state_3 ) \) etc. as such each event's \( state_2 \) is going to be the \( state_1 \) for the next event. this type of correlation between successive examples is bad news for any iterative optimizer.

the solution is to use 'experience replay' [2] where we simly keep old events in a memory and replay them back as training examples in a random order. it's very similar to the ideas behind why we shuffle input data for any learning problem.

adding these two gives the best result ...

freq [action, reward] 116850 [F, 1] 18594 [R, 0] 18418 [L, 0] 2050 [L, 1] # noise 2026 [R, 1] # noise 1681 [F, -1] 1 [F, 0] # noise |

this run used a bot with a high \(\rho\) value (i.e. maximum exploit) that was fed 1,000 events/second randomly
from the explore/exploit job. we can see a few episodes of doing poorly before converging quickly (note that
this experience replay provides events a *lot* quicker than just normal simulation)

overall this approach nails it compared to the previous runs.

- a harder reward function
- instead of a reward per movement we can give a reward only at discrete points on the track.

- continuous control
- instead of three discrete actions we should try to learn continuous control actions ([3]) e.g. acceleration & steering.
- will most probably require an actor/critic implementation

- add an adversial net [4] as a way to transfer learn between this simulated robot and the raspberry pi powered whippersnapper rover i'm building.

- [1] Playing Atari with Deep Reinforcement Learning (pdf)
- [2] Reinforcement Learning for Robots Using Neural Networks (pdf)
- [3] Continuous control with deep reinforcement learning (arxiv)
- [4] Domain-Adversarial Training of Neural Networks (arxiv)

follow along further with this project by reading my google+ robot0 stream

see more of what i'm generally reading on my google+ reading stream

]]>one thing in theano i couldn't immediately find examples for was a simple embedding lookup table, a critical component for anything with NLP. turns out that it's just one of those things that's so simple no one bothered writing it down :/

tl;dr : you can just use numpy indexing and everything just works.

consider the following theano.tensor example of 2d embeddings for 5 items. each row represents a seperate embeddable item.

>>> E = np.random.randn(5, 2) >>> t_E = theano.shared(E) >>> t_E.eval() array([[-0.72310919, -1.81050727], [ 0.2272197 , -1.23468159], [-0.59782901, -1.20510837], [-0.55842279, -1.57878187], [ 0.63385967, -0.35352725]])

to pick a subset of the embeddings it's as simple as just using indexing. for example to get the third & first embeddings it's ...

>>> idxs = np.asarray([2, 0]) >>> t_E[idxs].eval() array([[-0.59782901, -1.20510837], # third row of E [-0.72310919, -1.81050727]]) # first row of E

if we want to concatenate them into a single vector (a common operation when we're feeding up to, say, a densely connected hidden layer), it's a reshape

>>> t_E[idxs].reshape((1, -1)).eval() array([[-0.59782901, -1.20510837, -0.72310919, -1.81050727]]) # third & first row concatenated

all the required multi dimensional operations you need for batching just work too..

eg. if we wanted to run a batch of size 2 with the first batch item being the third & first embeddings and the second batch item being the fourth & fourth embeddings we'd do the following...

>>> idxs = np.asarray([[2, 0], [3, 3]]) # batch of size 2, first example in batch is pair [2, 0], second example in batch is [3, 3] >>> t_E[idxs].eval() array([[[-0.59782901, -1.20510837], # third row of E [-0.72310919, -1.81050727]], # first row of E [[-0.55842279, -1.57878187], # fourth row of E [-0.55842279, -1.57878187]]]) # fourth row of E >>> t_E[idxs].reshape((idxs.shape[0], -1)).eval() array([[-0.59782901, -1.20510837, -0.72310919, -1.81050727], # first item in batch; third & first row concatenated [-0.55842279, -1.57878187, -0.55842279, -1.57878187]]) # second item in batch; fourth row duplicated

this type of packing of the data into matrices is crucial to enable linear algebra libs and GPUs to really fire up.

consider the following as-simple-as-i-can-think-up "network" that uses embeddings;

given 6 items we want to train 2d embeddings such that the first two items have the same embeddings, the third and fourth have the same embeddings and the last two have the same embeddings. additionally we want all other combos to have different embeddings.

the *entire* theano code (sans imports) is the following..

first we initialise the embedding matrix as before

E = np.asarray(np.random.randn(6, 2), dtype='float32') t_E = theano.shared(E)

the "network" is just a dot product of two embeddings ...

t_idxs = T.ivector() t_embedding_output = t_E[t_idxs] t_dot_product = T.dot(t_embedding_output[0], t_embedding_output[1])

... where the training cost is a an L1 penality against the "label" of 1.0 for the pairs we want to have the same embeddings and 0.0 for the ones we want to have a different embeddings.

t_label = T.iscalar() gradient = T.grad(cost=abs(t_label - t_dot_product), wrt=t_E) updates = [(t_E, t_E - 0.01 * gradient)] train = theano.function(inputs=[t_idxs, t_label], outputs=[], updates=updates)

we can generate training examples by randomly picking two elements and assigning label 1.0 for the pairs 0 & 1, 2 & 3 and 4 & 6 (and 0.0 otherwise) and every once in awhile write them out to a file.

print "i n d0 d1" for i in range(0, 10000): v1, v2 = random.randint(0, 5), random.randint(0, 5) label = 1.0 if (v1/2 == v2/2) else 0.0 train([v1, v2], label) if i % 100 == 0: for n, embedding in enumerate(t_E.get_value()): print i, n, embedding[0], embedding[1]

plotting this shows the convergence of the embeddings (labels denote initial embedding location)...

0 & 1 come together, as do 2 & 3 and 4 & 5. ta da!

it's interesting to observe the effect of this (somewhat) arbitrary cost function i picked.

for the pairs where we wanted the embeddings to be same the cost function, \( |1 - a \cdot b | \), is minimised when the dotproduct is 1 and this happens when the vectors
are the same and have unit length. you can see this is case for pairs 0 & 1 and 4 & 5 which have come together and ended up on the unit circle. but what about 2 & 3?
they've gone to the origin and the dotproduct of the origin with itself is 0, so it's *maximising* the cost, not minimising it! why?

it's because of the other constraint we added. for all the pairs we wanted the embeddings to be different the cost function, \( |0 - a \cdot b | \), is minimised when
the dotproduct is 0. this happens when the vectors are orthogonal. both 0 & 1 and 4 & 5 can be on the unit sphere and orthogonal but for them to be both orthogonal
to 2 & 3 *they* have to be at the origin. since my loss is an L1 loss (instead of, say, a L2 squared loss) the pair 2 & 3 is overall better at the origin because it
gets more from minimising this constraint than worrying about the first.

the pair 2 & 3 has come together not because we were training embeddings to be the same but because we were also training them to be different. this wouldn't be a problem if we were using 3d embeddings since they could all be both on the unit sphere and orthogonal at the same time.

you can also see how the points never fully converge. in 2d with this loss it's impossible to get the cost down to 0 so they continue to get bumped around. in 3d, as just mentioned, the cost can be 0 and the points would converge.

there's one non trivial optimisation you can do regarding your embeddings that relates to how sparse the embedding update is.
in the above example we have 6 embeddings in total and, even though we only update 2 of them at a time, we are calculating the
gradient with respect to the *entire* t_E matrix. the end result is that we calculate (and apply) a gradient that for the majority of rows is just zeros.

... gradient = T.grad(cost=abs(t_label - t_dot_product), wrt=t_E) updates = [(t_E, t_E - 0.01 * gradient)] ... print gradient.eval({t_idxs: [1, 2], t_label: 0}) [[ 0.00000000e+00 0.00000000e+00] [ 9.60363150e-01 2.22545816e-04] [ 1.00614786e+00 -3.63630615e-03] [ 0.00000000e+00 0.00000000e+00] [ 0.00000000e+00 0.00000000e+00] [ 0.00000000e+00 0.00000000e+00]]

you can imagine how much sparser things are when you've got 1M embeddings and are only updating <10 per example :/

rather than do all this wasted work we can be a bit more explicit about both how we want the gradient calculated and updated by using inc_subtensor

... t_embedding_output = t_E[t_idxs] ... gradient = T.grad(cost=abs(t_label - t_dot_product), wrt=t_embedding_output) updates = [(t_E, T.inc_subtensor(t_embedding_output, -0.01 * gradient))] ... print gradient.eval({t_idxs: [1, 2], t_label: 0}) [[ 9.60363150e-01 2.22545816e-04] [ 1.00614786e+00 -3.63630615e-03]]

and of course you should only do this once you've proven it's the slow part...

]]>language modelling is a classic problem in NLP; given a sequence of words such as "my cat likes to ..." what's the next word? this problem is related to all sorts of things, everything from autocomplete to speech to text.

the classic solution to language modelling is based on just counting. if a speech to text system is sure its heard "my cat likes to" but then can't decide if the next word if "sleep" or "beep" we can just look at relative counts; if we've observed in a large corpus how cats like to sleep more than they like to beep we can say "sleep" is more likely. (note: this would be different if it was "my roomba likes to ...")

the first approach i saw to solving this problem with neural nets is from bengio et al. "a neural probabilistic language model" (2003). this paper was a huge eye opener for me and was the first case i'd seen of using a distributed, rather than purely symbolic, representation of text. definitely "word embeddings" are all the rage these days!

bengio takes the approach of using a softmax to estimate the distribution of possible words given the two previous words. ie \( P({w}_3 | {w}_1, {w}_2) \). depending on your task though it might make more sense to instead estimate the likelihood of the triple directly ie \( P({w}_1, {w}_2, {w}_3) \).

let's work through a empirical comparison of these two on a synthetic problem. we'll call the first the *softmax* approach and the second the *logisitic_regression* approach.

rather than use real text data let's work on a simpler synthetic dataset with a vocab of only 6 tokens; "A", "B", "C", "D", "E" & "F". be warned: a vocab this small is so contrived that it's hard to generalise any result from it. in particular a normal english vocab in the hundreds of thousands would be soooooooo much sparser.

we'll use random walks on the following erdos renyi graph as a generating grammar. egs "phrases" include "D C", "A F A F A A", "A A", "E D C" & "E A A"

the main benefit of such a contrived small vocab is that it's feasible to analyse all 6^{3} = 216 trigrams.
let's consider the distributions associated with a couple of specific (w1, w2) pairs.

there are only 45 trigrams that this grammar generates and the most frequent one is FAA. FAF is also possible but the other FA? cases can never occur.

F A A 0.20 # the most frequent trigram generated F A B 0.0 # never generated F A C 0.0 F A D 0.0 F A E 0.0 F A F 0.14 # the 4th most frequent trigram

if we train a simple softmax based neural probabilistic language model (nplm) we see the distribution of \( P({w}_3 | {w}_1=F, {w}_2=A ) \) converge to what we expect; FAA has a likelihood of 0.66, FAF has 0.33 and the others 0.0

this is a good illustration of the convergence we expect to see with a softmax. each observed positive example of FAA is also an implicit negative example for FAB, FAC, FAD, FAE & FAF and as such each FAA causes the likelihood of FAA to go up while pushing the others down. since we observe FAA twice as much as FAF it gets twice the likelihood and since we never see FAB, FAC, FAD or FAE they only ever get pushed down and converge to 0.0

since the implementation behind this is (overly) simple we can run a couple of times to ensure things are converging consistently. here's 6 runs, from random starting parameters, and we can see each converges to the same result..

now consider the logisitic model where instead of learning the distribution of w3 given (w1, w2) we instead model the likelihood of the triple directly \( P({w}_1, {w}_2, {w}_3) \). in this case we're modelling whether a specific example is true or not, not how it relates to others, so one big con is that there are no implicit negatives like the softmax. we need explicit negative examples and for this experiment i've generated them by random sampling the trigrams that don't occur in the observed set. ( the generation of "good" negatives is a surprisingly hard problem )

if we do 6 runs again instead of learning the distribution we have FAA and FAF converging to 1.0 and the others converge to 0.0. run4 actually has FAB tending to 1.0 too but i wouldn't be surprised at all if it dropped later; these graphs in general are what i'd expect given i'm just using a fixed global learning rate (ie nothing at all special about adapative learning rate)

now insteading considering the most frequent w1, w2 trigrams let's consider the least frequent.

C B A 0.003 C B B 0.07 # 28th most frequent (of 45 possible trigrams) C B C 0.0 C B D 0.003 C B E 0.002 C B F 0.001 # the least frequent trigram generated

as before the softmax learns the distribution; CBB is the most frequent, CBC has 0.0 probability and the others are roughly equal. these examples are far less frequent in the dataset so the model, quite rightly, allocates less of the models complexity to getting them right.

the logisitic model as before has, generally, everything converging to 1.0 except CBC which converges to 0.0

finally consider the case of C -> C -> ?. this one is interesting since C -> C never actually occurs in the grammar.

first let's consider the logistic case. CC only ever occurs in the training data as an explicit negative so we see all of them converging to 0.0 ( amusingly in run4 CCC alllllmost made it )

now consider the softmax. recall that the softmax learns by explicit positives and implicit negatives, but, since there are no cases of observed CC?, the softmax would not have seen any CC? cases.

so what is going on here? the convergence is all over the place! run2 and run6 seems to suggest CCA is the only likely case whereas run3 and run4 oscillate between CCB and CCF ???

it turns out these are artifacts of the training. there was no pressure in any way to get CC? "right" so these are just the side effects of how the embeddings for tokens, particularly C in this case, are being used for the other actual observed examples. we call these hallucinations.

another slightly different way to view this is to run the experiment 100 times and just consider the converged state (or at least the final state after a fixed number of iterations)

if we consider FA again we can see its low variance convergence of FAA to 0.66 and FAF to 0.33.

if we consider CB again we can see its higher variance convergence to the numbers we reviewe before; CBB ~= 0.4, CBC = 0.0 and the others around 0.15

considering CC though we see CCA and CCB have a bimodal distribution between 0.0 and 1.0 unlike any of the others. fascinating!

this is interesting but i'm unsure how much of it is just due to an overly simple model. this implementation just uses a simple fixed global learning rate (no per weight adaptation at all), uses very simple weight initialisation and has no regularisation at all :/

all the code can be found on github

]]>i've been reviving some old theano code recently and in case you haven't seen it theano is a pretty awesome python library that reads a lot like numpy but provides two particularly interesting features.

- symbolic differentiation; not something i'll talk about here, but super useful if you're tinkering with new models and you're using a gradient descent method for learning (and these days, who's not..)
- the ability to run transparently on a gpu; well, almost transparently, this'll be the main focus of this post...

let's work through a very simple model that's kinda like a system of linear equations. we'll compare 1) numpy (our timing baseline) vs 2) theano on a cpu vs 3) theano on a gpu. keep in mind this model is contrived and doesn't really represent anything useful, it's more to demonstrate some matrix operations.

first consider the following numpy code (speed_test_numpy.py) which does a simple y=mx+b like calculation a number of times in a tight loop. this looping isn't just for benchmarking, lots of learning algorithms operate on a tight loop.

# define data # square matrices will do for a demo np.random.seed(123) m = np.random.randn(1000, 1000).astype('float32') x = np.random.randn(1000, 1000).astype('float32') b = np.random.randn(1000, 1000).astype('float32') # run tight loop start = time.time() for i in range(500): y = np.add(np.dot(m, x), b) print "numpy", time.time()-start, "sec"

this code on a 6 core 3.8Ghz AMD runs in a bit over 2min

$ python speed_test_numpy.py numpy 135.350140095 sec

now consider the same thing in theano (speed_test_theano.py)

import theano import theano.tensor as T # define data np.random.seed(123) m = np.random.randn(1000, 1000).astype('float32') x = np.random.randn(1000, 1000).astype('float32') b = np.random.randn(1000, 1000).astype('float32') # define a symbolic expression of the equations in theano tm = T.matrix("m") tx = T.matrix("x") tb = T.matrix("b") ty = T.add(T.dot(tm, tx), tb) # and compile it line = theano.function(inputs=[tx, tm, tb], outputs=[ty]) # then run same loop as before start = time.time() for i in range(500): y, = line(m, x, b) print "theano", time.time()-start, "sec"

hopefully it's clear enough what is happening here at a high level but just briefly the tm, tx, tb and ty variables represent a symbolic representation of what we want to do and the theano.function call compiles this into actual executable code. there is lots of gentle intro material that introduces this notation on the theano site.

when run on the cpu it takes about the same time as the numpy version

$ THEANO_FLAGS=device=cpu python speed_test_theano.py numpy 136.371109009 sec

but when "magically" run on the gpu it's quite a bit faster.

$ THEANO_FLAGS=device=gpu python speed_test_theano.py Using gpu device 0: GeForce GTX 970 theano 3.16091990471 sec

awesome! a x40 speed up! so we're done right? not quite, we can do better.

let's drill into what's actually happening; we can do this in two ways, debugging the compiled graph and theano profiling.

debugging allows us to see what a function has been compiled to. for the cpu case it's just a single blas gemm (general matrix mulitplication) call. that's exactly what'd we want, so great!

Gemm{no_inplace} [@A] '' 0 |b [@B] |TensorConstant{1.0} [@C] |m [@D] |x [@E] |TensorConstant{1.0} [@C]

profiling allows to see where time is spent. 100% in this single op, no surprise.

$ THEANO_FLAGS=device=cpu,profile=True python speed_test_theano.py ... <% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name> 100.0% 100.0% 136.282s 2.73e-01s 500 0 Gemm{no_inplace} ...

looking at the gpu version though things are a little different...

HostFromGpu [@A] '' 4 |GpuGemm{inplace} [@B] '' 3 |GpuFromHost [@C] '' 2 | |b [@D] |TensorConstant{1.0} [@E] |GpuFromHost [@F] '' 1 | |m [@G] |GpuFromHost [@H] '' 0 | |x [@I] |TensorConstant{1.0} [@E]

we can see a GpuGemm operation, the gpu equivalent of Gemm, but now there's a bunch of GpuFromHost & HostFromGpu operations too? what are these?

i'll tell you what they are, they are the bane of your existence! these represent transferring data to/from the gpu which is slow and, if we're not careful, can add up to a non trivial amount. if we review the profiling output we can see that, though we're faster than the non gpu version, we're spending >70% of the time just moving data.

(though remember this example is contrived, we'd expect to be doing more in our overall computation that just a single general matrix mulitply)

$ THEANO_FLAGS=device=gpu,profile=True python speed_test_theano.py ... <% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name> 26.4% 26.4% 0.776s 1.55e-03s 500 3 GpuGemm{inplace} 19.5% 45.9% 0.573s 1.15e-03s 500 0 GpuFromHost(x) 19.5% 65.4% 0.572s 1.14e-03s 500 1 GpuFromHost(m) 19.3% 84.7% 0.565s 1.13e-03s 500 2 GpuFromHost(b) 15.3% 100.0% 0.449s 8.99e-04s 500 4 HostFromGpu(GpuGemm{inplace}.0) ...

ouch!

the crux of this problem is that we actually have two types of variables in this model; the parameterisation of the model (m & b) and
those related to examples (x & y). so, though it's realistic to do a speed test with a tight loop over the same function many times,
what is *not* realistic is that we are passing the model parameters to/from the gpu
each and every input example. this is a complete waste; it's much more sensible to send them over to the gpu once at the
start of the loop and retreive them once at the end. this is an important and very common pattern.

how do we fix this? it's actually pretty simple; shared variables. yay!

consider the following; speed_test_theano_shared.py

# define data np.random.seed(123) m = np.random.randn(1000, 1000).astype('float32') x = np.random.randn(1000, 1000).astype('float32') b = np.random.randn(1000, 1000).astype('float32') # define a symbolic expression of the equations in theano tm = theano.shared(m) # copy m over to gpu once explicitly tx = T.matrix("x") tb = theano.shared(b) # copy b over to gpu once explicitly ty = T.add(T.dot(tm, tx), tb) line = theano.function(inputs=[tx], outputs=[ty]) # don't pass m & b each call # then run same loop as before start = time.time() for i in range(500): y, = line(x) print tm.get_value().shape # note: we can get the value back at any time

reviewing the debug we can see this removes a stack of the GpuFromHost calls.

HostFromGpu [@A] '' 2 |GpuGemm{no_inplace} [@B] '' 1 |[@C] |TensorConstant{1.0} [@D] | [@E] |GpuFromHost [@F] '' 0 | |x [@G] |TensorConstant{1.0} [@D]

and we're down to < 2s

$ THEANO_FLAGS=device=gpu,profile=True python speed_test_theano_shared.py Using gpu device 0: GeForce GTX 970 theano 1.93515706062 sec ... <% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name> 44.7% 44.7% 0.804s 1.61e-03s 500 1 GpuGemm{no_inplace} 30.2% 74.9% 0.543s 1.09e-03s 500 0 GpuFromHost(x) 25.1% 100.0% 0.451s 9.01e-04s 500 2 HostFromGpu(GpuGemm{no_inplace}.0) ...

what's even crazier is we can go further by moving the x and y matrices onto the gpu too. it turns out this isn't *too*
far fetched since if x and y were representing training examples we'd be iterating over them anyways (and if we could fit them
all onto the gpu that'd be great )

#define data np.random.seed(123) m = np.random.randn(1000, 1000).astype('float32') x = np.random.randn(1000, 1000).astype('float32') b = np.random.randn(1000, 1000).astype('float32') # define a symbolic expression of the equations in theano tm = theano.shared(m) tx = theano.shared(x) tb = theano.shared(b) ty = theano.shared(np.zeros((1000, 1000)).astype('float32')) # we need a shared var for y now mx_b = T.add(T.dot(tm, tx), tb) # and compile it train = theano.function(inputs=[], updates={ty: mx_b}) # update y on gpu # then run same loop as before start = time.time() for i in range(500): train() # now there's no input/output print tm.get_value().shape print "theano", time.time()-start, "sec"

the debug graph is like the cpu graph now, just one gemm call.

GpuGemm{no_inplace} [@A] '' 0 |[@B] |TensorConstant{1.0} [@C] | [@D] | [@E] |TensorConstant{1.0} [@C]

and runs in under a second. x150 the numpy version. nice! :)

$ THEANO_FLAGS=device=gpu,profile=True python speed_test_theano_shared2.py theano 0.896003007889 sec ... <% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name> 100.0% 100.0% 0.800s 1.60e-03s C 500 1 GpuGemm{no_inplace} ...]]>

PyMC is a python library for working with bayesian statistical models, primarily using MCMC methods. as a software engineer who has only just scratched the surface of statistics this whole MCMC business is blowing my mind so i've got to share some examples.

let's start with the simplest thing possible, fitting a simple distribution.

say we have a thousand values, ` 87.27, 67.98, 119.56, ...`

and we want to build a model of them.

a common first step might be to generate a histogram.

if i had to a make a guess i'd say this data looks normally distributed. somewhat unsurprising, not just because normal distributions are freakin everywhere, (this great khan academy video on the central limit theorem explains why) but because it was me who synthetically generated this data in the first place ;)

now a normal distribution is parameterised by two values; it's *mean* (technically speaking, the "middle" of the curve) and it's *standard deviation* (even more technically speaking, how "fat" it is) so let's use PyMC to figure out what these values are for this data.

*!!warning!! !!!total overkill alert!!!* there must be a bazillion simpler ways to fit a normal to this data but this post is about
dead-simple-PyMC not dead-simple-something-else.

first a definition of our model.

# simple_normal_model.py from pymc import * data = map(float, open('data', 'r').readlines()) mean = Uniform('mean', lower=min(data), upper=max(data)) precision = Uniform('precision', lower=0.0001, upper=1.0) process = Normal('process', mu=mean, tau=precision, value=data, observed=True)

working *backwards* through this code ...

- line 6 says i am trying to model some
`process`

that i believe is`Normal`

ly distributed defined by variables`mean`

and`precision`

. (precision is just the inverse of the variance, which in turn is just the standard deviation squared). i've already`observed`

this data and the`value`

s are in the variable`data`

- line 5 says i don't know the
`precision`

for my`process`

but my prior belief is it's value is somewhere between 0.0001 and 1.0. since i don't favor any values in this range my belief is`uniform`

across the values. note: assuming a uniform distribution for the precision is overly simplifying things quite a bit, but we can get away with it in this simple example and we'll come back to it. - line 4 says i don't know the
`mean`

for my data but i think it's somewhere between the`min`

and the`max`

of the observed`data`

. again this belief is`uniform`

across the range. - line 3 says the
`data`

for my unknown`process`

comes from a local file (just-plain-python)

the second part of the code runs the MCMC sampling.

# run_mcmc.py from pymc import * import simple_normal_model model = MCMC(simple_normal_model) model.sample(iter=500) print(model.stats())

working *forwards* through this code ...

- line 4 says build a MCMC for the model from the
`simple_normal_model`

file - line 5 says run a sample for 500 iterations
- line 6 says print some stats.

**and that's it!**

the output from our stats includes among other things estimates for the `mean`

and `precision`

we were trying to find

{ 'mean': {'95% HPD interval': array([ 94.53688316, 102.53626478]) ... }, 'precision': {'95% HPD interval': array([ 0.00072487, 0.03671603]) ... }, ... }

now i've brushed over a couple of things here (eg the use of uniform prior over the precision, see here for more details) but i can get away with it all because this problem is a trivial one and i'm not doing gibbs sampling in this case. the main point i'm trying to make is that it's dead simple to start writing these models.

one thing i do want to point out is that this estimation doesn't result in just one single value for mean and precision, it results in a distribution of the possible values. this is great since it gives us an idea of how confident we can be in the values as well as allowing this whole process to be iterative, ie the output values from this model can be feed easily into another.

all the code above parameterised the normal distribution with a mean and a precision. i've always thought of normals though in terms of means and standard deviations
(precision is a more bayesian way to think of things... apparently...) so the first extension to my above example i want to make is to redefine the problem
in terms of a prior on the standard deviation instead of the precision. mainly i want to do this to introduce the `deterministic`

concept
but it's also a subtle change in how the sampling search will be directed because it introduces a non linear transform.

data = map(float, open('data', 'r').readlines()) mean = Uniform('mean', lower=min(data), upper=max(data)) std_dev = Uniform('std_dev', lower=0, upper=50) @deterministic(plot=False) def precision(std_dev=std_dev): return 1.0 / (std_dev * std_dev) process = Normal('process', mu=mean, tau=precision, value=data, observed=True)

our code is almost the same but instead of a prior on the `precision`

we use a `deterministic`

method to map from the parameter we're
trying to fit (the `precision`

) to a variable we're trying to estimate (the `std_dev`

).

we fit the model using the same `run_mcmc.py`

but this time get estimates for the `std_dev`

not the `precision`

{ 'mean': {'95% HPD interval': array([ 94.23147867, 101.76893808]), ... 'std_dev': {'95% HPD interval': array([ 19.53993697, 21.1560098 ]), ... ... }

which all matches up to how i originally generated the data in the first place.. cool!

from numpy.random import normal data = [normal(100, 20) for _i in xrange(1000)]

for this example let's now dive a bit deeper than just the stats object.
to help understand how the sampler is converging on it's results we can also dump
a trace of it's progress at the end of `run_mcmc.py`

import numpy for p in ['mean', 'std_dev']: numpy.savetxt("%s.trace" % p, model.trace(p)[:])

plotting this we can see how quickly the sampled values converged.

let's consider a slightly more complex example.

again we have some data... `107.63, 207.43, 215.84, ...`

that plotted looks like this...

hmmm. looks like *two* normals this time with the one centered on 100 having a bit more data.

how could we model this one?

data = map(float, open('data', 'r').readlines()) theta = Uniform("theta", lower=0, upper=1) bern = Bernoulli("bern", p=theta, size=len(data)) mean1 = Uniform('mean1', lower=min(data), upper=max(data)) mean2 = Uniform('mean2', lower=min(data), upper=max(data)) std_dev = Uniform('std_dev', lower=0, upper=50) @deterministic(plot=False) def mean(bern=bern, mean1=mean1, mean2=mean2): return bern * mean1 + (1 - ber) * mean2 @deterministic(plot=False) def precision(std_dev=std_dev): return 1.0 / (std_dev * std_dev) process = Normal('process', mu=mean, tau=precision, value=data, observed=True)

reviewing the code again it's mostly the same the big difference being the `deterministic`

definition of the `mean`

.
it's now that we finally start to show off the awesome power of these non analytical approaches.

line 12 defines the mean not by one `mean`

variable
but instead as a mixture of two, `mean1`

and `mean2`

. for each value we're trying to model we pick either `mean1`

or `mean2`

based on *another* random variable `bern`

.
`bern`

is described by a
bernoulli distribution
and so is either 1 or 0, proportional to the parameter `theta`

.

ie the definition of our `mean`

is that when `theta`

is high, near 1.0, we pick `mean1`

most of the time and
when `theta`

is low, near 0.0, we pick `mean2`

most of the time.

what we are solving for then is not just `mean1`

and `mean2`

but how the values are split between them (described by `theta`

)
(and note for the sake of simplicity i made the two normal differ in their means but use a shared standard deviation. depending on what you're doing this
might or might not make sense)

reviewing the traces we can see the converged `mean`

s are 100 & 200 with `std dev`

20. the mix (`theta`

) is 0.33, which all agrees
with the synthetic data i generated for this example...

from numpy.random import normal import random data = [normal(100, 20) for _i in xrange(1000)] # 2/3rds of the data data += [normal(200, 20) for _i in xrange(500)] # 1/3rd of the data random.shuffle(data)

to me the awesome power of these methods is the ability in that function to pretty much write whatever i think best describes the process. too cool for school.

i also find it interesting to see how the convergence came along... the model starts in a local minima of both normals having mean a bit below 150 (the midpoint of the two actual ones) with a mixing proportion of somewhere in the ballpark of 0.5 / 0.5. around iteration 1500 it correctly splits them apart and starts to understand the mix is more like 0.3 / 0.7. finally by about iteration 2,500 it starts working on the standard deviation which in turn really helps narrow down the true means.

(thanks cam for helping me out with the formulation of this one..)

these are pretty simple examples thrown together to help me learn but i think they're still illustrative of the power of these methods (even when i'm completely ignore anything to do with conjugacy)

in general i've been working through an awesome book, doing bayesian data analysis, and can't recommend it enough.

i also found john's blog post on using jags in r was really helpful getting me going.

all the examples listed here are on github.

next is to rewrite everything in stan and do some comparision between pymc, stan and jags. fun times!

]]>