brain of mat kelcey...

theano and the curse of GpuFromHost

February 22, 2015 at 10:00 PM | categories: Uncategorized

brutually short intro to theano

i've been reviving some old theano code recently and in case you haven't seen it theano is a pretty awesome python library that reads a lot like numpy but provides two particularly interesting features.

symbolic differentiation; not something i'll talk about here, but super useful if you're tinkering with new models and you're using a gradient descent method for learning (and these days, who's not..)
the ability to run transparently on a gpu; well, almost transparently, this'll be the main focus of this post...

multiplying matrices

let's work through a very simple model that's kinda like a system of linear equations. we'll compare 1) numpy (our timing baseline) vs 2) theano on a cpu vs 3) theano on a gpu. keep in mind this model is contrived and doesn't really represent anything useful, it's more to demonstrate some matrix operations.

in numpy

first consider the following numpy code (speed_test_numpy.py) which does a simple y=mx+b like calculation a number of times in a tight loop. this looping isn't just for benchmarking, lots of learning algorithms operate on a tight loop.

# define data
# square matrices will do for a demo
np.random.seed(123)
m = np.random.randn(1000, 1000).astype('float32')
x = np.random.randn(1000, 1000).astype('float32')
b = np.random.randn(1000, 1000).astype('float32')

# run tight loop
start = time.time()
for i in range(500):
    y = np.add(np.dot(m, x), b)
print "numpy", time.time()-start, "sec"

this code on a 6 core 3.8Ghz AMD runs in a bit over 2min

$ python speed_test_numpy.py
numpy 135.350140095 sec

in theano

now consider the same thing in theano (speed_test_theano.py)

import theano
import theano.tensor as T

# define data                                                                                                                                                                           
np.random.seed(123)
m = np.random.randn(1000, 1000).astype('float32')
x = np.random.randn(1000, 1000).astype('float32')
b = np.random.randn(1000, 1000).astype('float32')

# define a symbolic expression of the equations in theano                                                                                                                               
tm = T.matrix("m")
tx = T.matrix("x")
tb = T.matrix("b")
ty = T.add(T.dot(tm, tx), tb)
# and compile it
line = theano.function(inputs=[tx, tm, tb], outputs=[ty])

# then run same loop as before                                                                                                                                                          
start = time.time()
for i in range(500):
    y, = line(m, x, b)
print "theano", time.time()-start, "sec"

hopefully it's clear enough what is happening here at a high level but just briefly the tm, tx, tb and ty variables represent a symbolic representation of what we want to do and the theano.function call compiles this into actual executable code. there is lots of gentle intro material that introduces this notation on the theano site.

when run on the cpu it takes about the same time as the numpy version

$ THEANO_FLAGS=device=cpu python speed_test_theano.py
numpy 136.371109009 sec

but when "magically" run on the gpu it's quite a bit faster.

$ THEANO_FLAGS=device=gpu python speed_test_theano.py
Using gpu device 0: GeForce GTX 970
theano 3.16091990471 sec

awesome! a x40 speed up! so we're done right? not quite, we can do better.

profiling

let's drill into what's actually happening; we can do this in two ways, debugging the compiled graph and theano profiling.

debugging allows us to see what a function has been compiled to. for the cpu case it's just a single blas gemm (general matrix mulitplication) call. that's exactly what'd we want, so great!

Gemm{no_inplace} [@A] ''   0
 |b [@B]
 |TensorConstant{1.0} [@C]
 |m [@D]
 |x [@E]
 |TensorConstant{1.0} [@C]

profiling allows to see where time is spent. 100% in this single op, no surprise.

$ THEANO_FLAGS=device=cpu,profile=True python speed_test_theano.py
...
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
  100.0%   100.0%     136.282s       2.73e-01s    500     0   Gemm{no_inplace}
...

looking at the gpu version though things are a little different...

HostFromGpu [@A] ''   4
 |GpuGemm{inplace} [@B] ''   3
   |GpuFromHost [@C] ''   2
   | |b [@D]
   |TensorConstant{1.0} [@E]
   |GpuFromHost [@F] ''   1
   | |m [@G]
   |GpuFromHost [@H] ''   0
   | |x [@I]
   |TensorConstant{1.0} [@E]

we can see a GpuGemm operation, the gpu equivalent of Gemm, but now there's a bunch of GpuFromHost & HostFromGpu operations too? what are these?

i'll tell you what they are, they are the bane of your existence! these represent transferring data to/from the gpu which is slow and, if we're not careful, can add up to a non trivial amount. if we review the profiling output we can see that, though we're faster than the non gpu version, we're spending >70% of the time just moving data.

(though remember this example is contrived, we'd expect to be doing more in our overall computation that just a single general matrix mulitply)

$ THEANO_FLAGS=device=gpu,profile=True python speed_test_theano.py
...
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
  26.4%    26.4%       0.776s       1.55e-03s    500     3   GpuGemm{inplace}
  19.5%    45.9%       0.573s       1.15e-03s    500     0   GpuFromHost(x)
  19.5%    65.4%       0.572s       1.14e-03s    500     1   GpuFromHost(m)
  19.3%    84.7%       0.565s       1.13e-03s    500     2   GpuFromHost(b)
  15.3%   100.0%       0.449s       8.99e-04s    500     4   HostFromGpu(GpuGemm{inplace}.0)
...

ouch!

shared variables

the crux of this problem is that we actually have two types of variables in this model; the parameterisation of the model (m & b) and those related to examples (x & y). so, though it's realistic to do a speed test with a tight loop over the same function many times, what is not realistic is that we are passing the model parameters to/from the gpu each and every input example. this is a complete waste; it's much more sensible to send them over to the gpu once at the start of the loop and retreive them once at the end. this is an important and very common pattern.

how do we fix this? it's actually pretty simple; shared variables. yay!

consider the following; speed_test_theano_shared.py

# define data                                                                                                                                                                           
np.random.seed(123)
m = np.random.randn(1000, 1000).astype('float32')
x = np.random.randn(1000, 1000).astype('float32')
b = np.random.randn(1000, 1000).astype('float32')

# define a symbolic expression of the equations in theano                                                                                                                               
tm = theano.shared(m)  # copy m over to gpu once explicitly
tx = T.matrix("x")
tb = theano.shared(b)  # copy b over to gpu once explicitly
ty = T.add(T.dot(tm, tx), tb)
line = theano.function(inputs=[tx], outputs=[ty])  # don't pass m & b each call

# then run same loop as before                                                                                                                                                          
start = time.time()
for i in range(500):
    y, = line(x)

print tm.get_value().shape  # note: we can get the value back at any time

reviewing the debug we can see this removes a stack of the GpuFromHost calls.

HostFromGpu [@A] ''   2
 |GpuGemm{no_inplace} [@B] ''   1
   | [@C]
   |TensorConstant{1.0} [@D]
   | [@E]
   |GpuFromHost [@F] ''   0
   | |x [@G]
   |TensorConstant{1.0} [@D]

and we're down to < 2s

$ THEANO_FLAGS=device=gpu,profile=True python speed_test_theano_shared.py
Using gpu device 0: GeForce GTX 970
theano 1.93515706062 sec
...
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
  44.7%    44.7%       0.804s       1.61e-03s    500     1   GpuGemm{no_inplace}
  30.2%    74.9%       0.543s       1.09e-03s    500     0   GpuFromHost(x)
  25.1%   100.0%       0.451s       9.01e-04s    500     2   HostFromGpu(GpuGemm{no_inplace}.0)
...

what's even crazier is we can go further by moving the x and y matrices onto the gpu too. it turns out this isn't too far fetched since if x and y were representing training examples we'd be iterating over them anyways (and if we could fit them all onto the gpu that'd be great )

#define data
np.random.seed(123)
m = np.random.randn(1000, 1000).astype('float32')
x = np.random.randn(1000, 1000).astype('float32')
b = np.random.randn(1000, 1000).astype('float32')

# define a symbolic expression of the equations in theano
tm = theano.shared(m)
tx = theano.shared(x)
tb = theano.shared(b)
ty = theano.shared(np.zeros((1000, 1000)).astype('float32'))  # we need a shared var for y now
mx_b = T.add(T.dot(tm, tx), tb)
# and compile it
train = theano.function(inputs=[], updates={ty: mx_b})  # update y on gpu

# then run same loop as before
start = time.time()
for i in range(500):
    train()  # now there's no input/output
print tm.get_value().shape
print "theano", time.time()-start, "sec"

the debug graph is like the cpu graph now, just one gemm call.

GpuGemm{no_inplace} [@A] ''   0
 | [@B]
 |TensorConstant{1.0} [@C]
 | [@D]
 | [@E]
 |TensorConstant{1.0} [@C]

and runs in under a second. x150 the numpy version. nice! :)

$ THEANO_FLAGS=device=gpu,profile=True python speed_test_theano_shared2.py
theano 0.896003007889 sec
...
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
  100.0%   100.0%       0.800s       1.60e-03s     C      500        1   GpuGemm{no_inplace}
...