brain of mat kelcey...
simple tensorboard visualisation for gradient norms
June 27, 2017 at 09:45 PM | categories: Uncategorized( i've had three people recently ask me about how i was visualising gradient norms in tensorboard so, according to my three strikes rule, i now have to "automate" it by writing a blog post about it )
one really useful visualisation you can do while training a network is visualise the norms of the variables and gradients.
how are they useful? some random things that immediately come to mind include the fact that...
- diverging norms of variables might mean you haven't got enough regularisation.
- zero norm gradient means learning has somehow stopped.
- exploding gradient norms means learning is unstable and you might need to clip (hellloooo deep reinforcement learning).
let's consider a simple bounding box regression conv net (the specifics aren't important, i just grabbed this from another project, just needed something for illustration) ...
# (256, 320, 3) input image model = slim.conv2d(images, num_outputs=8, kernel_size=3, stride=2, weights_regularizer=l2(0.01), scope="c0") # (128, 160, 8) model = slim.conv2d(model, num_outputs=16, kernel_size=3, stride=2, weights_regularizer=l2(0.01), scope="c1") # (64, 80, 16) model = slim.conv2d(model, num_outputs=32, kernel_size=3, stride=2, weights_regularizer=l2(0.01), scope="c2") # (32, 40, 32) model = slim.conv2d(model, num_outputs=4, kernel_size=1, stride=1, weights_regularizer=l2(0.01), scope="c3") # (32, 40, 4) 1x1 bottleneck to get num of params down betwee c2 & h0 model = slim.dropout(model, keep_prob=0.5, is_training=is_training) # (5120,) 32x40x4 -> 32 is where the majority of params are so going to be most prone to overfitting. model = slim.fully_connected(model, num_outputs=32, weights_regularizer=l2(0.01), scope="h0") # (32,) model = slim.fully_connected(model, num_outputs=4, activation_fn=None, scope="out") # (4,) = bounding box (x1, y1, dx, dy)
a simple training loop using feed_dict would be something along the lines of ...
optimiser = tf.train.AdamOptimizer() train_op = optimiser.minimize(loss=some_loss) with tf.Session() as sess: while True: _ = sess.run(train_op, feed_dict=blah)
but if we want to get access to gradients we need to do things a little differently and call compute_gradients
and apply_gradients
ourselves ...
optimiser = tf.train.AdamOptimizer() gradients = optimiser.compute_gradients(loss=some_loss) train_op = optimiser.apply_gradients(gradients) with tf.Session() as sess: while True: _ = sess.run(train_op, feed_dict=blah)
with access to the gradients we can inspect them and create tensorboard summaries for them ...
optimiser = tf.train.AdamOptimizer() gradients = optimiser.compute_gradients(loss=some_loss) l2_norm = lambda t: tf.sqrt(tf.reduce_sum(tf.pow(t, 2))) for gradient, variable in gradients: tf.summary.histogram("gradients/" + variable.name, l2_norm(gradient)) tf.summary.histogram("variables/" + variable.name, l2_norm(variable)) train_op = optimiser.apply_gradients(gradients) with tf.Session() as sess: summaries_op = tf.summary.merge_all() summary_writer = tf.summary.FileWriter("/tmp/tb", sess.graph) for step in itertools.count(): _, summary = sess.run([train_op, summaries_op], feed_dict=blah) summary_writer.add_summary(summary, step)
( though we may only want to run the expensive summaries_op
once in awhile... )
with logging like this we get 8 histogram summaries per variable; the cross product of
- layer weights vs layer biases
- variable vs gradients
- norms vs values
e.g. for conv layer c3 in the above model we get the summaries shown below. note: nothing terribly interesting in this example, but a couple of things
- red : very large magnitude of gradient very early in training; this is classic variable rescaling.
- blue: non zero gradients at end of training, so stuff still happening at this layer in terms of the balance of l2 regularisation vs loss. (note: no bias regularisation means it'll continue to drift)
gradient norms with ggplot
sometimes the histograms aren't enough and you need to do some more serious plotting. in these cases i hackily wrap the gradient calc in tf.Print and plot with ggplot
e.g. here's some gradient norms from an old actor / critic model (cartpole++)
related: explicit simple_value and image summaries
on a related note you can also explicitly write summaries which is sometimes easier to do than generating the summaries through the graph.
i find this especially true for image summaries where there are many pure python options for post processing with, say, PIL
e.g. explicit scalar values
summary_writer =tf.summary.FileWriter("/tmp/blah") summary = tf.Summary(value=[ tf.Summary.Value(tag="foo", simple_value=1.0), tf.Summary.Value(tag="bar", simple_value=2.0), ]) summary_writer.add_summary(summary, step)
e.g. explicit image summaries using PIL post processing
summary_values = [] # (note: could already contain simple_values like above) for i in range(6): # wrap np array with PIL image and canvas img = Image.fromarray(some_np_array_probably_output_of_network[i])) canvas = ImageDraw.Draw(img) # draw a box in the top left canvas.line([0,0, 0,10, 10,10, 10,0, 0,0], fill="white") # write some text canvas.text(xy=(0,0), text="some string to add to image", fill="black") # serialise out to an image summary sio = StringIO.StringIO() img.save(sio, format="png") image = tf.Summary.Image(height=256, width=320, colorspace=3, #RGB encoded_image_string=sio.getvalue()) summary_values.append(tf.Summary.Value(tag="img/%d" % idx, image=image)) summary_writer.add_summary(tf.Summary(value=summary_values), step)