brain of mat kelcey...

simple tensorboard visualisation for gradient norms

June 27, 2017 at 09:45 PM | categories: Uncategorized

( i've had three people recently ask me about how i was visualising gradient norms in tensorboard so, according to my three strikes rule, i now have to "automate" it by writing a blog post about it )

one really useful visualisation you can do while training a network is visualise the norms of the variables and gradients.

how are they useful? some random things that immediately come to mind include the fact that...

  • diverging norms of variables might mean you haven't got enough regularisation.
  • zero norm gradient means learning has somehow stopped.
  • exploding gradient norms means learning is unstable and you might need to clip (hellloooo deep reinforcement learning).

let's consider a simple bounding box regression conv net (the specifics aren't important, i just grabbed this from another project, just needed something for illustration) ...

# (256, 320, 3)  input image

model = slim.conv2d(images, num_outputs=8, kernel_size=3, stride=2, weights_regularizer=l2(0.01), scope="c0")
# (128, 160, 8)

model = slim.conv2d(model, num_outputs=16, kernel_size=3, stride=2, weights_regularizer=l2(0.01), scope="c1")
# (64, 80, 16)

model = slim.conv2d(model, num_outputs=32, kernel_size=3, stride=2, weights_regularizer=l2(0.01), scope="c2")
# (32, 40, 32)

model = slim.conv2d(model, num_outputs=4, kernel_size=1, stride=1, weights_regularizer=l2(0.01), scope="c3")
# (32, 40, 4)  1x1 bottleneck to get num of params down betwee c2 & h0

model = slim.dropout(model, keep_prob=0.5, is_training=is_training)
# (5120,)  32x40x4 -> 32 is where the majority of params are so going to be most prone to overfitting.

model = slim.fully_connected(model, num_outputs=32, weights_regularizer=l2(0.01), scope="h0")
# (32,)

model = slim.fully_connected(model, num_outputs=4, activation_fn=None, scope="out")
# (4,) = bounding box (x1, y1, dx, dy)

a simple training loop using feed_dict would be something along the lines of ...

optimiser = tf.train.AdamOptimizer()
train_op = optimiser.minimize(loss=some_loss)

with tf.Session() as sess:
  while True:
    _ =, feed_dict=blah)

but if we want to get access to gradients we need to do things a little differently and call compute_gradients and apply_gradients ourselves ...

optimiser = tf.train.AdamOptimizer()
gradients = optimiser.compute_gradients(loss=some_loss)
train_op = optimiser.apply_gradients(gradients)

with tf.Session() as sess:
  while True:
    _ =, feed_dict=blah)

with access to the gradients we can inspect them and create tensorboard summaries for them ...

optimiser = tf.train.AdamOptimizer()
gradients = optimiser.compute_gradients(loss=some_loss)
l2_norm = lambda t: tf.sqrt(tf.reduce_sum(tf.pow(t, 2)))
for gradient, variable in gradients:
  tf.summary.histogram("gradients/" +, l2_norm(gradient))
  tf.summary.histogram("variables/" +, l2_norm(variable))
train_op = optimiser.apply_gradients(gradients)

with tf.Session() as sess:
  summaries_op = tf.summary.merge_all()
  summary_writer = tf.summary.FileWriter("/tmp/tb", sess.graph)
  for step in itertools.count():
    _, summary =[train_op, summaries_op], feed_dict=blah)
    summary_writer.add_summary(summary, step)

( though we may only want to run the expensive summaries_op once in awhile... )

with logging like this we get 8 histogram summaries per variable; the cross product of

  • layer weights vs layer biases
  • variable vs gradients
  • norms vs values

e.g. for conv layer c3 in the above model we get the summaries shown below. note: nothing terribly interesting in this example, but a couple of things

  • red : very large magnitude of gradient very early in training; this is classic variable rescaling.
  • blue: non zero gradients at end of training, so stuff still happening at this layer in terms of the balance of l2 regularisation vs loss. (note: no bias regularisation means it'll continue to drift)

gradient norms with ggplot

sometimes the histograms aren't enough and you need to do some more serious plotting. in these cases i hackily wrap the gradient calc in tf.Print and plot with ggplot

e.g. here's some gradient norms from an old actor / critic model (cartpole++)

related: explicit simple_value and image summaries

on a related note you can also explicitly write summaries which is sometimes easier to do than generating the summaries through the graph.

i find this especially true for image summaries where there are many pure python options for post processing with, say, PIL

e.g. explicit scalar values

summary_writer =tf.summary.FileWriter("/tmp/blah")
summary = tf.Summary(value=[
  tf.Summary.Value(tag="foo", simple_value=1.0),
  tf.Summary.Value(tag="bar", simple_value=2.0),
summary_writer.add_summary(summary, step)

e.g. explicit image summaries using PIL post processing

summary_values = []  # (note: could already contain simple_values like above)
for i in range(6):
  # wrap np array with PIL image and canvas
  img = Image.fromarray(some_np_array_probably_output_of_network[i]))
  canvas = ImageDraw.Draw(img)
  # draw a box in the top left
  canvas.line([0,0, 0,10, 10,10, 10,0, 0,0], fill="white")
  # write some text
  canvas.text(xy=(0,0), text="some string to add to image", fill="black")
  # serialise out to an image summary
  sio = StringIO.StringIO(), format="png")
  image = tf.Summary.Image(height=256, width=320, colorspace=3, #RGB
  summary_values.append(tf.Summary.Value(tag="img/%d" % idx, image=image))
summary_writer.add_summary(tf.Summary(value=summary_values), step)