brain of mat kelcey

fully convolutional networks

April 06, 2018

the standard convolutional classifier

the most familiar form of a convolutional network to most people is the type used for classifying images.

we can think of these types of networks as being made up of two main parts;

the first is a sequence of convolutional layers with some form of spatial downsampling; e.g. pooling or having a stride > 1 ...

some input (64, 64, 3)
a convolution; stride 2, kernel size 8 (32, 32, 8)
and another (kernel size 16) (16, 16, 16)
and another (kernel size 32) (8, 8, 32)

... followed by the second part which is a sequence of fully connected layers ...

output from convolutions (8, 8, 32)
flattened (2048)
fully connected to 128 (128)
fully connected to 10 (10)

(note: here, and following, we're going to ignore any leading batch dimension)

in these networks the first part "squeezes" spatial information into depth information while the second part acts as a standard classifier.

one property of any fully connected layer is that the number of parameters is dictated by the input size; in this type of classifier it's,usually the flattened size of the final volume of the first part.

this property is not true the for the layers in the first part though, there the number of parameters is not dictated by the input size. the kernel size and number of channels are important but the actual spatial size of the input doesn't matter.

e.g. using pooling for downsampling for whatever (H, W)

input (H, W, 3)
convolution, stride=1, #kernels=5 (H, W, 5)
pooling, size=2 (H//2, W//2, 5)

... vs stride > 1 for downsampling.

input (H, W, 3)
convolution, stride=2, #kernels=5 (H//2, W//2, 5)

so in our original example the first part of the network going from (64, 64, 4) to (8, 8, 32) could be applied to an input of any spatial size. if for example we gave an input of (128, 128, 4) we'd get an output of (16, 16, 32). you wouldn't though be able to run this (16, 16, 32) through the second classifier part, since the size of the flattened tensor would now be the wrong.

fully convolutional networks

now let's consider the common architecture for an image to image network. again it has two parts.

the first part is like the prior example; some convolutions with a form of downsampling as a way of trading spatial information for channel information.

but the second part isn't a classifier, it's the reverse of the first part; a sequence of convolutions with some form of upsampling;

this upsampling can can either deconvolutions with a stride>1 or something like nearest neighbour upsampling


some input (64, 64, 3)
convolution (64, 64, 8)
pooling (32, 32, 8)
convolution (32, 32, 16)
pooling (16, 16, 16)
nearest neigbour upsample resize (32, 32, 16)
convolution (32, 32, 8)
nearest neigbour upsample resize (64, 64, 8)
convolution (64, 64, 3)

we can see that non of these operations require a fixed spatial size, so it's fine to apply them to an image of whatever size, even something like (128000, 128000, 3) which would produce an output of (128000, 128000, 3)

so what does it mean then for a network to be "fully convolutional"? for me it's basically not using any operations that require a fixed tensor size as input.

for this case we'd say we're training on "patches" on (64, 64) and the ability to apply to huge images is a great trick for when you're dealing with huge image data like medical scans.

1x1 convolutions

a 1x1 kernel in a convolutional layer at first appears a bit strange. why would you bother?

consider a volume of (1, 1, 3) that we apply a 1x1 convolution to. with a kernel size of 5 we'd end up getting a volume of (1, 1, 5). an interesting interpretation of this is that it's exactly equivalent to having a fully connected layer between 3 inputs and 5 outputs.

a volume then of, say, (10, 20, 3) that we apply this same convolution to gives a volume of (10, 20, 5) so what what we're doing is equivalent to applying fully connected "layer" per pixel of the (10, 20) input.

tie this in with the idea of the fully convolutional network...

some input (64, 64, 3)
some convolutions + downsampling (32, 32, 8)
more convolutions + downsampling (16, 16, 16)
more convolutions + downsampling (8, 8, 32)
a 1x1 convolution, stride=1 & kernel size 10 (8, 8, 10)
a 1x1 convolution, stride=1, kernel size 1 & sigmoid activation (8, 8, 1)

what we've got is like our original classifier example but that's operating in a fully convolutional way across a (8, 8) grid. and as before we'd be able to train this on an input (64, 64, 3) to and output (8, 8, 1) but apply it to whatever multiple of in the input size we'd like. e.g. input (640, 320, 3) which would just result in output of (80, 40, 1)

we can think of this final (80, 40, 1) as kind of similar to a heat map across whatever is being captured by the (8, 8, 1) output.

the papers were i first saw these ideas were OverFeat (Sermanet et al) & Network in Network (Lin et al)