April 06, 2018

# the standard convolutional classifier

the most familiar form of a convolutional network to most people is the type used for classifying images.

we can think of these types of networks as being made up of two halves.

the first is a sequence of convolutional layers with some form of spatial downsampling; e.g. pooling or having a stride > 1 ...

 some input (64, 64, 3) a convolution; stride 2, kernel size 8 (32, 32, 8) and another (kernel size 16) (16, 16, 16) and another (kernel size 32) (8, 8, 32)

... followed by a second half which is a sequence of fully connected layers ...

 output from convolutions (8, 8, 32) flattened (2048) fully connected to 128 (128) fully connected to 10 (10)

(note: here, and following, we're going to ignore any leading batch dimension)

in these networks the first half "squeezes" spatial information into depth information while the second half acts as a standard classifier.

one property of any fully connected layer is that the number of parameters is dictated by the input size; in this example of a classifier it's the flattened size of the final volume of the first half (the 2048d vector)

this idea of the number of parameters being linked to the input size is not the case for the layers in the first half though; there the number of parameters is not dictated by the input size but instead by the kernel size and number of output channels. specifically the spatial size of the input doesn't matter.

e.g. using pooling for downsampling for any arbitrary (H, W) ...

 input (H, W, 3) convolution, stride=1, #kernels=5 (H, W, 5) pooling, size=2 (H//2, W//2, 5)

... vs stride > 1 for downsampling.

 input (H, W, 3) convolution, stride=2, #kernels=5 (H//2, W//2, 5)

so in our original example the first half of the network going from (64, 64, 4) to (8, 8, 32) could actually be applied to an input of any spatial size. if for example we gave an input of (128, 128, 4) we'd get an output of (16, 16, 32). but! you wouldn't though be able to run this (16, 16, 32) through the second classifier half, since the size of the flattened tensor would now be the wrong size.

# fully convolutional networks

now let's consider the common architecture for an image to image network. again it has two halves.

the first half is like the prior example; some convolutions with a form of downsampling as a way of trading spatial information for channel information.

but the second half isn't a classifier, it's the reverse of the first half; a sequence of convolutions with some form of upsampling;

this upsampling can can either deconvolutions with a stride>1 or something like nearest neighbour upsampling

e.g.

 some input (64, 64, 3) convolution (64, 64, 8) pooling (32, 32, 8) convolution (32, 32, 16) pooling (16, 16, 16) nearest neigbour upsample resize (32, 32, 16) convolution (32, 32, 8) nearest neigbour upsample resize (64, 64, 8) convolution (64, 64, 3)

we can see that none of these operations require a fixed spatial size, so it's fine to apply them to an image of whatever size, even something like (128000, 128000, 3) which would produce an output of (128000, 128000, 3). this ability to apply to huge images is a great trick for when you're dealing with huge image data like medical scans.

so what does it mean then for a network to be "fully convolutional"? for me it's basically not using any operations that require a fixed tensor size as input.

in this above example we'd say we're training on "patches" on (64, 64). these would probably be random crops within a larger image and, note, that means that each training image doesn't even need to be the same resolution or aspect (as long as it's larger than 64x64)

## 1x1 convolutions

a 1x1 kernel in a convolutional layer at first appears a bit strange. why would you bother?

consider a volume of (1, 1, 3) that we apply a 1x1 convolution to. with a kernel size of 5 we'd end up getting a volume of (1, 1, 5). an interesting interpretation of this is that it's exactly equivalent to having a fully connected layer between 3 inputs and 5 outputs.

a volume then of, say, (10, 20, 3) that we apply this same convolution to gives a volume of (10, 20, 5) so what what we're doing is equivalent to applying the same fully connected "layer" per pixel to the (10, 20) input.

tie this in with the idea of the fully convolutional network...

 some input (64, 64, 3) some convolutions + downsampling (32, 32, 8) more convolutions + downsampling (16, 16, 16) more convolutions + downsampling (8, 8, 32) a 1x1 convolution, stride=1 & kernel size 10 (8, 8, 10) a 1x1 convolution, stride=1, kernel size 1 & sigmoid activation (8, 8, 1)

what we've got is like our original classifier example but that's operating in a fully convolutional way; the last three layers are the same as a sequence of fully connected layers going from 32 to 10 to 1 output, but across an 8x8 grid in parallel.

and as before we'd be able to train this on an input of (64, 64, 3) with an output of (8, 8, 1) but apply it to whatever multiple of in the input size we'd like. e.g. an input (640, 320, 3) would result in output of (80, 40, 1)

we can think of this final (80, 40, 1) as kind of similar to a 10x5 heat map across whatever is being captured by the (8, 8, 1) output.

the papers were i first saw these ideas were OverFeat (Sermanet et al) & Network in Network (Lin et al)