Convolutional Autoencoders

The convolution operator allows filtering an input signal in order to extract some part of its content. Autoencoders in their traditional formulation do not take into account the fact that a signal can be seen as a sum of other signals. Convolutional Autoencoders, instead, use the convolution operator to exploit this observation. They learn to encode the input in a set of simple signals and then try to reconstruct the input from them.


Convolution with no padding and no strides

Figure 1. A convolution between a 4x4x1 input and a 3x3x1 convolutional filter.
The result is a 2x2x1 activation map. Source

A convolution in the general continue case is defined as the integral of the product of two functions (signals) after one is reversed and shifted:

As a result, a convolution produces a new function (signal). The convolution is a commutative operation, therefore

Autoencoders can be potentially trained to inputs living in a generic -dimensional space. Practically, AEs are often used to extract features from 2D, finite and discrete input signals, such as digital images.

In the 2D discrete space, the convolution operation is defined as:

In the image domain where the signals are finite, this formula becomes:


  • is the output pixel, in position
  • is the side of a square, odd convolutional filter
  • is the convolutional filter
  • is the input image

This operation (single convolutional step) is done for every location of the input image that completely overlaps with the convolutional filter as shown in Figure 1.

Convolution with an edge detection filter

Figure 2. The convolution of an image with and hand-crafted filter (also called kernel) for edge detection allows extracting the edges from the input image.

As it can be easily seen from the Figure 2 the result of a convolution depends on the value of the convolutional filter. There are different manually engineered convolutional filters, each one used in image processing tasks like denoising, blurring, etc…

The discrete 2D convolution has 2 additional parameters: Horizontal & Vertical Stride. They’re the number of pixels to skip along the dimensions of after having performed a single convolutional step. Usually, the horizontal and vertical strides are equal and they’re noted as .

The result of a 2D discrete convolution of a square image with side (for simplicity, but it’s easy to generalize to a generic rectangular image) with a squared convolutional filter with side is a square image with side:

Until now it has been shown the case of an image in gray scale (single channel) convolved with a single convolutional filter. If the input image has more than one channel, say channels, the convolution operator spans along any of these channels.

The general rule is that a convolutional filter must have the same number of channels of the image is convolved with. It’s possible to generalize the concept of discrete 2-D convolution, treating stacks of 2D signals as volumes.

Convolution among volumes

A volume is a rectangular parallelepiped completely defined by the triple , where:

  • is its width
  • is its height
  • is its depth

It’s obvious that a gray-scale image can be seen as a volume width whilst an RGB image can be seen as a volume with .

A convolutional filter can be also seen as a volume of filters with depth . In particular, we can think about the image and the filter as a set (the order doesn’t matter) of single-channel images/filters.

It’s possible to generalize the previous convolution formula, in order to keep in account the depths:

The result of a convolution among volumes is called activation map. The activation map is a volume with .

It may sound strange that a 2D convolution is done among volumes that are 3D objects. In reality, for an input volume with depth exactly 2D discrete convolutions are performed. The sum (collapse) of the activation maps produced is a way to treat this set of 2D convolutions as a single 2D convolution. In this way, every single position of the resulting activation map contains the information extracted from the same input location through its whole depth.

Intuitively, one can think about this operation as a way to keep into account the relations that exist along the RGB channels of a single input pixel.

Convolutional AutoEncoders

Convolutional AutoEncoders (CAEs) approach the filter definition task from a different perspective: instead of manually engineer convolutional filters we let the model learn the optimal filters that minimize the reconstruction error. These filters can then be used in any other computer vision task.

CAEs are the state-of-art tools for unsupervised learning of convolutional filters. Once these filters have been learned, they can be applied to any input in order to extract features. These features, then, can be used to do any task that requires a compact representation of the input, like classification.

CAEs are a type of Convolutional Neural Networks (CNNs): the main difference between the common interpretation of CNN and CAE is that the former are trained end-to-end to learn filters and combine features with the aim of classifying their input. In fact, CNNs are usually referred as supervised learning algorithms. The latter, instead, are trained only to learn filters able to extract features that can be used to reconstruct the input.

CAEs, due to their convolutional nature, scale well to realistic-sized high-dimensional images because the number of parameters required to produce an activation map is always the same, no matter what the size of the input is. Therefore, CAEs are general purpose feature extractors differently from AEs that completely ignore the 2D image structure. In fact, in AEs the image must be unrolled into a single vector and the network must be built following the constraint on the number of inputs. In other words, AEs introduce redundancy in the parameters, forcing each feature to be global (i.e., to span the entire visual field)1, while CAEs do not.


It’s easy to understand that a single convolutional filter, can’t learn to extract the great variety of patterns that compose an image. For this reason, every convolutional layer is composed of (hyper-parameter) convolutional filters, each with depth , where is the input depth.

Therefore, a convolution among an input volume and a set of convolutional filters , each with depth , produces a set of activation maps, or equivalently, a volume of activations maps whith depth :

To improve the generalization capabilities of the network, every convolution is wrapped by a non-linear function (activation), in that way the training procedure can learn to represent input combining non-linear functions:

Where is the bias (single real value for every activation map) for the -th feature map. The term has been introduced to use the same variable name for the latent variable used in the AEs.

The produced activation maps are the encoding of the input in a low dimensional space; a dimension that’s not the dimension (width and height) of but the number of parameters used to build every feature map , in other words, the number of parameters to learn.

Since our objective is to reconstruct the input from the produced feature maps, we want a decoding operation capable of doing this. Convolutional autoencoders are fully convolutional networks, therefore the decoding operation is again a convolution.

A careful reader could argue that the convolution reduces the output’s spatial extent and therefore is not possible to use a convolution to reconstruct a volume with the same spatial extent of its input.

This is true, but we can work around this issue using the input padding. If we pad with zeros the input volume , then the result of the first convolution can have a spatial extent greater than the one of and thus the second convolution can produce a volume with the original spatial extent of .

Therefore, the amount of zeros we want to pad the input with is such that:

It follows from the equation 1 that we want to pad by zeros ( per side), in that way the encoding convolution will produce a volume with width and height equals to


The produced feature maps (latent representations) will be used as input to the decoder, in order to reconstruct the input image from this reduced representation.

The hyper-parameters of the decoding convolution are fixed by the encoding architecture, in fact:

  • Filters volume with dimensions , because the convolution should span across every feature map and produce a volume with the same spatial extent of
  • Number of filters to learn: , because we’are interested in reconstructing the input image that has depth

Therefore, the reconstructed image is the result of the convolution between the volume of feature maps and this convolutional filters volume .

Padding with the previously found amount of zeros, leads the decoding convolution to produce a volume with dimensions:

Having input’s dimensions equals to the output’s dimensions, it possible to relate input and output using any loss function, like the MSE:


In the following post, I’ll show how to build, train and use a convolutional autoencoder with Tensorflow. The following posts will guide the reader deep down the deep learning architectures for CAEs: stacked convolutional autoencoders.

Don't you want to miss the next article? Do you want to be kept updated?
Subscribe to the newsletter!

Related Posts

FaceCTRL: control your media player with your face

After being interrupted dozens of times a day while coding with my headphones on, I decided to find a solution that eliminates the stress of pausing and re-playing the song I was listening to. The solution is machine learning / computer vision application developed with TensorFlow 2, OpenCV, and Playerctl. This article will guide you trough the step required to develop such an application.

Hands-On Neural Networks with TensorFlow 2.0

The first book on TensorFlow 2.0 and neural networks is out now!

Analyzing tf.function to discover AutoGraph strengths and subtleties - part 3

In this third and last part, we analyze what happens when tf.function is used to convert a function that contains complex Python constructs in its body. Should we design functions thinking about how they are going to be converted?

Analyzing tf.function to discover AutoGraph strengths and subtleties - part 2

In part 1 we learned how to convert a 1.x code to its eager version, the eager version to its graph representation and faced the problems that arise when working with functions that create a state. In this second part, we’ll analyze what happens when instead of a tf.Variable we pass a tf.Tensor or a Python native type as input to a tf.function decorated function. Are we sure everything is going to be converted to the Graph representation we expect?

Analyzing tf.function to discover AutoGraph strengths and subtleties - part 1

AutoGraph is one of the most exciting new features of Tensorflow 2.0: it allows transforming a subset of Python syntax into its portable, high-performance and language agnostic graph representation bridging the gap between Tensorflow 1.x and the 2.0 release based on eager execution. As often happens all that glitters is not gold: although powerful, AutoGraph hides some subtlety that is worth knowing; this article will guide you through them using an error-driven approach.

Tensorflow 2.0: Keras is not (yet) a simplified interface to Tensorflow

In Tensorflow 2.0 Keras will be the default high-level API for building and training machine learning models, hence complete compatibility between a model defined using the old tf.layers and the new tf.keras.layers is expected. In version 2 of the popular machine learning framework the eager execution will be enabled by default although the static graph definition + session execution will be still supported. In this post, you'll see that the compatibility between a model defined using tf.layers and tf.keras.layers is not always guaranteed.

Fixed camera setup for object localization and measurement

A common task in Computer Vision is to use a camera for localize and measure certain objects in the scene. In the industry is common to use images of objects on a high contrast background and use Computer Vision algorithms to extract useful information. There's a lot of literature about the computer vision algorithm that we can use to extract the information, but something that's usually neglected is how to correctly setup the camera in order to correctly address the problem. This post aim is to shed light on this subject.

Tensorflow 2.0: models migration and new design

Tensorflow 2.0 will be a major milestone for the most popular machine learning framework: lots of changes are coming, and all with the aim of making ML accessible to everyone. These changes, however, require for the old users to completely re-learn how to use the framework: this article describes all the (known) differences between the 1.x and 2.x version, focusing on the change of mindset required and highlighting the pros and cons of the new implementation.

Understanding Tensorflow's tensors shape: static and dynamic

Describing computational graphs is just a matter connecting nodes correctly. Connecting nodes seems a trivial operation, but it hides some difficulties related to the shape of tensors. This article will guide you through the concept of tensor's shape in both its variants: static and dynamic.

Camera calibration guidelines

The process of geometric camera calibration (camera resectioning) is a fundamental step for machine vision and robotics applications. Unfortunately, the result of the calibration process can vary a lot depending on various factors. There are a lot of empirical guidelines that have to be followed in order to achieve good results: this post will drive you through them.