Introduction to Autoencoders

Autoencoders are neural networks models whose aim is to reproduce their input: this is trivial if the network has no constraints, but if the network is constrained the learning process becomes more interesting.

Simple Autoencoder

The simplest AutoEncoder (AE) has an MLP-like (Multi Layer Perceptron) structure:

  • One input layer
  • One hidden layer
  • One output layer

The main difference between the AE and the MLP is that former’s output layer has the same cardinality of its input layer whilst the latter’s output layer cardinality is the number of classes the perceptron should be capable of classifying. Moreover, the AE belongs to the unsupervised learning algorithms family because it learns to represent unlabeled data; the MLP instead requires labeled data to be trained on.

The most important part of an AE is its hidden layer. In fact, this layer learns to encode the input whilst the output layer learns to decode it.

The hidden layer plays a fundamental role because a common application of AEs is dimensionality reduction: after the training phase, the output layer is usually thrown away and the AE is used to build a new dataset of samples with lower dimensions.


  • input vector
  • parameters matrix of layer -th, in charge of projecting a -D input in a -D space
  • bias vector
  • activation function applied to every neuron of the layer .

The simplest AE can therefore be summarized as:

The AE is the model that tries to minimize the reconstruction error between the input value and the reconstructed value : the training process is, therefore, the minimization of a distance (like the ) or some other chosen metric.

From the information theory prospective the loss can be seen as:

Constraints are everything

It can be easily noticed that if the number of units in the hidden layer is greater than or equal to the number of input units, the network will learn the identity function easily.

Learning the identity function alone is useless because the network will never learn to extract useful features but, instead, it will simply pass forward the input data to the output layer. In order to learn useful features, constraints must be added to the network: in this way no neuron can learn the identity function but they’ll learn to project inputs in a lower dimensional space.

AEs can extract the so-called latent variables from the input data. These variables are an aggregated view of the input data, that can be used to easily manage and understand the input data.

Dimensionality reduction using AEs leads to better results than classical dimensionality reduction techniques such as PCA due to the non-linearities and the type of constraints applied.

From the information theory point of view, the constraints force to learn a lossy representation of input data.

Number of hidden units

Limit the number of hidden units

Constraining the number of hidden units to be less than input units

The simplest idea is to constrain the number of hidden units to be less than the number of input units.

In this way, the identity function can’t be learned, but instead, a compress representation of the data should be.


Sparse AE architecture

Do not constraint the number of hidden units, make the network learn to turn off the right ones.

To impose a sparsity constraint means to force some hidden unit to be inactive most of the time. The inactivity of a neuron depends on the chosen activation function .

If the activation function is the

being close to means to be inactive.

Sparsity is a desired characteristic for an autoencoder, because it allows to use a greater number of hidden units (even more than the input ones) and therefore gives the network the ability of learning different connections and extract different features (w.r.t. the features extracted with the only constraint on the number of hidden units). Moreover, sparsity can be used together with the constraint on the number of hidden units: an optimization process of the combination of these hyper-parameters is required to achieve better performance.

Sparsity can be forced adding a term to the loss function. Since we want that most of the neurons in the hidden layer are inactive, we can extract the average activation value for every neuron of the hidden layer (averaged over the whole training set ) and force it to be under a threshold.

If the threshold value is low, the neurons will adapt their parameters (and thus their outputs) to respect this constraint. To do this, most of them will be inactive. Let:

the average value for the -th neuron of the hidden (number 2) layer.

Defining the sparsity parameter as the desired average activation value for every hidden neuron and initializing it to a value close to zero, we can enforce the sparsity:

To achieve this, various penalization terms to the loss function can be added. The common one is based on the Kullback-Leibler (KL) divergence: a measure of the similarity of two distributions.

In this case, the KL divergence is measured times between a Bernoulli random variable with mean and a Bernoulli random variable with mean used to model a single neuron.

In short, the divergence increase as the difference between and increase. Therefore this is a good candidate to be added as penalization term to the loss, in that way the learning process via gradient descent will try to minimize the divergence while minimizing the reconstruction error.

The final form of the loss thus is:

Adding noise: DAE

Instead of forcing the hidden layer to learn to extract features from the input data , we can train the AE to reconstruct the input from a corrupted version of it .

This allows the AE to discover more robust features w.r.t. the ones that could be learned from the original uncorrupted data.

This kind of constraint gave rise to the Denoising AutoEncoder (DAE) field.

DAEs have the same architecture of the AEs presented above, with just 2 differences:

  1. Input corruption step
  2. Loss function

Input corruption

For every element in the training set, it’s required to generate a corrupted version.

can be any function to corrupt the input data and it depends on the data type. For example, in the computer vision field can be added Gaussian noise or salt and pepper noise.

Moreover, AEs have been used to introduce the Dropout. Dropout is a simple technique to prevent neural networks from overfitting and it’s highly related to the input corruption: in fact, it consists of dropping out neurons (setting their output to ) casually with a specified probability.

Dropout is, therefore, an input corruption method and can be applied to improve the quality of the learned features of the hidden layer.

Loss function

Any loss function can be used, the only thing to pay attention to is the relations between the value to reconstruct.

In fact, we’re interested in minimizing the reconstruction error between the original input and the corrupted decoded output.

Therefore the minimization process should be:

Common usage

As previously mentioned, autoencoders are commonly used to reduce the inputs’ dimensionality and not to decode the encoded value. The extracted compressed representation can be used for:

  • Statistical analysis on the data distribution. At this purpose it possible to visualize a 2D representation using t-SNE
  • Classification: classifiers works better with non-highly dimensional data
  • One-Class Classification (OCC): if the AE has been trained on a single class only, it’s possible to find a threshold for the reconstruction error such that elements with a reconstruction error greater than this threshold expose differences from the learned model. It can somehow be seen as an outlier detection procedure.

Moreover, if the decoder has not been thrown away it can be used to perform Data denoising: if the AE trained is a DAE it has the ability to remove (some kind of) noise from the input, therefore a DAE can be used to do data preprocessing on a noisy source of data.


Autoencoders have been successfully applied to different tasks and different architecture have been defined.

In the next posts, I’ll introduce stacked autoencoders and convolutional autoencoders and I’ll mix them together to build a stacked convolutional autoencoder in Tensorflow.

Don't you want to miss the next article? Do you want to be kept updated?
Subscribe to the newsletter!

Related Posts

FaceCTRL: control your media player with your face

After being interrupted dozens of times a day while coding with my headphones on, I decided to find a solution that eliminates the stress of pausing and re-playing the song I was listening to. The solution is machine learning / computer vision application developed with TensorFlow 2, OpenCV, and Playerctl. This article will guide you trough the step required to develop such an application.

Hands-On Neural Networks with TensorFlow 2.0

The first book on TensorFlow 2.0 and neural networks is out now!

Analyzing tf.function to discover AutoGraph strengths and subtleties - part 3

In this third and last part, we analyze what happens when tf.function is used to convert a function that contains complex Python constructs in its body. Should we design functions thinking about how they are going to be converted?

Analyzing tf.function to discover AutoGraph strengths and subtleties - part 2

In part 1 we learned how to convert a 1.x code to its eager version, the eager version to its graph representation and faced the problems that arise when working with functions that create a state. In this second part, we’ll analyze what happens when instead of a tf.Variable we pass a tf.Tensor or a Python native type as input to a tf.function decorated function. Are we sure everything is going to be converted to the Graph representation we expect?

Analyzing tf.function to discover AutoGraph strengths and subtleties - part 1

AutoGraph is one of the most exciting new features of Tensorflow 2.0: it allows transforming a subset of Python syntax into its portable, high-performance and language agnostic graph representation bridging the gap between Tensorflow 1.x and the 2.0 release based on eager execution. As often happens all that glitters is not gold: although powerful, AutoGraph hides some subtlety that is worth knowing; this article will guide you through them using an error-driven approach.

Tensorflow 2.0: Keras is not (yet) a simplified interface to Tensorflow

In Tensorflow 2.0 Keras will be the default high-level API for building and training machine learning models, hence complete compatibility between a model defined using the old tf.layers and the new tf.keras.layers is expected. In version 2 of the popular machine learning framework the eager execution will be enabled by default although the static graph definition + session execution will be still supported. In this post, you'll see that the compatibility between a model defined using tf.layers and tf.keras.layers is not always guaranteed.

Fixed camera setup for object localization and measurement

A common task in Computer Vision is to use a camera for localize and measure certain objects in the scene. In the industry is common to use images of objects on a high contrast background and use Computer Vision algorithms to extract useful information. There's a lot of literature about the computer vision algorithm that we can use to extract the information, but something that's usually neglected is how to correctly setup the camera in order to correctly address the problem. This post aim is to shed light on this subject.

Tensorflow 2.0: models migration and new design

Tensorflow 2.0 will be a major milestone for the most popular machine learning framework: lots of changes are coming, and all with the aim of making ML accessible to everyone. These changes, however, require for the old users to completely re-learn how to use the framework: this article describes all the (known) differences between the 1.x and 2.x version, focusing on the change of mindset required and highlighting the pros and cons of the new implementation.

Understanding Tensorflow's tensors shape: static and dynamic

Describing computational graphs is just a matter connecting nodes correctly. Connecting nodes seems a trivial operation, but it hides some difficulties related to the shape of tensors. This article will guide you through the concept of tensor's shape in both its variants: static and dynamic.

Camera calibration guidelines

The process of geometric camera calibration (camera resectioning) is a fundamental step for machine vision and robotics applications. Unfortunately, the result of the calibration process can vary a lot depending on various factors. There are a lot of empirical guidelines that have to be followed in order to achieve good results: this post will drive you through them.