Introduction to Autoencoders


Autoencoders are neural networks models whose aim is to reproduce their input: this is trivial if the network has no constraints, but if the network is constrained the learning process becomes more interesting.

Simple Autoencoder

The simplest AutoEncoder (AE) has an MLP-like (Multi Layer Perceptron) structure:

  • One input layer
  • One hidden layer
  • One output layer

The main difference between the AE and the MLP is that former’s output layer has the same cardinality of its input layer whilst the latter’s output layer cardinality is the number of classes the perceptron should be capable of classifying. Moreover, the AE belongs to the unsupervised learning algorithms family because it learns to represent unlabeled data; the MLP instead requires labeled data to be trained on.

The most important part of an AE is its hidden layer. In fact, this layer learns to encode the input whilst the output layer learns to decode it.

The hidden layer plays a fundamental role because a common application of AEs is dimensionality reduction: after the training phase, the output layer is usually thrown away and the AE is used to build a new dataset of samples with lower dimensions.

Formally:

  • \(x \in [0,1]^{d}\) input vector
  • \(W_i \in \mathbb{R}^{I_{di} \times O_{di}}\) parameters matrix of layer \(i\)-th, in charge of projecting a \(I_{di}\)-D input in a \(O_{di}\)-D space
  • \(b_i \in \mathbb{R}^{O_{di}}\) bias vector
  • \(a(h)\) activation function applied to every neuron of the layer \(h\).

The simplest AE can therefore be summarized as:

\[\begin{align} z &= a(xW_1 + b_1) \\ x' &= a(zW_2 + b_2) \end{align}\]

The AE is the model that tries to minimize the reconstruction error between the input value \(x\) and the reconstructed value \(x'\): the training process is, therefore, the minimization of a \(L_p\) distance (like the \(L_2\)) or some other chosen metric.

\[\min \mathcal{L} = \min E(x, x') \stackrel{e.g.}{=} \min || x - x' ||_p\]

From the information theory prospective the loss can be seen as:

\[\min \mathcal{L} = \min || x - \text{decode}(\text{encode}(x)) ||_p\]

Constraints are everything

It can be easily noticed that if the number of units in the hidden layer is greater than or equal to the number of input units, the network will learn the identity function easily.

Learning the identity function alone is useless because the network will never learn to extract useful features but, instead, it will simply pass forward the input data to the output layer. In order to learn useful features, constraints must be added to the network: in this way no neuron can learn the identity function but they’ll learn to project inputs in a lower dimensional space.

AEs can extract the so-called latent variables from the input data. These variables are an aggregated view of the input data, that can be used to easily manage and understand the input data.

Dimensionality reduction using AEs leads to better results than classical dimensionality reduction techniques such as PCA due to the non-linearities and the type of constraints applied.

From the information theory point of view, the constraints force to learn a lossy representation of input data.

Number of hidden units

Limit the number of hidden units

Constraining the number of hidden units to be less than input units

The simplest idea is to constrain the number of hidden units to be less than the number of input units.

\[|z| < d\]

In this way, the identity function can’t be learned, but instead, a compress representation of the data should be.

Sparsity

Sparse AE architecture

Do not constraint the number of hidden units, make the network learn to turn off the right ones.

To impose a sparsity constraint means to force some hidden unit to be inactive most of the time. The inactivity of a neuron \(i\) depends on the chosen activation function \(a_i(h)\).

If the activation function is the \(tanh\)

\[a_i(x) = \text{tanh}(x)\]

being close to \(-1\) means to be inactive.

Sparsity is a desired characteristic for an autoencoder, because it allows to use a greater number of hidden units (even more than the input ones) and therefore gives the network the ability of learning different connections and extract different features (w.r.t. the features extracted with the only constraint on the number of hidden units). Moreover, sparsity can be used together with the constraint on the number of hidden units: an optimization process of the combination of these hyper-parameters is required to achieve better performance.

Sparsity can be forced adding a term to the loss function. Since we want that most of the neurons in the hidden layer are inactive, we can extract the average activation value for every neuron of the hidden layer (averaged over the whole training set \(TS\)) and force it to be under a threshold.

If the threshold value is low, the neurons will adapt their parameters (and thus their outputs) to respect this constraint. To do this, most of them will be inactive. Let:

\[\hat{\rho_{j}} = \frac{1}{|TS|} \sum_{i=1}^{|TS|}{a^{(2)}_j(x_i)}\]

the average value for the \(j\)-th neuron of the hidden (number 2) layer.

Defining the sparsity parameter \(\rho\) as the desired average activation value for every hidden neuron and initializing it to a value close to zero, we can enforce the sparsity:

\[\hat{\rho}_{j} \le \rho \quad \forall j\]

To achieve this, various penalization terms to the loss function can be added. The common one is based on the Kullback-Leibler (KL) divergence: a measure of the similarity of two distributions.

\[\sum_{j=1}^{O_{d_2}} \rho \log \frac{\rho}{\hat\rho_j} + (1-\rho) \log \frac{1-\rho}{1-\hat\rho_j} = \sum_{j=1}^{O_{d_2}} KL(\rho||\hat{\rho_j})\]

In this case, the KL divergence is measured \(O_{d_2}\) times between a Bernoulli random variable with mean \(\rho\) and a Bernoulli random variable with mean \(\hat{\rho_j}\) used to model a single neuron.

In short, the \(KL\) divergence increase as the difference between \(\rho\) and \(\hat{\rho_j}\) increase. Therefore this is a good candidate to be added as penalization term to the loss, in that way the learning process via gradient descent will try to minimize the divergence while minimizing the reconstruction error.

The final form of the loss thus is:

\[\min \mathcal{L} = \min \left( E(x, x') + \sum_{j=1}^{O_{d_2}} KL(\rho||\hat{\rho_j}) \right)\]

Adding noise: DAE

Instead of forcing the hidden layer to learn to extract features from the input data \(x\), we can train the AE to reconstruct the input from a corrupted version of it \(\tilde{x}\).

This allows the AE to discover more robust features w.r.t. the ones that could be learned from the original uncorrupted data.

This kind of constraint gave rise to the Denoising AutoEncoder (DAE) field.

DAEs have the same architecture of the AEs presented above, with just 2 differences:

  1. Input corruption step
  2. Loss function

Input corruption

For every element \(x\) in the training set, it’s required to generate a corrupted version.

\[\tilde{x} = \text{corrupt}(x)\]

\(\text{corrupt}\) can be any function to corrupt the input data and it depends on the data type. For example, in the computer vision field can be added Gaussian noise or salt and pepper noise.

Moreover, AEs have been used to introduce the Dropout. Dropout is a simple technique to prevent neural networks from overfitting and it’s highly related to the input corruption: in fact, it consists of dropping out neurons (setting their output to \(0\)) casually with a specified probability.

Dropout is, therefore, an input corruption method and can be applied to improve the quality of the learned features of the hidden layer.

Loss function

Any loss function can be used, the only thing to pay attention to is the relations between the value to reconstruct.

In fact, we’re interested in minimizing the reconstruction error between the original input and the corrupted decoded output.

Therefore the minimization process should be:

\[\min \mathcal{L}(x, \tilde{x}') = \min || x - \text{decode}(\text{encode}(\text{corrupt}(x))) ||_p\]

Common usage

As previously mentioned, autoencoders are commonly used to reduce the inputs’ dimensionality and not to decode the encoded value. The extracted compressed representation can be used for:

  • Statistical analysis on the data distribution. At this purpose it possible to visualize a 2D representation using t-SNE
  • Classification: classifiers works better with non-highly dimensional data
  • One-Class Classification (OCC): if the AE has been trained on a single class only, it’s possible to find a threshold for the reconstruction error such that elements with a reconstruction error greater than this threshold expose differences from the learned model. It can somehow be seen as an outlier detection procedure.

Moreover, if the decoder has not been thrown away it can be used to perform Data denoising: if the AE trained is a DAE it has the ability to remove (some kind of) noise from the input, therefore a DAE can be used to do data preprocessing on a noisy source of data.

Next…

Autoencoders have been successfully applied to different tasks and different architecture have been defined.

In the next posts, I’ll introduce stacked autoencoders and convolutional autoencoders and I’ll mix them together to build a stacked convolutional autoencoder in Tensorflow.

Don't you want to miss the next article? Do you want to be kept updated?
Subscribe to the newsletter!

Related Posts

Building a RAG for tabular data in Go with PostgreSQL & Gemini

In this article we explore how to combine a large language model (LLM) with a relational database to allow users to ask questions about their data in a natural way. It demonstrates a Retrieval-Augmented Generation (RAG) system built with Go that utilizes PostgreSQL and pgvector for data storage and retrieval. The provided code showcases the core functionalities. This is an overview of how the "chat with your data" feature of fitsleepinsights.app is being developed.

Using Gemini in a Go application: limits and details

This article explores using Gemini within Go applications via Vertex AI. We'll delve into the limitations encountered, including the model's context window size and regional restrictions. We'll also explore various methods for feeding data to Gemini, highlighting the challenges faced due to these limitations. Finally, we'll briefly introduce RAG (Retrieval-Augmented Generation) as a potential solution, but leave its implementation details for future exploration.

Custom model training & deployment on Google Cloud using Vertex AI in Go

This article shows a different approach to solving the same problem presented in the article AutoML pipeline for tabular data on VertexAI in Go. This time, instead of relying on AutoML we will define the model and the training job ourselves. This is a more advanced usage that allows the experienced machine learning practitioner to have full control on the pipeline from the model definition to the hardware to use for training and deploying. At the end of the article, we will also see how to use the deployed model. All of this, in Go and with the help of Python and Docker for the custom training job definition.

Integrating third-party libraries as Unreal Engine plugins: solving the ABI compatibility issues on Linux when the source code is available

In this article, we will discuss the challenges and potential issues that may arise during the integration process of a third-party library when the source code is available. It will provide guidance on how to handle the compilation and linking of the third-party library, manage dependencies, and resolve compatibility issues. We'll realize a plugin for redis plus plus as a real use case scenario, and we'll see how tough can it be to correctly compile the library for Unreal Engine - we'll solve every problem step by step.

AutoML pipeline for tabular data on VertexAI in Go

In this article, we delve into the development and deployment of tabular models using VertexAI and AutoML with Go, showcasing the actual Go code and sharing insights gained through trial & error and extensive Google research to overcome documentation limitations.

Advent of Code 2022 in pure TensorFlow - Day 12

Solving problem 12 of the AoC 2022 in pure TensorFlow is a great exercise in graph theory and more specifically in using the Breadth-First Search (BFS) algorithm. This problem requires working with a grid of characters representing a graph, and the BFS algorithm allows us to traverse the graph in the most efficient way to solve the problem.

Advent of Code 2022 in pure TensorFlow - Day 11

In this article, we'll show how to solve problem 11 from the Advent of Code 2022 (AoC 2022) using TensorFlow. We'll first introduce the problem and then provide a detailed explanation of our TensorFlow solution. The problem at hand revolves around the interactions of multiple monkeys inspecting items, making decisions based on their worry levels, and following a set of rules.

Advent of Code 2022 in pure TensorFlow - Day 10

Solving problem 10 of the AoC 2022 in pure TensorFlow is an interesting challenge. This problem involves simulating a clock signal with varying frequencies and tracking the state of a signal-strength variable. TensorFlow's ability to handle complex data manipulations, control structures, and its @tf.function decorator for efficient execution makes it a fitting choice for tackling this problem. By utilizing TensorFlow's features such as Dataset transformations, efficient filtering, and tensor operations, we can create a clean and efficient solution to this intriguing puzzle.

Advent of Code 2022 in pure TensorFlow - Day 9

In this article, we'll show two different solutions to the Advent of Code 2022 day 9 problem. Both of them are purely TensorFlow solutions. The first one, more traditional, just implement a solution algorithm using only TensorFlow's primitive operations - of course, due to some TensorFlow limitations this solution will contain some details worth reading (e.g. using a pairing function for being able to use n-dimensional tf.Tensor as keys for a mutable hashmap). The second one, instead, demonstrates how a different interpretation of the problem paves the way to completely different solutions. In particular, this solution is Keras based and uses a multi-layer convolutional model for modeling the rope movements.

Advent of Code 2022 in pure TensorFlow - Day 8

Solving problem 8 of the AoC 2022 in pure TensorFlow is straightforward. After all, this problem requires working on a bi-dimensional grid and evaluating conditions by rows or columns. TensorFlow is perfectly suited for this kind of task thanks to its native support for reduction operators (tf.reduce) which are the natural choice for solving problems of this type.