Suppose you are a digital artist with a massive collection of sketches, drawings, and digital paintings. Over time, your hard drive is filling up, and finding specific art pieces among thousands of similar-looking files is becoming a nightmare. You have several goals:
- Compress your art files: You want to reduce the size of your digital art without losing the essential details, saving valuable storage space.
- Organize and categorize: You need a way to represent each art piece in a much smaller, more manageable format that still captures its unique style and content. This would make it easier to group similar artworks together, perhaps by theme or artistic technique.
- Clean up noisy scans: Some of your older sketches were scanned from physical paper and contain smudges, dust, or other imperfections (noise). You'd love a way to automatically clean these up.
This is precisely where autoencoders can help. Think of an autoencoder as a specialized digital assistant that can learn to 'summarize' your art. It takes a large, complex image (like a high-resolution painting) and learns to convert it into a much smaller, condensed representation. Crucially, it also learns how to reconstruct the original image from this condensed summary. The magic lies in its ability to do this without you explicitly telling it what to summarize – it learns the most important visual features on its own.
For you, an autoencoder could be a game-changer. It could process all your digital art, learn their underlying visual patterns, and create compact 'digital fingerprints' for each. These fingerprints would be small enough to store efficiently and rich enough to allow for accurate reconstruction of the original art, or at least a cleaner version of it. This process of learning efficient data representations is known as dimensionality reduction or feature learning, and it's one of the core applications of autoencoders.
What is an Autoencoder?
At its heart, an autoencoder is a type of artificial neural network used for learning efficient data codings (representations) in an unsupervised manner. The core idea is to train a neural network to copy its input to its output. This might sound trivial, but the key is that the network is forced to learn a compressed representation of the input data. This compression is achieved by restricting the number of nodes in the middle layer, often called the bottleneck or latent space.
An autoencoder consists of two main parts:
- Encoder: This part of the network takes the input data and transforms it into a lower-dimensional representation. Think of it as the component that learns to summarize or encode the essential features of the input. The output of the encoder is often referred to as the latent space or bottleneck features.
- Decoder: This part takes the compressed latent representation and reconstructs the original input data from it. Its job is to reverse the encoding process, trying to recreate the input as accurately as possible.
The entire autoencoder is trained to minimize the difference between its input and its output. This difference is measured by a reconstruction loss function (e.g., Mean Squared Error for continuous data or Binary Cross-Entropy for binary data). By minimizing this loss, the autoencoder learns to capture the most significant patterns and features in the data, effectively performing dimensionality reduction.
Here’s a simple diagram to visualize the autoencoder architecture:
In the diagram, you can see how the input data passes through the encoder, gets compressed into a latent space, and then is expanded back into a reconstructed output by the decoder. The goal is for the reconstructed output to be as close as possible to the original input.
Building Our First Autoencoder
To truly understand how autoencoders work, there’s no better way than to build one yourself. For our first autoencoder, we need a simple, easy-to-understand dataset. The FashionMNIST dataset is a perfect choice for beginners. It consists of 70,000 grayscale images of fashion products (like shirts, trousers, sneakers, etc.), with 60,000 images for training and 10,000 for testing. Each image is a small 28x28
pixel square. Its simplicity and slightly more complex patterns than regular MNIST digits make it ideal for demonstrating core concepts.
Let’s start by setting up our PyTorch environment and loading the FashionMNIST dataset. If you don’t have PyTorch installed, you can follow the instructions on the official PyTorch website. We’ll also need torchvision
to easily access the FashionMNIST dataset and matplotlib
for visualization.
First, let’s import the necessary libraries:
Next, we need to define some transformations for our images. For grayscale images like FashionMNIST, pixel values typically range from 0 to 255. We’ll transform them into PyTorch tensors, which will automatically scale them to a range between 0 and 1.
A quick note on transforms.ToTensor()
: This transformation automatically converts a PIL Image or NumPy ndarray
to a FloatTensor
and scales the image pixel intensity values from the range [0, 255] to [0.0, 1.0]. This is a standard practice for preparing image data for neural networks.
Now, let’s load the FashionMNIST training and testing datasets and create data loaders. Data loaders are essential in PyTorch for efficiently feeding data to our model in batches during training.
We’ve set a batch_size
of 64, meaning our model will process 64 images at a time. shuffle=True
for the training loader ensures that the data is randomly ordered in each epoch, which helps the model generalize better. For the test loader, shuffle=False
is typical as we usually want to evaluate performance consistently.
Let’s visualize a few images from our dataset to make sure everything is loaded correctly. Since ToTensor()
scales pixel values to [0, 1], we can display them directly.
The Encoder
Now that our FashionMNIST data is ready, let’s dive into the first half of our autoencoder: the Encoder. The encoder’s job is to take the input data (a 28x28 pixel image) and compress it into a smaller, more meaningful representation. This compressed representation is often called the latent space or bottleneck because it acts like a narrow passage through which all the information must flow. By forcing the network to squeeze the information through this bottleneck, it learns to keep only the most important features of the input.
Think back to our digital artist scenario. The encoder is like the intelligent system that reads a complex digital painting and distills its core essence into a concise summary. This summary is much smaller than the original painting but still contains enough information to understand what the artwork is about.
Unlike the previous example that used fully connected layers, the provided code utilizes Convolutional Neural Networks (CNNs) for the encoder. CNNs are particularly well-suited for image data because they can automatically learn spatial hierarchies of features. Instead of flattening the image into a 1D vector, convolutional layers process the image directly, preserving its 2D structure. This allows them to detect patterns like edges, textures, and shapes more effectively.
Here’s a conceptual diagram of a convolutional encoder:
In this diagram, you can see the input image being transformed through several convolutional layers, each reducing the spatial dimensions (width and height) and increasing the number of feature maps, until it becomes the compact latent representation.
Let’s define our encoder using PyTorch’s nn.Module
. We’ll use nn.Conv2d
layers, which are designed for 2D image data.
Let’s break down this convolutional encoder code step by step:
- Progressive Downsampling: Each convolutional layer reduces the image size while increasing channel depth (from 1 → 16 → 32 → 64 channels)
- Kernel Choices: All layers use 3x3 kernels, a common choice for capturing local patterns
- Stride=2: This aggressive stride quickly reduces spatial dimensions
- Padding: Careful padding in first two layers (padding=1) maintains proper dimensions during reduction
- ReLU Activation: Provides non-linearity after each convolution
The final output is a compressed 64-channel, 3×3 latent representation of the input image. This balances dimensionality reduction with feature preservation.
The Decoder
Now that our encoder has successfully compressed the input image into a compact latent representation (a (batch_size, 64, 3, 3)
tensor), the next step is to reconstruct the original input. This is the job of the Decoder. The decoder takes this small, summarized latent code and attempts to expand it back into something that closely resembles the original 28x28 pixel image. It’s essentially the inverse operation of the encoder.
Continuing with our digital artist analogy, if the encoder was the intelligent system summarizing a painting, the decoder is the system that takes that summary and tries to recreate the original artwork. It won’t be a perfect copy, but it should be close enough to convey the original visual information.
Just like the encoder used convolutional layers to reduce dimensionality, our decoder will use transposed convolutional layers (often called deconvolutional layers or fractionally-strided convolutions) to increase dimensionality. These layers are designed to reverse the operations of standard convolutional layers, effectively upsampling the feature maps and expanding them back to the original image size.
Here’s a conceptual diagram of a convolutional decoder:
In this diagram, the compressed latent representation is fed into the decoder, which then transforms it through several transposed convolutional layers, each increasing its size until it outputs a reconstructed image that has the same dimensions as the original input.
Let’s define our decoder using PyTorch’s nn.Module
:
Let’s break down the Decoder
code:
- Mirror Architecture:
- Channel counts decrease (64 → 32 → 16 → 1) as spatial dimensions increase (3x3 → 7x7 → 14x14 → 28x28), reversing the encoder's structure.
- Transposed Convolutions:
nn.ConvTranspose2d
upsamples by "undoing" strided convolutions from the encoder.- Critical parameters:
stride=2
: Doubles spatial resolutionoutput_padding=1
: Corrects dimension mismatches when upsampling odd-sized inputs (e.g., 7x7 → 14x14)
- Output Layer:
- Final
Sigmoid
ensures pixel values match the input range (0 to 1), standard for grayscale images.
- Final
Dimension Flow:
The decoder precisely reconstructs the original 28x28 image from the 64-channel latent space.
Combining Encoder and Decoder: The Autoencoder Model
Now that we have both the Encoder
and Decoder
components, we can combine them to form our complete Autoencoder
model. This model will take an input image, encode it into a latent representation, and then decode that representation back into a reconstructed image.
In this Autoencoder
class:
self.encoder = Encoder()
: We instantiate ourEncoder
.self.decoder = Decoder()
: We instantiate ourDecoder
.def forward(self, x):
: The forward pass of the autoencoder simply calls the encoder on the inputx
to getencoded
(the latent representation), and then calls the decoder onencoded
to getdecoded
(the output image). Thedecoded
image is what we will compare to the original input to calculate our loss.
We now have a complete convolutional autoencoder model ready for training! The next step is to train this model and see how well it can reconstruct images.
Training Our Autoencoder and Visualizing Results
With our Autoencoder
model defined, the next crucial step is to train it. Training a neural network involves iteratively adjusting its internal parameters (weights and biases) so that it performs its task better. For an autoencoder, the task is to reconstruct its input as accurately as possible. This means we need to minimize the reconstruction loss – the difference between the original input and the autoencoder’s output.
The Training Process: Loss Function and Optimizer
To train our autoencoder, we need two key components:
-
Loss Function: This function quantifies how well our autoencoder is performing. A smaller loss value means better reconstruction. For image reconstruction tasks where pixel values are continuous (like our FashionMNIST images, which are scaled to [0, 1]), Mean Squared Error (MSE) is a common choice. MSE calculates the average of the squared differences between the predicted and actual pixel values.
-
Optimizer: The optimizer is the algorithm that adjusts the model’s weights based on the gradients of the loss function. It tells the model how to change its parameters to reduce the loss. A popular and effective optimizer is Adam, known for its adaptive learning rate capabilities, which often leads to faster convergence.
Let’s set up our training loop. We’ll train the model for a certain number of epochs
. An epoch means one complete pass through the entire training dataset.
Let’s break down the training code:
-
Model, Loss, and Optimizer Initialization: We instantiate our
Autoencoder
model, move it to the selecteddevice
, define ournn.MSELoss()
as thecriterion
, and chooseoptim.Adam
as ouroptimizer
with a learning rate of0.001
. -
num_epochs = 10
: We decide to train our model for 10 epochs. You might need more or fewer epochs depending on the complexity of the task and the dataset. -
Training Loop
- Forward Pass: Image → Encoder → Decoder → Reconstruction
- Backward Pass:
zero_grad()
: Clears old gradientsbackward()
: Computes new gradientsstep()
: Updates model parameters
- Loss Tracking: Running average per epoch monitors progress
Visualizing Reconstruction Results
After training, let’s visualize how well our autoencoder can reconstruct images. We’ll take a few images from the test set, pass them through our trained autoencoder, and then display the original and reconstructed versions side-by-side.
Here you can see, the reconstructed images, while not perfectly identical to the originals, capture the essential features of the fashion items quite well. This demonstrates that our autoencoder has successfully learned a compressed representation and how to reconstruct from it!
The Problem with Traditional Autoencoders
Suppose you, as a digital artist, not only want to compress and clean your art files but also generate new, similar art pieces based on the patterns you have learned. For instance, if you have a collection of sketches of different clothing items, you might want to generate new, plausible sketches of clothing that weren't part of your original collection.
Here’s where traditional autoencoders fall short:
-
Lack of Generative Capability: While the decoder can reconstruct data from a latent code, there’s no guarantee that a randomly sampled point from the latent space will produce a meaningful or realistic output. The latent space learned by a standard autoencoder is often discontinuous or sparse. This means that if you pick a random point in this latent space, the decoder might produce gibberish or something that doesn't resemble any of the training data.
Imagine our latent space as a map. A traditional autoencoder might place all the meaningful data points in clusters, but the areas between these clusters are empty or contain no learned information. If you randomly sample a point from these empty regions, the decoder has no idea what to do with it.
-
No Smooth Transitions: Because the latent space can be discontinuous, moving smoothly from one point to another in the latent space doesn't necessarily result in a smooth, meaningful transition in the reconstructed output. For example, if you try to interpolate between the latent codes of a 'T-shirt' and a 'Trouser' from the FashionMNIST dataset, a standard autoencoder might produce a series of unrecognizable images in between, rather than a gradual morphing from one clothing item to another.
To illustrate this, let's visualize the latent space learned by our autoencoder using t-SNE (t-Distributed Stochastic Neighbor Embedding). t-SNE is a dimensionality reduction technique that is particularly well-suited for the visualization of high-dimensional datasets. It helps us to see how our autoencoder has organized the different types of fashion items in its compressed latent space.
Here you can see a scatter plot where each point represents an image from the test set, positioned according to its learned latent representation. Different colors represent different classes of fashion items. You'll likely observe that the autoencoder has grouped similar items together, forming distinct clusters. However, you might also notice that the space between these clusters is largely empty. Now, what happens if you were to pick a random point from these empty regions and pass it through the decoder? Let's try this,
Now let's check if the random value is in the empty space or not.
As we can see, our random value is in the empty space, so let's try to decode it.
This is actually the main drawback of autoencoder, if we plot any value of the empty space, the output would likely be meaningless or distorted, because the autoencoder has not learned how to map these regions to realistic images.
Variational Autoencoders (VAEs)
This is where Variational Autoencoders (VAEs) come into play. VAEs are a powerful extension of autoencoders that address these limitations by introducing a probabilistic approach to the latent space. Instead of mapping an input directly to a single point in the latent space, a VAE maps it to a distribution over the latent space. This subtle change makes the latent space continuous and structured, enabling true generative capabilities.
Here's how VAEs solve the problems:
-
Structured and Continuous Latent Space: In a VAE, the encoder doesn't output a single latent vector , but rather two vectors: (mean) and (standard deviation) of a Gaussian distribution. The actual latent vector is then sampled from this distribution. This forces the encoder to learn a latent space where similar inputs are mapped to similar distributions, and the distributions are encouraged to overlap. This continuity ensures that any point sampled from the latent space will likely correspond to a meaningful data point. The VAE introduces an additional term to its loss function, called the Kullback-Leibler (KL) Divergence loss. This loss term encourages the learned latent distributions to be close to a standard normal distribution (mean 0, variance 1). By doing so, it regularizes the latent space, preventing the model from scattering latent representations too widely and ensuring that the latent space is well-behaved and continuous. This is crucial for generating new data, as we can simply sample from a standard normal distribution and pass it through the decoder to get a new, plausible output.
-
Generative Capability: Because the latent space is continuous and well-structured, we can now sample any point from a simple distribution (like a standard normal distribution) in the latent space, pass it through the decoder, and expect to get a meaningful and realistic output. This is the key to VAEs' generative power. We can generate entirely new images that were never seen by the model during training, but which share the characteristics of the training data.
-
Smooth Interpolation: The continuity of the latent space also means that interpolating between two latent vectors will result in a smooth and meaningful transition in the reconstructed output. For example, if you interpolate between the latent codes of a 'T-shirt' and a 'Trouser' in a VAE, you would likely see a gradual morphing from one clothing item to the other, passing through plausible intermediate forms.
Implementing a Variational Autoencoder
Let’s implement a variational autoencoder (VAE) to see how it overcomes the limitations we just discussed. We’ll stick with our FashionMNIST dataset and use PyTorch, mirroring the structure of our earlier autoencoder example. By the end, you’ll see how a VAE not only reconstructs images but also generates new ones.
Understanding the VAE Architecture
Think of a VAE as an upgraded version of our autoencoder. Like before, it has an encoder and a decoder, but with a twist:
-
Encoder: Instead of producing a single point in the latent space, it outputs two things: a mean and a standard deviation . These define a probability distribution (usually a Gaussian) for each input image.
-
Sampling: We then sample a latent vector from this distribution using a clever trick called reparameterization (more on that in a moment).
-
Decoder: Just like before, the decoder takes this and reconstructs the image or even generates a new one if we sample ourselves!
The big difference is that VAEs ensure the latent space is smooth and continuous, so sampling new points actually makes sense. To achieve this, we tweak the training process with a special loss function, but let’s build it first and then explain how it all comes together.
Building the VAE
Since our autoencoder used convolutional layers, we’ll build a convolutional VAE for consistency. The encoder will output both and (log-variance, which is more stable to compute), and we’ll sample before passing it to the decoder. Here’s the code:
Let’s break this down:
- Encoder: Similar to our autoencoder, but after flattening, we split into two linear layers: one for and one for . We set
latent_dim=16
to keep the latent space manageable. - Reparameterization Trick: This is the magic that lets us train with sampling. Instead of sampling directly (which isn’t differentiable), we compute , where is random noise from a standard normal distribution. This keeps the gradient flow intact.
- Decoder: Almost identical to our autoencoder’s decoder, turning the sampled back into an image.
Defining the Loss Function
Training a VAE requires balancing two goals: reconstructing the input well and keeping the latent space structured. Our loss has two parts:
- Reconstruction Loss: Same as before, Mean Squared Error (MSE) between the input and output images.
- KL Divergence Loss: This encourages the latent distributions to resemble a standard normal distribution to ensure continuity.
Here’s the loss function:
- Reconstruction Loss: Measures pixel-wise differences, summed across the batch.
- KL Divergence: A mathematical term that regularizes the latent space, keeping near 0 and near 1.
Training the VAE
Let’s train it, reusing our FashionMNIST data loaders from earlier:
This is similar to our autoencoder training, but now we’re optimizing both reconstruction and KL divergence.
Visualizing the Results
Let’s see what our VAE can do! We’ll check three things: reconstruction, generation, and interpolation.
Reconstructing Test Images
First, let’s reconstruct some test images, just like we did with the autoencoder:
The reconstructions should look pretty good, similar to our autoencoder but with a slight change due to the probabilistic nature of VAEs.
Generating New Images
Now, the fun part: generating new fashion items! We’ll sample random points from a standard normal distribution and decode them:
These are brand-new images, not seen in the training set! They might not be perfect, but they should resemble FashionMNIST items proof that our latent space is meaningful.
Interpolating Between Images
Let’s see the smooth transitions VAEs promise. We’ll take two test images, encode them, and interpolate between their latent vectors:
You should see a gradual shift from one item to another like an Ankle boot morphing into a shirt.
Finally, let's again plot a random point to our latent space and decode this point using our autoencoder and variational autoencoder model.
Wrapping Up
Our VAE has shown us something incredible: by making the latent space probabilistic and regularized, we’ve turned our autoencoder into a generative tool. Unlike the standard autoencoder, which struggled with random sampling and transitions, the VAE can:
- Generate new data: Sample from the latent space to create fresh fashion designs.
- Interpolate smoothly: Blend between existing items in a meaningful way.
VAEs are a stepping stone in deep learning, and the basics of generative modeling. They’ve paved the way for even more advanced techniques, like Generative Adversarial Networks (GANs), which we might explore in a future post!
Want to dig deeper? Check out the original VAE paper by Kingma and Welling (2013) or experiment with tweaking the latent_dim or adding more layers to see how it affects the results.
Full Source Code : Link