Through the first six chapters of this book, we have explored how to use unsupervised learning to perform dimensionality reduction and clustering, and we’ve built applications to detect anomalies and segment groups based on similarity in user behavior/profile.
However, unsupervised learning is capable of a lot more. One area that unsupervised learning excels in is feature extraction, which is a method to generate a new feature representation from an original set of features; the new feature representation is called a learned representation and is used to improve performance on supervised learning problems.
In other words, feature extraction is an unsupervised means to a supervised end.
Autoencoders are one such form of feature extraction; autoencoders use a feedforward, non-recurrent neural network to perform representation learning. Representation learning is a core part of an entire branch of machine learning involving neural networks.
In autoencoders - which are a form of representation learning - each layer of the neural network learns a representation of the original features, and subsequent layers build on the representation learned by the preceding layers. Layer by layer, the autoencoder learns increasingly complicated representations from simpler ones, building what is known as a hierarchy of concepts that become more and more abstract.
The output layer is the final newly learned representation of the original features. This learned representation can then be used as an input into a supervised learning model with the objective of improving the generalization error.
But, before we get too far ahead of ourselves, let’s begin by introducing neural networks and the Python frameworks to work with them such as TensorFlow and Keras.
At their very essence, neural networks perform representation learning, where each layer of the neural network learns a representation from the previous layer. By building more nuanced and detailed representations layer by layer, neural networks can accomplish pretty amazing tasks such as computer vision, speech recognition, and machine translation.
Neural networks come in two forms - shallow and deep. Shallow networks have few layers, and deep networks have many layers. Deep learning gets its name from the deep (many-layered) neural networks it deploys.
Shallow neural networks are not particularly powerful since the degree of representation learning is limited by the few number of layers. Deep learning, on the other hand, is incredibly powerful and is currently one of the hottests area in machine learning.
To be clear, shallow and deep learning using neural networks are just a part of the entire machine learning ecosystem. The major difference between machine learning using neural networks and classical machine learning is that a lot of the feature representation is automatically performed in the neural networks case and is hand-designed in classical machine learning.
Neural networks have an input layer, one or many hidden layers, and an output layer. The number of the hidden layers defines just how deep the neural network is. You can view these hidden layers as intermediate computations; these hidden layers together allow the entire neural network to perform complex function approximation.
Each layer has a certain number of nodes (also known as neurons or units) that comprise the layer. The nodes of any every layer are then connected to the nodes of the next layer. During the training process, the neural network determines the optimal weights to assign to each node.
In addition to adding more layers, we can add more nodes to a neural network to increase the capacity of the neural network to model complex relationships.
These nodes are fed into an activation function, which determines what value of the current layer is fed into the next layer of the neural network. Common activation functions include linear, sigmoid, hyperbolic tangent, and rectified linear unit (ReLU) activation functions.
The final activation function is usually the softmax function, which outputs a class probability that the input observation falls in. This is pretty typical for classification type problems.
Neural networks may also have bias nodes; these bias nodes are always constant values and, unlike the normal nodes, are not connected to the previous layer. Rather, they allow the output of an activation function to be shifted lower or higher.
With the hidden layers - including the nodes, bias nodes, and activation functions - the neural network is trying to learn the right function approximation to map the input layer to the output layer.
In the case of supervised learning problems, this is pretty straightforward. The input layer represents the features that are fed into the neural network, and the output layer represents the label assigned to each observation.
During the training process, the neural network determines which weights across the neural network help minimize the error between its predicted label for each observation and the true label.
In unsupervised learning problems, the neural network learns representations of the input layer via the various hidden layers but is not guided by labels.
Neural networks are incredibly powerful and are capable of modeling complex non-linear relationships to a degree that classicial machine learning algorithms struggle with.
In general, this is a great charactertic of neural networks, but there is a potential risk. Because neural networks can model such complex non-linear relationships, they are also much more prone to overfitting, which we should be aware of and address when designing machine learning applications using neural networks.1
Although there are multiple types of neural networks such as recurrent neural networks in which data can flow in any direction (used for speech recognition and machine transaltion) and convolutional neural networks (used for computer vision), we will focus on the more straightforward feedforward neural network in which data moves in just one direction - forward.
There is also a lot more hyperparameter optimization to perform to have neural networks perform well - including the choice of the cost function, the algorithm to minimize the loss, the type of initialization for the starting weights, the number of iterations to train the neural network (i.e., number of epochs), the number of observations to feed in before each weight update (i.e., batch size), and the step size to move the weights in (i.e., learning rate) during the training process.
Before we introduce autoencoders, let’s explore TensorFlow, which is the primary library we will use to build neural networks. TensorFlow is an open-source software library for high performance numerical computation and was initially developed by the Google Brain team for internal Google use. In November 2015, it was released as open-source software.2
TensorFlow is available across many operating systems including Linux, macOS, Windows, Android, and iOS and can run on multiple CPUs and GPUs, making the software very scalable for fast performance and deployable to most users across desktop, mobile, web, and cloud.
The beauty of TensorFlow is that users can define a neural network - or, more generally, a graph of computations - in Python, and TensorFlow is able to take the neural network and run it using C++ code, which is much faster than Python.
TensorFlow is also able to parallelize the computations, breaking down the entire series of operations into separate chunks and running them in parallel across multiple CPUs and GPUs.
Performance like this is a very important consideration for large-scale machine learning applications like those that Google runs for its core operations such as search.
While there are other open-source libraries capable of similar feats, TensorFlow has become the most popular, partly due to Google’s brand.
Before we move ahead, let’s build a TensorFlow graph and run a computation.
We will import TensorFlow, define a few variables using the TensorFlow API (which resembles the Scikit-Learn API we’ve used in previous chapters), and the compute the values for those variables.
importtensorflowastfb=tf.constant(50)x=b*10y=x+bwithtf.Session()assess:result=y.eval()(result)
It is important to realize that there are two phases here. First, we construct the computation graph, defining b, x, and y.
Then, we execute the graph by calling tf.Session(). Until we call this, no computations are being executed by the CPU and/or GPU. Rather, only the instructions for the computations are being stored.
Once you execute this block of code, you will see the result of “550” as expected.
Later on, we will build actual neural networks using TensorFlow.
Keras is an open-source software library and provides a high-level API that runs on top of TensorFlow. It provides a much more user-friendly interface for TensorFlow, allowing data scientists and researchers to experiment faster and more easily than if they had to work directly with the TensorFlow commands. Keras was also primarily authored by a Google engineer, Francois Chollet.
When we start building models using TensorFlow, we will work hands-on with Keras and explore its advantages.
Now that we’ve introduced neural networks and the popular libraries to work with them in Python - TensorFlow and Keras - let’s build an autoencoder, one of the simplest unsupervised learning neural networks.
An autoencoder is comprised of two parts, an encoder and a decoder. The encoder converts the input set of features into a different representation - via representation learning - and the decoder converts this newly learned representation to the original format.
The core concept of an autoencoder is similar to the concept of dimensionality reduction we studied in chapter three. Similar to dimensionality reduction, an autoencoder does not precisely learn the original observations and features as they are, which is known as the identify function. If it learned the exact identity function, the autoencoder would not be useful.
Rather, autoencoders are restricted to a degree so that they must approximate the original observations as closely as possible - but not exactly - using a newly learned representation; in other words, the autoencoder learns an approximation of the identity function.
Since the autoencoder is constrained, it is forced to learn the most salient properties of the original data, capturing the underlying structure of the data; this is similar to what happens in dimensionality reduction.
The constraint is a very important attribute of autoencoders - the constraint forces the autoencoder to intelligently choose which important information to capture and which irrelevant / less important information to discard.
Autoencoders have been around for decades, and, as you may suspect already, they have been used widely for dimensionality reduction and automatic feature engineering/learning. Nowadays, they are often used to build generative models such as generative adversarial networks.
In the autoencoder, we care most about the encoder because this component is the one that learns a new representation of the original data. This new representation is the new set of features derived from the original set of features and observations.
We will refer to the encoder function of the autoencoder as h = f(x), which takes in the original observations x and uses the newly learned representation captured in function f to output h.
The decoder function that reconstructs the original observations using the encoder function is r = g(h).
As you can see, the decoder function feeds in the encoder’s output h and reconstructs the observations, known as r, using its reconstruction function g.
If done correctly, g(f(x)) will not be exactly equal to x everywhere but will be close enough.
How do we restrict the encoder function to approximate x so that it is forced to learn only the most salient properties of x without copying it exactly?
We can constrain the encoder function’s output, h, to have fewer dimensions than x. This is known as an undercomplete autoencoder since the encoder’s dimensions are fewer than the original input dimensions. This is again similar to what happens in dimensionality reduction, where we take in the original input dimensions and reduce them to a much smaller set.
Constrained in this manner, the autoencoder attempts to minimize a loss function we define such that the reconstruction error - after the decoder reconstructs the observations approximately using the encoder’s output - is as small as possible.
It is important to realize that the hidden layers are where the dimensions are constrained. In other words, the output of the encoder has fewer dimensions than the original input. But, the output of the decoder is the reconstructed original data and, therefore, has the same number of dimensions as the original input.
When the decoder is linear and the loss function is the mean squared error, an undercomplete autoencoder learns the same sort of new representation as principal component analysis, a form of dimensionality reduction we introduced in chapter three.
However, if the encoder and decoder functions are nonlinear, autoencoders can learn much more complex nonlinear representations. This is what we care about most.
But, be warned - if the autoencoder is given too much capacity and latitude to model complex, nonlinear representations, it will simply memorize/copy the original observations instead of extracting the most salient information from them. Therefore, we must restrict the autoencoder meaningfully enough so that this does not happen.
If the encoder learns a representation in a greater number of dimensions than the original input dimensions, the autoecoder is referred to as overcomplete. Such autoencoders simply copy the original observations and are not forced to efficiently and compactly capture information about the original distribution in a way that undercomplete autoencoders are.
That being said, if we employ some form of regularization, which penalizes the neural network for learning unnecessarily complex functions, overcomplete autoencoders can be used successfully for dimensionality reduction and automatic feature engineering.
Compared to undercomplete autoeconders, regularized overcomplete autoencoders are harder to design successfully but are potentially more powerful because they can learn more complex - but not overly complex - representations that better approximate the original observations without copying them precisely.
In a nutshell, autoencoders that perform well are ones that learn a new representation that approximates the original obsevations close enough but not exactly - and, to do this, the autoencoders essentially learn a new probability distribution.
If you recall in Chapter 3, we had both dense (the normal) and sparse versions of dimensionality reduction algorithms. Autoencoders work similarly.
So far, we’ve discussed just the normal autoencoders that output a dense final matrix such that a handful of features have the most salient information that has been captured about the original data.
However, we may instead want to output a sparse final matrix such that the information captured is more well-distributed across the features that the autoencoder learns.
To do this, we need to include not just a reconstruction error as part of the autoencoder but also a sparsity penalty so that the autoencoder must take the sparsity of the final matrix into consideration.
Sparse autoencoders are generally overcomplete - the hidden layers have more units than the number of input features with the caveat that only a small fraction of the hidden units are allowed to be active at the same time.
When defined in this way, a sparse autoencoder will output a final matrix that has many more zeros embedded throughout and the information captured will be better distributed across the features learned.
For certain machine learning applications, sparse autoencoders have better performance and also learn somewhat different representations than the normal (dense) autoencoders would.
Later, we will work with real examples to see the difference between these two types of autoencoders.
As you know by now, autoencoders are capable to learning new (and improved) representations from the original input data, capturing the most salient elements but disregarding the noise in the original data.
In some cases, we may want the autoencoder we design to more aggressively ignore the noise in the data, especially if we suspect the original data is corrupted to some degree.
For example, imagine if we record a conversation between two people at a noisy coffee shop in the middle of the day. We want to isolate the conversation (the signal) from the background chatter (the noise).
Or, imagine a dataset of images that are grainy or distorted due to low resolution or some blurring effect. We want to isolate the core image (the signal) from the distortion (the noise).
For these problems, we can design a denoising autoencoder that receives the corrupted data as input and is trained to output the original, uncorrupted data as best as possible.
Of course, while this is not easy to do, this is clearly a very powerful application of autoencoders to solve real world problems.
So far, we have discussed the use of autoencoders to learn new representations of the original input data (via the encoder) to minimize the reconstruction error between the newly reconstructed data (via the decoder) and the original input data.
In these examples, the encoder is of a fixed size, n, where n is typically smaller than the number of original dimensions - in other words, we train an undercomplete autoencoder.
Or n may be larger than the number of original dimensions - an overcomplete autoencoder - but constrained using a regularization penalty, a sparsity penalty, etc.
But, in all these autoencoders, the encoder outputs a single vector of a fixed size n.
An alternative autoencoder known as the variational autoencoder (VAE) has an encoder that outputs two vectors instead of one: a vector of means, mu, and a vector of standard deviations, sigma.
These two vectors form random variables such that the ith element of mu and sigma corresponds to the mean and standard deviation of the ith random variable.
By forming this stochastic output via its encoder, the variational autoencoder is able to sample across a continuous space based on what it has learned from the input data. The variational autoencoder is not confined to just the examples it has trained on but can generalize and output new examples even if it may have never seen precisely similar ones before.
This is incredibly powerful because now the variational autoencoders can generate new synthetic data that appears to belong in the distribution the variational autoencoder has learned from the original input data.
Advances like this have led to an entirely new and trending field in unsupervised learning known as generative modeling, which includes generative adversarial networks.
With these models, it is possible to generate synthetic images, speech, music, art, etc., opening up a world of possibilities for AI-generated data.
In this chapter, we introduced neural networks and the popular open-source libraries to work with them, TensorFlow and Keras.
We also explored autoencoders and their ability to learn new representations from original input data. Variations include sparse autoencoders, denoising autoencoders, and variational autoencoders, among others.
In Chapter 8, we will build hands-on applications using the techniques we have discussed in this chapter.
Before we move, let’s revisit why automatic feature extraction is so important. Without the ability to automatically extract features, data scientists and machine learning engieers would have to hand-engineer features that might be important in solving real-world problems. This is very time-consuming and would dramatically limit progress in the field of artificial intelligence.
In fact, until Geoffrey Hinton and other researchers developed methods to automatically learn new features using neural networks - launching the deep learning revolution starting in 2006 - problems involving computer vision, speech recognition, machine translation, etc. remained largely intractable.
Once autoencoders and other variations of neural networks were used to automatically extract features from input data, a lot of these problems became solvable, leading to some major breakthroughs in machine learning over the past decade.
You will see the power of automatic feature extraction in the hands-on application of autoencoders in the next chapter.
1 This process is known as regularization.
2 For more on TensorFlow.