V Kishore AyyadevaraPro Machine Learning Algorithms https://doi.org/10.1007/978-1-4842-3564-5_12

12. Principal Component Analysis

V Kishore Ayyadevara¹

(1)

Hyderabad, Andhra Pradesh, India

Regression typically works best when the ratio of number of data points to number of variables is high. However, in some scenarios, such as clinical trials, the number of data points is limited (given the difficulty in collecting samples from many individuals), and the amount of information collected is high (think of how much information labs give us based on small samples of collected blood).

In these cases, where the ratio of data points to variables is low, one faces difficulty in using the traditional techniques for the following reasons:

There is a high chance that a majority of the variables are correlated to each other.
The time taken to run a regression could be very extensive because the number of weights that need to be predicted is large.

Techniques like principal component analysis (PCA) come to the rescue in such cases. PCA is an unsupervised learning technique that helps in grouping multiple variables into fewer variables without losing much information from the original set of variables.

In this chapter, we will look at how PCA works and get to know the benefits of performing PCA. We will also implement it in Python and R.

Intuition of PCA

PCA is a way to reconstruct the original dataset by using fewer features or variables than the original one had. To see how that might work, consider the following example:

Dep Var	Var 1	Var 2
0	1	10
0	2	20
0	3	30
0	4	40
0	5	50
1	6	60
1	7	70
1	8	80
1	9	90
1	10	100

We’ll assume both Var 1 and Var 2 are the independent variables used to predict the dependent variable (Dep Var). We can see that Var 2 is highly correlated to Var 1, where Var 2 = (10) × Var 1.

A plot of their relation can be seen in Figure 12-1.

../images/463052_1_En_12_Chapter/463052_1_En_12_Fig1_HTML.jpg — Figure 12-1
Plotting the relation

In the figure, we can clearly see that there is a strong relation between the variables. This means the number of independent variables can be reduced.

The equation can be expressed like this:

Var2 = 10 × Var1

In other words, instead of using two different independent variables, we could have just used one variable Var1 and it would have worked out in solving the problem.

Moreover, if we are in a position to view the two variables through a slightly different angle (or, we rotate the dataset), like the one indicated by the arrow in Figure 12-2, we see a lot of variation in horizontal direction and very little in vertical direction.

../images/463052_1_En_12_Chapter/463052_1_En_12_Fig2_HTML.jpg — Figure 12-2
Viewpoint/angle from which data points should be looked at

Let’s complicate our dataset by a bit. Consider a case where the relation between v1 and v2 is like that shown in Figure 12-3.

../images/463052_1_En_12_Chapter/463052_1_En_12_Fig3_HTML.jpg — Figure 12-3
Plotting two variables

Again, the two variables are highly correlated with each other, though not as perfectly correlated as the previous case.

In such scenario, the first principle component is the line/variable that explains the maximum variance in the dataset and is a linear combination of multiple independent variables. Similarly, the second principal component is the line that is completely uncorrelated (has a correlation of close to 0) to the first principal component and that explains the rest of variance in dataset, while also being a linear combination of multiple independent variables.

Typically the second principal component is a line that is perpendicular to the first principal component (because the next highest variation happens in a direction that is perpendicular to the principal component line).

In general, the nth principal component of a dataset is perpendicular to the (n – 1)th principal component of the same dataset.

Working Details of PCA

In order to understand how PCA works, let’s look at another example (available as “PCA_2vars.xlsx” in github), where x1 and x2 are two independent variables that are highly correlated with each other:

../images/463052_1_En_12_Chapter/463052_1_En_12_Figa_HTML.png

Given that a principal component is a linear combination of variables, we’ll express it as follows:

PC1 = w₁ × x1 + w₂ × x2

Similarly, the second principal component is perpendicular to the original line, as follows:

PC2 = –w₂ × x1 + w₁ × x2

The weights w₁ and w₂ are randomly initialized and should be iterated further to obtain the optimal ones.

Let’s revisit the objective and constraints that we have, while solving for w₁ and w₂:

Objective: Maximize PC1 variance.
Constraint: Overall variance in principal components should be equal to the overall variance in original dataset (as the data points did not change, but only the angle from which we view the data points changed).

Let’s initialize the principal components in the dataset we created earlier:

../images/463052_1_En_12_Chapter/463052_1_En_12_Figb_HTML.jpg

The formulas for PC1 and PC2 can be visualized as follows:

../images/463052_1_En_12_Chapter/463052_1_En_12_Figc_HTML.jpg

Now that we have initialized the principal component variables, we’ll bring in the objective and constraints:

../images/463052_1_En_12_Chapter/463052_1_En_12_Figd_HTML.jpg

Note that PC variance = PC1 variance + PC2 variance.

Original variance = x1 variance + x2 variance

We calculate the difference between original and PC variance since our constraint is to maintain the same variance as original dataset in the principal component transformed dataset. Here are their formulas:

../images/463052_1_En_12_Chapter/463052_1_En_12_Fige_HTML.jpg

Once the dataset is initialized, we will proceed with identifying the optimal values of w₁ and w₂ that satisfy our objective and constraint.

Let us look at how do we achieve that through Excel’s Solver add-in:

../images/463052_1_En_12_Chapter/463052_1_En_12_Figf_HTML.jpg

Note that the objective and criterion that we specified earlier are met:

PC1 variance is maximized.
There is hardly any difference between the original dataset variance and the principal component dataset variance. (We have allowed for a small difference of less than 0.01 only so that Excel is able to solve it because there may be some rounding-off errors.)

Note that PC1 and PC2 are now highly uncorrelated with each other, and PC1 explains the highest variance across all variables. Moreover, x2 has a higher weightage in determining PC1 than x1 (as is evident from the derived weight values).

In practice, once a principal component is arrived at, it is centered around the corresponding mean value—that is, each value within the principal component column would be subtracted by the average of the original principal component column:

../images/463052_1_En_12_Chapter/463052_1_En_12_Figg_HTML.jpg

The formulas used to derive the preceding dataset are shown here:

../images/463052_1_En_12_Chapter/463052_1_En_12_Figh_HTML.jpg

Scaling Data in PCA

One of the major pre-processing steps in PCA is to scale the variables. Consider the following scenario: we are performing PCA on two variables. One variable has a range of values from 0–100, and another variable has a range of values from 0–1.

Given that, using PCA, we are trying to capture as much variation in the dataset as possible, the first principal component will give a very high weightage to the variable that has maximum variance (in our case, Var1) when compared to the variable with low variance.

Hence, when we work out w₁ and w₂ for the principal component, we will end up with a w₁ that is close to 0 and a w₂ that is close to 1 (where w₂ is the weight in PC1 corresponding to the higher ranged variable). To avoid this, it is advisable to scale each variable so that both of them have similar range, due to which variance can be comparable.

Extending PCA to Multiple Variables

So far, we have seen building a PCA where there are two independent variables. In this section, we will consider how to hand-build a PCA where there are more than two independent variables.

Consider the following dataset (available as “PCA_3vars.xlsx” in github):

../images/463052_1_En_12_Chapter/463052_1_En_12_Figi_HTML.png

Unlike a two-variable PCA, in a more than 2-dimensional PCA, we’ll initialize the weights in a slightly different way. The weights are initialized randomly—but in matrix form, as follows:

../images/463052_1_En_12_Chapter/463052_1_En_12_Figj_HTML.png

From this matrix, we can consider PC1 = 0.49 × x1 + 0.89 × x2 + 0.92 × x3. PC2 and PC3 would be worked out similarly. If there were four independent variables, we would have had a 4 × 4-weight matrix.

Let’s look at the objective and constraints we might have:

Objective: Maximize PC1 variance.
Constraints: Overall PC variance should be equal to overall original dataset variance. PC1 variance should be greater than PC2 variance, PC1 variance should be greater than PC3 variance, and PC2 variance should be greater than PC3 variance.

../images/463052_1_En_12_Chapter/463052_1_En_12_Figk_HTML.jpg

Solving for the preceding would result in the optimal weight combination that satisfies our criterion. Note that the output from Excel could be slightly different from the output you would see in Python or R, but the output of Python or R is likely to have higher PC1 variance when compared to the output of Excel, due to the underlying algorithm used in solving. Also note that, even though ideally we would have wanted the difference between original and PC variance to be 0, for practical reasons of executing the optimization using Excel solver we have allowed the difference to be a maximum of 3.

Similar to the scenario of the PCA with two independent variables, it is a good idea to scale the inputs before processing PCA. Also, note that PC1 explains the highest variation after solving for the weights, and hence PC2 and PC3 can be eliminated because they explain very little of the original dataset variance.

Choosing the Number of Principal Components to Consider

There is no single prescribed method to choosing the number of principal components. In practice, a rule of thumb is to choose the minimum number of principal components that cumulatively explain 80% of the total variance in the dataset.

Implementing PCA in R

PCA can be implemented in R using the built-in function prcomp. Have a look at the following implementation (available as “PCA R.R” in github):

t=read.csv('D:/Pro ML book/PCA/pca_3vars.csv')

pca=prcomp(t)

pca

The output of pca is as follows:

../images/463052_1_En_12_Chapter/463052_1_En_12_Figm_HTML.jpg

The standard deviation values here are the same as the standard deviation values of PC variables. The rotation values are the same as the weight values that we initialized earlier.

A more detailed version of the outputs can be obtained by using str(pca), the output of which looks like the following:

../images/463052_1_En_12_Chapter/463052_1_En_12_Fign_HTML.jpg

From this, we notice that apart from the standard deviation of PC variables and the weight matrix, pca also provides the transformed dataset.

We can access the transformed dataset by specifying pca$x.

Implementing PCA in Python

Implementing PCA in Python is done using the scikit learn library, as follows (available as “PCA.ipynb” in github):

# import packages and dataset

import pandas as pd

import numpy as np

from sklearn.decomposition import PCA

data=pd.read_csv('F:/course/pca/pca.csv')

from sklearn.decomposition import PCA

pca = PCA(n_components=2)

pca.fit(data)

We see that we fit as many components as the number of independent variables and fit PCA on top of the data.

Once the data is fit, transform the original data into transformed data, as follows:

x_pca = pca.transform(data)

pca.components_

../images/463052_1_En_12_Chapter/463052_1_En_12_Figo_HTML.jpg

components_ is the same as weights associated with the principal components. x_pca is the transformed dataset.

print(pca.explained_variance_ratio_)

../images/463052_1_En_12_Chapter/463052_1_En_12_Figp_HTML.jpg

explained_variance_ratio_ provides the amount of variance explained by each principal component. This is very similar to the standard deviation output in R, where R gives us the standard deviation of each principal component. PCA in Python’s scikit learn transformed it slightly and gave us the amount of variance out of the original variance explained by each variable.

Applying PCA to MNIST

MNIST is a handwritten digit recognition task. A 28 × 28 image is unrolled, where each pixel value is represented in a column. Based on that, one is expected to predict if the output is one of the numbers between 0 and 9.

Given that there are a total of 784 columns, intuitively we should observe one of the following:

Columns with zero variance
Columns with very little variance
Columns with high variance

In a way, PCA helps us in eliminating low- and no-variance columns as much as possible while still achieving decent accuracy with a limited number of columns.

Let’s see how to achieve that reduction in number of columns without losing out on much variance through the following example (available as “PCA mnist.R” in github):

# Load dataset

t=read.csv("D:/Pro ML book/PCA/train.csv")

# Keep the independent variables only, as PCA is on indep. vars

t$Label = NULL

# scale dataset by 255, as it is the mximum possible value in pixels

t=t/255

# Apply PCA

pca=prcomp(t)

str(pca)

# Check the variance explained

cumsum((pca$sdev)^2)/sum(pca$sdev^2)

The output of the preceding code is as follows:

../images/463052_1_En_12_Chapter/463052_1_En_12_Figq_HTML.jpg

From this we can see that the first 43 principal components explain ~80% of the total variance in the original dataset. Instead of running a model on all 784 columns, we could run the model on the first 43 principal components without losing much information and hence without losing out much on accuracy.

Summary

PCA is a way of reducing the number of independent variables in a dataset and is particularly applicable when the ratio of data points to independent variables is low.
It is a good idea to scale independent variables before applying PCA.
PCA transforms a linear combination of variables such that the resulting variable expresses the maximum variance within the combination of variables.

Previous Chapter

11. Clustering

Next Chapter

13. Recommender Systems

Table of Contents for Pro Machine Learning Algorithms : A Hands-On Approach to Implementing Algorithms in Python and R