Machine learning can be broadly classified into supervised and unsupervised learning. By definition, the term supervised means that the “machine” (the system) learns with the help of something—typically a labeled training data.
Training data (or a dataset) is the basis on which the system learns to infer. An example of this process is to show the system a set of images of cats and dogs with the corresponding labels of the images (the labels say whether the image is of a cat or a dog) and let the system decipher the features of cats and dogs.
Similarly, unsupervised learning is the process of grouping data into similar categories. An example of this is to input into the system a set of images of dogs and cats without mentioning which image belongs to which category and let the system group the two types of images into different buckets based on the similarity of images.
The difference between regression and classification
The need for training, validation, and testing data
The different measures of accuracy
Regression and Classification
Let’s assume that we are forecasting for the number of units of Coke that would be sold in summer in a certain region. The value ranges between certain values—let’s say 1 million to 1.2 million units per week. Typically, regression is a way of forecasting for such continuous variables.
Classification or prediction, on the other hand, predicts for events that have few distinct outcomes—for example, whether a day will be sunny or rainy.
Linear regression is a typical example of a technique to forecast continuous variables, whereas logistic regression is a typical technique to predict discrete variables. There are a host of other techniques, including decision trees, random forests, GBM, neural networks, and more, that can help predict both continuous and discrete outcomes.
Training and Testing Data

An overfitted dataset
From the dataset in the figure, you can see that the straight line does not fit all the data points perfectly, whereas the curved line fits the points perfectly—hence the curve has minimal error on the data points on which it is trained.
However, the straight line has a better chance of being more generalizable when compared to the curve on a new dataset. So, in practice, regression/classification is a trade-off between the generalizability of the model and complexity of model.
The lower the generalizability of the model, the higher the error rate will be on “unseen” data points.

Error rate in unseen data points
The unseen data points are the points that are not used in training the model, but are used in testing the accuracy of the model, and so are called testing data or test data.
The Need for Validation Dataset
The major problem in having a fixed training and testing dataset is that the test dataset might be very similar to the training dataset, whereas a new (future) dataset might not be very similar to the training dataset. The result of a future dataset not being similar to a training dataset is that the model’s accuracy for the future dataset may be very low.
An intuition of the problem is typically seen in data science competitions and hackathons like Kaggle ( www.kaggle.com ). The public leaderboard is not always the same as the private leaderboard. Typically, for a test dataset, the competition organizer will not tell the users which rows of the test dataset belong to the public leaderboard and which belong to the private leaderboard. Essentially, a randomly selected subset of test dataset goes to the public leaderboard and the rest goes to the private leaderboard.
One can think of the private leaderboard as a test dataset for which the accuracy is not known to the user, whereas with the public leaderboard the user is told the accuracy of the model .
Potentially, people overfit on the basis of the public leaderboard, and the private leaderboard might be a slightly different dataset that is not highly representative of the public leaderboard’s dataset.

The problem illustrated
In this case, you would notice that a user moved down from rank 17 to rank 47 when compared between public and private leaderboards. Cross-validation is a technique that helps avoid the problem. Let’s go through the workings in detail.
If we only have a training and testing dataset, given that the testing dataset would be unseen by the model, we would not be in a position to come up with the combination of hyper-parameters (A hyper-parameter can be thought of as a knob that we change to improve our model’s accuracy) that maximize the model’s accuracy on unseen data unless we have a third dataset. Validation is the third dataset that can be used to see how accurate the model is when the hyper-parameters are changed. Typically, out of the 100% data points in a dataset, 60% are used for training, 20% are used for validation, and the remaining 20% are for testing the dataset.
Another idea for a validation dataset goes like this: assume that you are building a model to predict whether a customer is likely to churn in the next two months. Most of the dataset will be used to train the model, and the rest can be used to test the dataset. But in most of the techniques we will deal with in subsequent chapters, you’ll notice that they involve hyper-parameters.
- 1.
We cannot test a model’s accuracy on the dataset on which it is trained.
- 2.
We cannot use the result of test dataset accuracy to finalize the ideal hyper-parameters, because, practically, the test dataset is unseen by the model.
Hence, the need for a third dataset—the validation dataset .
Measures of Accuracy
In a typical linear regression (where continuous values are predicted), there are a couple of ways of measuring the error of a model. Typically, error is measured on the testing dataset, because measuring error on the training dataset (the dataset a model is built on) is misleading—as the model has already seen the data points, and we would not be in a position to say anything about the accuracy on a future dataset if we test the model’s accuracy on the training dataset only. That’s why error is always measured on the dataset that is not used to build a model.
Absolute Error
Actual value | Predicted value | Error | Absolute error | |
|---|---|---|---|---|
Data point 1 | 100 | 120 | 20 | 20 |
Data point 2 | 100 | 80 | –20 | 20 |
Overall | 200 | 200 | 0 | 40 |
In this scenario, we might incorrectly see that the overall error is 0 (because one error is +20 and the other is –20). If we assume that the overall error of the model is 0, we are missing the fact that the model is not working well on individual data points.
To avoid the issue of a positive error and negative error cancelling out each other and thus resulting in minimal error, we consider the absolute error of a model , which in this case is 40, and the absolute error rate is 40 / 200 = 20%
Root Mean Square Error
Actual value | Predicted value | Error | Squared error | |
|---|---|---|---|---|
Data point 1 | 100 | 120 | 20 | 400 |
Data point 2 | 100 | 80 | –20 | 400 |
Overall | 200 | 200 | 0 | 800 |
Now the overall squared error is 800, and the root mean squared error (RMSE) is the square root of (800 / 2), which is 20.
Confusion Matrix
Absolute error and RMSE are applicable while predicting continuous variables. However, predicting an event with discrete outcomes is a different process. Discrete event prediction happens in terms of probability—the result of the model is a probability that a certain event happens. In such cases, even though absolute error and RMSE can theoretically be used, there are other relevant metrics.
Predicted fraud | Predicted non-fraud | |
|---|---|---|
Actual fraud | True positive (TP) | False negative (FN) |
Actual non-fraud | False positive (FP) | True negative (TN) |
Sensitivity or true positive rate or recall = true positive / (total positives) = TP/ (TP + FN)
Specificity or true negative rate = true negative / (total negative) = TN / (FP + TN)
Precision or positive predicted value = TP / (TP + FP)
Recall = TP / (TP+FN)
Accuracy = (TP + TN) / (TP + FN + FP + TN)
F1 score = 2TP/ (2TP + FP + FN)
AUC Value and ROC Curve
The cost associated with such a process is the manpower required to review all the transactions.
The benefit associated with the cost is the number of fraudulent transactions that are preempted because of the manual review.
The overall profit associated with this setup above is the money saved by preventing fraud minus the cost of manual review.
In such a scenario, a model can come in handy as follows: we could come up with a model that gives a score to each transaction. Each transaction is scored on the probability of being a fraud. This way, all the transactions that have very little chances of being a fraud need not be reviewed by a manual reviewer. The benefit of the model thus would be to reduce the number of transactions that need to be reviewed, thereby reducing the amount of human resources needed to review the transactions and reducing the cost associated with the reviews. However, because some transactions are not reviewed, however small the probability of fraud is, there could still be some fraud that is not captured because some transactions are not reviewed.
In that scenario, a model could be helpful if it improves the overall profit by reducing the number of transactions to be reviewed (which, hopefully, are the transactions that are less likely to be fraud transactions).
- 1.
Score each transaction to calculate the probability of fraud. (The scoring is based on a predictive model—more details on this in Chapter 3.)
- 2.
Order the transactions in descending order of probability.
There should be very few data points that are non-frauds at the top of the ordered dataset and very few data points that are frauds at the bottom of the ordered dataset. AUC value penalizes for having such anomalies in the dataset.
The x-axis of the receiver operating characteristic (ROC) curve is the cumulative number of points (transactions) considered.
The y-axis is the cumulative number of fraudulent transactions captured.
Once we order the dataset, intuitively all the high-probability transactions are fraudulent transactions, and low-probability transactions are not fraudulent transactions. The cumulative number of frauds captured increases as we look at the initial few transactions, and after a certain point, it saturates as a further increase in transactions would not increase fraudulent transactions.

Cumulative frauds captured when using a model
In this scenario, we have a total of 10,000 fraudulent transactions out of a total 1,000,000 transactions. That’s an average 1% fraudulent rate—that is, one out of every 100 transactions is fraudulent.

Cumulative frauds captured when transactions are randomly sampled
In Figure 1-5, you can see that the line divides the total dataset into two roughly equal parts—the area under the line is equal to 0.5 times of the total area. For convenience, if we assume that the total area of the plot is 1 unit, then the total area under the line generated by random guess model would be 0.5.

Comparison of cumulative frauds
Note that the area under the curve (AUC) below the curve generated by the predictive model is > 0.5 in this instance.
Thus, the higher the AUC, the better the predictive power of the model.
Unsupervised Learning
So far we have looked at supervised learning, where there is a dependent variable (the variable we are trying to predict) and an independent variable (the variable(s) we use to predict the dependent variable value).
However, in some scenarios, we would only have the independent variables—for example, in cases where we have to group customers based on certain characteristics. Unsupervised learning techniques come in handy in those cases.
Clustering-based approach
Principal components analysis (PCA)
Clustering is an approach where rows are grouped, and PCA is an approach where columns are grouped. We can think of clustering as being useful in assigning a given customer into one or the other group (because each customer typically represents a row in the dataset), whereas PCA can be useful in grouping columns (alternatively, reducing the dimensionality/variables of data).
Though clustering helps in segmenting customers, it can also be a powerful pre-processing step in our model-building process (you’ll read more about that in Chapter 11). PCA can help speed up the model-building process by reducing the number of dimensions, thereby reducing the number of parameters to estimate.
- 1.
We first hand-code them in Excel.
- 2.
We implement in R.
- 3.
We implement in Python.
The basics of Excel, R and Python are outlined in the appendix.
Typical Approach Towards Building a Model
In the previous section, we saw a scenario of the cost-benefit analysis of an operations team implementing the predictive models in a real-world scenario. In this section, we’ll look at some of the points you should consider while building the predictive models.
Where Is the Data Fetched From?
Typically, data is available in tables in database, CSV, or text files. In a database, different tables may be capturing different information. For example, in order to understand fraudulent transactions, we would be likely to join a transactions table with customer demographics table to derive insights from data.
Which Data Needs to Be Fetched?
The output of a prediction exercise is only as good as the inputs that go into the model. The key part in getting the input right is understanding the drivers/ characteristics of what we are trying to predict better—in our case, understanding the characteristics of a fraudulent transaction better.
Here is where a data scientist typically dons the hat of a management consultant. They research the factors that might be driving the event they are trying to predict. They could do that by reaching out to the people who are working in the front line—for example, the fraud risk investigators who are manually reviewing the transactions—to understand the key factors that they look at while investigating a transaction.
Pre-processing the Data
Missing values in data: Missing values in data exist when a variable (data point) is not recorded or when joins across different tables result in a nonexistent value.
Missing values can be imputed in a few ways. The simplest is by replacing the missing value with the average/ median of the column. Another way to replace a missing value is to add some intelligence based on the rest of variables available in a transaction. This method is known as identifying the K-nearest neighbors (more on this in Chapter 13).
Outliers in data: Outliers within the input variables result in inefficient optimization across the regression-based techniques (Chapter 2 talks more about the affect of outliers). Typically outliers are handled by capping variables at a certain percentile value (95%, for example).
- Transformation of variables: The variable transformations available are as follows:
Scaling a variable: Scaling a variable in cases of techniques based on gradient descent generally result in faster optimization.
Log/Squared transformation: Log/Squared transformation comes in handy in scenarios where the input variable shares a non-linear relation with the dependent variable.
Feature Interaction
Consider the scenario where, the chances of a person’s survival on the Titanic is high when the person is male and also has low age. A typical regression-based technique would not take such a feature interaction into account, whereas a tree-based technique would. Feature interaction is the process of creating new variables based on a combination of variables. Note that, more often than not, feature interaction is known by understanding the business (the event that we are trying to predict) better.
Feature Generation
Feature generation is a process of finding additional features from the dataset. For example, a feature for predicting fraudulent transaction would be time since the last transaction for a given transaction. Such features are not available straightaway, but can only be derived by understanding the problem we are trying to solve.
Building the Models
Once the data is in place and the pre-processing steps are done, building a predictive model would be the next step. Multiple machine learning techniques would be helpful in building a predictive model. Details on the major machine learning techniques are explored in the rest of chapters.
Productionalizing the Models
Once the final model is in place, productionalizing a model varies, depending on the use case. For example, a data scientist can do an offline analysis looking at the historical purchases of a customer and come up with a list of products that are to be sent as recommendation over email, customized for the specific customer. In another scenario, online recommendation systems work on a real-time basis and a data scientist might have to provide the model to a data engineer who then implements the model in production to generate recommendations on a real time basis.
Build, Deploy, Test, and Iterate
In general, building a model is not a one-time exercise. You need to show the value of moving from the prior process to a new process. In such a scenario, you typically follow the A/B testing or test/control approach, where the models are deployed only for a small amount of total possible transactions/customers. The two groups are then compared to see whether the deployment of models has indeed resulted in an improvement in the metric the business is interested in achieving. Once the model shows promise, it is expanded to more total possible transactions/customers. Once consensus is reached that the model is promising, it is accepted as a final solution. Otherwise, the data scientist reiterates with the new information from the previous A/B testing experiment.
Summary
In this chapter, we looked into the basic terminology of machine learning. We also discussed the various error measures you can use in evaluating a model. And we talked about the various steps involved in leveraging machine learning algorithms to solve a business problem.





























![$$ {R}_{adj}^2=1-\left[\frac{\left(1-{R}^2\right)\left(n-1\right)}{n-k-1}\right] $$](../images/463052_1_En_2_Chapter/463052_1_En_2_Chapter_TeX_Equj.png)















































































































































































































is an activation that is performed (tanh activation in general).







































represents element-to-element multiplication.










































































































