Manohar Swamynathan

Mastering Machine Learning with Python in Six Steps

A Practical Implementation Guide to Predictive Data Analytics Using Python

Manohar Swamynathan

Bangalore, Karnataka, India

Any source code or other supplementary material referenced by the author in this book is available to readers on GitHub via the book’s product page, located at www.apress.com/978-1-4842-2865-4 . For more detailed information, please visit http://www.apress.com/source-code .

ISBN 978-1-4842-2865-4

e-ISBN 978-1-4842-2866-1

DOI 10.1007/978-1-4842-2866-1

Library of Congress Control Number: 2017943522

© Manohar Swamynathan 2017

This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed.

Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with every occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights.

While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein.

Printed on acid-free paper

Distributed to the book trade worldwide by Springer Science+Business Media New York, 233 Spring Street, 6th Floor, New York, NY 10013. Phone 1-800-SPRINGER, fax (201) 348-4505, e-mail orders-ny@springer-sbm.com, or visit www.springeronline.com. Apress Media, LLC is a California LLC and the sole member (owner) is Springer Science + Business Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is a Delaware corporation.

Introduction

This book is your practical guide towards novice to master in machine learning with Python in six steps. The six steps path has been designed based on the “Six degrees of separation” theory that states that everyone and everything is a maximum of six steps away. Note that the theory deals with the quality of connections, rather than their existence. So a great effort has been taken to design eminent, yet simple six steps covering fundamentals to advanced topics gradually that will help a beginner walk his way from no or least knowledge of machine learning in Python to all the way to becoming a master practitioner. This book is also helpful for current Machine Learning practitioners to learn the advanced topics such as Hyperparameter tuning, various ensemble techniques, Natural Language Processing (NLP), deep learning, and the basics of reinforcement learning. See Figure 1 .

A434293_1_En_BookFrontmatter_Fig1_HTML.jpg
Figure 1. Mastering Python Machine Learning: In Six Steps

Each topic has two parts: the first part will cover the theoretical concepts and the second part will cover practical implementation with different Python packages. The traditional approach of math to machine learning, that is, learning all the mathematics then understanding how to implement it to solve problems needs a great deal of time/effort, which has proven to be not efficient for working professionals looking to switch careers. Hence the focus in this book has been more on simplification, such that the theory/math behind algorithms have been covered only to the extent required to get you started.

I recommend you work with the book instead of reading it. Real learning goes on only through active participation. Hence, all the code presented in the book is available in the form of iPython notebooks to enable you to try out these examples yourselves and extend them to your advantage or interest as required later.

Who This Book Is for

This book will serve as a great resource for learning machine learning concepts and implementation techniques for the following:

  • Python developers or data engineers looking to expand their knowledge or career into the machine learning area.

  • A current non-Python (R, SAS, SPSS, Matlab, or any other language) machine learning practitioners looking to expand their implementation skills in Python.

  • Novice machine learning practitioners looking to learn advanced topics such as hyperparameter tuning, various ensemble techniques, Natural Language Processing (NLP), deep learning, and basics of reinforcement learning.

What You Will Learn

Chapter 1 , Step 1 - Getting started in Python . This chapter will help you to set up the environment, and introduce you to the key concepts of Python programming language in relevance to machine learning. If you are already well versed with Python basics, I recommend you glance through the chapter quickly and move onto the next chapter.

Chapter 2 , Step 2 - Introduction to Machine Learning. Here you will learn about the history, evolution, and different frameworks in practice for building machine learning systems. I think this understanding is very important as it will give you a broader perspective and set the stage for your further expedition. You’ll understand the different types of machine learning (supervised / unsupervised / reinforcement learning). You will also learn the various concepts are involved in core data analysis packages (NumPy, Pandas, Matplotlib) with example codes.

Chapter 3 , Step 3 - Fundamentals of Machine Learning This chapter will expose you to various fundamental concepts involved in feature engineering, supervised learning (linear regression, nonlinear regression, logistic regression, time series forecasting and classification algorithms), unsupervised learning (clustering techniques, dimension reduction technique) with the help of scikit-learn and statsmodel packages.

Chapter 4 , Step 4 - Model Diagnosis and Tuning. in this chapter you’ll learn advanced topics around different model diagnosis, which covers the common problems that arise, and various tuning techniques to overcome these issues to build efficient models. The topics include choosing the correct probability cutoff, handling an imbalanced dataset, the variance, and the bias issues. You’ll also learn various tuning techniques such as ensemble models and hyperparameter tuning using grid / random search.

Chapter 5 , Step 5 - Text Mining and Recommender System. Statistics says 70% of the data available in the business world is in the form of text, so text mining has vast scope across various domains. You will learn the building blocks and basic concepts to advanced NLP techniques. You’ll also learn the recommender systems that are most commonly used to create personalization for customers.

Chapter 6 , Step 6 - Deep and Reinforcement Learning. There has been a great advancement in the area of Artificial Neural Network (ANN) through deep learning techniques and it has been the buzzword in recent times. You’ll learn various aspects of deep learning such as multilayer perceptrons, Convolution Neural Network (CNN) for image classification, RNN (Recurrent Neural Network) for text classification, and transfer learning. And you’ll also learn the q-learning example to understand the concept of reinforcement learning.

Chapter 7 , Conclusion. This chapter summarizes your six step learning and you’ll learn quick tips that you should remember while starting with real-world machine learning problems.

Acknowledgments

I’m grateful to my mom, dad, and loving brother; I thank my wife Usha and son Jivin for providing me the space for writing this book.

I would like to express my gratitude to my mentors, colleagues, and friends from current/previous organizations for their inputs, inspiration, and support. Thanks to Jojo for the encouragement to write this book and his technical review inputs. Big thanks to the Apress team for their constant support and help.

Finally, I would like to thank you the reader for showing an interest in this book and sincerely hope to help your pursuit to machine learning quest.

Note that the views expressed in this book are author’s personal.

Contents

  1. Chapter 1:​ Step 1 – Getting Started in Python
    1. The Best Things in Life Are Free
    2. The Rising Star
    3. Python 2.​7.​x or Python 3.​4.​x?​
      1. Windows Installation
      2. OSX Installation
      3. Linux Installation
      4. Python from Official Website
      5. Running Python
    4. Key Concepts
      1. Python Identifiers
      2. Keywords
      3. My First Python Program
      4. Code Blocks (Indentation &​ Suites)
      5. Basic Object Types
      6. When to Use List vs.​ Tuples vs.​ Set vs.​ Dictionary
      7. Comments in Python
      8. Multiline Statement
      9. Basic Operators
      10. Control Structure
      11. Lists
      12. Tuple
      13. Sets
      14. Dictionary
      15. User-Defined Functions
      16. Module
      17. File Input/​Output
      18. Exception Handling
    5. Endnotes
  2. Chapter 2:​ Step 2 – Introduction to Machine Learning
    1. History and Evolution
    2. Artificial Intelligence Evolution
    3. Different Forms
      1. Statistics
      2. Data Mining
      3. Data Analytics
      4. Data Science
      5. Statistics vs.​ Data Mining vs.​ Data Analytics vs.​ Data Science
    4. Machine Learning Categories
      1. Supervised Learning
      2. Unsupervised Learning
      3. Reinforcement Learning
    5. Frameworks for Building Machine Learning Systems
      1. Knowledge Discovery Databases (KDD)
      2. Cross-Industry Standard Process for Data Mining
      3. SEMMA (Sample, Explore, Modify, Model, Assess)
      4. KDD vs.​ CRISP-DM vs.​ SEMMA
    6. Machine Learning Python Packages
    7. Data Analysis Packages
      1. NumPy
      2. Pandas
      3. Matplotlib
    8. Machine Learning Core Libraries
    9. Endnotes
  3. Chapter 3:​ Step 3 – Fundamentals of Machine Learning
    1. Machine Learning Perspective of Data
    2. Scales of Measurement
      1. Nominal Scale of Measurement
      2. Ordinal Scale of Measurement
      3. Interval Scale of Measurement
      4. Ratio Scale of Measurement
    3. Feature Engineering
      1. Dealing with Missing Data
      2. Handling Categorical Data
      3. Normalizing Data
      4. Feature Construction or Generation
    4. Exploratory Data Analysis (EDA)
      1. Univariate Analysis
      2. Multivariate Analysis
    5. Supervised Learning– Regression
      1. Correlation and Causation
      2. Fitting a Slope
      3. How Good Is Your Model?​
      4. Polynomial Regression
      5. Multivariate Regression
      6. Multicollinearit​y and Variation Inflation Factor (VIF)
      7. Interpreting the OLS Regression Results
      8. Regression Diagnosis
      9. Regularization
      10. Nonlinear Regression
    6. Supervised Learning – Classification
      1. Logistic Regression
      2. Evaluating a Classification Model Performance
      3. ROC Curve
      4. Fitting Line
      5. Stochastic Gradient Descent
      6. Regularization
      7. Multiclass Logistic Regression
      8. Generalized Linear Models
      9. Supervised Learning – Process Flow
      10. Decision Trees
      11. Support Vector Machine (SVM)
      12. k Nearest Neighbors (kNN)
      13. Time-Series Forecasting
    7. Unsupervised Learning Process Flow
      1. Clustering
      2. K-means
      3. Finding Value of k
      4. Hierarchical Clustering
      5. Principal Component Analysis (PCA)
    8. Endnotes
  4. Chapter 4:​ Step 4 – Model Diagnosis and Tuning
    1. Optimal Probability Cutoff Point
      1. Which Error Is Costly?​
    2. Rare Event or Imbalanced Dataset
      1. Known Disadvantages
      2. Which Resampling Technique Is the Best?​
    3. Bias and Variance
      1. Bias
      2. Variance
    4. K-Fold Cross-Validation
    5. Stratified K-Fold Cross-Validation
    6. Ensemble Methods
    7. Bagging
      1. Feature Importance
      2. RandomForest
      3. Extremely Randomized Trees (ExtraTree)
      4. How Does the Decision Boundary Look?​
      5. Bagging – Essential Tuning Parameters
    8. Boosting
      1. Example Illustration for AdaBoost
      2. Gradient Boosting
      3. Boosting – Essential Tuning Parameters
      4. Xgboost (eXtreme Gradient Boosting)
    9. Ensemble Voting – Machine Learning’s Biggest Heroes United
      1. Hard Voting vs.​ Soft Voting
    10. Stacking
    11. Hyperparameter Tuning
      1. GridSearch
      2. RandomSearch
    12. Endnotes
  5. Chapter 5:​ Step 5 – Text Mining and Recommender Systems
    1. Text Mining Process Overview
    2. Data Assemble (Text)
      1. Social Media
      2. Step 1 – Get Access Key (One-Time Activity)
      3. Step 2 – Fetching Tweets
    3. Data Preprocessing (Text)
      1. Convert to Lower Case and Tokenize
      2. Removing Noise
      3. Part of Speech (PoS) Tagging
      4. Stemming
      5. Lemmatization
      6. N-grams
      7. Bag of Words (BoW)
      8. Term Frequency-Inverse Document Frequency (TF-IDF)
    4. Data Exploration (Text)
      1. Frequency Chart
      2. Word Cloud
      3. Lexical Dispersion Plot
      4. Co-occurrence Matrix
    5. Model Building
    6. Text Similarity
    7. Text Clustering
      1. Latent Semantic Analysis (LSA)
    8. Topic Modeling
      1. Latent Dirichlet Allocation (LDA)
      2. Non-negative Matrix Factorization
    9. Text Classification
    10. Sentiment Analysis
    11. Deep Natural Language Processing (DNLP)
    12. Recommender Systems
      1. Content-Based Filtering
      2. Collaborative Filtering (CF)
    13. Endnotes
  6. Chapter 6:​ Step 6 – Deep and Reinforcement Learning
    1. Artificial Neural Network (ANN)
    2. What Goes Behind, When Computers Look at an Image?​
    3. Why Not a Simple Classification Model for Images?​
    4. Perceptron – Single Artificial Neuron
    5. Multilayer Perceptrons (Feedforward Neural Network)
      1. Load MNIST Data
      2. Key Parameters for scikit-learn MLP
    6. Restricted Boltzman Machines (RBM)
    7. MLP Using Keras
    8. Autoencoders
      1. Dimension Reduction Using Autoencoder
      2. De-noise Image Using Autoencoder
    9. Convolution Neural Network (CNN)
      1. CNN on CIFAR10 Dataset
      2. CNN on MNIST Dataset
    10. Recurrent Neural Network (RNN)
      1. Long Short-Term Memory (LSTM)
    11. Transfer Learning
    12. Reinforcement Learning
    13. Endnotes
  7. Chapter 7:​ Conclusion
    1. Summary
    2. Tips
      1. Start with Questions/​Hypothesis Then Move to Data!
      2. Don’t Reinvent the Wheels from Scratch
      3. Start with Simple Models
      4. Focus on Feature Engineering
      5. Beware of Common ML Imposters
    3. Happy Machine Learning
  8. Index

About the Author and About the Technical Reviewer

About the Author

A434293_1_En_BookFrontmatter_Figb_HTML.jpg

Manohar Swamynathan is a data science practitioner and an avid programmer, with over 13 years of experience in various data science-related areas that include data warehousing, Business Intelligence (BI), analytical tool development, ad hoc analysis, predictive modeling, data science product development, consulting, formulating strategy, and executing analytics program.

He’s had a career covering life cycles of data across different domains such as U.S. mortgage banking, retail, insurance, and industrial IoT. He has a bachelor’s degree with specialization in physics, mathematics, and computers; and a master’s degree in project management. He’s currently living in Bengaluru, the Silicon Valley of India, working as Staff Data Scientist with General Electric Digital, contributing to the next big digital industrial revolution.

You can visit him at http://www.mswamynathan.com to learn more about his various other activities.

About the Technical Reviewer

A434293_1_En_BookFrontmatter_Figc_HTML.jpg

Jojo Moolayil is a Data Scientist and the author of the book: Smarter Decisions – The Intersection of Internet of Things and Decision Science . With over 4 years of industrial experience in Data Science, Decision Science and IoT, he has worked with industry leaders on high impact and critical projects across multiple verticals. He is currently associated with General Electric , the pioneer and leader in data science for Industrial IoT and lives in Bengaluru—the silicon valley of India.

He was born and raised in Pune, India and graduated from University of Pune with a major in Information Technology Engineering. He started his career with Mu Sigma Inc., the world's largest pure play analytics provider and worked with the leaders of many Fortune 50 clients. One of the early enthusiasts to venture into IoT analytics, he converged his learnings from decision science to bring the problem solving frameworks and his learnings from data and decision science to IoT Analtyics.

To cement his foundations in data science for industrial IoT and scale the impact of the problem solving experiments, he joined a fast growing IoT Analytics startup called Flutura based in Bangalore and headquartered in the valley. After a short stint with Flutura, Jojo moved on to work with the leaders of Industrial IoT - General Electric, in Bangalore, where he focused on solving decision science problems for Industrial IoT use cases. As a part of his role in GE, Jojo also focuses on developing data science and decision science products and platforms for Industrial IoT.

Apart from authoring books on Decision Science and IoT, Jojo has also been Technical Reviewer for various books on Machine Learning, Deep Learning and Business Analytics with Apress. He is an active Data Science tutor and maintains a blog at http://www.jojomoolayil.com/web/blog/ .

Profile

http://www.jojomoolayil.com/

https://www.linkedin.com/in/jojo62000

I would like to thank my family, friends and mentors.

—Jojo Moolayil