Chapter 14. Automation and Scaling

You’ve scraped large amounts of data from APIs and websites, you’ve cleaned and organized your data, and you’ve run statistical analysis and produced visual reports. Now it’s time to let Python take the wheel and automate your data wrangling. In this chapter, we’ll cover how to automate your data analysis, collection, and publication. We will learn how to create proper logging and alerting so you can fully automate your scripts and get notifications of success, failure, and any issues your work encounters along the way.

We will also take a look at scaling your automation using Python libraries designed to help you execute many tasks and monitor their success and failure. We’ll analyze some libraries and helper tools for fully scaling your data in the cloud.

Python gives us plenty of options for automation and scaling. There are some simple, straightforward tasks that lend themselves to Python automation on almost any machine without much setup, and there are some larger, more complex ways to automate. We’ll cover examples of both, as well as how to scale your data automation as a data wrangler.

Why Automate?

Automation gives you a way to easily run your scripts without needing to do so on your local machine—or even be awake! The ability to automate means you can spend time working on other more thought-intensive projects. If you have a well-written script to perform data cleanup for you, you can focus on working with the data to produce better reporting.

Here are some great examples of tasks where automation can help:

Every Tuesday a new set of analytics comes out; you compile a report and send it to the interested parties.
Another department or coworker needs to be able to run your reporting tool or cleanup tool without your guidance and support.
Once a week, you have to download, clean, and send data.
Every time a user requests a new report, the reporting script should run and alert the user once the report is generated.
Once a week, you need to clean erroneous data from the database and back it up to another location.

Each of these problems has myriad solutions, but one thing is certain: they are good tasks to automate. They are clear in their outcomes and steps. They have a limited but specific audience. They have a certain time or event that sets them into motion. And they are all things you can script and run when the particular circumstances apply.

Automation is easiest when the task is clear and well defined and the outcomes are easy to determine. However, even if the outcome is not always easy to test or predict, automation can help complete a part of a task and leave the rest for your (or someone else’s) closer inspection and analysis. You can think of automation here similarly to the ways you automate other things in your life. You might have a favorite saved pizza order or an auto-reply on your email. If a task has a fairly clear outcome and occurs regularly, then it is a good task to automate.

But when should you not automate? Here are some criteria to indicate if a task isn’t a good candidate for automation:

The task occurs so rarely and is so complex, it’s better to do it yourself (e.g., filing your taxes).
A successful outcome for the task is difficult to determine (e.g., group discussion, social research, or investigation).
The task requires human interaction to determine the proper way to complete it (e.g., navigating traffic, translating poetry).
It is imperative the task succeeds.

Some of these examples—particularly things that require human input—are ripe for some level of automation. Some we can partially automate by allowing machines to find recommendations, which we can then determine are right or wrong (machine learning with human feedback). Others, like when a task is rare and complex or is business critical, might end up becoming automated, or partially automated, as they become familiar. But you can see the overall logic to guide when automation fits best and when it’s not a good idea.

If you’re not sure automation is right for you, you can always try automating something small you do on a regular interval and see how it works. Chances are you’ll find more applicable solutions over time, and the experience of automating one thing will make it easier to automate more things in the future.

Steps to Automate

Because automation begins with a clear and simple focus, your steps to automate should also be clear and simple. It is particularly helpful to begin automation by documenting the following (in a list, on a whiteboard, in drawings, in a storyboard):

When must this task begin?
Does this task have a time limit or maximum length? If so, when must it end?
What are the necessary inputs for this task?
What constitutes success, or partial success, for this task?
If this task fails, what should happen?
What does the task produce or provide? To whom? In what way?
What, if anything, should happen after this task concludes?

If you can answer five or more of these questions, you are in a good place. If you can’t, it might be worth doing some more research and clarification before you begin. If you are asked to automate something you have never done before, or haven’t done often, try documenting it as you perform the task and then determine if you can answer the questions listed here.

Tip

If your project is too large or vague, try breaking it up into smaller tasks and automate a few of those tasks. Perhaps your task involves a report which downloads two datasets, runs cleanup and analysis, and then sends the results to different groups depending on the outcome. You can break this task into subtasks, automating each step. If any of these subtasks fail, stop the chain and alert the person(s) responsible for maintaining the script so it can be investigated and restarted after the bug or issue is resolved.

So, our basic steps for automation are as follows (note that these will vary depending on the types of tasks you are completing):

Define your problem set and break it into smaller chunks of work.
Describe exactly what each subtask needs as input, what it needs to do, and what it needs to be marked complete.
Identify where you can get those inputs and when the tasks need to run.
Begin coding your task and testing with real or example data.
Clean up your task and your script, and add documentation.
Add logging, with a focus on debugging errors and recording successful completion.
Submit your code to a repository and test it manually. Make changes as needed.
Get your script ready for automation, by replacing manual tasks with automated ones.
Watch your logs and alerts as the task begins automation. Correct any errors or bugs. Update your testing and documentation.
Develop a long-term plan for how often the logs are checked for errors.

The first step toward automation is always to better define your tasks and subtasks and make them small enough chunks, so they can easily be completed and their success or failure determined.

The next few steps align well with our process throughout this book. You should identify how you can begin to solve the problem with Python. Search for libraries or tools to help fix the problem or complete the request, and begin coding. Once your script is working, you’ll want to test it with a few different possible datasets or inputs. After successful testing, you’ll want to simplify and document it. You will likely set it up in a repository (on Bitbucket or GitHub) so you can document changes and additions over time.

Note

Once you have a completed script, first run it by hand (rather than the automated way). When the new data arrives or the time comes to run it, do so manually and keep watch over its output. There might be unforeseen errors or extra logging and debugging you’ll need to add.

Depending on what type of automation fits your needs, you might set up a simple cron task where the script is executed at certain intervals. (You’ll learn all about cron later in this chapter.) You might need to slightly modify the script so it has the ability to run autonomously by using argument variables, databases, or particular files on the system. You might add it to a task queue to manage when it runs. Whichever fits, your job is not yet over.

Warning

When your script is first automated, it’s essential you take time to review it every time it runs. Look through your logs and monitor what is happening. You will likely find small bugs, which you can then fix. Again, refresh any necessary logging and documentation.

After about five successes or properly logged failures, you can likely scale back your manual review. However, it’s still a great idea to grep your logs monthly or quarterly and see what’s happening. If you are using a log aggregator, you can actually automate this step and have the task send you error and warning reports. How meta is that?

Automation is no small process, but an early investment in time and attention will pay dividends. A well-running set of automation tasks takes time to complete, but the result is often far better than haphazard scripts requiring constant attention, care, and monitoring. Pay close attention now and take time to automate your script the right way. Only then can you really move on to whatever is next at hand, rather than constantly having one part of your work tied to monitoring and administering support for a few unruly tasks.

What Could Go Wrong?

There are quite a few things that can go wrong with your automation. Some of them are easy to correct and account for, while others are more nebulous and might never have a true fix. One of the important lessons in automation is figuring out what types of errors and issues are worth taking the time and energy to fix and what ones are better to just plan for and work through another way.

Let’s take, for example, the types of errors we talked about in Chapter 12: our network errors for web scraping. If you are running into significant network errors, you have only a few good options. You can change who hosts your tasks and see if the performance improves (which may be costly and time consuming, depending on your setup). You can call your network provider and ask for support. You can run the tasks at a different time and see if there is a different outcome. You can expect the problems to happen and build your script around these expectations (i.e., run more than you need and expect some percentage to fail).

There are many possible errors you will encounter when running your tasks by automation:

Database connection errors leading to lost or bad data
Script bugs and errors where the script does not properly complete
Timeout errors or too many request errors from websites or APIs
Edge cases, where the data or parts of the reporting don’t conform and break the script
Server load issues or other hardware problems
Poor timing, race conditions (if scripts depend on previous completion of other tasks, race conditions can invalidate the data)

Warning

There are naturally far more potential issues than you can anticipate. The larger the team you work with, the greater the chance that poor documentation, poor understanding, and poor team communication can hurt automation. You will not be able to prevent every error, but you can try through the best communication and documentation you can provide. Still, you will also need to accept your automation will sometimes fail.

To prepare for eventual failure, you will want to be alerted when issues arise. You should determine what percentage of error is acceptable. Not every service performs well 100% of the time (hence the existence of status pages); however, we can strive for perfection and determine how many hours and how much effort our automation is worth.

Depending on your automation and its weaknesses, there are some ways to combat those issues. Here are some ways to build a more resilient automation system:

Retry failed tasks at a specific interval.
Ensure your code has numerous try...except blocks allowing it to work through failures.
Build special exception blocks around code handling connections to other machines, databases, or APIs.
Regularly maintain and monitor machines you use for your automation.
Test your tasks and automation on a regular basis using test data and ensure they run properly.
Make yourself aware of dependencies, race conditions, and API rules in your script’s domain and write code according to this knowledge.
Utilize libraries like requests and multiprocessing to make difficult problems easier and attempt to take some of the mystery out of problems that plague many scripts.

We’ll be reviewing some of these techniques and ideas as we walk through how to best go about monitoring and automating your scripts. For now, let’s move on to tools we can use for automation to make our lives as data wranglers easier and simpler and determine a few tips on where and how you should implement these tools.

Where to Automate

Depending on the needs of your script, deciding where it runs will be an important first step. No matter where it first runs, you can move it elsewhere, but this will likely require some rewriting. At the beginning, you will probably need it to run locally. To run a script or task locally is to run it on your own computer.

To run something remotely means to run it on another machine—likely a server somewhere. Once your script succeeds and is well tested, you will want to move it to run remotely. If you manage or have servers, or work for an organization with servers, it can be relatively easy to port your scripts to those servers. This allows you to work on your own machine (laptop or desktop) and not worry about when you turn it off and on. Running your scripts remotely also means you are not dependent on your ISP.

If you don’t have access to a server, but you have an old desktop or laptop you don’t use anymore, you can essentially turn it into your server. If it’s running an old operating system, you can upgrade it so you can properly run Python on it, or you can wipe it and install Linux.

Tip

Using a home computer as your remote device means it should always be turned on and plugged into your home Internet. If you’d like to also install an OS you haven’t used before, like Linux, this is an easy way to learn a new operating system and can help transition you to managing your own servers. If you’re just getting started with Linux, we recommend you choose one of the popular distributions, such as Ubuntu or LinuxMint.

If you’d like to manage your own server but you’re just getting started, don’t panic! Even if you’ve never managed or helped manage a server, increased competition among cloud service providers has made it a lot easier. Cloud providers allow you to spin up new machines and run your own server without needing to know a lot of technical knowledge. One such provider, DigitalOcean, has several nice writeups on how to get started, including introductions to creating your first server and getting your server set up.

Whether you host your scripts locally or remotely, there are a variety of tools to keep your computer or your server well monitored and updated. You’ll want to ensure your scripts and tasks are fairly easy to manage and update, and that they run to completion on a regular basis. Finally, you’ll want to be able to configure them and document them easily. We’ll cover all of those topics in the following sections, starting off with Python tools you can use to help make your scripts more automation-friendly.

Special Tools for Automation

Python gives us many special tools for automation. We’ll take a look at some of the ways we can manage our automation using Python, as well as using other machines and servers to do our bidding. We’ll also discuss how we can use some built-in Python tools to manage inputs for our scripts and automate things that seem to require human input.

Using Local Files, argv, and Config Files

Depending on how your script works, you may need arguments or input that cannot always or shouldn’t always be in a database or an API. When you have a simple input or output, you can use local files and arguments to pass the data.

Local files

When using local files for input and output, you’ll want to ensure the script can run on the same machine every day, or can be easily moved with the input and output files. As your script grows, it’s possible you will move and change it along with the files you use.

We’ve used local files before, but let’s review how to do so from a more functional code standpoint. This code gives you the ability to open and write files using standard data types, and is very reusable and expandable depending on your script’s needs:

from csv import reader, writer


def read_local_file(file_name):
    if '.csv' in file_name: 
        rdr = reader(open(file_name, 'rb'))
        return rdr
    return open(file_name, 'rb') 


def write_local_file(file_name, data):
    with open(file_name, 'wb') as open_file: 
        if type(data) is list: 
            wr = writer(open_file)
            for line in data:
                wr.writerow(line)
        else:
            open_file.write(data)

: This line tests whether the file might be a good candidate to open with the csv module. If it ends in .csv, then it’s likely we might want to open it using our CSV reader.
: If we haven’t returned with our CSV reader, this code returns the open file. If we wanted to build a series of different ways to open and parse files based on the file extension, we could do that as well (e.g., using the json module for JSON files, or pdfminer for PDFs).
: This code uses with...as to return the output of the open function, assigning it to the open_file variable. When the indented block ends, Python will close the file automatically.
: If we are dealing with a list, this line uses the CSV writer to write each list item as a row of data. If we have dictionaries, we might want to use the DictWriter class.
: We want a good backup plan in case it’s not a list. For this reason, write the raw data to file. Instead of this option, we could write different code depending on the data type.

Let’s look at an example where we need the most recent file in a directory, which is often useful if you need to parse log files going back in time or look at the results of a recent web spider run:

import os

def get_latest(folder):
    files = [os.path.join(folder, f) for f in os.listdir(folder)] 
    files.sort(key=lambda x: os.path.getmtime(x), reverse=True) 
    return files[0]

: Uses Python’s built-in os module to list each file (listdir method), then uses the path module’s join method to make a long string representing a full file path. This is an easy way to get a list of all of the files in a folder just by passing a string (the folder’s path).
: Sorts files by last-modified date. Because files is a list, we can call the sort method and give it a key on which to sort. This code passes the full file paths to getmtime, which is the os module’s “get modified time” method. The reverse argument makes sure the more recent files are on the top of the list.
: Returns only the most recent file.

This code returns the most recent folder, but if we wanted to return the whole list of files starting with the most recent we could simply modify the code to not return the first index, but instead the whole list or a slice.

Tip

There are many powerful tools to look up, modify, and alter files on your local (or your server’s local) machine using the os library. A simple search on Stack Overflow returns educated answers as to how to find the only file modified in the last seven days or the only .csv file modified in the last month, and so on. Using local files, particularly when the data you need is already there (or easily put there with a wget), is a great way to simplify your automation.

Config files

Setting up local config files for your sensitive information is a must. As asserted in the Twelve-Factor App, storing your configuration (such as passwords, logins, email addresses, and other sensitive information) outside of your code base is part of being a good developer. If you connect to a database, send an email, use an API, or store payment information, that sensitive data should be stored in a configuration file.

Usually, we store config files in a separate folder within the repository (e.g., config/). All the code in the repository has access to these files, but by using .gitignore files, we can keep the configuration out of version control. If you have other developers or servers who need those files, you should copy them over manually.

Tip

We recommend having a section of the repository’s README.md cover where and how to get hold of special configuration files so new users and collaborators know who to ask for the proper files.

Using a folder rather than one file allows you to have different configurations depending on what machine or environment the script runs in. You might want to have one configuration file for the test environment with test API keys, and a production file. You might have more than one database depending on what machine the script uses. You can store these specific pieces of information using a .cfg file, like the following example:

# Example configuration file
[address] 
name = foo 
email = myemail@bar.com
postalcode = 10177
street = Schlangestr. 4
city = Berlin
telephone = 015745738292950383

[auth_login]
user = test@mysite.com
pass = goodpassword

[db]
name = my_awesome_db
user = script_user
password = 7CH+89053FJKwjker)
host = my.host.io

[email]
user = script.email@gmail.com
password = 788Fksjelwi&

: Each section is denoted by square brackets with an easy-to-read string inside of them.
: Each line contains a key = value pair. The ConfigParser interprets these as strings. Values can contain any characters, including special characters, but keys should follow PEP-8 easy-to-read syntax and structure.

Having sections, keys, and values for our configuration lets us use the names of the sections and keys to access configuration values. This adds clarity to our Python scripts, without being insecure. Once you have a config file like the previous example set up, it’s quite easy to parse with Python and use in your script and automation. Here’s an example:

import ConfigParser
from some_api import get_client 


def get_config(env):
    config = ConfigParser.ConfigParser() 
    if env == 'PROD':
        return config.read(['config/production.cfg']) 
    elif env == 'TEST':
        return config.read(['config/test.cfg'])
    return config.read(['config/development.cfg']) 


def api_login():
    config = get_config('PROD') 
    my_client = get_client(config.get('api_login', 'user'),
                           config.get('api_login', 'auth_key')) 
    return my_client

: Here’s an example of an API client hook we could import.
: This code instantiates a config object by calling the ConfigParser class. This is now an empty configuration object.
: This line calls the configuration parser object’s read method and passes a list of configuration files. Here, we store them in a directory in the root of the project in a folder called config.
: If the environment variable passed does not match production or testing, we will always return the development configuration. It’s a good idea to have catches like this in your configuration code, in case of a failure to define environment variables.
: We’ll assume our example needs the production API, so this line asks for the PROD configuration. You can also save those types of decisions in the bash environment and read them using the built-in os.environ method.
: This line calls the section name and key name to access the values stored in the configuration. This will return the values as strings, so if you need integers or other types, you should convert them.

The built-in ConfigParser library gives us easy access to our sections, keys, and values stored in our config file. If you’d like to store different pieces of information in different files and parse a list of them for each particular script, your code might look like this:

config = ConfigParser.ConfigParser()
config.read(['config/email.cfg', 'config/database.cfg', 'config/staging.cfg'])

It’s up to you to organize your code and configuration depending on your needs. The syntax to access the configuration values simply uses the section name in your config (i.e., [section_name]) and the name of the key. So, a config file like this one:

[email]
user = test@mydomain.org
pass = my_super_password

can be accessed like this:

email_addy = config.get('email', 'user')
email_pass = config.get('email', 'pass')

Tip

Config files are a simple tool to keeping all of your sensitive information in one place. If you’d rather use .yml or other extension files, Python has readers for those file types as well. Make sure you use something to keep your authentication and sensitive information stored separately from your code.

Command-line arguments

Python gives us the ability to pass command-line arguments to use for automation. These arguments pass information regarding how the script should function. For example, if we need the script to know we want it to run with the development configuration, we could run it like so:

python my_script.py DEV

We are using the same syntax to run a file from the command line, calling python, then the script name, and then adding DEV to the end of the line. How can we parse the extra argument using Python? Let’s write code that does just that:

from import_config import get_config
import sys


def main(env):
    config = get_config(env)
    print config


if __name__ == '__main__':
    if len(sys.argv) > 1: 
        env = sys.argv(1) 
    else:
        env = 'TEST'
    main(env)

: The built-in sys module helps with system tasks, including parsing command-line arguments. If the command-line argument list returned has a length greater than 1, there are extra arguments. The first argument always holds the name of the script (so if it has a length of 1, that’s the only argument).
: To get the value of an argument, pass the index of that argument to the sys module’s argv method. This line sets env equal to that value. Remember, the 0-index of argv will always be the Python script name, so you start parsing with the argument at the 1-index.
: This line uses the parsed arguments to modify your code according to the command-line arguments.

Note

If we wanted to parse more than one extra variable, we could test the length to ensure we have enough, and then continue parsing. You can string together as many arguments as you’d like, but we recommend keeping it to under four. If you need more than four arguments, consider writing some of the logic into your script (e.g., on Tuesdays we only run testing, so if it’s a Tuesday, use the test section of code, etc.).

Argument variables are great if you need to reuse the same code to perform different tasks or run in different environments. Maybe you have a script to run either collection or analysis, and you’d like to switch which environments you use. You might run it like so:

python my_script.py DEV ANALYSIS
python my_script.py PROD COLLECTION

Or you might have a script that needs to interact with a newly updated file folder and grab the latest logs—for example, to grab logs from more than one place:

python my_script.py DEV /var/log/apache2/
python my_script.py PROD /var/log/nginx/

With command-line arguments, simple changes in argument variables can create a portable and robust automation. Not every script will need to use these types of extra variables, but it’s a nice-to-have solution built into the standard Python library and provides some flexibility should you need it.

Aside from these fairly simple and straightforward ways to parse your data and to give your script extra pieces of information, you can use more sophisticated and distributed approaches like cloud data and databasing. We’ll look at these next.

Using the Cloud for Data Processing

The cloud is a term used to refer to a shared pool of resources, such as servers. There are many companies that offer cloud services—Amazon Web Services, more commonly referred to as AWS, is one of the best known.

Note

The term cloud is often overused. If you are running your code on a cloud-based server, it is better to say “I am running it on a server” rather than “I am running it in the cloud.”

When is a good time to use the cloud? The cloud is a good way to process data if the data is too large to process on your own computer or the procedure takes too long. Most tasks you want to automate you’ll want to place in the cloud so you don’t have to worry about whether the script is running or not when you turn your computer on or off.

If you choose to use AWS, the first time you log in you will see many different service offerings. There are only a few services you will need as a data wrangler (see Table 14-1).

Table 14-1. AWS cloud services
Service	Purpose in data wrangling
Simple Storage Service (S3)	A simple file storage service, used for dumping data files (JSON, XML, etc.).
Elastic Computing (EC2)	A on-demand server. This is where you run your scripts.
Elastic MapReduce (EMR)	Provides distributed data processing through a managed Hadoop framework.

Those are the basic AWS services with which to familiarize yourself. There are also several competitors, including IBM’s Bluemix and Watson Developer Cloud (giving you access to several large data platforms, including Watson’s logic and natural language processing abilities). You can also use DigitalOcean or Rackspace, which provide cheaper cloud resources.

No matter what you use, you’ll need to deploy your code to your cloud server. To do so, we recommend using Git.

Using Git to deploy Python

If you’d like to have your automation run somewhere other than your local machine, you’ll need to deploy your Python script. We will review a few simple ways to do so, and then some slightly more complex ways.

Note

Version control allows teams to work in parallel on the same repository of code without causing problems for one another. Git allows you to create different branches, thus allowing you or others on the team to work on a particular set of ideas or new integrations independently and then merge them back into the main or master branch of the code base easily and without losing any of the core functionality. It also ensures everyone has the most up-to-date code (including servers and remote machines).

The easiest and most intuitive way to deploy Python is to put your repository under version control using Git and use Git deploy hooks to “ship” code to your remote hosts. First, you’ll need to install Git.

If you’re new to Git, we recommend taking the Code School tutorial on GitHub or walking through the Git tutorials on Atlassian. It’s fairly easy to get started, and you’ll get the hang of the most used commands quickly. If you’re working on the repository by yourself, you won’t have to worry too much about pulling remote changes, but it’s always good to set a clear routine.

Once your Git installation is complete, run these commands in your project’s code folder:

git init . 
git add my_script.py 
git commit -a

: Initializes the current working directory as the root of your Git repository.
: Adds my_script.py to the repository. Use a filename or folder from your repository—just not your config files!
: Commits those changes along with any other running changes (-a) to your repository.

When prompted, you will need to write a commit message giving a brief explanation of the changes you’ve made, which should be explicit and clear. You might later need to find which commits implemented certain changes in your code. If you always write clear messages, this will help you search for and find those commits. It will also help others on your team or coworkers understand your code and commits.

Tip

Get used to fetching remote changes with git fetch or using the git pull --rebase command to update your local repository with new commits. Then, work on your code, commit your work, and push your commits to your active branch. When it’s time to merge your branch with the master, you can send a pull request, have others review the merge, and then merge it directly into master branch. Don’t forget to delete stale or old branches when they are no longer useful.

It’s also essential you set up a .gitignore file, where you list all of the file patterns you want Git to ignore when you push/pull changes, as discussed in the sidebar “Git and .gitignore”. You can have one for each folder or just one in the base folder of the repository. Most Python .gitignore files look something like this:

*.pyc
*.csv
*.log
config/*

This file will prevent compiled Python files, CSV files, log files, and config files from being stored in the repository. You’ll probably want to add more patterns, depending on what other types of files you have in your repository folders.

You can host your repository on a number of sites. GitHub offers free public repositories but no private repositories. If you need your code to be private, Bitbucket has free private repositories. If you’ve already started using Git locally, it’s easy to push your existing Git repository to GitHub or Bitbucket.

Once you have your repository set up, setting up your remote endpoints (server or servers) with Git is simple. Here is one example if you are deploying to a folder you have ssh access to:

git remote add deploy ssh://user@342.165.22.33/home/user/my_script

Before you can push your code to the server, you’ll need to set up the folder on the receiving end with a few commands. You will want to run these commands in the server folder in which you plan to deploy:

git init .
git config core.worktree `pwd`
git config receive.denycurrentbranch ignore

Here you have initialized an empty repository to send code to from your local machine and defined some simple configurations so Git knows it will be a remote endpoint. You’ll also want to set up a post-receive hook. Do so by creating an executable (via permissions) file called post-receive in the .git/hooks folder in the folder you just initialized. This file will execute when the deploy endpoint receives any Git push. It should contain any tasks you need to run every time you push, such as syncing databases, clearing the cache, or restarting any processes. At a minimum, it will need to update the endpoint.

A simple .git/hooks/post-receive file looks like this:

#!/bin/sh
git checkout -f
git reset --hard

This will reset any local changes (on the remote machine) and update the code.

Tip

You should make all of your changes on your local machine, test them, and then push them to the deploy endpoint. It’s a good habit to start from the beginning. That way, all of your code is under version control and you can ensure there are no intermittent bugs or errors introduced by modifying code directly on the server.

Once your endpoint is set up, you can simply run the following command from your local repository to update the code on the server with all the latest commits:

git push deploy master

Doing so is a great way to manage your repository and server or remote machine; it’s really easy to use and set up and makes migration, if necessary, straightforward.

If you’re new to deployment and version control, we recommend starting with Git and getting comfortable with it before moving on to more complex deployment options, like using Fabric. Later in this chapter, we’ll cover some larger-scale automation for deploying and managing code across multiple servers .

Using Parallel Processing

Parallel processing is a wonderful tool for script automation, giving you the ability to run many concurrent processes from one script. If your script needs to have more than one process, Python’s built-in multiprocessing library will become your go-to for automation. If you have a series of tasks you need to run in parallel or tasks you could speed up by running in parallel, multiprocessing is the right tool.

So how can one utilize multiprocessing? Here’s a quick example:

from multiprocessing import Process, Manager 
import requests

ALL_URLS = ['google.com', 'bing.com', 'yahoo.com',
            'twitter.com', 'facebook.com', 'github.com',
            'python.org', 'myreallyneatsiteyoushouldread.com']


def is_up_or_not(url, is_up, lock): 
    resp = requests.get('http://www.isup.me/%s' % url) 
    if 'is up.' in resp.content: 
        is_up.append(url)
    else:
        with lock: 
            print 'HOLY CRAP %s is down!!!!!' % url


def get_procs(is_up, lock): 
    procs = []
    for url in ALL_URLS:
        procs.append(Process(target=is_up_or_not,
                             args=(url, is_up, lock))) 
    return procs


def main():
    manager = Manager() 
    is_up = manager.list() 
    lock = manager.Lock() 
    for p in get_procs(is_up, lock): 
        p.start()
        p.join()
    print is_up

if __name__ == '__main__':
    main()

: Imports the Process and Manager classes from the built-in multiprocessing library to help manage our processes.
: Defines our main worker function, is_up_or_not, which requires three arguments: a URL, a shared list, and a shared lock. The list and lock are shared among all of our processes, allowing each of the processes the ability to modify or use them.
: Uses requests to ask isup.me whether a given URL is currently online and available.
: Tests to see if we can parse the text “is up.” on the page. If that text exists, we know the URL is up.
: Calls the lock’s acquire method through a with block. This acquires the lock, continues executing the indented code, and then releases the lock at the end of the code block. Locks are blocking and should be used only if you require blocking in your code (for example, if you need to ensure only one process runs a special set of logic, like checking if a shared value has changed or has reached a termination point).
: Passes the shared lock and list to use when generating the processes.
: Creates a Process object by passing it keyword arguments: the target (i.e., what function should I run?) and the args (i.e., with what variables?). This line appends all of our processes to a list so we have them in one place.
: Initializes our Manager object, which helps manage shared items and logging across processes.
: Creates a shared list object to keep track of what sites are up. Each of the processes will have the ability to alter this list.
: Creates a shared lock object to stop and announce if we encounter a site that is not up. If these were all sites we managed, we might have an important bit of business logic here for emergencies and therefore a reason to “stop everything.”
: Starts each of the processes returned by get_procs individually. Once they are started, join allows the Manager object and therefore all the child processes to communicate until the last one is finished.

When using multiprocessing, you usually have a manager process and child processes. You can pass arguments to your child processes, and you can use shared memory and shared variables. This gives you the power to determine how to utilize and architect your multiprocessing. Depending on the needs of your script, you might want to have the manager run a bunch of the logic of the script and use child processes to run one particular section of high-latency or long-running code.

Note

A shared lock object provides the ability to have multiple processes running simultaneously while protecting certain areas of the internal logic. A nice way to use them is simply by placing your lock logic in a with statement.

If you’re unsure whether your script is a good candidate for multiprocessing, you can always test out a section of the script or a subtask first, and determine whether you were able to achieve your parallel programming goals or whether it unnecessarily complicates the logic. There are some tasks better completed using large-scale automation and queueing, which we’ll discuss later in this chapter.

Using Distributed Processing

In addition to parallel processing or multiprocessing, there is also distributed processing, which involves distributing your process over many machines (unlike parallel processing, which occurs on one machine). Parallel processing is faster, when your computer can handle it, but sometimes you need more power.

Note

Distributed processing touches on more than one type of computing problem. There are tools and libraries working to manage processes distributed across many computers, and others working on managing storage across many computers. Terms related to these problems include distributed computing, MapReduce, Hadoop, HDFS, Spark, Pig, and Hive.

In early 2008, the William J. Clinton Presidential Library and the National Archives released Hillary Clinton’s schedule as First Lady from 1993 through 2001. The archive consisted of more than 17,000 pages of PDF images and needed to be optical character recognized, or OCR-ed, in order to be turned into a useful dataset. Because this was during the Democratic presidential primaries, news organizations wanted to publish the data. To accomplish this, The Washington Post used distributed processing services to turn the 17,000 images into text. By distributing the work to more than 100 computers, they were able to complete the process in less than 24 hours.

Distributed processing with a framework like Hadoop involves two major steps. The first step is to map the data or input. This process acts like a filter of sorts. A mapper is used to say “separate all the words in a text file,” or “separate all of the users who have tweeted a certain hashtag in the past hour.” The next step is to reduce the mapped data into something usable. This is similar to the aggregate functions we used in Chapter 9. If we were looking at all of the Twitter handles from the Spritzer feed, we might want a count of tweets per handle or an aggregate of handles depending on geography or topic (i.e., all tweets originating from this time zone used these words the most). The reducer portion helps us take this large data and “reduce” it into a readable and actionable report.

As you can probably see, not all datasets will need a map-reduce, and the theories behind MapReduce are already available in many of the Python data libraries. However, if you have a truly large dataset, using a MapReduce tool like Hadoop can save you hours of computing time. For a really great walkthrough, we recommend Michael Noll’s tutorial on writing a Hadoop MapReduce program in Python, which uses some word counting to explore Python and Hadoop. There is also great documentation for mrjob, which is written and maintained by developers at Yelp. If you’d like to read more on the topic, check out Kevin Schmidt and Christopher Phillips’s Programming Elastic MapReduce (O’Reilly).

If your dataset is large but is stored disparately or is real-time (or near real-time), you may want to take a look at Spark, another Apache project that has gained popularity for its speed, machine learning uses, and ability to handle streams. If your task handles streaming real-time data (from a service, an API, or even logs), then Spark is likely a more feasible choice than Hadoop and can handle the same MapReduce computing structure. Spark is also great if you need to use machine learning or any analysis requiring you to generate data and “feed” it into your data clusters. PySpark, the Python API for Spark, is maintained by the same developers, giving you the ability to write Python for your Spark processing.

To get started using Spark, we recommend Benjamin Bengfort’s detailed blog post covering how to get it installed, integrated with Jupyter notebooks, and setting up your first project. You can also check out John Ramey’s post on PySpark integration with Jupyter notebooks, and further explore the data collection and analysis possibilities in your notebook.

Simple Automation

Simple automation in Python is easy. If your code doesn’t need to run on many machines, if you have one server, or if your tasks aren’t event-driven (or can be run at the same time daily), simple automation will work. One major tenet of development is to choose the most clear and simple path. Automation is no different! If you can easily use a cron job to automate your tasks, by no means should you waste time overengineering it or making it any more complicated.

As we review simple automation, we’ll cover the built-in cron (a Unix-based system task manager) and various web interfaces to give your team easy access to the scripts you’ve written. These represent simple automation solutions which don’t require your direct involvement.

CronJobs

Cron is a Unix-based job scheduler for running scripts using your server’s logging and management utilities. Cron expects you to determine how often and at what times your task should run.

Warning

If you can’t easily define a timeline for your scripts, cron might not be a good fit. Alternatively, you could run a regular cron task to test whether the necessary conditions for your task to run exist and then use a database or local file to signal it’s time to run. With one more cron task, you would check that file or database and perform the task.

If you’ve never used a cron file before, they are fairly straightforward. Most can be edited by simply typing:

crontab -e

Note

Depending on your operating system, if you’ve never written a cron file before, you may be prompted to choose an editor. Feel free to stick with the default or change it if you have another preference.

You will see a bunch of documentation and comments in the file explaining how a cron file works. Every line of your cron file that doesn’t begin with a # symbol is a line to define a cron task. Each of these cron tasks is expected to have the following list of arguments:

minute hour day_of_month month day_of_week usercommand

If a script should run every hour of the day, but only on weekdays, you’d want to write something like this:

0 * * * 1-5 python run_this.py

This tells cron to run your script at the top of the hour, every hour, from Monday through Friday. There are quite a lot of good tutorials that walk through exactly what options are available to you, but here are a few tips:

Always set up your MAIL_TO=your@email.com variable before any lines of code. This way, if one of your scripts fails, cron will email you the exception so you’ll know it didn’t work. You will need to set up your laptop, computer, or server to send mail. Depending on your operating system and ISP, you may need to do some configuration. There’s a good GitHub gist to get Mac users started, and a handy post by HolaRails for Ubuntu users.
If you have services running that should be restarted if the computer reboots, use the @reboot feature.
If you have several path environments or other commands that must run to execute your script properly, you should write a cron.sh file in your repository. Put all necessary commands in the file and run that file directly, rather than a long list of commands connected with && signs.
Don’t be afraid to search for answers. If you’re new to cron and are having an issue, it’s quite possible someone has posted a solution that is a simple Google search away.

To test out how to use cron, we’ll create a simple Python example. Start by creating a new Python file called hello_time.py, and place this code in it:

from datetime import datetime

print 'Hello, it is now %s.' % datetime.now().strftime('%d-%m-%Y %H:%M:%S')

Next, make a simple cron.sh file in the same folder and write the following bash commands in it:

export ENV=PROD
cd /home/your_home/folder_name
python hello_time.py

We don’t need to set the environment variable, since we are not actively using it, and you’ll need to update the cd line so that it properly changes into the folder the code is in (this is the path to the current file). However, this is a good example of how to use bash commands to set variables, source virtual environments, copy and move files or change into new folders, and then call your Python file. You’ve been using bash since the beginning of the book, so no need to fear even if you are still a beginner.

Finally, let’s set up our cron task by editing our file using crontab -e. Add these lines below the documentation in your editor:

MAIL_TO=youremail@yourdomain.com
*/5 * * * * bash /home/your_home/folder_name/cron.sh > /var/log/my_cron.log 2>&1

You should replace the made-up email in this example with your real one and write the proper path to the cron file you just created. Remember, your hello_time.py script should be in the same folder. In this example, we have also set up a log file (/var/log/my_cron.log) for cron to use. The 2>&1 statement at the end of the line tells cron to put the output and any errors into that log file. Once you have exited your editor and properly saved your cron file, you should see a message confirming your new cron task is now installed. Wait a few minutes and then check the log file. You should see the message from the script in that file. If not, you can check your cron error messages by searching in your system log (usually /var/log/syslog) or in your cron log (usually /var/log/cron). To remove this cron task, simply edit your crontab again and delete the line or place a # at the beginning of the line to comment it out.

Note

Cron can be a very simple way to automate your script and alerting. It’s a powerful tool designed by Bell Labs during the initial development of Unix in the mid-1970s, and is still widely used. If it’s easy to predict when your automation should run, or it is only a few bash commands away from running, cron is a useful way to automate your code.

If you needed to pass command-line arguments for your cron tasks, the lines in the file might then look like this:

*/20 10-22 * * * python my_arg_code.py arg1 arg2 arg3
0,30 10-22 * * * python my_arg_code.py arg4 arg5 arg6

Cron is fairly flexible but also very simple. If it fits your needs, great! If not, keep reading to learn some other simple ways to automate your data wrangling.

Web Interfaces

If you need your script, scraper, or reporting task to run on demand, one easy solution is to simply build a web interface where people can log in and push a button to fire it up. Python has many different web frameworks to choose from, so it’s up to you which one to use and how much time you’d like to spend working on the web interface.

One easy way to get started is to use Flask-Admin, which is an administrative site built on top of the Flask web framework. Flask is a microframework, meaning it doesn’t require a lot of code to get started. After getting your site up and running by following the instructions in the quickstart guide, you simply set up a view in your Flask application to execute the task.

Warning

Make sure your task can alert the user or you when it’s finished in another way (email, messaging, etc.), as it’s unlikely to complete in time to give a proper web response. Also be sure to notify the user when the task starts, so they don’t end up requesting the task to run many times in a row.

Another popular and often used microframework in Python is Bottle. Bottle can be used similarly to Flask, with a view to execute the task if the user clicks a button (or does some other simple action).

A larger Python web framework often used by Python developers is Django. Originally developed to allow newsrooms to easily publish content, it comes with a built-in authentication and database system and uses a settings file to configure most of these features.

No matter what framework you use or how you build your views, you’ll want to host your framework somewhere so others can request tasks. You can host your own site fairly easily using DigitalOcean or Amazon Web Services (see Appendix G). You can also use service providers who support Python environments, like Heroku. If you’re interested in that option, Kenneth Reitz has written a great introduction to deploying your Python apps using Heroku.

Warning

Regardless of what framework or microframework you use, you’ll want to think about authentication and security. You can set that up server-side with whatever web server you are using, or explore options the framework gives you (including plug-ins or other support features).

Jupyter Notebooks

We covered how to set up your Jupyter notebooks in Chapter 10, and they are another great way to share code, particularly with folks who may not need to know Python, but who need to view the charts or other outputs of your script. If you teach them how to use simple commands, like running all the cells in the notebook and shutting it down after they’ve downloaded the new reports, you’ll find it can save you hours of time.

Tip

Adding in Markdown cells to explain how to use your shared notebooks is a great way to ensure everyone is clear on how to use the code and can move forward easily without your help.

If your script is well organized with functions and doesn’t need to be modified, simply put the repository in a place where the Jupyter notebooks can import and use the code (it’s also a good idea to set your server or notebook’s PYTHONPATH so the modules you are using are always available). This way, you can import those main functions into a notebook and have the script run and generate the report when someone clicks the notebook’s “Play All” button.

Large-Scale Automation

If your system is larger than one machine or server can handle or if your reporting is tied into a distributed application or some other event-driven system, it’s likely you’ll need something more robust than just web interfaces, notebooks, and cron. If you need a true task management system and you’d like to use Python, you’re in luck. In this section, we’ll cover a robust task management tool called Celery that handles larger stacks of tasks, automates workers (you’ll learn about workers in the next section) and provides monitoring solutions.

We will also cover operations automation, which can be helpful if you manage a series of servers or environments with different needs. Ansible is a great automation tool to help with tasks as rote as migrating databases all the way up to large-scale integrated deployments.

There are some alternatives to Celery, such as Spotify’s Luigi, which is useful if you are using Hadoop and you have large-scale task management needs (particularly long-running tasks, which can be a pain point). As far as good alternatives for operations automation, it is a quite crowded space. If you only need to manage a few servers, for Python-only deployment one good option is Fabric.

For larger-scale management of servers, a good alternative is SaltStack, or using Vagrant with any number of deployment and management tools like Chef or Puppet. We’ve chosen to highlight some of the tools we’ve used in this section, but they are not the only tools for larger-scale automation using Python. Given the field’s popularity and necessity, we recommend following discussions of larger-scale automation on your favorite technology and discussion sites, such as Hacker News.

Celery: Queue-Based Automation

Celery is a Python library used to create a distributed queue system. With Celery, your tasks are managed using a scheduler or via events and messaging. Celery is the complete solution if you’re looking for something scalable, that can handle long-running event-driven tasks. Celery integrates well with a few different queue backends. It uses settings files, user interfaces, and API calls to manage the tasks. And it’s fairly easy to get started, so no need to fear if it’s your first task management system.

No matter how you set up your Celery project, it will likely contain the following task manager system components:

Message broker (likely RabbitMQ): This acts as a queue for tasks waiting to be processed.
Task manager/queue manager (Celery): This service keeps track of the logic controlling how many workers to use, what tasks take priority, when to retry, and so on.
Workers: Workers are Python processes controlled by Celery which execute your Python code. They know what tasks you have set them up to do and they attempt to run that Python code to completion.
Monitoring tool (e.g., Flower): This allows you to take a look at the workers and your queue and is great for answering questions like “What failed last night?”

Celery has a useful getting started guide, but we find the biggest problem is not learning how to use Celery, but instead learning what types of tasks are good for queues and what tasks aren’t. Table 14-2 reviews a few questions and philosophical ideas around queue-based automation.

Table 14-2. To queue or not to queue?
Queue-based task management requirements.	Requirements for automation without queues.
Tasks do not have a specific deadline.	Tasks can and do have deadlines.
We don’t need to know how many tasks we have.	We can easily quantify what tasks need to be done.
We only know the prioritization of tasks in a general sense.	We know exactly which tasks take priority.
Tasks need not always happen in order, or are not usually order-based.	Tasks must happen in order.
Tasks can sometimes take a long time, and other times a short time.	We need to know how long tasks take.
Tasks are called (or queued) based on an event or another task’s completion.	Tasks are based on the clock or something predictable.
It’s OK if tasks fail; we can retry.	We must be aware of every task failure.
We have a lot of tasks and a strong potential for task growth.	We have only a few tasks a day.

These requirements are generalized, but they indicate some of the philosophical differences between when a task queue is a good idea and when something might be better run on a schedule with alerting, monitoring, and logging.

Note

It’s fine to have different parts of your tasks in different systems, and you’ll see that often at larger companies where they have different “buckets” of tasks. It’s also OK to test out both queue-based and non-queue-based task management and determine what fits best for you and your projects.

There are other task and queue management systems for Python, including Python RQ and PyRes. Both of them are newer and therefore might not have the same Google-fu in terms of problem solving, but if you’d like to play around with Celery first and then branch out to other alternatives, you have options.

Ansible: Operations Automation

If you are at the scale where you need Celery to manage your tasks, it’s quite likely you also need some help managing your other services and operations. If your projects need to be maintained on a distributed system, you should start organizing them so you can easily distribute via automation.

Ansible is an excellent system to automate the operations side of your projects. Ansible gives you access to a series of tools you can use to quickly spin up, deploy, and manage code. You can use Ansible to migrate projects and back up data from your remote machines. You can also use it to update servers with security fixes or new packages as needed.

Ansible has a quickstart video to get to know all of the basics, but we’d also like to highlight a few of the most useful features described in the documentation:

We also recommend checking out Justin Ellingwood’s introduction to Ansible playbooks and the Servers for Hackers extended introduction to Ansible.

Note

Ansible is probably too advanced and overcomplicated if you only have one or two servers or you only deploy one or two projects, but it is a great resource if your project grows and you need something to help keep your setup organized. If you have an interest in operations and system administration, it’s a great tool to learn and master.

If you’d rather leave your operations to a nice image you’ve created and can just restart every time, plenty of cloud providers let you do just that! There’s no pressing need to become an operations automation expert for your data wrangling needs.

Monitoring Your Automation

It’s essential you spend time monitoring your automation. If you have no idea whether a task completed or if your tasks succeeded or failed, you might as well not be running them. For this reason, monitoring your scripts and the machines running them is an important part of the process.

For example, if you have a hidden bug where data is not actually being loaded and every day or week you are running reporting on old data, that would be awful news. With automation, failure is not always obvious, as your script may continue running with old data or other errors and inconsistencies. Monitoring is your view into whether your script is succeeding or failing, even if all signs indicate it is still operating normally.

Note

Monitoring can have a small or large footprint, depending on the scale and needs of your tasks. If you are going to have a large-scale automation running across many servers, you’ll probably need to use a larger distributed monitoring system or something that boasts monitoring as a service. If, however, you are running your tasks on a home server, you probably only need to use the built-in Python logging tool.

You’ll likely want some alerting and notifications for your script as well. It’s easy in Python to upload, download, email, or even SMS the result. In this section, we’ll cover various logging options and review ways to set up notifications. After thorough testing and with a strong understanding of all the potential errors from daily monitoring, you can fully automate the task and manage the errors via alerts.

Python Logging

The most basic monitoring your script will need is logging. Lucky for you, Python has a very robust and feature-rich logging environment as part of the standard library. The clients or libraries you interact with usually have loggers integrated with the Python logging ecosystem.

Using the simple basic configuration given in Python’s built-in logging module, we can instantiate our logger and get started. You can then use the many different configuration options to meet your script’s specific logging needs. Python’s logging lets you set particular logging levels, and log record attributes and adjust the formatting. The logger object also has methods and attributes that can be useful depending on your needs.

Here’s how we set up and use logging in our code:

import logging
from datetime import datetime


def start_logger():
    logging.basicConfig(filename='/var/log/my_script/daily_report_%s.log' %
                        datetime.strftime(datetime.now(), '%m%d%Y_%H%M%S'), 
                        level=logging.DEBUG, 
                        format='%(asctime)s %(message)s', 
                        datefmt='%m-%d %H:%M:%S') 


def main():
    start_logger()
    logging.debug("SCRIPT: I'm starting to do things!") 

    try:
        20 / 0
    except Exception:
        logging.exception('SCRIPT: We had a problem!') 
        logging.error('SCRIPT: Issue with division in the main() function') 

    logging.debug('SCRIPT: About to wrap things up!')

if __name__ == '__main__':
    main()

: Initializes our logging using the logging module’s basicConfig method, which requires a log file name. This code logs to our /var/log folder in a folder, my_script. The filename is daily_report_<DATEINFO>.log, where <DATEINFO> is the time the script began, including the month, date, year, hour, minute, and second. This tells us when the script ran and why, and is good logging practice.
: Sets our logging level. Most often, you will want the level set to DEBUG so you can leave debugging messages in the code and track them in the logs. If you’d like even more information, you can use the INFO setting, which will also show more logging from your helper libraries. Some people prefer less verbose logs and set it to WARNING or ERROR instead.
: Sets the format of Python logging using the log record attributes. Here we record the message sent to logging and the time it was logged.
: Sets a human-readable date format so our logs can easily be parsed or searched using our preferred date format. Here we have month, day, hour, minute, and second logged.
: Calls the module’s debug method to start logging. This method expects a string. We are prefacing our script log entries with the word SCRIPT:. Adding searchable notes like this to your logs will help you later determine which processes and libraries wrote to your log.
: Uses the logging module’s exception method, which writes a string you send along with a traceback from the Python exception, and can therefore only be used in an exception block. This is tremendously useful for debugging errors and seeing how many exceptions you have in your script.
: Logs a longer error message using the error level. The logging module has the ability to log a variety of levels, including debug, error, info, and warning. Be consistent with how you log, and use info or debug for your normal messages and error to log messages specific to errors and exceptions in your script. That way, you always know where to look for problems and how to properly parse your logs for review.

As we’ve done in the example here, we find it useful to begin log messages with a note to yourself about what module or area of the code is writing the message. This can help determine where the error occurred. It also makes your logs easy to search and parse, as you can clearly see what errors or issues your script encounters. The best way to approach logging is to determine where to put messages to yourself as you are first writing your script, and keep the important messages in the script to determine whether something has broken and at what point.

Tip

Every exception should be logged, even if the exception is expected. This will help you keep track of how often those exceptions occur and whether your code should treat them as normal. The logging module provides exception and error methods for your usage, so you can log the exception and Python traceback and also add some extra information with error to elaborate on what might have occurred and where in the code it occurred.

You should also log your interactions with databases, APIs, and external systems. This will help you determine when your script has issues interacting with these systems and ensure they are stable, reliable, or able to be worked around. Many of the libraries you interact with also have their own ability to log to your log configuration. For example, the requests module will log connection problems and requests directly into your script log.

Even if you don’t set up any other monitoring or alerting for your script, you should use logging. It’s simple, and it provides good documentation for your future self and others. Logs are not the only solution, but they are a good standard and serve as a foundation for the monitoring of your automation.

In addition to logging, you can set up easy-to-analyze alerting for your scripts. In the following section, we’ll cover ways your script can message you about its success or failure.

Adding Automated Messaging

One easy way to send reports, keep track of your scripts, and notify yourself of errors is to use email or other messages sent directly from your scripts. There are many Python libraries to help with this task. It’s good to begin by determining exactly what type of messaging you need for your scripts and projects.

Ask yourself if any of the following apply to your script:

It produces a report which needs to be sent to a particular list of recipients.
It has a clear success/failure message.
It is pertinent to other coworkers or collaborators.
It provides results not easily viewed on a website or through a quick dashboard.

If any of these sound like your project, it’s likely a good candidate for some sort of automated messaging.

Email

Emailing with Python is straightforward. We recommend setting up a separate script-only email address through your favorite email provider (we used Gmail). If it doesn’t automatically integrate with Python out of the box, it’s likely there is a listing of the proper configuration or a useful example configuration online, found via search.

Let’s take a look at a script we’ve used to send mail with attachments to a list of recipients. We modified this code from a gist written by @dbieber, which was modified from Rodrigo Coutinho’s “Sending emails via Gmail with Python” post:

#!/usr/bin/python
# Adapted from
# http://kutuma.blogspot.com/2007/08/sending-emails-via-gmail-with-python.html
# Modified again from: https://gist.github.com/dbieber/5146518
# config file(s) should contain section 'email' and parameters
# 'user' and 'password'

import smtplib 
from email.MIMEMultipart import MIMEMultipart 
from email.MIMEBase import MIMEBase
from email.MIMEText import MIMEText
from email import Encoders
import os
import ConfigParser


def get_config(env): 
    config = ConfigParser.ConfigParser()
    if env == "DEV":
        config.read(['config/development.cfg']) 
    elif env == "PROD":
        config.read(['config/production.cfg'])
    return config


def mail(to, subject, text, attach=None, config=None): 
    if not config:
        config = get_config("DEV") 
    msg = MIMEMultipart()
    msg['From'] = config.get('email', 'user') 
    msg['To'] = ", ".join(to) 
    msg['Subject'] = subject
    msg.attach(MIMEText(text))
    if attach: 
        part = MIMEBase('application', 'octet-stream')
        part.set_payload(open(attach, 'rb').read()) 
        Encoders.encode_base64(part)
        part.add_header('Content-Disposition',
                        'attachment; filename="%s"' % os.path.basename(attach))
        msg.attach(part)
    mailServer = smtplib.SMTP("smtp.gmail.com", 587) 
    mailServer.ehlo()
    mailServer.starttls()
    mailServer.ehlo()
    mailServer.login(config.get('email', 'user'),
                     config.get('email', 'password'))
    mailServer.sendmail(config.get('email', 'user'), to, msg.as_string())
    mailServer.close()


def example():
    mail(['listof@mydomain.com', 'emails@mydomain.com'],
         "Automate your life: sending emails",
         "Why'd the elephant sit on the marshmallow?",
         attach="my_file.txt")

: Python’s built-in smtplib library gives you a wrapper for SMTP, the standard protocol for sending and receiving email.
: Python’s email library helps create email messages and attachments and keeps them in the proper format.
: The get_config function loads the configuration from a series of local configuration files. We pass an environment variable, which is expected to be the string "PROD" or "DEV" to signal whether it’s running locally ("DEV") or on our remote production environment ("PROD"). If you only have one environment, you could simply return the only configuration file in your project.
: This line uses Python’s ConfigParser to read in the .cfg file and returns config object.
: Our mail function takes a list of email addresses as the to variable, the subject and text of the email, an optional attachment, and an optional config argument. The attachment is expected to be the name of a local file. The config should be a Python ConfigParser object.
: This line sets the default configuration in case it wasn’t passed. To be safe, we are using the "DEV" configuration.
: This code uses the ConfigParser object to pull the email address out of the config file. This keeps the address secure and separate from our repository code.
: This code unpacks the list of emails and separates them with commas and a space. It expands the list of email addresses to a string, because that’s what our MIME type expects.
: If there is an attachment, this line begins the special handling for MIME multipart standards needed to send attachments.
: This code opens and reads the full file using the filename string passed.
: If you’re not using Gmail, set these to match your provider’s host and port for SMTP. Those should be easy to identify if there is good documentation. If there isn’t, a simple search for “SMTP settings <your provider name>” should give you the details.
: This is some example code to give an idea of what this mail function is expecting. You can see the data types expected (string, list, filename), and the order.

The simple Python built-in libraries smtplib and email help us quickly create and send email messages using their classes and methods. Abstracting some of the other parts of the script (such as saving your email address and password in your config) is an essential part of keeping your script and your repository secure and reusable. A few default settings ensure the script can always send email.

SMS and voice

If you’d like to integrate telephone messages into your alerting, you can use Python to send text messages or make phone calls. Twilio is a very cost-efficient way to do so, with support for messages with media and automated phone calls.

Note

Before you get started with the API, you’ll need to sign up to get your authorization codes and keys and install the Twilio Python client. There’s a long list of code examples in the Python client’s documentation, so if you might need to do something with voice or text, it’s likely there is a good feature available.

Take a look at how easy it is to send a quick text:

from twilio.rest import TwilioRestClient 
import ConfigParser


def send_text(sender, recipient, text_message, config=None): 
    if not config:
        config = ConfigParser('config/development.cfg')

    client = TwilioRestClient(config.get('twilio', 'account_sid'),
                              config.get('twilio', 'auth_token')) 
    sms = client.sms.messages.create(body=text_message,
                                      to=recipient,
                                      from_=sender) 

def example():
    send_text("+11008675309", "+11088675309", "JENNY!!!!")

: We’ll use the Twilio Python client to interact directly with the Twilio API via Python.
: This line defines a function we can use to send a text. We’ll need the sender’s and recipient’s phone numbers (prefaced with country codes) and the simple text message we want to send, and we have the ability to also pass a configuration object. We’ll use the configuration to authorize with the Twilio API.
: This code sets up a client object, which will authorize using our Twilio account. When you sign up for Twilio, you’ll receive an account_sid and an auth_token. Put them in the configuration file your script uses, in a section named twilio.
: To send a text, this code navigates to the SMS module in our client and calls the message resource’s create method. As documented by Twilio, we can then send a simple text message with only a few parameters.
: Twilio works internationally and expects to see international-based dialing numbers. If you’re unsure of the international dialing codes to use, Wikipedia has a good listing.

Tip

If you are interested in having your script “talk” via Python, Python text-to-speech can “read” your text over the phone.

Chat integration

If you’d like to integrate chat into your alerting, or if your team or collaborators commonly use chat, there are many Python chat toolkits you can use for this purpose. Depending on your chat client and needs, there’s likely a Python or API-based solution, and you can use your knowledge of REST clients to go about connecting and messaging the right people.

If you use HipChat, their API is fairly easy to integrate with your Python application or script. There are several Python libraries to make simple messaging to a chatroom or a person straightforward.

To get started using the HipChat API, you’ll first need to log in and get an API token. You can then use HypChat, a Python library, to send a quick message to a chatroom.

First, install HypChat using pip:

pip install hypchat

Now, send a message using Python!

from hypchat import HypChat
from utils import get_config


def get_client(config):
    client = HypChat(config.get('hipchat', 'token')) 
    return client


def message_room(client, room_name, message):
    try:
        room = client.get_room(room_name) 
        room.message(message) 
    except Exception as e:
        print e 


def main():
    config = get_config('DEV') 
    client = get_client(config)
    message_room(client, 'My Favorite Room', "I'M A ROBOT!")

: We use the HypChat library to talk to our chat client. The library initializes a new client using our HipChat token, which we will keep stored in our config files.
: This code uses the get_room method, which locates a room matching the string name.
: This line sends a message to a room or a user with the message method, and passes it a simple string of what to say.
: Always use try...except blocks with API-based libraries in case of connection errors or API changes. This code prints the error, but you’d likely want it logged to fully automate your script.
: The get_config function used here is imported from a different script. We follow modular code design by introducing these helper functions and putting them in individual modules for reuse.

If you want to log to chat, you can explore those options with HipLogging. Depending on your needs and how your team works, you can set up your chat logging how you’d like; but it’s nice to know you can always leave a note for someone where they might see it!

If you’d rather use Google Chat, there are some great examples of how to do so using SleekXMPP. You can also use SleekXMPP to send Facebook chat messages.

For Slack messaging, check out the Slack team’s Python client.

For other chat clients, we recommend doing a Google search for “Python <your client name>.” Chances are someone has attempted to connect their Python code with that client, or there’s an API you can use. You know how to use an API from your work in Chapter 13.

With so many options for alerting and messaging about your script’s (and automation’s) success or failure, it’s hard to know which one to use. The important part is to choose a method your or your team regularly use and will see. Prioritizing ease of use and integration with daily life is essential—automation is here to help you save time, not to make you spend more time checking services.

Uploading and Other Reporting

If you need to upload your reports or figures to a separate service or file share as part of your automation, there are terrific tools for those tasks. If it’s an online form or a site you need to interact with, we recommend using your Selenium scraping skills from Chapter 12. If it’s an FTP server, there is a standard FTP library for Python. If you need to send your reporting to an API or via a web protocol, you can use the requests library or the API skills you learned in Chapter 13. If you need to send XML, you can build it using LXML (see Chapter 11).

No matter what service you are looking to speak to, it’s likely you have had some exposure to communicating with that service. We hope you feel confident practicing those skills and striking out on your own.

Logging and Monitoring as a Service

If your needs are larger than one script can handle, or you want to incorporate your automation into a larger organizational framework, you might want to investigate logging and monitoring as a service. There are many companies working to make the lives of data analysts and developers easier by creating tools and systems to track logging. These tools often have simple Python libraries to send your logging or monitoring to their platform.

Note

With logging as a service, you can spend more time working on your research and scripts, and less time managing your monitoring and logging. This can offload some of the “Is our script working or not, and if so how well?” issues to the non-developers on your team, as many of the services have nice dashboards and built-in alerting.

Depending on the size and layout of your automation, you may need systems monitoring as well as script and error monitoring. In this section, we’ll look at a few services that do both, as well as some more specialized services. Even if you don’t have a large enough scale to justify them now, it’s always good to know what is possible.

Logging and exceptions

Python-based logging services offer the ability to log to one central service while having your script(s) run on a variety of machines, either local or remote.

One such service with great Python support is Sentry. For a relatively small amount of money per month, you can have access to a dashboard of errors, get alerts sent based on exception thresholds, and monitor the error and exception types you have on a daily, weekly, and monthly basis. The Python client for Sentry is easy to install, configure, and use. If you are using tools like Django, Celery, or even simple Python logging, Sentry has integration points so you don’t need to significantly alter your code to get started. On top of that, the code base is constantly updated and the staff is helpful in case you have questions.

Other options include Airbrake, which originally started as a Ruby-based exception tracker and now supports Python, and Rollbar. It’s a popular market, so there will likely be new ones launched before this book goes to print.

There are also services to pull in and parse your logs, such as Loggly and Logstash. These allow you to monitor your logs on an aggregate level as well as parse, search, and find issues in your logs. They are really only useful if you have enough logs and enough time to review them, but are great for distributed systems with a lot of logging.

Logging and monitoring

If you have distributed machines or you are integrating your script into your company or university’s Python-based server environment, you may want to have robust monitoring of not just Python, but the entire system. There are many services that offer monitoring for system load database traffic, and web applications, as well as automated tasks.

One of the most popular services used for this is New Relic, which can watch your servers and system processes as well as web applications. Using MongoDB and AWS? Or MySQL and Apache? New Relic plug-ins allow you to easily integrate logging for your services into the same dashboards you are using for monitoring server and application health. In addition, they offer a Python agent so you can easily log your Python application (or script) into the same ecosystem. With all of your monitoring in one place, it’s easier to spot issues and set up proper alerting so the right people on your team immediately know about any problems.

Another service for systems and application monitoring is Datadog. Datadog allows you to integrate many services into one dashboard. This saves time and effort and allows you to easily spot errors in your projects, apps, and scripts. The Datadog Python client enables logging of different events you’d like to monitor, but requires a bit of customization.

No matter what monitoring you use, or whether you decide to build your own or use a service, it’s essential to have regular alerting, insight into the services you use, and an understanding of the integrity of your code and automated systems.

Tip

When you depend on your automation to complete other parts of your work and projects, you should make sure your monitoring system is both easy to use and intuitive so you can focus on the bigger parts of your projects without risking missing errors or other issues.

No System Is Foolproof

As we’ve discussed in this chapter, relying entirely on any system is foolhardy and should be avoided. No matter how bulletproof your script or system appears to be, there’s an undeniable chance it will fail at some point. If your script depends on other systems, they could fail at any point. If your script involves data from an API, service, or website, there’s a chance the API or site will change or go down for maintenance, or any number of other events could occur causing your automation to fail.

If a task is absolutely mission critical, it should not be automated. You can likely automate parts of it or even most of it, but it will always need supervision and a person to ensure it hasn’t failed. If it’s important but not the most essential piece, the monitoring and alerting for that piece should reflect its level of importance.

Note

As you dive deeper into your own data wrangling and automation, you will spend less time on building higher-quality tasks and scripts, and more time on troubleshooting, critical thinking, and applying your analytical know-how and area knowledge to your work. Automation can help you do this, but it’s always good to have a healthy caution regarding what important tasks you automate, and how.

As the programs you’ve automated mature and progress, you will not only improve the automation you have and make it more resilient, but also increase your knowledge of your code base, Python, and your data and reporting.

Summary

You’ve learned how to automate much of your data wrangling using small- and large-scale solutions. You can monitor and keep track of your scripts and the tasks and subtasks with logging, monitoring, and cloud-based solutions—meaning you can spend less time keeping track of things and more time actually reporting. You have defined ways automation can succeed and fail and worked to help create a clear set of guidelines around automation (with an understanding that all systems can and will fail eventually). You know how to give other teammates and colleagues access so they can run tasks themselves, and you’ve learned a bit about how to deploy and set up Python automation.

Table 14-3 summarizes the new concepts and libraries introduced in this chapter.

Table 14-3. New Python and programming concepts and libraries
Concept/Library	Purpose
Running scripts remotely	Having your code run on a server or other machine so you don’t have to worry about your own computer use interfering.
Command-line arguments	Using `argv` to parse command-line arguments when running your Python script.
Environment variables	Using environment variables to help with script logic (such as what server your code is running on and what config to use).
Cron usage	Coding a shell script to execute as a cron task on your server or remote machine. A basic form of automation.
Configuration files	Using configuration files to define sensitive or special data for your Python script.
Git deployment	Using Git to easily deploy your code to one or more remote machine(s).
Parallel processing	Python’s `multiprocessing` library gives you easy access to run many processes at the same time while still having shared data and locking mechanisms.
MapReduce	With distributed data, you can map data according to a particular feature or by running it through a series of tasks, and then reduce that data to analyze it in aggregate.
Hadoop and Spark	Two tools used in cloud computing to perform MapReduce operations. Hadoop is better for an already defined and stored dataset, and Spark is preferred if you have streaming, extra-large, or dynamically generated data.
Celery (task queue use and management)	Gives you the ability to create a task queue and manage it using Python, allowing you to automate tasks that don’t have a clear start and end date.
`logging` module	Built-in logging for your application or script so you can easily track errors, debug messages, and exceptions.
`smtp` and `email` modules	Built-in email alerting from your Python script.
Twilio	A service with a Python API client for use with telephone and text messaging services.
HypChat	A Python API library for use with the HipChat chat client.
Logging as a service	Using a service like Sentry or Logstash to manage your logging, error rates, and exceptions.
Monitoring as a service	Using a service like New Relic or Datadog to monitor your logs as well as service uptimes, database issues, and performance (e.g., to identify hardware problems).

Along with the wealth of knowledge you’ve taken from previous chapters in this book, you should now be well prepared to spend your time building quality tools and allowing these tools to do the grunt work for you. You can throw out those old spreadsheet formulas and use Python to import data, run analysis, and deliver reports directly to your inbox. You can truly let Python manage the rote tasks, like a robotic assistant, and move on to the more critical and challenging parts of your reporting.

Table of Contents for Data Wrangling with Python

Chapter 14. Automation and Scaling

Why Automate?

Steps to Automate

Tip

Note

Warning

What Could Go Wrong?

Warning

Where to Automate

Tip

Special Tools for Automation

Using Local Files, argv, and Config Files

Local files

Tip

Config files

Tip

Tip

Command-line arguments

Note

Using the Cloud for Data Processing

Note

Using Git to deploy Python

Note

Tip

Tip

Using Parallel Processing

Note

Using Distributed Processing

Note

Simple Automation

CronJobs

Warning

Note

Note

Web Interfaces

Warning

Warning

Jupyter Notebooks

Tip

Large-Scale Automation

Celery: Queue-Based Automation

Note

Ansible: Operations Automation

Note

Monitoring Your Automation

Note

Python Logging

Tip

Adding Automated Messaging

Email

SMS and voice

Note

Tip

Chat integration

Uploading and Other Reporting

Logging and Monitoring as a Service

Note

Logging and exceptions

Logging and monitoring

Tip

No System Is Foolproof

Note

Summary

Table of Contents for
Data Wrangling with Python