Chapter 13. APIs

An application programming interface (API) sounds like a fancy concept, but it is not. An API is a standardized way of sharing data on the Web. Many websites share data through API endpoints. There are too many available APIs to list in this book, but here are some you might find useful or interesting:

All of these are examples of APIs that return data. You make a request to the API, and the API returns data. APIs can also serve as a way to interact with other applications. For example, we could use the Twitter API to get data from Twitter and build another application that interacts with Twitter (e.g., an application that posts Tweets using the API). The Google API list is another example—most APIs allow you to interact with the company’s services. With the LinkedIn API, you can retrieve data, but also post updates to LinkedIn without going through the web interface. Because an API can do many different things, it should be considered a service. For our purposes, the service provides data.

In this chapter, you will request API data and save it to your computer. APIs usually return JSON, XML, or CSV files, which means after the data is saved locally to your computer, you just need to apply the skills you learned in the early chapters of this book to parse it. The API we will be working with in this chapter is the Twitter API.

We chose the Twitter API as an example for a number of reasons. First, Twitter is a well-known platform. Second, it has a lot of data (tweets) that folks are interested in analyzing. Finally, the Twitter API allows us to explore many API concepts, which we will discuss along the way.

Twitter data has been used both as an informal information collection tool, like in the One Million Tweet Map, and as a more formal research tool, such as for predicting flu trends and detecting real-time events like earthquakes.

API Features

An API can be as simple as a data response to a request, but it’s rare to find APIs with only that functionality. Most APIs have other useful features. These features may include multiple API request methods (REST or streaming), data timestamps, rate limits, data tiers, and API access objects (keys and tokens). Let’s take a look at these in the context of the Twitter API.

REST Versus Streaming APIs

The Twitter API is available in two forms: REST and streaming. Most APIs are RESTful, but some real-time services offer streaming APIs. REST stands for Representational State Transfer and is designed to create stability in API architecture. Data from REST APIs can be accessed using the requests library (see Chapter 11). With the requests library, you can GET and POST web requests—which is what REST APIs use to return matching data. In the case of Twitter, the REST API allows you to query tweets, post tweets, and do most things Twitter allows via its website.

Tip

With a REST API, you can often (but not always) preview your query in a browser by using the API request as a URL. If you load the URL in your browser and it looks like a text blob, you can install a format previewer for your browser. For example, Chrome has plug-ins to preview JSON files in an easy-to-read way.

A streaming API runs as a real-time service and listens for data relating to your query. When you encounter a streaming API, you will likely want to use a library built to help manage data intake. To learn more about how Twitter’s streaming API works, see the overview on the Twitter website.

Rate Limits

APIs often have rate limits, which restrict the amount of data a user can request over a period of time. Rate limits are put in place by the API providers for several different reasons. In addition to rate limiting, you may also encounter an API with limited access to data, particularly if the data relates to business interests. For infrastructure and customer service purposes, the API provider will want to limit the number of requests so the servers and architecture can manage the amount of data transferred. If everyone was allowed to have 100% of the data 100% of the time, this could cause the API servers to crash.

If you encounter an API requiring payment for extra access, you’ll need to determine if you can pay and how much the data is worth for your research. If you encounter an API with rate limiting, you’ll want to determine if a subset of the data is sufficient. If the API has rate limits, it may take you quite a long time to collect a representative sample, so be sure to estimate the level of effort you’re willing and able to expend.

APIs will often have a rate limit for all users, as it’s easier to manage. Twitter’s API was once limited in such a way; however, with the launch of the Streaming API, the usage changed. Twitter’s Streaming API provides a constant stream of data, while the REST API limits the number of requests you can make per 15-minute period. To help developers understand the rate limits, Twitter has published a chart.

For our exercise, we will use the item called GET search/tweets. This query returns tweets containing a certain search term. If you refer to the documentation you will find the API returns JSON responses and is rate limited to 180 or 450 requests per 15 minutes, depending on whether you are querying the API as a user or an application.

Note

When you save data files from API responses, you can save many files or you can write the data to one file. You can also save the tweet data to a database, as we covered in Chapter 6. No matter what way you choose to save your data, ensure you do so regularly so you don’t lose what you’ve already requested.

In Chapter 3, we processed one JSON file. If we maximize our API usage for every 15 minutes, we can collect 180 JSON files. If you hit the rate limit and need to optimize your requests to Twitter or other APIs, read the section on “Tips to Avoid Being Rate Limited” in Twitter’s “API Rate Limits” article.

Tiered Data Volumes

So far, we have been talking about Twitter data freely available via its API. But maybe you want to know, how do I get all the data? In the case of Twitter, there are three access tiers you may have heard of before: firehose, gardenhose, and Spritzer. The spritzer is the free API. Table 13-1 describes differences between these tiers.

Table 13-1. Twitter feed types
Feed type	Coverage	Availability	Cost
Firehose	All tweets	Available through a partner - DataSift or Gnip	$$$
Gardenhose	10% of all tweets	New access is no longer available	N/A
Spritzer	1% of tweets, or up to it	Available through the public API	Free

You might look at these options and think, “I need the firehose, because I have to have it all!” But there are some things you should know before attempting to purchase access:

The firehose is a lot of data. When handling massive data, you need to scale your data wrangling. It will require numerous engineers and servers to even begin to query the dataset the firehose provides.
The firehose costs money—a few hundred thousand dollars a year. This doesn’t include the cost of the infrastructure you need to consume it (i.e., server space and database costs). Consuming the firehose is not something individuals do on their own—usually, a larger company or institution supports the costs.
Most of what you really need, you can get from the Spritzer.

We will be using the Spritzer feed, which is the free public API from Twitter, where we can access Tweets within the bounds of the rate limits. To access this API, we will use API keys and tokens.

API Keys and Tokens

API keys and tokens are ways of identifying applications and users. Twitter API keys and tokens can be confusing. There are four components you need to be aware of:

API key: Identifies the application
API secret: Acts as a password for the application
Token: Identifies the user
Token secret: Acts as a password for the user

The combination of these elements gives you access to the Twitter API. Not all APIs have two layers of identifiers and secrets, however. Twitter is a good “best case” (i.e., more secure) example. In some cases, APIs will have no key or only one key.

Creating a Twitter API key and access token

Continuing our child labor research, we will collect chatter around child labor on Twitter. Creating a Twitter API key is easy, but it requires a few steps:

If you don’t have a Twitter account, sign up.
Sign in to apps.twitter.com.
Click the “Create New App” button.
Give your application a name and description. For our example, let’s set the name to “Child labor chatter” and the description to “Pulling down chatter around child labor from Twitter.”
Give your application a website—this is the website hosting the app. The instructions say, “If you don’t have a URL yet, just put a placeholder here but remember to change it later.” We don’t have one, so we are also going to put the Twitter URL in the box. Make sure you include https, like this: https://twitter.com.
Agree to the developer agreement, and click “Create Twitter Application.”

After you create the application, you will be taken to the application management page. If you lose this page, you can find it by going back to the application landing page.

At this point, you need to create a token:

Click on the “Keys and Access Tokens” tab. (This is where you can reset your key as well as create an access token.)
Scroll to the bottom and click on “Create my access token.” Once you do this, the page will refresh with an update at the top. If you scroll to the bottom once again, you will see the access token.

Now you should have a consumer (API) key and a token. These are what ours look like:

Consumer key: 5Hqg6JTZ0cC89hUThySd5yZcL
Consumer secret: Ncp1oi5tUPbZF19Vdp8Jp8pNHBBfPdXGFtXqoKd6Cqn87xRj0c
Access token: 3272304896-ZTGUZZ6QsYKtZqXAVMLaJzR8qjrPW22iiu9ko4w
Access token secret: nsNY13aPGWdm2QcgOl0qwqs5bwLBZ1iUVS2OE34QsuR4C

Warning

Never share your keys or tokens with anyone! If you share your key with a friend, they can electronically represent you. If they abuse the system, you might lose access and be liable for their behavior.

Why did we publish ours? Well, for one, we generated new ones. In the process of generating new keys and tokens, the one included in this book was disabled—which is what you should do if you accidentally expose your key or token. If you need to create a new key, go to the “Keys and Access Tokens” tab and click “Regenerate.” This will generate a new API key and token .

Now that we have a key, let’s access the API!

A Simple Data Pull from Twitter’s REST API

With a set of keys, we can now start to access data from Twitter’s API. In this section, we will put together a simple script to pull data from the API by passing a search query. The script in this section is based on a snippet of Python code provided by Twitter as an example. This code uses Python OAuth2, which is a protocol for identifying and connecting securely when using APIs.

Tip

The current best practice for authentication is to use OAuth2. Some APIs might still use OAuth1, which will function differently and is a deprecated protocol. If you need to use OAuth1, you can use Requests-OAuthlib in conjunction with requests. When authenticating via an API, make sure to identify which protocol to use. If you use the wrong one, you will receive errors when trying to connect.

To start, we need to install Python OAuth2:

pip install oauth2

Open a new file and start by importing oauth2 and assigning your key variables:

import oauth2

API_KEY = '5Hqg6JTZ0cC89hUThySd5yZcL'
API_SECRET = 'Ncp1oi5tUPbZF19Vdp8Jp8pNHBBfPdXGFtXqoKd6Cqn87xRj0c'
TOKEN_KEY = '3272304896-ZTGUZZ6QsYKtZqXAVMLaJzR8qjrPW22iiu9ko4w'
TOKEN_SECRET = 'nsNY13aPGWdm2QcgOl0qwqs5bwLBZ1iUVS2OE34QsuR4C'

Then add the function to create the OAuth connection:

def oauth_req(url, key, secret, http_method="GET", post_body="",
              http_headers=None):
    consumer = oauth2.Consumer(key=API_KEY, secret=API_SECRET)              
    token = oauth2.Token(key=key, secret=secret)                            
    client = oauth2.Client(consumer, token)                                 
    resp, content = client.request(url, method=http_method,                 
                                   body=post_body, headers=http_headers)
    return content

: Establishes the consumer of the oauth2 object. The consumer is the owner of the keys. This line provides the consumer with the keys so it can properly identify via the API.
: Assigns the token to the oauth2 object.
: Creates the client, which consists of the consumer and token.
: Using the url, which is a function argument, executes the request using the OAuth2 client.
: Returns the content received from the connection.

Now we have a function that allows us to connect to the Twitter API. However, we need to define our URL and call the function. The Search API documentation tells us more about what requests we want to use. Using the web interface, we can see that if we search for #childlabor, we end up with the following URL: https://twitter.com/search?q=%23childlabor. The documentation instructs us to reformat the URL so we end up with the following: https://api.twitter.com/1.1/search/tweets.json?q=%23childlabor.

Then, we can add that URL as a variable and call the function using our previously defined variables:

url = 'https://api.twitter.com/1.1/search/tweets.json?q=%23childlabor'
data = oauth_req(url, TOKEN_KEY, TOKEN_SECRET)

print(data)

: Add a print statement at the end, so you can see the output.

When you run the script, you should see the data printed as a long JSON object. You may remember a JSON object looks like a Python dictionary, but if you were to rerun the script with print(type(data)), you would find out that the content is a string. At this point we could do one of two things: we could convert the data into a dictionary and start parsing it, or we could save the string to a file to parse later. To continue parsing the data in the script, add import json at the top of the script. Then, at the bottom, load the string using json and output it:

data = json.loads(data)
print(type(data))

The data variable will now return a Python dictionary. If you want to write the data to a file and parse it later, add the following code instead:

with open('tweet_data.json', 'wb') as data_file:
    data_file.write(data)

Your final script should look like the following:

import oauth2

API_KEY = '5Hqg6JTZ0cC89hUThySd5yZcL'
API_SECRET = 'Ncp1oi5tUPbZF19Vdp8Jp8pNHBBfPdXGFtXqoKd6Cqn87xRj0c'
TOKEN_KEY = '3272304896-ZTGUZZ6QsYKtZqXAVMLaJzR8qjrPW22iiu9ko4w'
TOKEN_SECRET = 'nsNY13aPGWdm2QcgOl0qwqs5bwLBZ1iUVS2OE34QsuR4C'


def oauth_req(url, key, secret, http_method="GET", post_body="",
              http_headers=None):
    consumer = oauth2.Consumer(key=API_KEY, secret=API_SECRET)
    token = oauth2.Token(key=key, secret=secret)
    client = oauth2.Client(consumer, token)
    resp, content = client.request(url, method=http_method,
                                   body=post_body, headers=http_headers)
    return content


url = 'https://api.twitter.com/1.1/search/tweets.json?q=%23popeindc'
data = oauth_req(url, TOKEN_KEY, TOKEN_SECRET)

with open("data/hashchildlabor.json", "w") as data_file:
    data_file.write(data)

From here you can refer back to the section “JSON Data” in Chapter 3 to parse the data.

Advanced Data Collection from Twitter’s REST API

Pulling a single data file from Twitter is not terribly useful, because it only returns about 15 tweets. We are looking to execute multiple queries in a row, so we can collect as many tweets as possible related to our topic. We are going to use another library to do some of the heavy lifting for us—Tweepy. Tweepy can help us manage a series of requests as well as OAuth using Twitter. Start by installing tweepy:

pip install tweepy

At the top of your script, import tweepy and set your keys again:

import tweepy

API_KEY = '5Hqg6JTZ0cC89hUThySd5yZcL'
API_SECRET = 'Ncp1oi5tUPbZF19Vdp8Jp8pNHBBfPdXGFtXqoKd6Cqn87xRj0c'
TOKEN_KEY = '3272304896-ZTGUZZ6QsYKtZqXAVMLaJzR8qjrPW22iiu9ko4w'
TOKEN_SECRET = 'nsNY13aPGWdm2QcgOl0qwqs5bwLBZ1iUVS2OE34QsuR4C'

Then pass your API key and API secret to tweepy’s OAuthHandler object, which will manage the same OAuth protocol covered in the last example. Then set your access token:

auth = tweepy.OAuthHandler(API_KEY, API_SECRET)                         
auth.set_access_token(TOKEN_KEY, TOKEN_SECRET)

: Creates an object to manage the API authentication via tweepy
: Sets token access

Next, pass the authorization object you just created to tweepy.API:

api = tweepy.API(auth)

The tweepy.API object can take a variety of arguments to give you customized control over how tweepy behaves when requesting data. You can directly add retries and delays between requests using parameters like retry_count=3, retry_delay=5. Another useful option is wait_on_rate_limit, which will wait until the rate limit has been lifted to make the next request. Details on all of these niceties and more are included in the tweepy documentation.

We want to create a connection to the Twitter API using tweepy.Cursor. We can then pass the cursor the API method to use, which is api.search, and the parameters associated with that method:

query = '#childlabor'                                                   
cursor = tweepy.Cursor(api.search, q=query, lang="en")

: Creates the query variable
: Establishes the cursor with the query, and limits it to just the English language

Note

While the term Cursor might not feel intuitive, it’s a common programming term in reference to database connections. Although an API is not a database, the class name Cursor was probably adopted from this usage. You can read more about cursors on Wikipedia.

According to tweepy’s documentation, cursor can return an iterator on a per-item or per-page level. You can also define limits to determine how many pages or items the cursor grabs. If you look at print(dir(cursor)), you’ll see there are three methods: ['items', 'iterator', 'pages']. A page returns a bunch of items, which are individual tweets from your query. For our needs, we are going to use pages.

Let’s iterate through the pages and save the data. Before we do that, we need to do two things:

Add import json to the top of the script.
Create a directory called data in the same directory as the script. To do this, run mkdir data on the command line.

Once you’ve done those two things, run the following code to iterate through and save the tweets:

for page in cursor.pages():                                             
    tweets = []                                                         
    for item in page:                                                   
        tweets.append(item._json)                                       

with open('data/hashchildlabor.json', 'wb') as outfile:             
    json.dump(tweets, outfile)

: For each page returned in cursor.pages()…
: Creates an empty list to store tweets.
: For each item (or tweet) in a page…
: Extracts the JSON tweet data and saves it to the tweets list.
: Opens a file called hashchildlabor.json and saves the tweets.

You will notice not many tweets are being saved to the file. There are only 15 tweets per page, so we’ll need to figure out a way to get more data. Options include:

Open a file and never close it, or open a file and append the information at the end. This will create one massive file.
Save each page in its own file (you can use timestamps to ensure you have different filenames for each file).
Create a new table in your database to save the tweets.

Creating one file is dangerous, because at any moment the process could fail and corrupt the data. Unless you have a small data pull (e.g., 1000 tweets) or are doing development testing, you should use one of the other options.

There are a couple of ways to save the data in a new file every time, the most common ones being creating a filename by using a date and timestamp, or just by incrementing a number and appending it to the end of the filename.

We’ll go ahead and add our tweets to our simple database. To do so, we’ll use this function:

def store_tweet(item):
    db = dataset.connect('sqlite:///data_wrangling.db')
    table = db['tweets']                                    
    item_json = item._json.copy()
    for k, v in item_json.items():
        if isinstance(v, dict):                             
            item_json[k] = str(v)
    table.insert(item_json)

: Creates or accesses a new table called tweets
: Tests if there are any dictionaries in our tweet item values. Since SQLite doesn’t support saving Python dictionaries, we need to convert dictionaries into strings.
: Inserts the cleaned JSON item.

We will also need to add dataset into our import. We will then need to add the use of this function where we were previously storing the pages. We’ll also want to make sure we iterate over every tweet. Your final script should look like the following:

import json
import tweepy
import dataset

API_KEY = '5Hqg6JTZ0cC89hUThySd5yZcL'
API_SECRET = 'Ncp1oi5tUPbZF19Vdp8Jp8pNHBBfPdXGFtXqoKd6Cqn87xRj0c'
TOKEN_KEY = '3272304896-ZTGUZZ6QsYKtZqXAVMLaJzR8qjrPW22iiu9ko4w'
TOKEN_SECRET = 'nsNY13aPGWdm2QcgOl0qwqs5bwLBZ1iUVS2OE34QsuR4C'

def store_tweet(item):
    db = dataset.connect('sqlite:///data_wrangling.db')
    table = db['tweets']
    item_json = item._json.copy()
    for k, v in item_json.items():
        if isinstance(v, dict):
            item_json[k] = str(v)
    table.insert(item_json)

auth = tweepy.OAuthHandler(API_KEY, API_SECRET)
auth.set_access_token(TOKEN_KEY, TOKEN_SECRET)

api = tweepy.API(auth)

query = '#childlabor'
cursor = tweepy.Cursor(api.search, q=query, lang="en")

for page in cursor.pages():
    for item in page:
        store_tweet(item)

Advanced Data Collection from Twitter’s Streaming API

Early in this chapter, we mentioned there are two types of Twitter APIs available: REST and Streaming.

How does the Streaming API differ from the REST API? Here’s a brief rundown:

The data is live, while the REST API returns only data that has already been tweeted.
Streaming APIs are less common, but will become more available in the future as more live data is generated and exposed.
Because up-to-date data is interesting, many people are interested in the data, which means you can find lots of resources and help online.

Let’s create a script to collect from the Streaming API. Such a script builds on all the concepts we’ve covered in this chapter. We’ll first add the basics—imports and keys:

from tweepy.streaming import StreamListener                             
from tweepy import OAuthHandler, Stream                                 

API_KEY = '5Hqg6JTZ0cC89hUThySd5yZcL'
API_SECRET = 'Ncp1oi5tUPbZF19Vdp8Jp8pNHBBfPdXGFtXqoKd6Cqn87xRj0c'
TOKEN_KEY = '3272304896-ZTGUZZ6QsYKtZqXAVMLaJzR8qjrPW22iiu9ko4w'
TOKEN_SECRET = 'nsNY13aPGWdm2QcgOl0qwqs5bwLBZ1iUVS2OE34QsuR4C'

: Imports StreamListener, which creates a streaming session and listens for messages
: Imports OAuthHandler, which we used before, and Stream, which actually handles the Twitter stream

In this script, we are doing our import statements slightly differently than we did in the last script. Both of these are valid approaches and a matter of preference. Here’s a quick comparison of the two approaches:

Approach 1

import tweepy
...
auth = tweepy.OAuthHandler(API_KEY, API_SECRET)

Approach 2

from tweepy import OAuthHandler
...
auth = OAuthHandler(API_KEY, API_SECRET)

Usually the first approach is used when the library is not used much in the script. It is also good when you have a longer piece of code and want to be explicit. However, when the library is used a lot it gets tiresome to type this out: also, if the library is the cornerstone of the script, it should be fairly obvious to people what modules or classes are imported from the library.

Now we are going to subclass (a concept you learned about in Chapter 12) the StreamListener class we imported because we want to override the on_data method. To do this, we redefine it in our new class, which we call Listener. When there is data, we want to see it in our terminal, so we are going to add a print statement:

class Listener(StreamListener):                                         

    def on_data(self, data):                                            
        print data                                                      
        return True

: Subclasses StreamListener.
: Defines the on_data method.
: Outputs tweets.
: Returns True. StreamListener has an on_data method, which also returns True. As we’re subclassing and redefining it, we must repeat the return value in the subclassed method.

Next, add your authentication handlers:

auth = OAuthHandler(API_KEY, API_SECRET)
auth.set_access_token(TOKEN_KEY, TOKEN_SECRET)

Finally, pass the Listener and auth to the Stream and start filtering with a search term. In this case, we are going to look at child labor because it has more traffic than #childlabor:

stream = Stream(auth, Listener())                                       
stream.filter(track=['child labor'])

: Sets up the stream by passing auth and Listener as arguments
: Filters the stream and returns only items with the terms child and labor

Your final script should look like this:

from tweepy.streaming import StreamListener
from tweepy import OAuthHandler, Stream


API_KEY = '5Hqg6JTZ0cC89hUThySd5yZcL'
API_SECRET = 'Ncp1oi5tUPbZF19Vdp8Jp8pNHBBfPdXGFtXqoKd6Cqn87xRj0c'
TOKEN_KEY = '3272304896-ZTGUZZ6QsYKtZqXAVMLaJzR8qjrPW22iiu9ko4w'
TOKEN_SECRET = 'nsNY13aPGWdm2QcgOl0qwqs5bwLBZ1iUVS2OE34QsuR4C'


class Listener(StreamListener):

    def on_data(self, data):
        print data
        return True

auth = OAuthHandler(API_KEY, API_SECRET)
auth.set_access_token(TOKEN_KEY, TOKEN_SECRET)

stream = Stream(auth, Listener())
stream.filter(track=['child labor'])

From here, you would add a way to save tweets to your database, file, or other storage using your on_data method as we did earlier in the chapter.

Summary

Being able to interact with application programming interfaces is an important part of data wrangling. In this chapter, we covered some of the API basics (see Table 13-2 for a summary) and processed data from the Twitter API.

Table 13-2. API concepts
Concept	Usage
REST APIs (vs. streaming)	Return data and expose static endpoints
Streaming APIS (vs. REST)	Return live data to query
OAuth and OAuth2	Authenticate given a series of keys and tokens
Tiered data volumes	Various layers of rate limits/availability of data; some cost $
Keys and tokens	Unique IDs and secrets to identify the user and application

We reused many Python concepts we already knew and learned a few new Python concepts in this chapter. The first was the usage of tweepy, a library to handle interactions with the Twitter API. You also learned about authentication and OAuth protocols.

As an extension of interacting with an API, Chapter 14 will help you learn about techniques enabling you to run your API scripts while you are away.

Previous Chapter

12. Advanced Web Scraping: Screen Scrapers and Spiders

Next Chapter

14. Automation and Scaling