Solving problems – closest good restaurant

We want to meet our secret informant at a good restaurant that's a reasonable distance from our base. In order to locate a good restaurant, we need to gather some additional information. In this case, good means a passing grade from the health inspectors.

Before we can even have a meeting, we'll need to use basic espionage skills to locate the health code survey results for local restaurants.

We'll create a Python application to combine many things to sort through the results. We'll perform the following steps:

We'll start with the restaurant health score information.
We need to geocode the restaurant addresses if it hasn't been done already. In some cases, geocoding is done for us. In other cases, we'll be using a web service for this.
We need to filter and organize restaurants by good scores. We'll also need to use our haversine() function to compute the distance from our base.
Finally, we need to communicate this to our network, ideally using a short NAC code embedded within an image that we post to a social media site. See Chapter 3, Encoding Secret Messages with Steganography, for details on this final step.

In many cities, the health code data is available online. A careful search will reveal a useful dataset. In other cities, the health inspection data isn't readily available online. We might have to dig considerably deep to track down even a few restaurants near our base of operations.

Some cities use Yelp to publicize restaurant health code inspection data. We can read about the YELP API to search for restaurants on the following link:

http://www.yelp.com/developers/documentation

We might also find some useful data on InfoChimps at http://www.infochimps.com/tags/restaurant.

One complexity we often encounter is the use of HTML-based APIs for this kind of information. This is not intentional obfuscation, but the use of HTML complicates analysis of the data. Parsing HTML to extract meaningful information isn't easy; we'll need an extra library to handle this.

We'll look at two approaches: good, clean data and more complex HTML data parsing. In both cases, we need to create a Python object that acts as a container for a collection of attributes. First, we'll divert to look at the SimpleNamespace class. Then, we'll use this to collect information.

Creating simple Python objects

We have a wide variety of ways to define our own Python objects. We can use the central built-in types such as dict to define an object that has a collection of attribute values. When looking at information for a restaurant, we could use something like this:

some_place = { 'name': 'Secret Base', 'address': '333 Waterside Drive' }

Since this is a mutable object, we can add attribute values and change the values of the existing attributes. The syntax is a bit clunky, though. Here's what an update to this object looks like:

some_place['lat']= 36.844305
some_place['lng']= -76.29112

The extra [] brackets and '' characters seem needless. We'd like to have a notation that's a little cleaner than this very general key-value syntax used for dictionaries.

One common solution is to use a proper class definition. The syntax looks like this:

class Restaurant:
    def __init__(self, name, address):
        self.name= name
        self.address= address

We've defined a class with an initialization method, __init__(). The name of the initialization method is special, and only this name can be used. When the object is built, the initialization method is evaluated to assign initial values to the attributes of the object.

This allows us to create an object more succinctly:

some_place= Restaurant( name='Secret Base', address='333 Waterside Drive' )

We've used explicit keyword arguments. The use of name= and address= isn't required. However, as class definitions become more complex, it's often more flexible and more clear to use keyword argument values.

We can update the object nicely too, as follows:

some_place.lat= 36.844305
some_place.lng= -76.29112

This works out best when we have a lot of unique processing that is bound to each object. In this case, we don't actually have any processing to associate with the attributes; we just want to collect those attributes in a tidy capsule. The formal class definition is too much overhead for such a simple problem.

Python also gives us a very flexible structure called a namespace. This is a mutable object that we can access using simple attribute names, as shown in the following code:

from types import SimpleNamespace
some_place= SimpleNamespace( name='Secret Base', address='333 Waterside Drive' )

The syntax to create a namespace must use keyword arguments (name='The Name'). Once we've created this object, we can update it using a pleasant attribute access, as shown in the following snippet:

some_place.lat= 36.844305
some_place.lng= -76.29112

The SimpleNamespace class gives us a way to build an object that contains a number of individual attribute values.

We can also create a namespace from a dictionary using Python's ** notation. Here's an example:

>>> SimpleNamespace( **{'name': 'Secret Base', 'address': '333 Waterside Drive'} )
namespace(address='333 Waterside Drive', name='Secret Base')

The ** notation tells Python that a dictionary object contains keyword arguments for the function. The dictionary keys are the parameter names. This allows us to build a dictionary object and then use it as the arguments to a function.

Recall that JSON tends to encode complex data structures as a dictionary. Using this ** technique, we can transform a JSON dictionary into SimpleNamespace, and replace the clunky object['key'] notation with a cleaner object.key notation.

Working with HTML web services – tools

In some cases, the data we want is tied up in HTML websites. The City of Norfolk, for example, relies on the State of Virginia's VDH health portal to store its restaurant health code inspection data.

In order to make sense of the intelligence encoded in the HTML notation on the WWW, we need to be able to parse the HTML markup that surrounds the data. Our job is greatly simplified by the use of special higher-powered weaponry; in this case, BeautifulSoup.

Start with https://pypi.python.org/pypi/beautifulsoup4/4.3.2 or http://www.crummy.com/software/BeautifulSoup/.

If we have Easy Install (or PIP), we can use these tools to install BeautifulSoup. Back in Chapter 1, Our Espionage Toolkit, we should have installed one (or both) of these tools to install more tools.

We can use Easy Install to install BeautifulSoup like this:

sudo easy_install-3.3 beautifulsoup4

Mac OS X and GNU/Linux users will need to use the sudo command. Windows users won't use the sudo command.

Once we have BeautifulSoup, we can use it to parse the HTML code looking for specific facts buried in an otherwise cryptic jumble of HTML tags.

Before we can go on, you'll need to read the quickstart documentation and bring yourself up to speed on BeautifulSoup. Once you've done that, we'll move to extracting data from HTML web pages.

Start with http://www.crummy.com/software/BeautifulSoup/bs4/doc/#quick-start.

An alternative tool is scrapy. For information see http://scrapy.org. Also, read Instant Scrapy Web Mining and Scraping, Travis Briggs, Packt Publishing, for details on using this tool. Unfortunately, as of this writing, scrapy is focused on Python 2, not Python 3.

Working with HTML web services – getting the page

In the case of VDH health data for the City of Norfolk, the HTML scraping is reasonably simple. We can leverage the strengths of BeautifulSoup to dig into the HTML page very nicely.

Once we've created a BeautifulSoup object from the HTML page, we will have an elegant technique to navigate down through the hierarchy of the HTML tags. Each HTML tag name (html, body, and so on) is also a BeautifulSoup query that locates the first instance of that tag.

An expression such as soup.html.body.table can locate the first <table> in the HTML <body> tag. In the case of the VDH restaurant data, that's precisely the data we want.

Once we've found the table, we need to extract the rows. The HTML tag for each row is <tr> and we can use the BeautifulSoup table.find_all("tr") expression to locate all rows within a given <table> tag. Each tag's text is an attribute, .text. If the tag has attributes, we can treat the tag as if it's a dictionary to extract the attribute values.

We'll break down the processing of the VDH restaurant data into two parts: the web services query that builds Soup from HTML and the HTML parsing to gather restaurant information.

Here's the first part, which is getting the raw BeautifulSoup object:

scheme_host= "http://healthspace.com"
def get_food_list_by_name():
    path= "/Clients/VDH/Norfolk/Norolk_Website.nsf/Food-List-ByName"
    form = {
        "OpenView": "",
        "RestrictToCategory": "FAA4E68B1BBBB48F008D02BF09DD656F",
        "count": "400",
        "start": "1",
    }
    query= urllib.parse.urlencode( form )
    with urllib.request.urlopen(scheme_host + path + "?" + query) as data:
        soup= BeautifulSoup( data.read() )
    return soup

This repeats the web services queries we've seen before. We've separated three things here: the scheme_host string, the path string, and query. The reason for this is that our overall script will be using the scheme_host with other paths. And we'll be plugging in lots of different query data.

For this basic food_list_by_name query, we've built a form that will get 400 restaurant inspections. The RestrictToCategory field in the form has a magical key that we must provide to get the Norfolk restaurants. We found this via a basic web espionage technique: we poked around on the website and checked the URLs used when we clicked on each of the links. We also used the Developer mode of Safari to explore the page source.

In the long run, we want all of the inspections. To get started, we've limited ourselves to 400 so that we don't spend too long waiting to run a test of our script.

The response object was used by BeautifulSoup to create an internal representation of the web page. We assigned this to the soup variable and returned it as the result of the function.

In addition to returning the soup object, it can also be instructive to print it. It's quite a big pile of HTML. We'll need to parse this to get the interesting details away from the markup.

Working with HTML web services – parsing a table

Once we have a page of HTML information parsed into a BeautifulSoup object, we can examine the details of that page. Here's a function that will locate the table of restaurant inspection details buried inside the page.

We'll use a generator function to yield each individual row of the table, as shown in the following code:

def food_table_iter( soup ):
    """Columns are 'Name', '', 'Facility Location', 'Last Inspection', 
    Plus an unnamed column with a RestrictToCategory key
    """
    table= soup.html.body.table
    for row in table.find_all("tr"):
        columns = [ td.text.strip() for td in row.find_all("td") ]
        for td in row.find_all("td"):
            if td.a:
                url= urllib.parse.urlparse( td.a["href"] )
                form= urllib.parse.parse_qs( url.query )
                columns.append( form['RestrictToCategory'][0] )
        yield columns

Notice that this function begins with a triple-quoted string. This is a docstring and it provides documentation about the function. Good Python style insists on a docstring in every function. The Python help system will display the docstrings for functions, modules, and classes. We've omitted them to save space. Here, we included it because the results of this particular iterator can be quite confusing.

This function requires a parsed Soup object. The function uses simple tag navigation to locate the first <table> tag in the HTML <body> tag. It then uses the table's find_all() method to locate all of the rows within that table.

For each row, there are two pieces of processing. First, a generator expression is used to find all the <td> tags within that row. Each <td> tag's text is stripped of excess white space and the collection forms a list of cell values. In some cases, this kind of processing is sufficient.

In this case, however, we also need to decode an HTML <a> tag, which has a reference to the details for a given restaurant. We use a second find_all("td") expression to examine each column again. Within each column, we check for the presence of an <a> tag using a simple if td.a: loop. If there is an <a> tag, we can get the value of the href attribute on that tag. When looking at the source HTML, this is the value inside the quotes of <a href="">.

This value of an HTML href attribute is a URL. We don't actually need the whole URL. We only need the query string within the URL. We've used the urllib.parse.urlparse() function to extract the various bits and pieces of the URL. The value of the url.query attribute is just the query string, after the ?.

It turns out, we don't even want the entire query string; we only want the value for the key RestrictToCategory. We can parse the query string with urllib.parse.parse_qs() to get a form-like dictionary, which we assigned to the variable form. This function is the inverse of urllib.parse.urlencode(). The dictionary built by the parse_qs() function associates each key with a list of values. We only want the first value, so we use form['RestrictToCategory'][0] to get the key required for a restaurant.

Since this food_table_iter () function is a generator, it must be used with a for statement or another generator function. We can use this function with a for statement as follows:

for row in  food_table_iter(get_food_list_by_name()):
    print(row)

This prints each row of data from the HTML table. It starts like this:

['Name', '', 'Facility Location', 'Last Inspection']
["Todd's Refresher", '', '150 W. Main St #100', '6-May-2014', '43F6BE8576FFC376852574CF005E3FC0']
["'Chick-fil-A", '', '1205 N Military Highway', '13-Jun-2014', '5BDECD68B879FA8C8525784E005B9926']

This goes on for 400 locations.

The results are unsatisfying because each row is a flat list of attributes. The name is in row[0] and the address in row[2]. This kind of reference to columns by position can be obscure. It would be much nicer to have named attributes. If we convert the results to a SimpleNamespace object, we can then use the row.name and row.address syntax.

Making a simple Python object from columns of data

We really want to work with an object that has easy-to-remember attribute names and not a sequence of anonymous column names. Here's a generator function that will build a SimpleNamespace object from a sequence of values produced by a function such as the food_table_iter() function:

def food_row_iter( table_iter ):
    heading= next(table_iter)
    assert ['Name', '', 'Facility Location', 'Last Inspection'] == heading
    for row in table_iter:
        yield SimpleNamespace(
            name= row[0], address= row[2], last_inspection= row[3],
            category= row[4]
        )

This function's argument must be an iterator like food_table_iter(get_food_list_by_name()). The function uses next(table_iter) to grab the first row, since that's only going to be a bunch of column titles. We'll assert that the column titles really are the standard column titles in the VDH data. If the assertion ever fails, it's a hint that VDH web data has changed.

For every row after the first row, we build a SimpleNamespace object by taking the specific columns from each row and assigning them nice names.

We can use this function as follows:

soup= get_food_list_by_name()
raw_columns=  food_table_iter(soup)
for business in food_row_iter( raw_column ):
    print( business.name, business.address )

The processing can now use nice attribute names, for example, business.name, to refer to the data we extracted from the HTML page. This makes the rest of the programming meaningful and clear.

What's also important is that we've combined two generator functions. The food_table_iter() function will yield small lists built from HTML table rows. The food_row_iter() function expects a sequence of lists that can be iterated, and will build SimpleNamespace objects from that sequence of lists. This defines a kind of composite processing pipeline built from smaller steps. Each row of the HTML table that starts in food_table_iter() is touched by food_row_iter() and winds up being processed by the print() function.

Enriching Python objects with geocodes

The Norfolk data we've gotten so far is only a list of restaurants. We still neither have inspection scores, nor do we have useful geocodes. We need to add these details to each business that we found in the initial list. This means making two more RESTful web services requests for each individual business.

The geocoding is relatively easy. We can use a simple request and update the SimpleNamespace object that we're using to model each business. The function looks like this:

def geocode_detail( business ):
    scheme_netloc_path = "https://maps.googleapis.com/maps/api/geocode/json"
    form = {
        "address": business.address + ", Norfolk, VA",
        "sensor": "false",
        #"key": An API Key, if you signed up for one,
    }
    query = urllib.parse.urlencode( form, safe="," )
    with urllib.request.urlopen( scheme_netloc_path+"?"+query ) as geocode:
        response= json.loads( geocode.read().decode("UTF-8") )
    lat_lon = response['results'][0]['geometry']['location']
    business.latitude= lat_lon['lat']
    business.longitude= lat_lon['lng']
    return business

We're using the Google geocoding API that we used earlier. We've made a few modifications. First, the data in the form variable has the business.address attribute from the SimpleNamespace object. We've had to add the city and state information, since that's not provided in the VDH address.

As with previous examples, we took only the first location of the response list with response['results'][0]['geometry']['location'], which is a small dictionary object with two keys: lat and lon. We've updated the namespace that represents our business by setting two more attributes, business.latitude and business.longitude from the values in this small dictionary.

The namespace object is mutable, so this function will update the object referred to by the variable business. We also returned the object. The return statement is not necessary, but sometimes it's handy because it allows us to create a fluent API for a sequence of functions.

Enriching Python objects with health scores

The bad news is that getting health scoring details requires yet more HTML parsing. The good news is that the details are placed in an easy-to-locate HTML <table> tag. We'll break this process into two functions: a web service request to get the BeautifulSoup object and more HTML parsing to explore that Soup.

Here's the URL request. This requires the category key that we parsed from the <a> tag in the food_table_iter() function shown previously:

def get_food_facility_history( category_key ):
    url_detail= "/Clients/VDH/Norfolk/Norolk_Website.nsf/Food-FacilityHistory"
    form = {
        "OpenView": "",
        "RestrictToCategory": category_key
    }
    query= urllib.parse.urlencode( form )
    with urllib.request.urlopen(scheme_host + url_detail + "?" + query) as data:
        soup= BeautifulSoup( data.read() )
    return soup

This request, like other HTML requests, builds a query string, opens the URL response object, and parses it to create a BeautifulSoup object. We're only interested in the soup instance. We return this value for use with HTML processing.

Also, note that part of the path, Norolk_Website.nsf, has a spelling error. Secret agents in the field are responsible for finding information in spite of these kind of problems.

We'll use this in a function that updates the SimpleNamespace object that we're using to model each business. The data extraction function looks like this:

def inspection_detail( business ):
    soup= get_food_facility_history( business.category )
    business.name2= soup.body.h2.text.strip()
    table= soup.body.table
    for row in table.find_all("tr"):
        column = list( row.find_all( "td" ) )
        name= column[0].text.strip()
        value= column[1].text.strip()
        setattr( business, vdh_detail_translate[name], value )
    return business

This function gets the BeautifulSoup object for a specific business. Given that Soup, it navigates to the first <h2> tag within the <body> tag. This should repeat the business name. We've updated the business object with this second copy of the name.

This function also navigates to the first <table> tag within the <body> tag via the soup.body.table expression. The HTML table has two columns: the left column contains a label and the right column contains the value.

To parse this kind of table, we stepped through each row using table.find_all("tr"). For each row, we built a list from row.find_all( "td" ). The first item in this list is the <td> tag that contains a name. The second item is the <td> tag that contains a value.

We can use a dictionary, vdh_detail_translate, to translate the names in the left column to a better looking Python attribute name, as shown in the following code:

vdh_detail_translate = {
    'Phone Number:': 'phone_number',
    'Facility Type:': 'facility_type',     '# of Priority Foundation Items on Last Inspection:':
        'priority_foundation_items',
    '# of Priority Items on Last Inspection:': 'priority_items',
    '# of Core Items on Last Inspection:': 'core_items',
    '# of Critical Violations on Last Inspection:': 'critical_items',
    '# of Non-Critical Violations on Last Inspection:': 'non_critical_items',
}

Using a dictionary like this allows us to use the expression vdh_detail_translate[name] to locate a pleasant attribute name (such as core_item) instead of the long string that's displayed in the original HTML.

We need to look closely at the use of the setattr() function that's used to update the business namespace:

setattr( business, vdh_detail_translate[name], value )

In other functions, we've used a simple assignment statement such as business.attribute= value to set an attribute of the namespace object. Implicitly, the simple assignment statement actually means setattr( business, 'attribute', value ). We can think of setattr(object, attribute_string, value) as the reason why Python implements the simple variable.attribute= value assignment statement.

In this function, we can't use a simple assignment statement, because the attribute name is a string that's looked up via a translation. We can use the setattr() function to update the business object using the attribute name string computed from vdh_detail_translate[name].

Combining the pieces and parts

We can now look at the real question: finding high-quality restaurants. We can build a composite function that combines our previous functions. This can become a generator function that yields all of the details in a sequence of namespace objects, as shown in the following code:

def choice_iter():
    base= SimpleNamespace( address= '333 Waterside Drive' )
    geocode_detail( base )
    print( base ) # latitude= 36.844305, longitude= -76.29111999999999 )
    soup= get_food_list_by_name()
    for row in food_row_iter( food_table_iter( soup ) ):
        geocode_detail( row )
        inspection_detail( row )
        row.distance= haversine(
            (row.latitude, row.longitude),
            (base.latitude, base.longitude) )
        yield row

This will build a small object, base, to describe our base. The object will start with just the address attribute. After we apply the geocode_detail() function, it will also have a latitude and longitude.

The print() function will produce a line that looks like this:

namespace(address='333 Waterside Drive', latitude=36.844305, longitude=-76.29111999999999)

The get_food_list_by_name() function will get a batch of restaurants. We use food_table_iter() to get the HTML table, and food_row_iter() to build individual SimpleNamespace objects from the HTML table. We then do some updates on each of those SimpleNamespace objects to provide restaurant inspection results and geocode information. We update the object yet again to add the distance from our base to the restaurant.

Finally, we yield the richly detailed namespace object that represents everything we need to know about a business.

Given this sequence of objects, we can apply some filters to exclude places over .75 miles away or with more than one problem reported:

for business in choice_iter():
    if business.distance > .75: continue
    if business.priority_foundation_items > 1: continue
    if business.priority_items > 1: continue
    if business.core_items > 1: continue
    print( business )

This script will apply four different filters to each response. If the business, for example, is too far away, the continue statement will end the processing of this item: the for statement will advance to the next. If the business has too many items, the continue statements will reject this business and advance to the next item. Only a business that passes all four tests will be printed.

Note that we've inefficiently processed each business through the geocode_detail() and inspection_detail() functions. A more efficient algorithm would apply the distance filter early in the processing. If we immediately reject places that are too far away, we will only need to get detailed restaurant health data for places that are close enough.

The important thing about this sequence of examples is that we integrated data from two different web services and folded them in our own value-added intelligence processing.

Working with clean data portals

A good example of a clean data portal is the City of Chicago. We can get the restaurant inspection data with a simple URL:

https://data.cityofchicago.org/api/views/4ijn-s7e5/rows.json?accessType=DOWNLOAD

This will download all the restaurant inspection information in a tidy, easy-to-parse, JSON document. The only problem is the size. It has over 83,000 inspections and takes a very long time to download. If we apply a filter (for instance, only inspects done this year), we can cut the document down to a manageable size. More details on the various kinds of filters supported can be found at http://dev.socrata.com/docs/queries.html.

There's a lot of sophistication available. We'll define a simple filter based on the inspection date to limit ourselves to a subset of the available restaurant inspections.

A function to get the data looks like this:

def get_chicago_json():
    scheme_netloc_path= "https://data.cityofchicago.org/api/views/4ijn-s7e5/rows.json"
    form = {
        "accessType": "DOWNLOAD",
        "$where": "inspection_date>2014-01-01",
    }
    query= urllib.parse.urlencode(form)
    with urllib.request.urlopen( scheme_netloc_path+"?"+query ) as data:
        with open("chicago_data.json","w") as output:
            output.write( data.read() )

The schem_netloc_path variable includes two interesting details in path. 4ijn-s7e5 is the internal identity of the dataset we're looking for and rows.json specifies the format we want the data in.

The form we built includes a $where clause that will cut down on the volume of data to just the recent inspection reports. The Socrata API pages show us that we have a great deal of flexibility here.

As with other web services requests, we created a query and made the request using the urllib.request.urlopen() function. We opened an output file named chicago_data.json and wrote the document to that file for further processing. This saves us from having to retrieve the data repeatedly since it doesn't change too quickly.

We've done the processing via nested with statements to be assured that the files are closed and the network resources are properly released.

Making a simple Python object from a JSON document

The JSON document contains lots of individual dict objects. While a dict is a handy general-purpose structure, the syntax is a bit clunky. Having to use object['some_key'] is awkward. It's nicer to work with SimpleNamespace objects and use pleasant attribute names. Using object.some_key is nicer.

Here's a function that will iterate through the massive JSON document with all of the inspection details:

def food_row_iter():
    with open( "chicago_data.json", encoding="UTF-8" ) as data_file:
        inspections = json.load( data_file )
    headings = [item['fieldName']
        for item in inspections["meta"]["view"]["columns"] ]
    print( headings )
    for row in inspections["data"]:
        data= SimpleNamespace(
            **dict( zip( headings, row ) )
        )
        yield data

We've built a SimpleNamespace object from each individual row that was in the source data. The JSON document's data, in inspections["data"], is a list of lists. It's rather hard to interpret because we need to know the position of each relevant field.

We created a list of headings based on the field names we found in inspections["meta"]["view"]["columns"]. The field names seem to all be valid Python variable names and will make good Python attribute names in a SimpleNamespace object.

Given this list of headings, we can then use the zip() function to interleave headings and data from each row that we find. This sequence of two-tuples can be used to create a dictionary by employing dict( zip( headings, row ) ). The dictionary can then be used to build the SimpleNamespace object.

The ** syntax specifies that the items in the dictionary will become individual keyword parameters for SimpleNamespace. This will elegantly transform a dictionary such as {'zip': '60608', 'results': 'Fail', 'city': 'CHICAGO', ... } to a SimpleNamespace object as if we had written SimpleNamespace( zip='60608', results='Fail', city='CHICAGO', ... ).

Once we have a sequence of SimpleNamespace objects, we can do some minor updates to make them easier to work with. Here's a function that makes a few tweaks to each object:

def parse_details( business ):
    business.latitude= float(business.latitude)
    business.longitude= float(business.longitude)
    if business.violations is None:
        business.details = []
    else:
        business.details = [ v.strip() for v in business.violations.split("|") ]
    return business

We've converted the longitude and latitude values from strings to float numbers. We need to do this in order to properly use the haversine() function to compute distance from our secret base. We've also split the business.violations value to a list of individual detailed violations. It's not clear what we'd do with this, but it might be helpful in understanding the business.res ults value.

Combining different pieces and parts

We can combine the processing into a function that's very similar to the choice_iter() function shown previously in the Combining the pieces and parts section. The idea is to create code that looks similar but starts with different source data.

This will iterate through the restaurant choices, depending on having SimpleNamespace objects that have been updated:

def choice_iter():
    base= SimpleNamespace( address="3420 W GRACE ST",
        city= "CHICAGO", state="IL", zip="60618",
        latitude=41.9503, longitude=-87.7138)
    for row in food_row_iter():
        try:
            parse_details( row )
            row.distance= haversine(
                (row.latitude, row.longitude),
                (base.latitude, base.longitude) )
            yield row
        except TypeError:
            pass
            # print( "problems with", row.dba_name, row.address )

This function defines our secret base at 3420 W Grace St. We've already worked out the latitude and longitude, and don't need to make a geocoding request for the location.

For each row produced by food_row_iter(), we've used parse_details() to update the row. We needed to use a try: block because some of the rows have invalid (or missing) latitude and longitude information. When we try to compute float(None), we get a TypeError exception. We just skipped those locations. We can geocode them separately, but this is Chicago: there's another restaurant down the block that's probably better.

The result of this function is a sequence of objects that include the distance from our base and health code inspection details. We might, for example, apply some filters to exclude places over .25 miles away or those that got a status of Fail:

for business in choice_iter():
    if business.distance > .25: continue
    if business.results == "Fail": continue
    print( business.dba_name,
        business.address, business.results,
        len(business.details) )

The important thing about this sequence of examples is that we leveraged data from a web source, adding value to the raw data by doing our own intelligence processing. We also combined several individual steps into a more sophisticated composite function.

Final steps

Now that we've located places where we can meet, we have two more things to do. First, we need to create a proper grid code for our chosen locations. The NAC codes are pretty terse. We simply need to agree with our informant about what code we're going to use.

Second, we need to use our steganography script from Chapter 3, Encoding Secret Messages with Steganography, to conceal the message in an image. Again, we'll need to be sure that our informant can locate the encoded message in the image.

We'll leave the design of these final processing steps as a mission for you to tackle on your own.

Understanding the data – schema and metadata

Data is described by additional data that we often call metadata. A basic datum might be 6371. Without some metadata, we have no idea what this means. Minimally, metadata has to include the unit of measurement (kilometers in this case) as well as the thing being measured (mean radius of the earth).

In the case of less objective data, there may be no units, but rather a domain of possible values. For restaurants, it may be an A-B-C score or a pass-fail outcome. It's important to track down the metadata in order to interpret the actual data.

An additional consideration is the schema problem. A set of data should consist of multiple instances of some essential entity. In our case, the entity is the recent health inspection results for a given restaurant. If each instance has a consistent collection of attributes, we can call that set of attributes the schema for the set of data.

In some cases, the data isn't consistent. Perhaps there are multiple schemata or perhaps the schema is quite complex with options and alternatives. If there's good metadata, it should explain the schema.

The City of Chicago data has a very tidy and complete metadata description for the restaurant health inspection information. We can read it at https://data.cityofchicago.org/api/assets/BAD5301B-681A-4202-9D25-51B2CAE672FF?download=true. It explains the risk category assigned to the facility and the ultimate result (pass, pass with conditions, fail). Note the long ugly URL; opaque paths like this are often a bad idea.

The Virginia Department of Health data isn't quite so tidy or complete. We can eventually work out what the data appears to mean. To be completely sure, we'd need to contact the curator of the data to find out precisely what each attribute means. This would involve an e-mail exchange with the department of health at the state level. A field agent might find this extra effort necessary in the case of ambiguous data names.

Previous Chapter

Creating natural area codes

Next Chapter

Summary