Downloading and importing the data

As described in the previous section, the DISTAL application will make use of four separate sets of freely available geospatial data:

The World Borders Dataset
The high-resolution GSHHG shoreline database
The GNIS Database of US place names
The GEONet Names Server's list of non-US place names

Note

For more information on these sources of data, refer to Chapter 4, Sources of Geospatial Data.

Let's work through the process of downloading and importing each of these datasets in turn.

The World Borders Dataset

You downloaded a copy of this dataset earlier. Create a subdirectory named data within your DISTAL directory, and place a copy of the TM_WORLD_BORDERS-0.3 directory into the data directory. Then, create a new Python program named import_world_borders.py within your DISTAL directory, and enter the following into it:

import os.path
import psycopg2
import osgeo.ogr

connection = psycopg2.connect(database="distal",
                              user="distal_user",
                              password="...")
cursor = connection.cursor()

cursor.execute("DELETE FROM countries")

srcFile = os.path.join("data", "TM_WORLD_BORDERS-0.3",
                       "TM_WORLD_BORDERS-0.3.shp")
shapefile = osgeo.ogr.Open(srcFile)
layer = shapefile.GetLayer(0)
num_done = 0

for i in range(layer.GetFeatureCount()):
    feature = layer.GetFeature(i)
    name    = feature.GetField("NAME")
    wkt     = feature.GetGeometryRef().ExportToWkt()

    cursor.execute("INSERT INTO countries (name,outline) " +
                   "VALUES (%s, ST_GeometryFromText(%s, 4326))",
                   (name, wkt))

    num_done = num_done + 1

connection.commit()

print("Imported {} countries".format(num_done))

This program should be pretty easy to understand; we're simply making use of OGR to load the contents of the shapefile and psycopg2 to copy the data into the countries table. When you run this program, the contents of the World Borders Dataset should be imported into your database.

The GSHHG shoreline database

Download a copy of the GSHHG shoreline database (http://www.ngdc.noaa.gov/mgg/shorelines/gshhs.html) if you haven't already done so. Make sure you download the full database in Shapefile format. Place the resulting GSHHS_shp directory into your data subdirectory.

Tip

Remember that the GSHHG database used to be called GSHHS, which is why the GSHHS abbreviation still appears in a few places, such as the name of this shapefile.

The GSHHG shoreline database consists of separate shapefiles defining land/water boundaries at five different resolutions. For the DISTAL application, we want to import four of these levels (coastline, lake, island-in-lake, and pond-in-island-in-lake) at full resolution. This shapefile is stored in a subdirectory named f, which stands for full resolution.

To import this data into the DISTAL database, create a new program named import_gshhg.py within your DISTAL directory, and enter the following Python code into this file:

import os.path
import psycopg2
import osgeo.ogr

connection = psycopg2.connect(database="distal",
                              user="distal_user",
                              password="...")
cursor = connection.cursor()

cursor.execute("DELETE FROM shorelines")

for level in [1, 2, 3, 4]:
    num_done = 0

    srcFile = os.path.join("data", "GSHHS_shp", "f",
                           "GSHHS_f_L{}.shp".format(level))
    shapefile = osgeo.ogr.Open(srcFile)
    layer = shapefile.GetLayer(0)

    for i in range(layer.GetFeatureCount()):
        feature = layer.GetFeature(i)
        geometry = feature.GetGeometryRef()
        if geometry.IsValid():
            cursor.execute("INSERT INTO shorelines " +
                           "(level,outline) VALUES " +
                           "(%s, ST_GeometryFromText(%s, 4326))",
                           (level, geometry.ExportToWkt()))

        num_done = num_done + 1

    connection.commit()

    print("Imported {} shorelines at level {}".format(num_done, level))

Tip

Don't forget to enter the password you set up for your distal_user user in the call to psycopg2.connect().

Notice that we have to check that the geometry is valid before inserting it into the database. This is because of some problems with a few of the geometries in the GSHHG database. It is theoretically possible to fix these invalid records, but doing so is quite complicated, so we're simply skipping them.

When you run this program, it will take a few minutes to complete as there are over 180,000 high-resolution polygons to be imported.

US place names

You can download the list of US place names from http://geonames.usgs.gov/domestic. Click on the Download Domestic Names hyperlink, and choose the Download all national features in one .zip file option. This will download a file named NationalFile_YYYYMMDD.zip, where YYYYMMDD is the datestamp identifying when the file was last updated. Once again, decompress the resulting .zip archive and move the NationalFile_YYYYMMDD.txt file into your data subdirectory.

The file you have just downloaded is a pipe-delimited text file, meaning that each column is separated by a | character, like this:

FEATURE_ID|FEATURE_NAME|FEATURE_CLASS|...|DATE_EDITED
399|Agua Sal Creek|Stream|AZ|...|02/08/1980
400|Agua Sal Wash|Valley|AZ|...|02/08/1980

The first line contains the names of the various fields. While there are a lot of fields in the file, there are four fields that we are particularly interested in:

The FEATURE_NAME field contains the name of the location, in UTF-8 character encoding.
The FEATURE_CLASS field tells us what type of feature we are dealing with, in this case, a Stream or a Valley. There are a lot of features we don't need for the DISTAL application, for example, the names of bays, beaches, bridges, and oilfields. In fact, there is only one feature class we are interested in: Populated Place.
The PRIM_LONG_DEC and PRIM_LAT_DEC fields contain the longitude and latitude of the location, in decimal degrees. According to the documentation, these coordinates use the NAD83 datum rather than the WGS84 datum used by the other data we are importing.

We are going to import the data from this file into our places database table. Because the latitude and longitude values use the NAD83 datum, we will use pyproj to convert the coordinates into WGS84 so that it is compatible with the other data we are using.

Tip

Strictly speaking, this step isn't necessary. We will be using pyproj to transform coordinates from NAD83 to WGS84. However, the data we are importing is all within the United States, and these two datums happen to be identical for points within the United States. Because of this, pyproj won't actually change the coordinates at all. But we will do this anyway, following the recommended practice of knowing the spatial reference for our data and transforming when necessary—even if that transformation is a no-op at times.

Let's write the program to import this data into the database. Create a new file named import_gnis.py in your main DISTAL directory, and enter the following Python code into this file:

import psycopg2
import os.path
import pyproj

connection = psycopg2.connect(database="distal",
                              user="distal_user",
                              password="...")
cursor = connection.cursor()

cursor.execute("DELETE FROM places")

srcProj = pyproj.Proj(proj='longlat', ellps='GRS80',
                      datum='NAD83')
dstProj = pyproj.Proj(proj='longlat', ellps='WGS84',
                      datum='WGS84')

f = file(os.path.join("data",
                      "NationalFile_YYYYMMDD.txt"), "r")
heading = f.readline() # Ignore field names.

num_inserted = 0

for line in f:
    parts = line.rstrip().split("|")
    featureName = parts[1]
    featureClass = parts[2]
    lat = float(parts[9])
    long = float(parts[10])

    if featureClass == "Populated Place":
        long,lat = pyproj.transform(srcProj, dstProj,
                                    long, lat)
        cursor.execute("INSERT INTO places " +
                       "(name, position) VALUES (%s, " +
                       "ST_SetSRID(" +
                       "ST_MakePoint(%s,%s), 4326))",
                       (featureName, long, lat))

        num_inserted += 1
        if num_inserted % 1000 == 0:
            connection.commit()

connection.commit()
f.close()

print("Inserted {} placenames".format(num_inserted))

Tip

Make sure you enter the password for distal_user in the call to psycopg2.connect(). Also make sure that you use the correct name for the NationalFile_YYYYMMDD.txt file you downloaded, according to its datestamp.

Note that we're committing the changes to the database for every 1,000 records. This speeds up the importing process. We're also using the ST_SetSRID() function to specify the SRID for the Point geometry before we insert it—we need to do this because we've told PostGIS to only accept geometries which have an SRID value of 4326.

When you run this program, it will take a few minutes to complete. Once it's finished, you will have imported over 200,000 US place names into the places table.

Non-US place names

If you haven't already done so, download the list of non-US place names from http://geonames.nga.mil/gns/html/namefiles.html by clicking on the Entire country files dataset hyperlink. Note that this is a large download—over 450 megabytes in compressed form—so it may take a while to download.

Once it has finished downloading, you will end up with a file named geonames_YYYYMMDD.zip where once again YYYMMDD is the datestamp identifying when the file was last updated. Decompress this file, which will result in a directory containing two text files: Countries.txt and Countries_disclaimer.txt. We want the file named Countries.txt, so copy this file into your data subdirectory.

The Countries.txt file is probably too big to open up in a text editor. If you could, however, you'd see that this is a tab-delimited text file that uses UTF-8 character encoding. The first few lines of this file look something like this:

RC  UFI     ... FULL_NAME_ND_RG  NOTE       MODIFY_DATE
1  -1307834 ... Pavia                       1993-12-21
1  -1307889 ... Santa Anna       gjgscript  1993-12-21

As with the US place name data, there are many more features here than we need for the DISTAL application. Since we are only interested in the official names of towns and cities, we need to filter this data in the following way:

The FC (feature classification) field tells us which type of feature we are dealing with. We want features with an FC value of "P" (populated place).
The NT (name type) field tells us the status of this feature's name. We want names with an NT value of "N" (approved name).
The DSG (feature designation code) field tells us the type of feature, in more detail than the FC field. For the DISTAL application, we are interested in features with a DSG value of "PPL" (populated place), "PPLA" (administrative capital), or "PPLC" (capital city).
Note
If you want to see a full list of all the feature designation codes, go to http://geonames.nga.mil/namesgaz and click on the Designation Codes link in the sidebar on the left, under Lookup Tables. This brings up a page where you can search for a particular feature designation code; if you leave all the fields blank and click on the Search Designation Codes button, a window will appear showing you all the various feature designation codes and what they mean.

There are also several different versions of each place name; we want the full name in normal reading order, which is in the field named FULL_NAME_RO.

This tells us everything we need to know to import the non-US place names into our places database table. Create a new Python script named import_geonames.py inside your DISTAL directory, and enter the following into this file:

import psycopg2
import os.path

connection = psycopg2.connect(database="distal",
                              user="distal_user",
                              password="...")
cursor = connection.cursor()

f = open(os.path.join("data", "Countries.txt"), "r")

heading = f.readline() # Ignore field names.

num_inserted = 0

for line in f:
    parts = line.rstrip().split("\t")
    lat = float(parts[3])
    long = float(parts[4])
    featureClass = parts[9]
    featureDesignation = parts[10]
    nameType = parts[17]
    featureName = parts[22]

    if (featureClass == "P" and nameType == "N" and
        featureDesignation in ["PPL", "PPLA", "PPLC"]):
        cursor.execute("INSERT INTO places " +
                       "(name, position) VALUES (%s, " +
                       "ST_SetSRID(" +
                       "ST_MakePoint(%s,%s), 4326))",
                       (featureName, long, lat))

        num_inserted = num_inserted + 1
        if num_inserted % 1000 == 0:
            connection.commit()

f.close()
connection.commit()

print("Inserted {} placenames".format(num_inserted))

Notice that we don't delete the existing contents of the places table. This is because we've already imported some data into that table. If you need to re-import the data, you'll need to run both the import_gnis.py script and the import_geonames.py script, in that order.

Go ahead and run this program. It will take a while to complete—maybe 10 to 15 minutes, depending on the speed of your computer—as you're adding over three million place names to the database.

Previous Chapter

Designing and building the database

Next Chapter

Implementing the DISTAL application

Table of Contents for Python Geospatial Development - Third Edition

Downloading and importing the data

Note

The World Borders Dataset

The GSHHG shoreline database

Tip

Tip

US place names

Tip

Tip

Non-US place names

Note

Table of Contents for
Python Geospatial Development - Third Edition