As described in the previous section, the DISTAL application will make use of four separate sets of freely available geospatial data:
For more information on these sources of data, refer to Chapter 4, Sources of Geospatial Data.
Let's work through the process of downloading and importing each of these datasets in turn.
You downloaded a copy of this dataset earlier. Create a subdirectory named data within your DISTAL directory, and place a copy of the TM_WORLD_BORDERS-0.3 directory into the data directory. Then, create a new Python program named import_world_borders.py within your DISTAL directory, and enter the following into it:
import os.path
import psycopg2
import osgeo.ogr
connection = psycopg2.connect(database="distal",
user="distal_user",
password="...")
cursor = connection.cursor()
cursor.execute("DELETE FROM countries")
srcFile = os.path.join("data", "TM_WORLD_BORDERS-0.3",
"TM_WORLD_BORDERS-0.3.shp")
shapefile = osgeo.ogr.Open(srcFile)
layer = shapefile.GetLayer(0)
num_done = 0
for i in range(layer.GetFeatureCount()):
feature = layer.GetFeature(i)
name = feature.GetField("NAME")
wkt = feature.GetGeometryRef().ExportToWkt()
cursor.execute("INSERT INTO countries (name,outline) " +
"VALUES (%s, ST_GeometryFromText(%s, 4326))",
(name, wkt))
num_done = num_done + 1
connection.commit()
print("Imported {} countries".format(num_done))This program should be pretty easy to understand; we're simply making use of OGR to load the contents of the shapefile and psycopg2 to copy the data into the countries table. When you run this program, the contents of the World Borders Dataset should be imported into your database.
Download a copy of the GSHHG shoreline database (http://www.ngdc.noaa.gov/mgg/shorelines/gshhs.html) if you haven't already done so. Make sure you download the full database in Shapefile format. Place the resulting GSHHS_shp directory into your data subdirectory.
The GSHHG shoreline database consists of separate shapefiles defining land/water boundaries at five different resolutions. For the DISTAL application, we want to import four of these levels (coastline, lake, island-in-lake, and pond-in-island-in-lake) at full resolution. This shapefile is stored in a subdirectory named f, which stands for full resolution.
To import this data into the DISTAL database, create a new program named import_gshhg.py within your DISTAL directory, and enter the following Python code into this file:
import os.path
import psycopg2
import osgeo.ogr
connection = psycopg2.connect(database="distal",
user="distal_user",
password="...")
cursor = connection.cursor()
cursor.execute("DELETE FROM shorelines")
for level in [1, 2, 3, 4]:
num_done = 0
srcFile = os.path.join("data", "GSHHS_shp", "f",
"GSHHS_f_L{}.shp".format(level))
shapefile = osgeo.ogr.Open(srcFile)
layer = shapefile.GetLayer(0)
for i in range(layer.GetFeatureCount()):
feature = layer.GetFeature(i)
geometry = feature.GetGeometryRef()
if geometry.IsValid():
cursor.execute("INSERT INTO shorelines " +
"(level,outline) VALUES " +
"(%s, ST_GeometryFromText(%s, 4326))",
(level, geometry.ExportToWkt()))
num_done = num_done + 1
connection.commit()
print("Imported {} shorelines at level {}".format(num_done, level))Notice that we have to check that the geometry is valid before inserting it into the database. This is because of some problems with a few of the geometries in the GSHHG database. It is theoretically possible to fix these invalid records, but doing so is quite complicated, so we're simply skipping them.
When you run this program, it will take a few minutes to complete as there are over 180,000 high-resolution polygons to be imported.
You can download the list of US place names from http://geonames.usgs.gov/domestic. Click on the Download Domestic Names hyperlink, and choose the Download all national features in one .zip file option. This will download a file named NationalFile_YYYYMMDD.zip, where YYYYMMDD is the datestamp identifying when the file was last updated. Once again, decompress the resulting .zip archive and move the NationalFile_YYYYMMDD.txt file into your data subdirectory.
The file you have just downloaded is a pipe-delimited text file, meaning that each column is separated by a | character, like this:
FEATURE_ID|FEATURE_NAME|FEATURE_CLASS|...|DATE_EDITED 399|Agua Sal Creek|Stream|AZ|...|02/08/1980 400|Agua Sal Wash|Valley|AZ|...|02/08/1980
The first line contains the names of the various fields. While there are a lot of fields in the file, there are four fields that we are particularly interested in:
FEATURE_NAME field contains the name of the location, in UTF-8 character encoding.FEATURE_CLASS field tells us what type of feature we are dealing with, in this case, a Stream or a Valley. There are a lot of features we don't need for the DISTAL application, for example, the names of bays, beaches, bridges, and oilfields. In fact, there is only one feature class we are interested in: Populated Place.PRIM_LONG_DEC and PRIM_LAT_DEC fields contain the longitude and latitude of the location, in decimal degrees. According to the documentation, these coordinates use the NAD83 datum rather than the WGS84 datum used by the other data we are importing.We are going to import the data from this file into our places database table. Because the latitude and longitude values use the NAD83 datum, we will use pyproj to convert the coordinates into WGS84 so that it is compatible with the other data we are using.
Strictly speaking, this step isn't necessary. We will be using pyproj to transform coordinates from NAD83 to WGS84. However, the data we are importing is all within the United States, and these two datums happen to be identical for points within the United States. Because of this, pyproj won't actually change the coordinates at all. But we will do this anyway, following the recommended practice of knowing the spatial reference for our data and transforming when necessary—even if that transformation is a no-op at times.
Let's write the program to import this data into the database. Create a new file named import_gnis.py in your main DISTAL directory, and enter the following Python code into this file:
import psycopg2
import os.path
import pyproj
connection = psycopg2.connect(database="distal",
user="distal_user",
password="...")
cursor = connection.cursor()
cursor.execute("DELETE FROM places")
srcProj = pyproj.Proj(proj='longlat', ellps='GRS80',
datum='NAD83')
dstProj = pyproj.Proj(proj='longlat', ellps='WGS84',
datum='WGS84')
f = file(os.path.join("data",
"NationalFile_YYYYMMDD.txt"), "r")
heading = f.readline() # Ignore field names.
num_inserted = 0
for line in f:
parts = line.rstrip().split("|")
featureName = parts[1]
featureClass = parts[2]
lat = float(parts[9])
long = float(parts[10])
if featureClass == "Populated Place":
long,lat = pyproj.transform(srcProj, dstProj,
long, lat)
cursor.execute("INSERT INTO places " +
"(name, position) VALUES (%s, " +
"ST_SetSRID(" +
"ST_MakePoint(%s,%s), 4326))",
(featureName, long, lat))
num_inserted += 1
if num_inserted % 1000 == 0:
connection.commit()
connection.commit()
f.close()
print("Inserted {} placenames".format(num_inserted))Note that we're committing the changes to the database for every 1,000 records. This speeds up the importing process. We're also using the ST_SetSRID() function to specify the SRID for the Point geometry before we insert it—we need to do this because we've told PostGIS to only accept geometries which have an SRID value of 4326.
When you run this program, it will take a few minutes to complete. Once it's finished, you will have imported over 200,000 US place names into the places table.
If you haven't already done so, download the list of non-US place names from http://geonames.nga.mil/gns/html/namefiles.html by clicking on the Entire country files dataset hyperlink. Note that this is a large download—over 450 megabytes in compressed form—so it may take a while to download.
Once it has finished downloading, you will end up with a file named geonames_YYYYMMDD.zip where once again YYYMMDD is the datestamp identifying when the file was last updated. Decompress this file, which will result in a directory containing two text files: Countries.txt and Countries_disclaimer.txt. We want the file named Countries.txt, so copy this file into your data subdirectory.
The Countries.txt file is probably too big to open up in a text editor. If you could, however, you'd see that this is a tab-delimited text file that uses UTF-8 character encoding. The first few lines of this file look something like this:
RC UFI ... FULL_NAME_ND_RG NOTE MODIFY_DATE 1 -1307834 ... Pavia 1993-12-21 1 -1307889 ... Santa Anna gjgscript 1993-12-21
As with the US place name data, there are many more features here than we need for the DISTAL application. Since we are only interested in the official names of towns and cities, we need to filter this data in the following way:
FC (feature classification) field tells us which type of feature we are dealing with. We want features with an FC value of "P" (populated place).NT (name type) field tells us the status of this feature's name. We want names with an NT value of "N" (approved name).DSG (feature designation code) field tells us the type of feature, in more detail than the FC field. For the DISTAL application, we are interested in features with a DSG value of "PPL" (populated place), "PPLA" (administrative capital), or "PPLC" (capital city).If you want to see a full list of all the feature designation codes, go to http://geonames.nga.mil/namesgaz and click on the Designation Codes link in the sidebar on the left, under Lookup Tables. This brings up a page where you can search for a particular feature designation code; if you leave all the fields blank and click on the Search Designation Codes button, a window will appear showing you all the various feature designation codes and what they mean.
There are also several different versions of each place name; we want the full name in normal reading order, which is in the field named FULL_NAME_RO.
This tells us everything we need to know to import the non-US place names into our places database table. Create a new Python script named import_geonames.py inside your DISTAL directory, and enter the following into this file:
import psycopg2
import os.path
connection = psycopg2.connect(database="distal",
user="distal_user",
password="...")
cursor = connection.cursor()
f = open(os.path.join("data", "Countries.txt"), "r")
heading = f.readline() # Ignore field names.
num_inserted = 0
for line in f:
parts = line.rstrip().split("\t")
lat = float(parts[3])
long = float(parts[4])
featureClass = parts[9]
featureDesignation = parts[10]
nameType = parts[17]
featureName = parts[22]
if (featureClass == "P" and nameType == "N" and
featureDesignation in ["PPL", "PPLA", "PPLC"]):
cursor.execute("INSERT INTO places " +
"(name, position) VALUES (%s, " +
"ST_SetSRID(" +
"ST_MakePoint(%s,%s), 4326))",
(featureName, long, lat))
num_inserted = num_inserted + 1
if num_inserted % 1000 == 0:
connection.commit()
f.close()
connection.commit()
print("Inserted {} placenames".format(num_inserted))Notice that we don't delete the existing contents of the places table. This is because we've already imported some data into that table. If you need to re-import the data, you'll need to run both the import_gnis.py script and the import_geonames.py script, in that order.
Go ahead and run this program. It will take a while to complete—maybe 10 to 15 minutes, depending on the speed of your computer—as you're adding over three million place names to the database.