Web Mapping Illustrated

Summarizing Information Using Other Tools

While ogrinfo and other ogr utilities are powerful tools, basic text-processing tools such as sort, uniq, wc, and sed can give them an extra bit of flexibility. The tools here are readily available for Unix-type operating systems (like Linux) by default. They are also available for other operating systems but you may need to download a package (e.g., from http://gnu.org) to get them for your system.

Each command can receive text streams. In this case, the text stream will be the lines of information coming from ogrinfo and listed on the screen. These commands take in those lines and allow you to, for example, show only certain portions of them, to throw away certain lines, reformat them, do a search/replace function or count items. Many types of functions can be done using the ogrinfo -sql parameter, but the ultimate formatting of the results isn’t always what is desired. These examples show some common patterns for extracting specific information and generating more custom stats.

Setting Up Processing Tools for Non-GNU Platforms

These text-processing tools are sometimes packaged together, but are usually separate projects in and of themselves. Most of them were formed as part of the GNU/Free Software Foundation and are registered with the GNU free software directory at http://www.gnu.org/directory/. The targets of GNU software are free operating systems, which can cause some problems if you are dependent on an operating system such as Microsoft Windows. Some operating system don’t normally include these tools, but they can often be acquired from Internet sources or even purchased.

A very comprehensive set of these tools for Windows is available at http://unxutils.sourceforge.net/. You can download a ZIP file that contains all the programs. If you unzip the file and store the files in a common Windows folder, such as C:\Windows\System32 or C:\winnt\System32, they will be available to run from the command prompt.

If the tool or command you want isn’t included, the next place to look is the GNU directory (http://gnu.org). This is where to start if you are looking for a particular program. A home page for the program and more information about it are available. Look for the download page for the program first to see if there is a binary version of the tool available for your operating system. If not, you may need to download the source code and compile the utility yourself.

Another resource to search is the Freshmeat web site at http://freshmeat.net. This site helps users find programs or projects and also provides daily news reports of what is being updated. Many projects reported in Freshmeat are hosted on the Sourceforge web site at http://sourceforge.net.

One source that is commonly used on Windows is the Cygwin environment, which can be found at http://www.cygwin.com. The web site describes Cygwin as “a Linux-like environment for Windows.” Cygwin can be downloaded and installed on most modern Windows platforms and provides many of the text-processing tools mentioned previously. Furthermore, it also provides access to source-code compilers such as GCC.

Mac OS X includes many of the same kinds of text-processing tools. They may not be exactly the same as the GNU programs mentioned here, but similar alternatives are available in the Darwin core underlying OS X. For ones that aren’t available natively in OS X, they can be compiled from the GNU source code or acquired through your favorite package manager such as Fink.

Using ogrinfo to List Data in a Shapefile

The standard output of ogrinfo reports are a set of lines displaying information about each feature. As earlier, this output is quite verbose, showing some summary information first, then sections for each feature. In the case of the airport data, each airport has its own section of seven lines. Example 6-10 shows a couple of these sections covering 2 of the 12 features (the rest were removed to reduce unnecessary length).

Example 6-10. Basic output listing about the airports shapefile

> ogrinfo data airports
INFO: Open of 'data'
using driver 'ESRI Shapefile' successful.

Layer name: airports
Geometry: Point
Feature Count: 12
Extent: (434634.000000, 5228719.000000) - (496393.000000, 5291930.000000)
Layer SRS WKT:
(unknown)
NAME: String (64.0)
LAT: Real (12.4)
LON: Real (12.4)
ELEVATION: Real (12.4)
QUADNAME: String (32.0)
OGRFeature(airports):0
  NAME (String) = Bigfork Municipal Airport
  LAT (Real) =      47.7789
  LON (Real) =     -93.6500
  ELEVATION (Real) =    1343.0000
  QUADNAME (String) = Effie
  POINT (451306 5291930)

OGRFeature(airports):1
  NAME (String) = Bolduc Seaplane Base
  LAT (Real) =      47.5975
  LON (Real) =     -93.4106
  ELEVATION (Real) =    1325.0000
  QUADNAME (String) = Balsam Lake
  POINT (469137 5271647)

But what if you don’t really care about a lot of the information that is displayed? You can use the ogrinfo options -sql and -where, but they still show you summary information and don’t necessarily format it the way you want. Various other operating system programs can help you reformat the output of ogrinfo. Examples of these commands follow, starting with the grep command.

Using grep to Show Only the Names of the Airports

The grep commands can be used to show only certain lines being printed to your screen; for example, to find a certain line in a text file. In this case, we are piping the text stream that ogrinfo prints into the grep command and analyzing it. The results are that any line starting with two spaces and the word NAME are printed; the rest of the lines won’t show. Note that the pipe symbol | is the vertical bar, usually the uppercase of the key \ on your keyboard. This tells the command-line interpreter to send all the results of the ogrinfo command to the grep command for further processing. You then add an option at the end of the command telling it which lines you want to see in your results, as shown in Example 6-11.

Example 6-11. Chaining together multiple commands to filter results from ogrinfo

> ogrinfo data airports | grep '  NAME'
  NAME (String) = Bigfork Municipal Airport
  NAME (String) = Bolduc Seaplane Base
  NAME (String) = Bowstring Municipal Airport
  NAME (String) = Burns Lake Seaplane Base
  NAME (String) = Christenson Point Seaplane Base
  NAME (String) = Deer River Municipal Airport
  NAME (String) = Gospel Ranch Airport
  NAME (String) = Grand Rapids-Itasca County/Gordon Newstrom Field
  NAME (String) = Richter Ranch Airport
  NAME (String) = Shaughnessy Seaplane Base
  NAME (String) = Sixberrys Landing Seaplane Base
  NAME (String) = Snells Seaplane Base

If you want some other piece of information to show instead, simply change 'NAME' (including two preceding spaces) to 'abc', which is the text or numbers you are interested in. For example, grep 'LAT' shows only the LAT lines. Notice that using 'NAME' without the preceding spaces as in NAME lists the QUADNAME attributes as well.

Using wc to Count the Number of Airport Names

Now that you have a list of attribute values in your airports file, you can start to use other commands. The wc command can perform a variety of analysis functions against a list of text. The name wc stands for word count. It can count the number of characters, words, or lines in a list of text (or a file) and report them back to you. Output from grep or ogrinfo can be redirected to wc to be further analyzed.

In this case we use wc to count the number of lines (using the -l line count option). Combined with the grep command, as shown in the following example, this shows the number of airports that grep would have printed to your screen.

    > ogrinfo data airports | grep '  NAME' | wc -l
               
    12

Using sed to Find Specific Patterns in Airport Names

Another very powerful tool is the text stream-editing tool called sed. sed allows a user to filter a list of text (in this case the listing from ogrinfo) and perform text substitutions (search and replace), find or delete certain text, etc. If you are already familiar with regular expression syntax, you will find yourself right at home using sed, because it uses regex syntax to define its filters.

In this example, you take the full output of the ogrinfo command again and search entries that contain the words Seaplane Base. What makes this different than the grep example is the inclusion of the trailing dollar $ sign at the end of the phrase. This symbol represents the end of the line. This example, therefore, prints only airport names that have Seaplane Base at the end of the name; it doesn’t print any airport without Seaplane Base in its name and also excludes airports that have the phrase in anything but the last part of the name. As in Example 6-12, the airport named Joes Seaplane Base and Cafe wouldn’t be returned.

Example 6-12. Using sed to do basic filtering of results

> ogrinfo data airports | sed -n '/Seaplane Base$/p'
  NAME (String) = Bolduc Seaplane Base
  NAME (String) = Burns Lake Seaplane Base
  NAME (String) = Christenson Point Seaplane Base
  NAME (String) = Shaughnessy Seaplane Base
  NAME (String) = Sixberrys Landing Seaplane Base
  NAME (String) = Snells Seaplane Base

Tip

Further text-processing can be done with the awk command. For example, to remove the NAME (String) text, pipe the results through awk:

    | awk -F= '{print $2}'.

Use sed to Reformat Print Results

The display of the previous example may be fine only for purposes of quick data review. When some type of report or cut/paste function needs to take place, it is often best to reformat the results. Example 6-13 uses grep to filter out all the lines that aren’t airport names, as in the previous example. It then uses two sed filters to remove the attribute name information, and then to remove any airports that start with B. As you can see, the example runs ogrinfo results through three filters and produces an easy-to-read list of all the airports meeting your criteria.

Example 6-13. Using sed to remove results with a certain starting letter

> ogrinfo data airports | grep '  NAME' | sed 's/  NAME (String) = //' | sed '/^B/d'
Christenson Point Seaplane Base
Deer River Municipal Airport
Gospel Ranch Airport
Grand Rapids-Itasca County/Gordon Newstrom Field
Richter Ranch Airport
Shaughnessy Seaplane Base
Sixberrys Landing Seaplane Base
Snells Seaplane Base

The usage of the last sed filter looks somewhat obscure, because it uses the caret ^ symbol. This denotes the start of a line, so, in this case, it looks for any line that starts with B. It doesn’t concern itself with the rest of the line at all. The final /d means “delete lines that meet the ^B criteria.”

Example 6-14 uses a similar approach but doesn’t require the text to be at the beginning of the line. Any airport with the word Municipal in the name is deleted from the final list.

Example 6-14. Using sed to remove results containing a keyword

> ogrinfo data airports | grep '  NAME' | sed 's/  NAME (String) = //' | sed '/Municipal/d'
Bolduc Seaplane Base
Burns Lake Seaplane Base
Christenson Point Seaplane Base
Gospel Ranch Airport
Grand Rapids-Itasca County/Gordon Newstrom Field
Richter Ranch Airport
Shaughnessy Seaplane Base
Sixberrys Landing Seaplane Base
Snells Seaplane Base

Using sed to Remove Lines and Trim the Front End of Lines

sed has many different options and can be very sophisticated, especially when combining sed filters. Example 6-15 shows how you can string numerous commands together and do a few filters all at once.

Example 6-15. Multiple sed commands to provide groups of lines meeting certain criteria

> ogrinfo data airports | sed -n '/^  NAME/,/^  ELEVATION/p' | sed '/LAT/d' | sed '/LON/d'
| sed 's/..................//'
Bigfork Municipal Airport
 =    1343.0000
Bolduc Seaplane Base
 =    1325.0000
Bowstring Municipal Airport
 =    1372.0000
Burns Lake Seaplane Base
 =    1357.0000
Christenson Point Seaplane Base
 =    1372.0000
Deer River Municipal Airport
 =    1311.0000
Gospel Ranch Airport
 =    1394.0000
Grand Rapids-Itasca County/Gordon Newstrom Field
 =    1355.0000
Richter Ranch Airport
 =    1340.0000
Shaughnessy Seaplane Base
 =    1300.0000
Sixberrys Landing Seaplane Base
 =    1372.0000
Snells Seaplane Base
 =    1351.0000

This example uses sed to do only four filters on the list. The first is perhaps the most complex. It has two options separated by a comma:

               
    '/^  NAME/,/^  ELEVATION/p'

You can see the use of the caret again, which always denotes that the filter is looking at the beginning of the line(s). In this case it looks for the lines starting with NAME (including a couple spaces that ogrinfo throws in by default), but then there is also ELEVATION specified. The comma tells sed to include a range of lines—those that fall between the line starting with NAME and the next line starting with ELEVATION. NAME is called the start; ELEVATION is called the end. This way you can see a few lines together rather than selecting one line at a time. This is helpful because it shows the lines in the context of surrounding information and is important for text streams that are listed like ogrinfo output, which groups together attributes of features onto multiple lines.

               
    sed '/LAT/d' | sed '/LON/d'

The second and third filters are simple delete filters that remove any LAT and LON lines. Notice that these lines originally fell between NAME and ELEVATION in the list, so the filter is simply removing more and more lines building on the previous filter.

               
    sed 's/..................//'

The fourth filter isn’t a joke, nor did I fall asleep on the keyboard. It is a substitute or search/replace filter, which is signified by the preceding s/. Each period represents a character that sed will delete from the beginning of each line.

The end result of these four filters is a much more readable list of all the airports in the shape file and their respective elevations.

Using sort to Create a List of Ordered Elevations

Another very handy command-line tool is sort. sort does just what the name promises: it puts text or numbers in a certain order. It sorts in ascending order by default, from smallest to highest or from lowest letter (closest to “a”) to highest letter (closest to “z”).

In Example 6-16 all the lines are filtered out except those including ELEVATION. Unwanted letters are then stripped from the beginning of each line. The output is then filtered through sort which reorders the output in ascending order.

Example 6-16. Using sort to reorder results

> ogrinfo data airports | grep 'ELEVATION' | sed -n 's/  ELEVATION (Real) =   //p' | sort
 1300.0000
 1311.0000
 1325.0000
 1340.0000
 1343.0000
 1351.0000
 1355.0000
 1357.0000
 1372.0000
 1372.0000
 1372.0000
 1394.0000

Using uniq to Summarize Results of Duplicate Lines

The output from sort includes some duplicate or repeated values. Obviously some airports rest at the same elevation: 1,372 feet. If this output is going to be used in a report, it may not make sense to include repeated values, especially when it is just a list of numbers.

The uniq command can help make the results more presentable. In Example 6-17, the results of grep, sed, and sort were passed to the uniq command. uniq processes the list and removes duplicate lines from the list. You’ll notice only one occurrence of 1372 now.

Example 6-17. Using uniq to remove duplicates from the results

> ogrinfo data airports | grep 'ELEVATION' | sed -n 's/  ELEVATION (Real) =   //p' | sort
| uniq
 1300.0000
 1311.0000
 1325.0000
 1340.0000
 1343.0000
 1351.0000
 1355.0000
 1357.0000
 1372.0000
 1394.0000

uniq has some other options. As seen in Example 6-18, -c tells uniq to also print the number of times each line occurs. Notice that only elevation 1372 occurred more than once.

Example 6-18. Counting the number of unique occurrences in results using uniq

> ogrinfo data airports | grep 'ELEVATION' | sed -n 's/  ELEVATION (Real) =   //p' | sort
| uniq -c
      1  1300.0000
      1  1311.0000
      1  1325.0000
      1  1340.0000
      1  1343.0000
      1  1351.0000
      1  1355.0000
      1  1357.0000
      3  1372.0000
      1  1394.0000

The -d option for uniq shows only duplicate records. You can combine multiple options to help give you exactly what you are looking for. As shown in Example 6-19, if you are only interested in airports with the same elevation, and you want to know how many there are, you would only have to add d to the options for uniq.

Example 6-19. Using uniq to only show results that are duplicated

> ogrinfo data airports | grep 'ELEVATION' | sed -n 's/  ELEVATION (Real) =   //p' | sort
| uniq -cd

To add line numbers to you output, pass your text stream through the nl command. Example 6-20 shows what this looks like.

Example 6-20. Adding line numbers to text stream output using nl

> ogrinfo data airports | grep 'ELEVATION' | sed -n 's/  ELEVATION (Real) =   //p' | sort
| uniq -c | nl
     1        1  1300.0000
     2        1  1311.0000
     3        1  1325.0000
     4        1  1340.0000
     5        1  1343.0000
     6        1  1351.0000
     7        1  1355.0000
     8        1  1357.0000
     9        3  1372.0000
    10        1  1394.0000

Keep in mind that uniq checks each line only against surrounding lines, therefore the sort beforehand helps make sure that all duplicates are side by side. If they aren’t, there is no guarantee that uniq will produce the expected results. Other text-processing commands may better suit you if you are unable to use sort. For example, tsort , referenced in the next section may do what you want.

Other Powerful Text-Processing Tools

Most Unix implementations, including Linux, have many more processing commands available. The list below shows a summary of the text-processing commands you may find useful. If you are wondering how to use some of them, you can usually add —help after the command name to get a list of options. Or you may also be able to read the manual for the command by typing man < command name >.

sort: Sorts lines of text
paste: Merges lines of files
sed: Performs basic text transformations on an input stream
tsort: Performs topological sort
join: Joins lines of two files on a common field
awk: Is a pattern scanning and processing language
uniq: Removes duplicate lines from a sort file
head/tail: Outputs the first/last part of files
wc: Prints the number of newlines, words, and bytes in files
expand/unexpand: Converts to/from tabs and spaces
grep: Prints lines matching a pattern
column: Columnates lists
cut: Removes sections from each line of files
look: Displays lines beginning with a given string
colrm: Removes columns from a file
nl: Numbers lines in text stream

I highly recommend the Linux Documentation Project site to get a comprehensive list and examples of these text-processing commands. This is helpful for those other platforms as well because these tools often exist for other platforms. The site address is http://www.tldp.org/—search for the “Text Processing” HOWTO document. Other HOWTO documents and tutorials go into more depth for specific text processing programs. Check out the O’Reilly book Unix Power Tools for more Unix help.

Previous Chapter

6.3. Examining Data Content

Next Chapter

7. Converting Map Data