Summary Tools

Since our goal is to reverse engineer binary program files, we are going to need more sophisticated tools to extract detailed information following initial classification of a file. The tools discussed in this section, by necessity, are far more aware of the formats of the files that they process. In most cases, these tools understand a very specific file format, and the tools are utilized to parse input files to extract very specific information.

nm

When source files are compiled to object files, compilers must embed information regarding the location of any global (external) symbols so that the linker will be able to resolve references to those symbols when it combines object files to create an executable. Unless instructed to strip symbols from the final executable, the linker generally carries symbols from the object files over into the resulting executable. According to the man page, the purpose of the nm utility is to “list symbols from object files.”

When nm is used to examine an intermediate object file (a .o file rather than an executable), the default output yields the names of any functions and global variables declared in the file. Sample output of the nm utility is shown below:

idabook# gcc -c ch2_example.c
idabook# nm ch2_example.o
         U __stderrp
         U exit
         U fprintf
00000038 T get_max
00000000 t hidden
00000088 T main
00000000 D my_initialized_global
00000004 C my_unitialized_global
         U printf
         U rand
         U scanf
         U srand
         U time
00000010 T usage
idabook#

Here we see that nm lists each symbol along with some information about the symbol. The letter codes are used to indicate the type of symbol being listed. In this example, we see the following letter codes, which we will now explain:

`U`	An undefined symbol, usually an external symbol reference.
`T`	A symbol defined in the text section, usually a function name.
`t`	A local symbol defined in the text section. In a C program, this usually equates to a static function.
`D`	An initialized data value.
`C`	An uninitialized data value.

Note

Uppercase letter codes are used for global symbols, whereas lowercase letter codes are used for local symbols. A full explanation of the letter codes can be found in the man page for nm.

Somewhat more information is displayed when nm is used to display symbols from an executable file. During the link process, symbols are resolved to virtual addresses (when possible), which results in more information being available when nm is run. Truncated example output from nm used on an executable is shown here:

idabook# gcc -o ch2_example ch2_example.c
idabook# nm ch2_example
         <. . .>
         U exit
         U fprintf
080485c0 t frame_dummy
08048644 T get_max
0804860c t hidden
08048694 T main
0804997c D my_initialized_global
08049a9c B my_unitialized_global
08049a80 b object.2
08049978 d p.0
         U printf
         U rand
         U scanf
         U srand
         U time
0804861c T usage
idabook#

At this point, some of the symbols (main, for example) have been assigned virtual addresses, new ones (frame_dummy) have been introduced as a result of the linking process, some (my_unitialized_global) have had their symbol type changed, and others remain undefined as they continue to reference external symbols. In this case, the binary we are examining is dynamically linked, and the undefined symbols are defined in the shared C library. More information regarding nm can be found in its associated man page.

ldd

When an executable is created, the location of any library functions referenced by that executable must be resolved. The linker has two methods for resolving calls to library functions: static linking and dynamic linking. Command-line arguments provided to the linker determine which of the two methods is used. An executable may be statically linked, dynamically linked, or both.^[10]

When static linking is requested, the linker combines an application’s object files with a copy of the required library to create an executable file. At runtime, there is no need to locate the library code because it is already contained within the executable. Advantages of static linking are that (1) it results in slightly faster function calls and (2) distribution of binaries is easier because no assumptions need be made regarding the availability of library code on users’ systems. Disadvantages of static linking include (1) larger resulting executables and (2) greater difficulty upgrading programs when library components change. Programs are more difficult to update because they must be relinked every time a library is changed. From a reverse engineering perspective, static linking complicates matters somewhat. If we are faced with the task of analyzing a statically linked binary, there is no easy way to answer the questions “Which libraries are linked into this binary?” and “Which of these functions is a library function?” Chapter 12 will discuss the challenges encountered while reverse engineering statically linked code.

Dynamic linking differs from static linking in that the linker has no need to make a copy of any required libraries. Instead, the linker simply inserts references to any required libraries (often .so or .dll files) into the final executable, usually resulting in much smaller executable files. Upgrading library code is much easier when dynamic linking is utilized. Since a single copy of a library is maintained and that copy is referenced by many binaries, replacing the single outdated library with a new version instantly updates every binary that makes use of that library. One of the disadvantages of using dynamic linking is that it requires a more complicated loading process. All of the necessary libraries must be located and loaded into memory, as opposed to loading one statically linked file that happens to contain all of the library code. Another disadvantage of dynamic linking is that vendors must distribute not only their own executable file but also all library files upon which that executable depends. Attempting to execute a program on a system that does not contain all the required library files will result in an error.

The following output demonstrates the creation of dynamically and statically linked versions of a program, the size of the resulting binaries, and the manner in which file identifies those binaries:

idabook# gcc -o ch2_example_dynamic ch2_example.c
idabook# gcc -o ch2_example_static ch2_example.c --static
idabook# ls -l ch2_example_*
-rwxr-xr-x  1 root  wheel    6017 Sep 26 11:24 ch2_example_dynamic
-rwxr-xr-x  1 root  wheel  167987 Sep 26 11:23 ch2_example_static
idabook# file ch2_example_*
ch2_example_dynamic: ELF 32-bit LSB executable, Intel 80386, version 1
        (FreeBSD), dynamically linked (uses shared libs), not stripped
ch2_example_static:  ELF 32-bit LSB executable, Intel 80386, version 1
        (FreeBSD), statically linked, not stripped
idabook#

In order for dynamic linking to function properly, dynamically linked binaries must indicate which libraries they depend on along with the specific resources that are required from each of those libraries. As a result, unlike statically linked binaries, it is quite simple to determine the libraries on which a dynamically linked binary depends. The ldd (list dynamic dependencies) utility is a simple tool used to list the dynamic libraries required by any executable. In the following example, ldd is used to determine the libraries on which the Apache web server depends:

idabook# ldd /usr/local/sbin/httpd
/usr/local/sbin/httpd:
        libm.so.4 => /lib/libm.so.4 (0x280c5000)
        libaprutil-1.so.2 => /usr/local/lib/libaprutil-1.so.2 (0x280db000)
        libexpat.so.6 => /usr/local/lib/libexpat.so.6 (0x280ef000)
        libiconv.so.3 => /usr/local/lib/libiconv.so.3 (0x2810d000)
        libapr-1.so.2 => /usr/local/lib/libapr-1.so.2 (0x281fa000)
        libcrypt.so.3 => /lib/libcrypt.so.3 (0x2821a000)
        libpthread.so.2 => /lib/libpthread.so.2 (0x28232000)
        libc.so.6 => /lib/libc.so.6 (0x28257000)
idabook#

The ldd utility is available on Linux and BSD systems. On OS X systems, similar functionality is available using the otool utility with the –L option: otool -L filename. On Windows systems, the dumpbin utility, part of the Visual Studio tool suite, can be used to list dependent libraries: dumpbin /dependents filename.

objdump

Whereas ldd is fairly specialized, objdump is extremely versatile. The purpose of objdump is to “display information from object files.”^[11] This is a fairly broad goal, and in order to accomplish it, objdump responds to a large number (30+) of command-line options tailored to extract various pieces of information from object files. objdump can be used to display the following data (and much more) related to object files:

Section headers: Summary information for each of the sections in the program file.
Private headers: Program memory layout information and other information required by the runtime loader, including a list of required libraries such as that produced by ldd.
Debugging information: Extracts any debugging information embedded in the program file.
Symbol information: Dumps symbol table information in a manner similar to the nm utility.
Disassembly listing: objdump performs a linear sweep disassembly of sections of the file marked as code. When disassembling x86 code, objdump can generate either AT&T or Intel syntax, and the disassembly can be captured as a text file. Such a text file is called a disassembly dead listing, and while these files can certainly be used for reverse engineering, they are difficult to navigate effectively and even more difficult to modify in a consistent and error-free manner.

objdump is available as part of the GNU binutils^[12] tool suite and can be found on Linux, FreeBSD, and Windows (via Cygwin). objdump relies on the Binary File Descriptor library (libbfd), a component of binutils, to access object files and thus is capable of parsing file formats supported by libbfd (ELF and PE among others). For ELF-specific parsing, a utility named readelf is also available. readelf offers most of the same capabilities as objdump, and the primary difference between the two is that readelf does not rely upon libbfd.

otool

otool is most easily described as an objdump-like utility for OS X, and it is useful for parsing information about OS X Mach-O binaries. The following listing demonstrates how otool displays the dynamic library dependencies for a Mach-O binary, thus performing a function similar to ldd.

idabook# file osx_example
osx_example: Mach-O executable ppc
idabook# otool -L osx_example
osx_example:
        /usr/lib/libstdc++.6.dylib (compatibility
 version 7.0.0, current version 7.4.0)
        /usr/lib/libgcc_s.1.dylib (compatibility version 1.0.0, current version 1.0.0)
        /usr/lib/libSystem.B.dylib (compatibility
 version 1.0.0, current version 88.1.5)

otool can be used to display information related to a file’s headers and symbol tables and to perform disassembly of the file’s code section. For more information regarding the capabilities of otool, please refer to the associated man page.

dumpbin

dumpbin is a command-line utility included with Microsoft’s Visual Studio suite of tools. Like otool and objdump, dumpbin is capable of displaying a wide range of information related to Windows PE files. The following listing shows how dumpbin displays the dynamic dependencies of the Windows calculator program in a manner similar to ldd.

$ dumpbin /dependents calc.exe
Microsoft (R) COFF/PE Dumper Version 8.00.50727.762
Copyright (C) Microsoft Corporation.  All rights reserved.

Dump of file calc.exe

File Type: EXECUTABLE IMAGE

  Image has the following dependencies:

    SHELL32.dll
    msvcrt.dll
    ADVAPI32.dll
    KERNEL32.dll
    GDI32.dll
    USER32.dll

Additional dumpbin options offer the ability to extract information from various sections of a PE binary, including symbols, imported function names, exported function names, and disassembled code. Additional information related to the use of dumpbin is available via the Microsoft Developer Network (MSDN).^[13]

c++filt

Languages that allow function overloading must have a mechanism for distinguishing among the many overloaded versions of a function since each version has the same name. The following C++ example shows the prototypes for several overloaded versions of a function named demo:

void demo(void);
void demo(int x);
void demo(double x);
void demo(int x, double y);
void demo(double x, int y);
void demo(char* str);

As a general rule, it is not possible to have two functions with the same name in an object file. In order to allow overloading, compilers derive unique names for overloaded functions by incorporating information describing the type sequence of the function arguments. The process of deriving unique names for functions with identical names is called name mangling.^[14] If we use nm to dump the symbols from the compiled version of the preceding C++ code, we might see something like the following (filtered to focus on versions of demo):

idabook# g++ -o cpp_test cpp_test.cpp
idabook# nm cpp_test | grep demo
0804843c T _Z4demoPc
08048400 T _Z4demod
08048428 T _Z4demodi
080483fa T _Z4demoi
08048414 T _Z4demoid
080483f4 T _Z4demov

The C++ standard does not define standards for name-mangling schemes, leaving compiler designers to develop their own. In order to decipher the mangled variants of demo shown here, we need a tool that understands our compiler’s (g++ in this case) name-mangling scheme. This is precisely the purpose of the c++filt utility. c++filt treats each input word as if it were a mangled name and then attempts to determine the compiler that was used to generate that name. If the name appears to be a valid mangled name, it outputs the demangled version of the name. When c++filt does not recognize a word as a mangled name, it simply outputs the word with no changes.

If we pass the results of nm from the preceding example through c++filt, it is possible to recover the demangled function names, as seen here:

idabook# nm cpp_test | grep demo | c++filt
0804843c T demo(char*)
08048400 T demo(double)
08048428 T demo(double, int)
080483fa T demo(int)
08048414 T demo(int, double)
080483f4 T demo()

It is important to note that mangled names contain additional information about functions that nm does not normally provide. This information can be extremely helpful in reversing engineering situations, and in more complex cases, this extra information may include data regarding class names or function-calling conventions.

^[10]For more information on linking, consult John R. Levine, Linkers and Loaders (San Francisco: Morgan Kaufmann, 2000).

^[11]See http://www.sourceware.org/binutils/docs/binutils/objdump.html#objdump/.

^[12]See http://www.gnu.org/software/binutils/.

^[13]See http://msdn.microsoft.com/en-us/library/c1h23y6c(VS.71).aspx.

^[14]For an overview of name mangling, refer to http://en.wikipedia.org/wiki/Name_mangling.

Previous Chapter

2. Reversing and Disassembly Tools

Next Chapter

Deep Inspection Tools