Chapter 22. Vulnerability Analysis

Before we get too far into this chapter, we need to make one thing clear: IDA is not a vulnerability discovery tool. There, we said it; what a relief! IDA seems to have attained mystical qualities in some people’s minds. All too often people seem to have the impression that merely opening a binary with IDA will reveal all the secrets of the universe, that the behavior of a piece of malware will be fully explained to them in comments automatically generated by IDA, that vulnerabilities will be highlighted in red, and that IDA will automatically generate exploit code if you right-click while standing on one foot in some obscure Easter egg–activation sequence.

While IDA is certainly a very capable tool, without a clever user sitting at the keyboard (and perhaps a handy collection of scripts and plug-ins), it is really only a disassembler/debugger. As a static-analysis tool, it can only facilitate your attempts to locate software vulnerabilities. Ultimately, it is up to your skills and how you apply them as to whether IDA makes your search for vulnerabilities easier. Based on our experience, IDA is not the optimal tool for locating new vulnerabilities,^[186] but when used in conjunction with a debugger, it is one of the best tools available for assisting in exploit development once a vulnerability has been discovered.

Over the past several years, IDA has taken on a new role in discovering existing vulnerabilities. Initially, it may seem unusual to search for known vulnerabilities until we stop to consider exactly what is known about these vulnerabilities and exactly who knows it. In the closed-source, binary-only software world, vendors frequently release software patches without disclosing exactly what has been patched and why. By performing differential analysis between new patched versions of a piece of software and old un-patched versions of the same software, it is possible to isolate the areas that have changed within a binary. Under the assumption that these changes were made for a reason, such differential-analysis techniques actually help to shine a spotlight on what were formerly vulnerable code sequences. With the search thusly narrowed, anyone with the requisite skills can develop an exploit for use against unpatched systems. In fact, given Microsoft’s well-known Patch Tuesday cycle of publishing updates, large numbers of security researchers prepare to sit down and do just that once every month.

Considering that entire books exist on the topic,^[187] there is no way that we can do justice to vulnerability analysis in a single chapter in a book dedicated to IDA. What we will do is assume that the reader is familiar with some of the basic concepts of software vulnerabilities, such as buffer overflows, and discuss some of the ways that IDA may be used to hunt down, analyze, and ultimately develop exploits for those vulnerabilities.

Discovering New Vulnerabilities with IDA

Vulnerability researchers take many different approaches to discovering new vulnerabilities in software. When source code is available, it may be possible to utilize any of a growing number of automated source code–auditing tools to highlight potential problem areas within a program. In many cases, such automated tools will only point out the low-hanging fruit, while discovery of deeper vulnerabilities may require extensive manual auditing.

Tools for performing automated auditing of binaries offer many of the same reporting capabilities offered by automated source-auditing tools. A clear advantage of automated binary analysis is that no access to the application source code is required. Therefore, it is possible to perform automated analysis of closed-source, binary-only programs. Veracode^[188] is an example of a company that offers a subscription-based service in which users may submit binary files for analysis by Veracode’s proprietary binary-analysis tools. While there is no guarantee that such tools can find any or all vulnerabilities within a binary, these technologies bring binary analysis within reach of the average person seeking some measure of confidence that the software she uses is free from vulnerabilities.

Whether auditing at the source or binary level, basic static-analysis techniques include auditing for the use of problematic functions such as strcpy and sprintf, auditing the use of buffers returned by dynamic memory-allocation routines such as malloc and VirtualAlloc, and auditing the handling of user-supplied input received via functions such as recv, read, fgets, and many other similar functions. Locating such calls within a database is not difficult. For example, to track down all calls to strcpy, we could perform the following steps:

Find the strcpy function.
Display all cross-references to the strcpy function by positioning the cursor on the strcpy label and then choosing View ▸ Open Subviews ▸ Cross References.
Visit each cross-reference and analyze the parameters provided to strcpy to determine whether a buffer overflow may be possible.

Step 3 may require a substantial amount of code and data-flow analysis to understand all potential inputs to the function call. Hopefully, the complexity of such a task is clear. Step 1, although it seems straightforward, may require a little effort on your part. Locating strcpy may be as easy as using the Jump ▸ Jump to Address command (G) and entering strcpy as the address to jump to. In Windows PE binaries or statically linked ELF binaries, this is usually all that is needed. However, with other binaries, extra steps may be required. In a dynamically linked ELF binary, using the Jump command may not take you directly to the desired function. Instead, it is likely to take you to an entry in the extern section (which is involved in the dynamic-linking process). An IDA representation of the strcpy entry in an extern section is shown here:

 extern:804DECC          extrn strcpy:near     ; CODE XREF: _strcpy↑j
  extern:804DECC                                ; DATA XREF: .got:off_804D5E4↑o

To confuse matters, this location does not appear to be named strcpy at all (it is, but the name is indented), and the only code cross-reference to the location is a jump cross-reference from a function that appears to be named _strcpy, while a data cross-reference is also made to this location from the .got section. The referencing function is actually named .strcpy, which is not at all obvious from the display. In this case, IDA has replaced the dot character with an underscore because IDA does not consider dots to be valid identifier characters by default. Double-clicking the code cross-reference takes us to the program’s procedure linkage table (.plt) entry for strcpy, as shown here:

.plt:08049E90 _strcpy    proc near               ; CODE XREF: decode+5F↓p
.plt:08049E90                                    ; extract_int_argument+24↓p ...
.plt:08049E90            jmp     ds:off_804D5E4
.plt:08049E90 _strcpy    endp

If instead we follow the data cross-reference, we end up at the corresponding .got entry for strcpy shown here:

.got:0804D5E4 off_804D5E4     dd offset strcpy        ; DATA XREF: _strcpy↑r

In the .got entry, we encounter another data cross-reference to the .strcpy function in the .plt section. In practice, following the data cross-references is the most reliable means of navigating from the extern section to the .plt section. In dynamically linked ELF binaries, functions are called indirectly through the procedure linkage table. Now that we have reached the .plt, we can bring up the cross-references to _strcpy (actually .strcpy) and begin to audit each call (of which there are at least two in this example).

This process can become tedious when we have a list of several common functions whose calls we wish to locate and audit. At this point it may be useful to develop a script that can automatically locate and comment all interesting function calls for us. With comments in place, we can perform simple searches to move from one audit location to another. The foundation for such a script is a function that can reliably locate another function so that we can locate all cross-references to that function. With the understanding of ELF binaries gained in the preceding discussion, the IDC function in Example 22-1 takes a function name as an input argument and returns an address suitable for cross-reference iteration.

Example 22-1. Finding a function’s callable address

static getFuncAddr(fname) {
   auto func = LocByName(fname);
   if (func != BADADDR) {
      auto seg = SegName(func);
      //what segment did we find it in?
      if (seg == "extern") { //Likely an ELF if we are in "extern"
         //First (and only) data xref should be from got
         func = DfirstB(func);
         if (func != BADADDR) {
            seg = SegName(func);
            if (seg != ".got") return BADADDR;
            //Now, first (and only) data xref should be from plt
            func = DfirstB(func);
            if (func != BADADDR) {
               seg = SegName(func);
               if (seg != ".plt") return BADADDR;
            }
         }
      }
      else if (seg != ".text") {
         //otherwise, if the name was not in the .text section, then we
         // don't have an algorithm for finding it automatically
         func = BADADDR;
      }
   }
   return func;
}

Using the supplied return address, it is now possible to track down all of the references to any function whose use we want to audit. The IDC function in Example 22-2 leverages the getFuncAddr function from the preceding example to obtain a function address and add comments at all calls to the function.

Example 22-2. Flagging calls to a designated function

static flagCalls(fname) {
     auto func, xref;
     //get the callable address of the named function
    func = getFuncAddr(fname);
     if (func != BADADDR) {
        //Iterate through calls to the named function, and add a comment
        //at each call
       for (xref
 = RfirstB(func); xref != BADADDR; xref = RnextB(func, xref)) {
           if (XrefType() == fl_CN || XrefType() == fl_CF) {
              MakeComm(xref, "*** AUDIT HERE ***");
           }
        }
        //Iterate through data references to the named function, and add a
        //comment at reference
       for
 (xref = DfirstB(func); xref != BADADDR; xref = DnextB(func, xref)) {
           if (XrefType() == dr_O) {
              MakeComm(xref, "*** AUDIT HERE ***");
           }
        }
     }
  }

Once the desired function’s address has been located , two loops are used to iterate over cross-references to the function. In the first loop , a comment is inserted at each location that calls the function of interest. In the second loop , additional comments are inserted at each location that takes the address of the function (use of an offset cross-reference type). The second loop is required in order to track down calls of the following style:

 .text:000194EA                 mov     esi, ds:strcpy
  .text:000194F0                 push    offset loc_40A006
  .text:000194F5                 add     edi, 160h
    .text:000194FB                 push    edi
 .text:000194FC call    esi

In this example, the compiler has cached the address of the strcpy function in the ESI register in order to make use of a faster means of calling strcpy later in the program. The call instruction shown here is faster to execute because it is both smaller (2 bytes) and requires no additional operations to resolve the target of the call, since the address is already contained within the CPU within the ESI register. A compiler may choose to generate this type of code when one function makes several calls to another function.

Given the indirect nature of the call in this example, the flagCalls function in our example may see only the data cross-reference to strcpy while failing to see the call to strcpy because the call instruction does not reference strcpy directly. In practice, however, IDA possesses the capability to perform some limited data-flow analysis in cases such as these and is likely to generate the disassembly shown here:

.text:000194EA                 mov     esi, ds:strcpy
  .text:000194F0                 push    offset loc_40A006
  .text:000194F5                 add     edi, 160h
  .text:000194FB                 push    edi
 .text:000194FC                 call    esi ; strcpy

Note that the call instruction has been annotated with a comment indicating which function IDA believes is being called. In addition to inserting the comment, IDA adds a code cross-reference from the point of the call to the function being called. This benefits the flagCalls function, because in this case the call instruction will be found and annotated via a code cross-reference.

To finish up our example script, we need a main function that invokes flagCalls for all of the functions that we are interested in auditing. A simple example to annotate calls to some of the functions mentioned earlier in this section is shown here:

static main() {
   flagCalls("strcpy");
   flagCalls("strcat");
   flagCalls("sprintf");
   flagCalls("gets");
}

After running this script, we can move from one interesting call to the next by searching for the inserted comment text, *** AUDIT ***. Of course this still leaves a lot of work to be done from an analysis perspective, since the mere fact that a program calls strcpy does not make that program exploitable. This is where data-flow analysis comes into play. In order to understand whether a particular call to strcpy is exploitable or not, you must determine what parameters are being passed in to strcpy and evaluate whether those parameters can be manipulated to your advantage or not.

Data-flow analysis is a far more complex task than simply finding calls to problem functions. In order to track the flow of data in a static-analysis environment, a thorough understanding of the instruction set being used is required. Your static-analysis tools need to understand where registers may have been assigned values and how those values may have changed and propagated to other registers. Further, your tools need a means for determining the sizes of source and destination buffers being referenced within the program, which in turn requires the ability to understand the layout of stack frames and global variables as well as the ability to deduce the size of dynamically allocated memory blocks. And, of course, all of this is being attempted without actually running the program.

An interesting example of what can be accomplished with creative scripting comes in the form of the BugScam^[189] scripts created by Halvar Flake. BugScam utilizes techniques similar to the preceding examples to locate calls to problematic functions and takes the additional step of performing rudimentary data-flow analysis at each function call. The result of BugScam’s analysis is an HTML report of potential problems in a binary. A sample report table generated as a result of a sprintf analysis is shown here:

Address	Severity	Description
8048c03	5	The maximum expansion of the data appears to be larger than the target buffer; this might be the cause of a buffer overrun! Maximum Expansion: 1053. Target Size: 1036.

In this case, BugScam was able to determine the size of the input and output buffers, which, when combined with the format specifiers contained in the format string, were used to determine the maximum size of the generated output.

Developing scripts of this nature requires an in-depth understanding of various exploit classes in order to develop an algorithm that can be applied generically across a large body of binaries. Lacking such knowledge, we can still develop scripts (or plug-ins) that answer simple questions for us faster than we can find the answers manually.

As a final example, consider the task of locating all functions that contain stack-allocated buffers, since these are the functions that might be susceptible to stack-based buffer-overflow attacks. Rather than manually scrolling through a database, we can develop a script to analyze the stack frame of each function, looking for variables that occupy large amounts of space. The Python function in Example 22-3 iterates through the defined members of a given function’s stack frame in search of variables whose size is larger than a specified minimum size.

Example 22-3. Scanning for stack-allocated buffers

def findStackBuffers(func_addr, minsize):
     prev_idx = −1
     frame = GetFrame(func_addr)
     if frame == −1: return   #bad function
       idx = 0
     prev = None
     while idx < GetStrucSize(frame):
       member = GetMemberName(frame, idx)
        if member is not None:
           if prev_idx != −1:
              #compute distance from previous field to current field
             delta = idx - prev_idx
             if delta >= minsize:
                 Message("%s: possible buffer %s: %d bytes\n" %  \
                         (GetFunctionName(func_addr), prev, delta))
           prev_idx = idx
           prev = member
          idx = idx + GetMemberSize(frame, idx)
        else:
          idx = idx + 1

This function locates all the variables in a stack frame using repeated calls to GetMemberName for all valid offsets within the stack frame. The size of a variable is computed as the difference between the starting offsets of two successive variables . If the size exceeds a threshold size (minsize) , then the variable is reported as a possible stack buffer. The index into the structure is moved along by either 1 byte when no member is defined at the current offset or by the size of any member found at the current offset . The GetMem-berSize function may seem like a more suitable choice for computing the size of each stack variable; however, this is true only if the variable has been sized properly by either IDA or the user. Consider the following stack frame:

.text:08048B38 sub_8048B38     proc near
.text:08048B38
.text:08048B38 var_818         = byte ptr −818h
.text:08048B38 var_418         = byte ptr −418h
.text:08048B38 var_C           = dword ptr −0Ch
.text:08048B38 arg_0           = dword ptr  8

Using the displayed byte offsets, we can compute that there are 1,024 bytes from the start of var_818 to the start of var_418 (818h - 418h = 400h) and 1,036 bytes between the start of var_418 and the start of var_C (418h - 0Ch). However, the stack frame might be expanded to show the following layout:

-00000818 var_818         db ?
−00000817                 db ? ; undefined
−00000816                 db ? ; undefined
...
−0000041A                 db ? ; undefined
−00000419                 db ? ; undefined
−00000418 var_418         db 1036 dup(?)
−0000000C var_C           dd ?

Here, var_418 has been collapsed into an array, while var_818 appears to be only a single byte (with 1,023 undefined bytes filling the space between var_818 and var_418). For this stack layout, GetMemberSize will report 1 byte for var_818 and 1,036 bytes for var_418, which is an undesirable result. The output of a call to findStackBuffers(0x08048B38, 16) results in the following output, regardless of whether var_818 is defined as a single byte or an array of 1,024 bytes:

sub_8048B38: possible buffer var_818: 1024 bytes
sub_8048B38: possible buffer var_418: 1036 bytes

Creating a main function that iterates through all functions in a database (see Chapter 15) and calls findStackBuffers for each function yields a script that quickly points out the use of stack buffers within a program. Of course, determining whether any of those buffers can be overflowed requires additional (usually manual) study of each function. The tedious nature of static analysis is precisely the reason that fuzz testing is so popular.

^[186]In general, far more vulnerabilities are discovered through fuzz testing than through static analysis.

^[187]For example, see Jon Erickson’s Hacking: The Art of Exploitation, 2nd Edition (http://nostarch.com/hacking2.htm).

^[188]See http://www.veracode.com/.

^[189]See http://www.sourceforge.net/projects/bugscam/.

Previous Chapter

Summary

Next Chapter

After-the-Fact Vulnerability Discovery with IDA

Table of Contents for The IDA Pro Book, 2nd Edition

Chapter 22. Vulnerability Analysis

Discovering New Vulnerabilities with IDA

Table of Contents for
The IDA Pro Book, 2nd Edition