IDC Scripting Examples

At this point it is probably useful to see some examples of scripts that perform specific tasks. For the remainder of the chapter we present some fairly common situations in which a script can be used to answer a question about a database.

Enumerating Functions

Many scripts operate on individual functions. Examples include generating the call tree rooted at a specific function, generating the control flow graph of a function, or analyzing the stack frames of every function in a database. Example 15-1 iterates through every function in a database and prints basic information about each function, including the start and end addresses of the function, the size of the function’s arguments, and the size of the function’s local variables. All output is sent to the output window.

Example 15-1. Function enumeration script

#include <idc.idc>
static main() {
   auto addr, end, args, locals, frame, firstArg, name, ret;
   addr = 0;
   for (addr = NextFunction(addr); addr != BADADDR; addr = NextFunction(addr)) {
      name = Name(addr);
      end = GetFunctionAttr(addr, FUNCATTR_END);
      locals = GetFunctionAttr(addr, FUNCATTR_FRSIZE);
      frame = GetFrame(addr);     // retrieve a handle to the function's stack frame
      ret = GetMemberOffset(frame, " r");  // " r" is the name of the return address
      if (ret == −1) continue;
      firstArg = ret + 4;
      args = GetStrucSize(frame) - firstArg;
      Message("Function: %s, starts at %x, ends at %x\n", name, addr, end);
      Message("   Local variable area is %d bytes\n", locals);
      Message("   Arguments occupy %d bytes (%d args)\n", args, args / 4);
   }
}

This script uses some of IDC’s structure-manipulation functions to obtain a handle to each function’s stack frame (GetFrame), determine the size of the stack frame (GetStrucSize), and determine the offset of the saved return address within the frame (GetMemberOffset). The first argument to the function lies 4 bytes beyond the saved return address. The size of the function’s argument area is computed as the space between the first argument and the end of the stack frame. Since IDA can’t generate stack frames for imported functions, this script tests whether the function’s stack frame contains a saved return address as a simple means of identifying calls to an imported function.

Enumerating Instructions

Within a given function, you may want to enumerate every instruction. Example 15-2 counts the number of instructions contained in the function identified by the current cursor position:

Example 15-2. Instruction enumeration script

#include <idc.idc>
  static main() {
     auto func, end, count, inst;
    func = GetFunctionAttr(ScreenEA(), FUNCATTR_START);
     if (func != −1) {
       end = GetFunctionAttr(func, FUNCATTR_END);
        count = 0;
        inst = func;
        while (inst < end) {
             count++;
          inst = FindCode(inst, SEARCH_DOWN | SEARCH_NEXT);
        }
        Warning("%s contains %d instructions\n", Name(func), count);
     }
     else {
        Warning("No function found at location %x", ScreenEA());
     }
  }

The function begins by using GetFunctionAttr to determine the start address of the function containing the cursor address (ScreenEA()). If the beginning of a function is found, the next step is to determine the end address for the function, once again using the GetFunctionAttr function. Once the function has been bounded, a loop is executed to step through successive instructions in the function by using the search functionality of the FindCode function . In this example, the Warning function is used to display results, since only a single line of output will be generated by the function and output displayed in a Warning dialog is much more obvious than output generated in the message window. Note that this example assumes that all of the instructions within the given function are contiguous. An alternative approach might replace the use of FindCode with logic to iterate over all of the code cross-references for each instruction within the function. Properly written, this second approach would handle noncontiguous, also known as “chunked,” functions.

Enumerating Cross-References

Iterating through cross-references can be confusing because of the number of functions available for accessing cross-reference data and the fact that code cross-references are bidirectional. In order to get the data you want, you need to make sure you are accessing the proper type of cross-reference for your situation. In our first cross-reference example, shown in Example 15-3, we derive the list of all function calls made within a function by iterating through each instruction in the function to determine if the instruction calls another function. One method of doing this might be to parse the results of GetMnem to look for call instructions. This would not be a very portable solution, because the instruction used to call a function varies among CPU types. Second, additional parsing would be required to determine exactly which function was being called. Cross-references avoid each of these difficulties because they are CPU-independent and directly inform us about the target of the cross-reference.

Example 15-3. Enumerating function calls

#include <idc.idc>
static main() {
  auto func, end, target, inst, name, flags, xref;
  flags = SEARCH_DOWN | SEARCH_NEXT;
  func = GetFunctionAttr(ScreenEA(), FUNCATTR_START);
  if (func != −1) {
    name = Name(func);
    end = GetFunctionAttr(func, FUNCATTR_END);
    for (inst = func; inst < end; inst = FindCode(inst, flags)) {
      for (target = Rfirst(inst); target != BADADDR; target = Rnext(inst, target)) {
        xref = XrefType();
        if (xref == fl_CN || xref == fl_CF) {
          Message("%s calls %s from 0x%x\n", name, Name(target), inst);
        }
      }
    }
  }
  else {
    Warning("No function found at location %x", ScreenEA());
  }
}

In this example, we must iterate through each instruction in the function. For each instruction, we must then iterate through each cross-reference from the instruction. We are interested only in cross-references that call other functions, so we must test the return value of XrefType looking for fl_CN or fl_CF-type cross-references. Here again, this particular solution handles only functions whose instructions happen to be contiguous. Given that the script is already iterating over the cross-references from each instruction, it would not take many changes to produce a flow-driven analysis instead of the address-driven analysis seen here.

Another use for cross-references is to determine every location that references a particular location. For example, if we wanted to create a low-budget security analyzer, we might be interested in highlighting all calls to functions such as strcpy and sprintf.

DANGEROUS FUNCTIONS

The C functions strcpy and sprintf are generally acknowledged as dangerous to use because they allow for unbounded copying into destination buffers. While each may be safely used by programmers who conduct proper checks on the size of source and destination buffers, such checks are all too often forgotten by programmers unaware of the dangers of these functions. The strcpy function, for example, is declared as follows:

char *strcpy(char *dest, const char *source);

The strcpy function’s defined behavior is to copy all characters up to and including the first null termination character encountered in the source buffer to the given destination buffer (dest). The fundamental problem is that there is no way to determine, at runtime, the size of any array. In this instance, strcpy has no means to determine whether the capacity of the destination buffer is sufficient to hold all of the data to be copied from source. Such unchecked copy operations are a major cause of buffer overflow vulnerabilities.

In the example shown in Example 15-4, we work in reverse to iterate across all of the cross-references to (as opposed to from in the preceding example) a particular symbol:

Example 15-4. Enumerating a function’s callers

#include <idc.idc>
  static list_callers(bad_func) {
     auto func, addr, xref, source;
    func = LocByName(bad_func);
     if (func == BADADDR) {
        Warning("Sorry, %s not found in database", bad_func);
     }
     else {
       for (addr
 = RfirstB(func); addr != BADADDR; addr = RnextB(func, addr)) {
         xref = XrefType();
         if (xref == fl_CN || xref == fl_CF) {
             source = GetFunctionName(addr);
             Message
("%s is called from 0x%x in %s\n", bad_func, addr, source);
           }
        }
     }
  }
  static main() {
     list_callers("_strcpy");
     list_callers("_sprintf");
  }

In this example, the LocByName function is used to find the address of a given (by name) bad function. If the function’s address is found, a loop is executed in order to process all cross-references to the bad function. For each cross-reference, if the cross-reference type is determined to be a call-type cross-reference, the calling function’s name is determined and is displayed to the user .

It is important to note that some modifications may be required to perform a proper lookup of the name of an imported function. In ELF executables in particular, which combine a procedure linkage table (PLT) with a global offset table (GOT) to handle the details of linking to shared libraries, the names that IDA assigns to imported functions may be less than clear. For example, a PLT entry may appear to be named _memcpy, when in fact it is named .memcpy and IDA has replaced the dot with an underscore because IDA considers dots invalid characters within names. Further complicating matters is the fact that IDA may actually create a symbol named memcpy that resides in a section that IDA names extern. When attempting to enumerate cross-references to memcpy, we are interested in the PLT version of the symbol because this is the version that is called from other functions in the program and thus the version to which all cross-references would refer.

Enumerating Exported Functions

In Chapter 13 we discussed the use of idsutils to generate .ids files that describe the contents of shared libraries. Recall that the first step in generating a .ids file involves generating a .idt file, which is a text file containing descriptions of each exported function contained in the library. IDC contains functions for iterating through the functions that are exported by a shared library. The script shown in Example 15-5 can be run to generate an .idt file after opening a shared library with IDA:

Example 15-5. A script to generate .idt files

#include <idc.idc>
static main() {
   auto entryPoints, i, ord, addr, name, purged, file, fd;
   file = AskFile(1, "*.idt", "Select IDT save file");
   fd = fopen(file, "w");
   entryPoints = GetEntryPointQty();
   fprintf(fd, "ALIGNMENT 4\n");
   fprintf(fd, "0 Name=%s\n", GetInputFile());
   for (i = 0; i < entryPoints; i++) {
      ord = GetEntryOrdinal(i);
      if (ord == 0) continue;
      addr = GetEntryPoint(ord);
      if (ord == addr) {
         continue; //entry point has no ordinal
      }
      name = Name(addr);
      fprintf(fd, "%d Name=%s", ord, name);
      purged = GetFunctionAttr(addr, FUNCATTR_ARGSIZE);
      if (purged > 0) {
         fprintf(fd, " Pascal=%d", purged);
      }
      fprintf(fd, "\n");
   }
}

The output of the script is saved to a file chosen by the user. New functions introduced in this script include GetEntryPointQty, which returns the number of symbols exported by the library; GetEntryOrdinal, which returns an ordinal number (an index into the library’s export table); GetEntryPoint, which returns the address associated with an exported function that has been identified by ordinal number; and GetInputFile, which returns the name of the file that was loaded into IDA.

Finding and Labeling Function Arguments

Versions of GCC later than 3.4 use mov statements rather than push statements in x86 binaries to place function arguments into the stack before calling a function. Occasionally this causes some analysis problems for IDA (newer versions of IDA handle this situation better), because the analysis engine relies on finding push statements to pinpoint locations at which arguments are pushed for a function call. The following listing shows an IDA disassembly when parameters are pushed onto the stack:

.text:08048894                 push    0               ; protocol
.text:08048896                 push    1               ; type
.text:08048898                 push    2               ; domain
.text:0804889A                 call    _socket

Note the comments that IDA has placed in the right margin. Such commenting is possible only when IDA recognizes that parameters are being pushed and when IDA knows the signature of the function being called. When mov statements are used to place parameters onto the stack, the resulting disassembly is somewhat less informative, as shown here:

.text:080487AD                 mov     [esp+8], 0
.text:080487B5                 mov     [esp+4], 1
.text:080487BD                 mov     [esp], 2
.text:080487C4                 call    _socket

In this case, IDA has failed to recognize that the three mov statements preceding the call are being used to set up the parameters for the function call. As a result, we get less assistance from IDA in the form of automatic comments in the disassembly.

Here we have a situation where a script might be able to restore some of the information that we are accustomed to seeing in our disassemblies. Example 15-6 is a first effort at automatically recognizing instructions that are setting up parameters for function calls:

Example 15-6. Automating parameter recognition

#include <idc.idc>
static main() {
  auto addr, op, end, idx;
  auto func_flags, type, val, search;
  search = SEARCH_DOWN | SEARCH_NEXT;
  addr = GetFunctionAttr(ScreenEA(), FUNCATTR_START);
  func_flags = GetFunctionFlags(addr);
  if (func_flags & FUNC_FRAME) {  //Is this an ebp-based frame?
    end = GetFunctionAttr(addr, FUNCATTR_END);
    for (; addr < end && addr != BADADDR; addr = FindCode(addr, search)) {
      type = GetOpType(addr, 0);
      if (type == 3) {  //Is this a register indirect operand?
        if (GetOperandValue(addr, 0) == 4) {   //Is the register esp?
          MakeComm(addr, "arg_0");  //[esp] equates to arg_0
        }
      }
      else if (type == 4) {  //Is this a register + displacement operand?
        idx = strstr(GetOpnd(addr, 0), "[esp"); //Is the register esp?
        if (idx != −1) {
          val = GetOperandValue(addr, 0);   //get the displacement
          MakeComm(addr, form("arg_%d", val));  //add a comment
        }
      }
    }
  }
}

The script works only on EBP-based frames and relies on the fact that when parameters are moved into the stack prior to a function call, GCC generates memory references relative to esp. The script iterates through all instructions in a function; for each instruction that writes to a memory location using esp as a base register, the script determines the depth within the stack and adds a comment indicating which parameter is being moved. The GetFunctionFlags function offers access to various flags associated with a function, such as whether the function uses an EBP-based stack frame. Running the script in Example 15-6 yields the annotated disassembly shown here:

.text:080487AD                 mov     [esp+8], 0   ; arg_8
.text:080487B5                 mov     [esp+4], 1   ; arg_4
.text:080487BD                 mov     [esp], 2    ; arg_0
.text:080487C4                 call    _socket

The comments aren’t particularly informative. However, we can now tell at a glance that the three mov statements are used to place parameters onto the stack, which is a step in the right direction. By extending the script a bit further and exploring some more of IDC’s capabilities, we can come up with a script that provides almost as much information as IDA does when it properly recognizes parameters. The output of the final product is shown here:

.text:080487AD                 mov     [esp+8], 0   ;  int protocol
.text:080487B5                 mov     [esp+4], 1   ;  int type
.text:080487BD                 mov     [esp], 2    ;  int domain
.text:080487C4                 call    _socket

The extended version of the script in Example 15-6, which is capable of incorporating data from function signatures into comments, is available on this book’s website.^[103]

Emulating Assembly Language Behavior

There are a number of reasons why you might need to write a script that emulates the behavior of a program you are analyzing. For example, the program you are studying may be self-modifying, as many malware programs are, or the program may contain some encoded data that gets decoded when it is needed at runtime. Without running the program and pulling the modified data out of the running process’s memory, how can you understand the behavior of the program? The answer may lie with an IDC script. If the decoding process is not terribly complex, you may be able to quickly write an IDC script that performs the same actions that are performed by the program when it runs. Using a script to decode data in this way eliminates the need to run a program when you don’t know what the program does or you don’t have access to a platform on which you can run the program. An example of the latter case might occur if you were examining a MIPS binary with your Windows version of IDA. Without any MIPS hardware, you would not be able to execute the MIPS binary and observe any data decoding it might perform. You could, however, write an IDC script to mimic the behavior of the binary and make the required changes within the IDA database, all with no need for a MIPS execution environment.

The following x86 code was extracted from a DEFCON^[104] Capture the Flag binary.^[105]

.text:08049EDE                 mov     [ebp+var_4], 0
.text:08049EE5
.text:08049EE5 loc_8049EE5:
.text:08049EE5                 cmp     [ebp+var_4], 3C1h
.text:08049EEC                 ja      short locret_8049F0D
.text:08049EEE                 mov     edx, [ebp+var_4]
.text:08049EF1                 add     edx, 804B880h
.text:08049EF7                 mov     eax, [ebp+var_4]
.text:08049EFA                 add     eax, 804B880h
.text:08049EFF                 mov     al, [eax]
.text:08049F01                 xor     eax, 4Bh
.text:08049F04                 mov     [edx], al
.text:08049F06                 lea     eax, [ebp+var_4]
.text:08049F09                 inc     dword ptr [eax]
.text:08049F0B                 jmp     short loc_8049EE5

This code decodes a private key that has been embedded within the program binary. Using the IDC script shown in Example 15-7, we can extract the private key without running the program:

Example 15-7. Emulating assembly language with IDC

auto var_4, edx, eax, al;
var_4 = 0;
while (var_4 <= 0x3C1) {
   edx = var_4;
   edx = edx + 0x804B880;
   eax = var_4;
   eax = eax + 0x804B880;
   al = Byte(eax);
   al = al ^ 0x4B;
   PatchByte(edx, al);
   var_4++;
}

Example 15-7 is a fairly literal translation of the preceding assembly language sequence generated according to the following rather mechanical rules.

For each stack variable and register used in the assembly code, declare an IDC variable.
For each assembly language statement, write an IDC statement that mimics its behavior.
Reading and writing stack variables is emulated by reading and writing the corresponding variable declared in your IDC script.
Reading from a nonstack location is accomplished using the Byte, Word, or Dword function, depending on the amount of data being read (1, 2, or 4 bytes).
Writing to a nonstack location is accomplished using the PatchByte, PatchWord, or PatchDword function, depending on the amount of data being written.
In general, if the code appears to contain a loop for which the termination condition is not immediately obvious, it is easiest to begin with an infinite loop such as while (1) {} and then insert a break statement when you encounter statements that cause the loop to terminate.
When the assembly code calls functions, things get complicated. In order to properly simulate the behavior of the assembly code, you must find a way to mimic the behavior of the function that has been called, including providing a return value that makes sense within the context of the code being simulated. This fact alone may preclude the use of IDC as a tool for emulating the behavior of an assembly language sequence.

The important thing to understand when developing scripts such as the previous one is that it is not absolutely necessary to fully understand how the code you are emulating behaves on a global scale. It is often sufficient to understand only one or two instructions at a time and generate correct IDC translations for those instructions. If each instruction has been correctly translated into IDC, then the script as a whole should properly mimic the complete functionality of the original assembly code. We can delay further study of the assembly language algorithm until after the IDC script has been completed, at which point we can use the IDC script to enhance our understanding of the underlying assembly. Once we spend some time considering how our example algorithm works, we might shorten the preceding IDC script to the following:

auto var_4, addr;
for (var_4 = 0; var_4 <= 0x3C1; var_4++) {
   addr = 0x804B880 + var_4;
   PatchByte(addr, Byte(addr) ^ 0x4B);
}

As an alternative, if we did not wish to modify the database in any way, we could replace the PatchByte function with a call to Message if we were dealing with ASCII data, or as an alternative we could write the data to a file if we were dealing with binary data.

^[103]See http://www.idabook.com/ch15_examples.

^[104]See http://www.defcon.org/.

^[105]Courtesy of Kenshoto, the organizers of CTF at DEFCON 15. Capture the Flag is an annual hacking competition held at DEFCON.

Table of Contents for The IDA Pro Book, 2nd Edition