Chapter 9. Cross-References and Graphing

Some of the more common questions asked while reverse engineering a binary are along the lines of “Where is this function called from?” and “What functions access this data?” These and other similar questions seek to catalog the references to and from various resources in a program. Two examples serve to show the usefulness of such questions.

Consider the case in which you have located a function containing a stack-allocated buffer that can be overflowed, possibly leading to exploitation of the program. Since the function may be buried deep within a complex application, your next step might be to determine exactly how the function can be reached. The function is useless to you unless you can get it to execute. This leads to the question “What functions call this vulnerable function?” as well as additional questions regarding the nature of the data that those functions may pass to the vulnerable function. This line of reasoning must continue as you work your way back up potential call chains to find one that you can influence to properly exploit the overflow that you have discovered.

In another case, consider a binary that contains a large number of ASCII strings, at least one of which you find suspicious, such as “Executing Denial of Service attack!” Does the presence of this string indicate that the binary actually performs a Denial of Service attack? No, it simply indicates that the binary happens to contain that particular ASCII sequence. You might infer that the message is displayed somehow just prior to launching an attack; however, you need to find the related code in order to verify your suspicions. Here the answer to the question “Where is this string referenced?” would help you to quickly track down the program location(s) that make use of the string. From there, perhaps it can assist you in locating any actual Denial of Service attack code.

IDA helps to answer these types of questions through its extensive cross-referencing features. IDA provides a number of mechanisms for displaying and accessing cross-reference data, including graph-generation capabilities that provide a highly visual representation of the relationships between code and data. In this chapter we discuss the types of cross-reference information that IDA makes available, the tools for accessing cross-reference data, and how to interpret that data.

Cross-References

We begin our discussion by noting that cross-references within IDA are often referred to simply as xrefs. Within this text, we will use xref only where it is used to refer to the content of an IDA menu item or dialog. In all other cases we will stick to the term cross-reference.

There are two basic categories of cross-references in IDA: code cross-references and data cross-references. Within each category, we will detail several different types of cross-references. Associated with each cross-reference is the notion of a direction. All cross-references are made from one address to another address. The from and to addresses may be either code or data addresses. If you are familiar with graph theory, you may choose to think of addresses as nodes in a directed graph and cross-references as the edges in that graph. Figure 9-1 provides a quick refresher on graph terminology. In this simple graph, three nodes are connected by two directed edges .

Figure 9-1. Basic graph components

Note that nodes may also be referred to as vertices. Directed edges are drawn using arrows to indicate the allowed direction of travel across the edge. In Figure 9-1, it is possible to travel from the upper node to either of the lower nodes, but it is not possible to travel from either of the lower nodes to the upper node.

Code cross-references are a very important concept, as they facilitate IDA’s generation of control flow graphs and function call graphs, each of which we discuss later in the chapter.

Before we dive into the details of cross-references, it is useful to understand how IDA displays cross-reference information in a disassembly listing. Figure 9-2 shows the header line for a disassembled function (sub_401000) containing a cross-reference as a regular comment (right side of the figure).

Figure 9-2. A basic cross-reference

The text CODE XREF indicates that this is a code cross-reference rather than a data cross-reference (DATA XREF). An address follows, _main+2A in this case, indicating the address from which the cross-reference originates. Note that this is a more descriptive form of address than .text:0040154A, for example. While both forms represent the same program location, the format used in the cross-reference offers the additional information that the cross-reference is being made from within the function named _main, specifically 0x2A (42) bytes into the _main function. An up or down arrow will always follow the address, indicating the relative direction to the referencing location. In Figure 9-2, the down arrow indicates that _main+2A lies at a higher address than sub_401000, and thus you would need to scroll down to reach it. Similarly, an up arrow indicates that a referencing location lies at a lower memory address, requiring that you scroll up to reach it. Finally, every cross-reference comment contains a single-character suffix to identify the type of cross-reference that is being made. Each suffix is described later as we detail all of IDA’s cross-reference types.

Code Cross-References

A code cross-reference is used to indicate that an instruction transfers or may transfer control to another instruction. The manner in which instructions transfer control is referred to as a flow within IDA. IDA distinguishes among three basic flow types: ordinary, jump, and call. Jump and call flows are further divided according to whether the target address is a near or far address. Far addresses are encountered only in binaries that make use of segmented addresses. In the discussion that follows, we make use of the disassembled version of the following program:

int read_it;            //integer variable read in main
int write_it;           //integer variable written 3 times in main
int ref_it;             //integer variable whose address is taken in main

void callflow() {}      //function called twice from main

int main() {
   int *p = &ref_it;    //results in an "offset" style data reference
   *p = read_it;        //results in a "read" style data reference
   write_it = *p;       //results in a "write" style data reference
   callflow();          //results in a "call" style code reference
   if (read_it == 3) {  //results in "jump" style code reference
      write_it = 2;     //results in a "write" style data reference
   }
   else {               //results in an "jump" style code reference
      write_it = 1;     //results in a "write" style data reference
   }
   callflow();          //results in an "call" style code reference
}

The program contains operations that will exercise all of IDA’s cross-referencing features, as noted in the comment text.

An ordinary flow is the simplest flow type, and it represents sequential flow from one instruction to another. This is the default execution flow for all nonbranching instructions such as ADD. There are no special display indicators for ordinary flows other than the order in which instructions are listed in the disassembly. If instruction A has an ordinary flow to instruction B, then instruction B will immediately follow instruction A in the disassembly listing. In the following listing, every instruction other than and has an associated ordinary flow to its immediate successor:

Example 9-1. Cross-reference sources and targets

.text:00401010 _main           proc near
  .text:00401010
  .text:00401010 p               = dword ptr −4
  .text:00401010
  .text:00401010                 push    ebp
  .text:00401011                 mov     ebp, esp
  .text:00401013                 push    ecx
  .text:00401014                mov     [ebp+p], offset ref_it
  .text:0040101B                 mov     eax, [ebp+p]
  .text:0040101E                mov     ecx, read_it
  .text:00401024                 mov     [eax], ecx
  .text:00401026                 mov     edx, [ebp+p]
  .text:00401029                 mov     eax, [edx]
  .text:0040102B                mov     write_it, eax
  .text:00401030                call    callflow
  .text:00401035                cmp     read_it, 3
  .text:0040103C                 jnz     short loc_40104A
  .text:0040103E                mov     write_it, 2
  .text:00401048                jmp     short loc_401054

 .text:0040104A ; -------------------------------------------------------------
  .text:0040104A
  .text:0040104A loc_40104A:                         ; CODE XREF: _main+2C↑j
  .text:0040104A                mov     write_it, 1
  .text:00401054
  .text:00401054 loc_401054:                         ; CODE XREF: _main+38↑j
  .text:00401054                call    callflow
  .text:00401059                 xor     eax, eax
    .text:0040105B                 mov     esp, ebp
  .text:0040105D                 pop     ebp
  .text:0040105E                retn
  .text:0040105E _main           endp

Instructions used to invoke functions, such as the x86 call instructions at , are assigned a call flow, indicating transfer of control to the target function. In most cases, an ordinary flow is also assigned to call instructions, as most functions return to the location that follows the call. If IDA believes that a function does not return (as determined during the analysis phase), then calls to that function will not have an ordinary flow assigned. Call flows are noted by the display of cross-references at the target function (the destination address of the flow). The resulting disassembly of the callflow function is shown here:

.text:00401000 callflow        proc near               ; CODE XREF: _main+20↓p
.text:00401000                                         ; _main:loc_401054↓p
.text:00401000                 push    ebp
.text:00401001                 mov     ebp, esp
.text:00401003                 pop     ebp
.text:00401004                 retn
.text:00401004 callflow        endp

In this example, two cross-references are displayed at the address of callflow to indicate that the function is called twice. The address displayed in the cross-references is displayed as an offset into the calling function unless the calling address has an associated name, in which case the name is used. Both forms of addresses are used in the cross-references shown here. Cross-references resulting from function calls are distinguished through use of the p suffix (think P for Procedure).

A jump flow is assigned to each unconditional and conditional branch instruction. Conditional branches are also assigned ordinary flows to account for control flow when the branch is not taken. Unconditional branches have no associated ordinary flow because the branch is always taken in such cases. The dashed line break at is a display device used to indicate that an ordinary flow does not exist between two adjacent instructions. Jump flows are associated with jump-style cross-references displayed at the target of the jump, as shown at . As with call-style cross-references, jump cross-references display the address of the referring location (the source of the jump). Jump cross-references are distinguished by the use of a j suffix (think J for Jump).

Data Cross-References

Data cross-references are used to track the manner in which data is accessed within a binary. Data cross-references can be associated with any byte in an IDA database that is associated with a virtual address (in other words, data cross-references are never associated with stack variables). The three most commonly encountered types of data cross-references are used to indicate when a location is being read, when a location is being written, and when the address of a location is being taken. The global variables associated with the previous example program are shown here, as they provide several examples of data cross-references.

.data:0040B720 read_it       dd ?                    ; DATA XREF: _main+E↑r
.data:0040B720                                       ; _main+25↑r
.data:0040B724 write_it      dd ?                    ; DATA XREF: _main+1B↑w
.data:0040B724                                      ; _main+2E↑w ...
.data:0040B728 ref_it        db    ? ;               ; DATA XREF: _main+4↑o
.data:0040B729               db    ? ;
.data:0040B72A               db    ? ;
.data:0040B72B               db    ? ;

A read cross-reference is used to indicate that the contents of a memory location are being accessed. Read cross-references can originate only from an instruction address but may refer to any program location. The global variable read_it is read at locations marked in Example 9-1. The associated cross-reference comments shown in this listing indicate exactly which locations in main are referencing read_it and are recognizable as read cross-references based on the use of the r suffix. The first read performed on read_it is a 32-bit read into the ECX register, which leads IDA to format read_it as a dword (dd). In general IDA takes as many cues as it possibly can in order to determine the size and/or type of variables based on how they are accessed and how they are used as parameters to functions.

The global variable write_it is referenced at the locations marked in Example 9-1. Associated write cross-references are generated and displayed as comments for the write_it variable, indicating the program locations that modify the contents of the variable. Write cross-references utilize the w suffix. Here again, IDA has determined the size of the variable based on the fact that the 32-bit EAX register is copied into write_it. Note that the list of cross-references displayed at write_it terminates with an ellipsis ( above), indicating that the number of cross-references to write_it exceeds the current display limit for cross-references. This limit can be modified through the Number of displayed xrefs setting on the Cross-references tab in the Options ▸ General dialog. As with read cross-references, write cross-references can originate only from a program instruction but may reference any program location. Generally speaking, a write cross-reference that targets a program instruction byte is indicative of self-modifying code, which is usually considered bad form and is frequently encountered in the de-obfuscation routines used in malware.

The third type of data cross-reference, an offset cross-reference, indicates that the address of a location is being used (rather than the content of the location). The address of global variable ref_it is taken at location in Example 9-1, resulting in the offset cross-reference comment at ref_it in the previous listing (suffix o). Offset cross-references are commonly the result of pointer operations either in code or in data. Array access operations, for example, are typically implemented by adding an offset to the starting address of the array. As a result, the first address in most global arrays can often be recognized by the presence of an offset cross-reference. For this reason, most string data (strings being arrays of characters in C/C++) is the target of offset cross-references.

Unlike read and write cross-references, which can originate only from instruction locations, offset cross-references can originate from either instruction locations or data locations. An example of an offset that can originate from a program’s data section is any table of pointers (such as a vtable) that results in the generation of an offset cross-reference from each location within the table to the location being pointed to by those locations. You can see this if you examine the vtable for class SubClass from Chapter 8, whose disassembly is shown here:

.rdata:00408148 off_408148  dd offset SubClass::vfunc1
(void) ; DATA XREF: SubClass::SubClass(void)+12↑o
.rdata:0040814C          dd offset BaseClass::vfunc2(void)
.rdata:00408150          dd offset SubClass::vfunc3(void)
.rdata:00408154          dd offset BaseClass::vfunc4(void)
.rdata:00408158          dd offset SubClass::vfunc5(void)

Here you see that the address of the vtable is used in the function SubClass::SubClass(void), which is the class constructor. The header lines for function SubClass::vfunc3(void), shown here, show the offset cross-reference that links the function to a vtable.

.text:00401080 public: virtual void __thiscall SubClass::vfunc3(void) proc near
.text:00401080                                      ; DATA XREF: .rdata:00408150↓o

This example demonstrates one of the characteristics of C++ virtual functions that becomes quite obvious when combined with offset cross-references, namely that C++ virtual functions are never called directly and should never be the target of a call cross-reference. Instead, all C++ virtual functions should be referred to by at least one vtable entry and should always be the target of at least one offset cross-reference. Remember that overriding a virtual function is not mandatory. Therefore, a virtual function can appear in more than one vtable, as discussed in Chapter 8. Backtracking offset cross-references is one technique for easily locating C++ vtables in a program’s data section.

Cross-Reference Lists

With an understanding of what cross-references are, we can now discuss the manner in which you may access all of this data within IDA. As mentioned previously, the number of cross-reference comments that can be displayed at a given location is limited by a configuration setting that defaults to 2. As long as the number of cross-references to a location does not exceed this limit, then working with those cross-references is fairly straightforward. Mousing over the cross-reference text displays the disassembly of the source region in a tool tip–style display, while double-clicking the cross-reference address jumps the disassembly window to the source of the cross-reference.

There are two methods for viewing the complete list of cross-references to a location. The first method is to open a cross-references subview associated with a specific address. By positioning the cursor on an address that is the target of one or more cross-references and selecting View ▸ Open Subviews ▸ Cross-References, you can open the complete list of cross-references to a given location, as shown in Figure 9-3, which shows the complete list of cross-references to variable write_it.

Figure 9-3. Cross-reference display window

The columns of the window indicate the direction (Up or Down) to the source of the cross-reference, the type of cross-reference (using the type suffixes discussed previously), the source address of the cross-reference, and the corresponding disassembled text at the source address, including any comments that may exist at the source address. As with other windows that display lists of addresses, double-clicking any entry repositions the disassembly display to the corresponding source address. Once opened, the cross-reference display window remains open and accessible via a title tab displayed along with every other open subview’s title tab above the disassembly area.

The second way to access a list of cross-references is to highlight a name that you are interested in learning about and choose Jump ▸ Jump to xref (hotkey ctrl-X) to open a dialog that lists every location that references the selected symbol. The resulting dialog, shown in Figure 9-4, is nearly identical in appearance to the cross-reference subview shown in Figure 9-3. In this case, the dialog was activated using the ctrl-X hotkey with the first instance of write_it (.text:0040102B) selected.

Figure 9-4. Jump to cross-reference dialog

The primary difference in the two displays is behavioral. Being a modal dialog,^[52] the display in Figure 9-4 has buttons to interact with and terminate the dialog. The primary purpose of this dialog is to select a referencing location and jump to it. Double-clicking one of the listed locations dismisses the dialog and repositions the disassembly window at the selected location. The second difference between the dialog and the cross-reference subview is that the former can be opened using a hotkey or context-sensitive menu from any instance of a symbol, while the latter can be opened only when you position the cursor on an address that is the target of a cross-reference and choose View ▸ Open Subviews ▸ Cross-References. Another way of thinking about it is that the dialog can be opened at the source of any cross-reference, while the subview can be opened only at the destination of the cross-reference.

An example of the usefulness of cross-reference lists might be to rapidly locate every location from which a particular function is called. Many people consider the use of the C strcpy^[53] function to be dangerous. Using cross-references, locating every call to strcpy is as simple as finding any one call to strcpy, using the ctrl-X hotkey to bring up the cross-reference dialog, and working your way through every call cross-reference. If you don’t want to take the time to find strcpy used somewhere in the binary, you can even get away with adding a comment with the text strcpy in it and activating the cross-reference dialog using the comment.^[54]

Function Calls

A specialized cross-reference listing dealing exclusively with function calls is available by choosing View ▸ Open Subviews ▸ Function Calls. Figure 9-5 shows the resulting dialog, which lists all locations that call the current function (as defined by the cursor location at the time the view is opened) in the upper half of the window and all calls made by the current function in the lower half of the window.

Figure 9-5. Function calls window

Here again, each listed cross-reference can be used to quickly reposition the disassembly listing to the corresponding cross-reference location. Restricting ourselves to considering function call cross-references allows us to think about more abstract relationships than simple mappings from one address to another and instead consider how functions relate to one another. In the next section, we show how IDA takes advantage of this by providing several types of graphs, all designed to assist you in interpreting a binary.

^[52]A modal dialog must be closed before you can continue normal interaction with the underlying application. Modeless dialogs can remain open while you continue normal interaction with the application.

^[53]The C strcpy function copies a source array of characters, up to and including the associated null termination character, to a destination array, with no checks whatsoever that the destination array is large enough to hold all of the characters from the source.

^[54]When a symbol name appears in a comment, IDA treats that symbol just as if it was an operand in a disassembled instruction. Double-clicking the symbol repositions the disassembly window, and the right-click context-sensitive menu becomes available.