ELF symbols

Symbols are a symbolic reference to some type of data or code such as a global variable or function. For instance, the printf() function is going to have a symbol entry that points to it in the dynamic symbol table .dynsym. In most shared libraries and dynamically linked executables, there exist two symbol tables. In the readelf -S output shown previously, you can see two sections: .dynsym and .symtab.

The .dynsym contains global symbols that reference symbols from an external source, such as libc functions like printf, whereas the symbols contained in .symtab will contain all of the symbols in .dynsym, as well as the local symbols for the executable, such as global variables, or local functions that you have defined in your code. So .symtab contains all of the symbols, whereas .dynsym contains just the dynamic/global symbols.

So the question is: Why have two symbol tables if .symtab already contains everything that's in .dynsym? If you check out the readelf -S output of an executable, you will see that some sections are marked A (ALLOC) or WA (WRITE/ALLOC) or AX (ALLOC/EXEC). If you look at .dynsym, you will see that it is marked ALLOC, whereas .symtab has no flags.

ALLOC means that the section will be allocated at runtime and loaded into memory, and .symtab is not loaded into memory because it is not necessary for runtime. The .dynsym contains symbols that can only be resolved at runtime, and therefore they are the only symbols needed at runtime by the dynamic linker. So, while the .dynsym symbol table is necessary for the execution of dynamically linked executables, the .symtab symbol table exists only for debugging and linking purposes and is often stripped (removed) from production binaries to save space.

Let's take a look at what an ELF symbol entry looks like for 64-bit ELF files:

typedef struct {
uint32_t      st_name;
    unsigned char st_info;
    unsigned char st_other;
    uint16_t      st_shndx;
    Elf64_Addr    st_value;
    Uint64_t      st_size;
} Elf64_Sym;

Symbol entries are contained within the .symtab and .dynsym sections, which is why the sh_entsize (section header entry size) for those sections are equivalent to sizeof(ElfN_Sym).

st_name

The st_name contains an offset into the symbol table's string table (located in either .dynstr or .strtab), where the name of the symbol is located, such as printf.

st_value

The st_value holds the value of the symbol (either an address or offset of its location).

st_size

The st_size contains the size of the symbol, such as the size of a global function ptr, which would be 4 bytes on a 32-bit system.

st_other

This member defines the symbol visibility.

st_shndx

Every symbol table entry is defined in relation to some section. This member holds the relevant section header table index.

st_info

The st_info specifies the symbol type and binding attributes. For a complete list of these types and attributes, consult the ELF(5) man page. The symbol types start with STT whereas the symbol bindings start with STB. As an example, a few common ones are as explained in the next sections.

Symbol types

We've got the following symbol types:

STT_NOTYPE: The symbols type is undefined
STT_FUNC: The symbol is associated with a function or other executable code
STT_OBJECT: The symbol is associated with a data object

Symbol bindings

We've got the following symbol bindings:

STB_LOCAL: Local symbols are not visible outside the object file containing their definition, such as a function declared static.
STB_GLOBAL: Global symbols are visible to all object files being combined. One file's definition of a global symbol will satisfy another file's undefined reference to the same symbol.
STB_WEAK: Similar to global binding, but with less precedence, meaning that the binding is weak and may be overridden by another symbol (with the same name) that is not marked as STB_WEAK.

There are macros for packing and unpacking the binding and type fields:

ELF32_ST_BIND(info) or ELF64_ST_BIND(info) extract a binding from an st_info value
ELF32_ST_TYPE(info) or ELF64_ST_TYPE(info) extract a type from an st_info value
ELF32_ST_INFO(bind, type) or ELF64_ST_INFO(bind, type) convert a binding and a type into an st_info value

Let's look at the symbol table for the following source code:

static inline void foochu()
{ /* Do nothing */ }

void func1()
{ /* Do nothing */ }

_start()
{
        func1();
        foochu();
}

The following is the command to see the symbol table entries for functions foochu and func1:

ryan@alchemy:~$ readelf -s test | egrep 'foochu|func1'
     7: 080480d8     5 FUNC    LOCAL  DEFAULT    2 foochu
     8: 080480dd     5 FUNC    GLOBAL DEFAULT    2 func1

We can see that the foochu function is a value of 0x80480da, and is a function (STT_FUNC) that has a local symbol binding (STB_LOCAL). If you recall, we talked a little bit about LOCAL bindings, which mean that the symbol cannot be seen outside the object file it is defined it, which is why foochu is local, since we declared it with the static keyword in our source code.

Symbols make life easier for everyone; they are a part of ELF objects for the purpose of linking, relocation, readable disassembly, and debugging. This brings me to the topic of a useful tool that I coded in 2013, named ftrace. Similar to, and in the same spirit of ltrace and strace, ftrace will trace all of the function calls made within the binary and can also show other branch instructions such as jumps. I originally designed ftrace to help in reversing binaries for which I didn't have the source code while at work. The ftrace is considered to be a dynamic analysis tool. Let's take a look at some of its capabilities. We compile a binary with the following source code:

#include <stdio.h>

int func1(int a, int b, int c)
{
  printf("%d %d %d\n", a, b ,c);
}

int main(void)
{
  func1(1, 2, 3);
}

Now, assuming that we don't have the preceding source code and we want to know the inner workings of the binary that it compiles into, we can run ftrace on it. First let's look at the synopsis:

ftrace [-p <pid>] [-Sstve] <prog>

The usage is as follows:

[-p]: This traces by PID
[-t]: This is for the type detection of function args
[-s]: This prints string values
[-v]: This gives a verbose output
[-e]: This gives miscellaneous ELF information (symbols, dependencies)
[-S]: This shows function calls with stripped symbols
[-C]: This completes the control flow analysis

Let's give it a try:

ryan@alchemy:~$ ftrace -s test
[+] Function tracing begins here:
PLT_call@0x400420:__libc_start_main()
LOCAL_call@0x4003e0:_init()
(RETURN VALUE) LOCAL_call@0x4003e0: _init() = 0
LOCAL_call@0x40052c:func1(0x1,0x2,0x3)  // notice values passed
PLT_call@0x400410:printf("%d %d %d\n")  // notice we see string value
1 2 3
(RETURN VALUE) PLT_call@0x400410: printf("%d %d %d\n") = 6
(RETURN VALUE) LOCAL_call@0x40052c: func1(0x1,0x2,0x3) = 6
LOCAL_call@0x400470:deregister_tm_clones()
(RETURN VALUE) LOCAL_call@0x400470: deregister_tm_clones() = 7

A clever individual might now be asking: What happens if a binary's symbol table has been stripped? That's right; you can strip a binary of its symbol table; however, a dynamically linked executable will always retain .dynsym but will discard .symtab if it is stripped, so only the imported library symbols will show up.

If a binary is compiled statically (gcc-static) or without libc linking (gcc-nostdlib), and it is then stripped with the strip command, a binary will have no symbol table at all since the dynamic symbol table is no longer imperative. The ftrace behaves differently with the –S flag that tells ftrace to show every function call even if there is no symbol attached to it. When using the –S flag, ftrace will display function names as SUB_<address_of_function>, similar to how IDA pro will show functions that have no symbol table reference.

Let's look at the following very simple source code:

int foo(void) {
}

_start()
{
  foo();
  __asm__("leave");
}

The preceding source code simply calls the foo() function and exits. The reason we are using _start() instead of main() is because we compile it with the following:

gcc -nostdlib test2.c -o test2

The gcc flag -nostdlib directs the linker to omit standard libc linking conventions and to simply compile the code that we have and nothing more. The default entry point is a symbol called _start():

ryan@alchemy:~$ ftrace ./test2
[+] Function tracing begins here:
LOCAL_call@0x400144:foo()
(RETURN VALUE) LOCAL_call@0x400144: foo() = 0
Now let's strip the symbol table and run ftrace on it again:
ryan@alchemy:~$ strip test2
ryan@alchemy:~$ ftrace -S test2
[+] Function tracing begins here:
LOCAL_call@0x400144:sub_400144()
(RETURN VALUE) LOCAL_call@0x400144: sub_400144() = 0

We now notice that foo() function has been replaced by sub_400144(), which shows that the function call is happening at address 0x400144. Now if we look at the binary test2 before we stripped the symbols, we can see that 0x400144 is indeed where foo() is located:

ryan@alchemy:~$ objdump -d test2
test2:     file format elf64-x86-64
Disassembly of section .text:
0000000000400144<foo>:
  400144:   55                      push   %rbp
  400145:   48 89 e5                mov    %rsp,%rbp
  400148:   5d                      pop    %rbp
  400149:   c3                      retq   

000000000040014a <_start>:
  40014a:   55                      push   %rbp
  40014b:   48 89 e5                mov    %rsp,%rbp
  40014e:   e8 f1 ff ff ff          callq  400144 <foo>
  400153:   c9                      leaveq
  400154:   5d                      pop    %rbp
  400155:   c3                 retq

In fact, to give you a really good idea of how helpful symbols can be to reverse engineers (when we have them), let's take a look at the test2 binary, this time without symbols to demonstrate how it becomes slightly less obvious to read. This is primarily because branch instructions no longer have a symbol name attached to them, so analyzing the control flow becomes more tedious and requires more annotation, which some disassemblers like IDA-pro allow us to do as we go:

$ objdump -d test2
test2:     file format elf64-x86-64
Disassembly of section .text:
0000000000400144 <.text>:
  400144:   55                      push   %rbp  
  400145:   48 89 e5                mov    %rsp,%rbp
  400148:   5d                      pop    %rbp
  400149:   c3                      retq   
  40014a:   55                      push   %rbp 
  40014b:   48 89 e5                mov    %rsp,%rbp
  40014e:   e8 f1 ff ff ff          callq  0x400144
  400153:   c9                      leaveq
  400154:   5d                      pop    %rbp
  400155:   c3                      retq

The only thing to give us an idea where a new function starts is by examining the procedure prologue, which is at the beginning of every function, unless (gcc -fomit-frame-pointer) has been used, in which case it becomes less obvious to identify.

This book assumes that the reader already has some knowledge of assembly language, since teaching x86 asm is not the goal of this book, but notice the preceding emboldened procedure prologue, which helps denote the start of each function. The procedure prologue just sets up the stack frame for each new function that has been called by backing up the base pointer on the stack and setting its value to the stack pointers before the stack pointer is adjusted to make room for local variables. This way variables can be referenced as positive offsets from a fixed address stored in the base pointer register ebp/rbp.

Now that we've gotten a grasp on symbols, the next step is to understand relocations. We will see in the next section how symbols, relocations, and sections are all closely tied together and live at the same level of abstraction within the ELF format.

Table of Contents for Learning Linux Binary Analysis