Mentioned earlier in this chapter (in Opcode Obfuscation in Opcode Obfuscation), some of the most sophisticated obfuscators reimplement the program they receive as input, using a custom byte code and associated virtual machine. When confronting a binary obfuscated in this manner, the only native code that you might see would be the virtual machine. Assuming you recognize that you are looking at a software virtual machine, developing a complete understanding of all of this code generally fails to reveal the true purpose of the obfuscated program. This is because the behavior of the program remains buried in the embedded byte code that the virtual machine must interpret. To fully understand the program, you must, first, locate all of the embedded byte code and, second, reverse engineer the instruction set of the virtual machine so you can properly interpret the meaning of that byte code.
By way of comparison, imagine that you knew nothing whatsoever about Java, and someone handed you a Java virtual machine and a .class file containing compiled byte code and asked you what they did. Lacking any documentation, you could make little sense of the byte code file, and you would need to fully reverse the virtual machine to learn both the structure of a .class file and how to interpret its contents. With an understanding of the byte code machine language, you could then proceed to understanding the .class file.
VMProtect is an example of a commercial product that utilizes very sophisticated virtual machine-based obfuscation techniques. As more of an academic exercise, TheHyper’s HyperUnpackMe2 challenge binary is a fairly straightforward example of the use of virtual machines in obfuscation, the primary challenge being to locate the virtual machine’s embedded byte code program and determine the meaning of each byte code. In his article on OpenRCE describing HyperUnpackMe2,[185] Rolf Rolles’s approach was to fully comprehend the virtual machine in order to build a processor module capable of disassembling its byte code. The processor module then allowed him to disassemble the byte code embedded within the challenge binary. A minor limitation to this approach is that it allows you to view either the x86 code within HyperUnpackme2 (using IDA’s x86 module) or the virtual machine code (using Rolle’s processor module) but not both at the same time. This obligates you to create two different databases, each using a different processor module. An alternative approach takes advantage of the ability to customize existing processor modules (see Customizing Existing Processors in Customizing Existing Processors) through the use of plug-ins, effectively allowing you to extend an instruction set to include all of the instructions of an embedded virtual machine. Applying this approach to HyperUnpackMe2 allows us to view x86 code and virtual machine code together in a single database, as shown in the following listing:
TheHyper:01013B2Fh_pop.l R9 TheHyper:01013B32 h_pop.l R7 TheHyper:01013B35 h_pop.l R5 TheHyper:01013B38 h_mov.l SP, R2 TheHyper:01013B3C h_sub.l SP, 0Ch TheHyper:01013B44 h_pop.l R2 TheHyper:01013B47 h_pop.l R1 TheHyper:01013B4A h_retn 0Ch TheHyper:01013B4A sub_1013919 endp TheHyper:01013B4A TheHyper:01013B4A ; ---------------------------------------------------------- TheHyper:01013B4D dd 24242424h TheHyper:01013B51 dd 0A9A4285Dh ; TAG VALUE TheHyper:01013B55 TheHyper:01013B55 ; ============ S U B R O U T I N E ========================= TheHyper:01013B55 TheHyper:01013B55 ; Attributes: bp-based frame TheHyper:01013B55 TheHyper:01013B55 sub_1013B55 proc near ; DATA XREF: TheHyper:0103AF7A?o TheHyper:01013B55 TheHyper:01013B55 var_8 = dword ptr −8 TheHyper:01013B55 var_4 = dword ptr −4 TheHyper:01013B55 arg_0 = dword ptr 8 TheHyper:01013B55 arg_4 = dword ptr 0Ch TheHyper:01013B55 TheHyper:01013B55
push ebp TheHyper:01013B56 mov ebp, esp TheHyper:01013B58 sub esp, 8 TheHyper:01013B5B mov eax, [ebp+arg_0] TheHyper:01013B5E mov [esp+8+var_8], eax TheHyper:01013B61 mov [esp+8+var_4], 0 TheHyper:01013B69 push 4 TheHyper:01013B6B push 1000h
Here, the code beginning at
is disassembled as HyperUnpackMe2 byte code, while the code that follows at
is displayed as x86 code.
The ability to simultaneously display native code and byte code has been anticipated by Hex-Rays, which introduced custom datatypes and formats in IDA 5.7. Custom data formats are useful when IDA’s built-in formatting options fail to meet your needs. New formatting capabilities are registered by specifying (using a script or plug-in) a menu name for your format and a function to perform the formatting. Once you select a custom format for a data item, IDA will invoke your formatting function each time it needs to display that data item. Custom datatypes are useful when IDA’s built-in datatypes are not expressive enough represent the data that you encounter in a particular binary. Custom datatypes, like custom formats, are registered using a script or a plug-in. The Hex-Rays example registers a custom data type to designate virtual machine byte code and displays each byte code as an instruction by using a custom data format. A drawback to this approach is that it requires you to locate every virtual machine instruction and explicitly change its data type. Using a custom processor extension, designating a single value as a virtual machine instruction automatically leads to the discovery of every reachable instruction, because IDA drives the disassembly process and the processor extension discovers new reachable instructions via its custom_emu implementation.
[185] See “Defeating HyperUnpackMe2 With an IDA Processor Module” at http://www.openrce.org/articles/full_view/28.