Igor Zhirkov, Low-Level Programming, 10.1007/978-1-4842-2403-8_2

2. Assembly Language

Igor Zhirkov¹

(1)Saint Petersburg, Russia

In this chapter we will start practicing assembly language by gradually writing more complex programs for Linux. We will observe some architecture details that impact the writing of all kinds of programs (e.g., endianness).

We have chosen a *nix system in this book because it is much easier to program in assembly compared to doing so in Windows.

2.1 Setting Up the Environment

It is impossible to learn programming without trying to program. So we are going to start programming in assembly right now.

We are using the following setup in order to complete assembler and C assignments:

Debian GNU\Linux 8.0 as an operating system.
NASM 2.11.05 as an assembly language compiler.
GCC 4.9.2 as C language compiler. This exact version is used to produce assembly from C programs. Clang compiler can be used as well.
GNU Make 4.0 as a build system.
GDB 7.7.1 as a debugger.
The text editor you like (preferably with syntax highlighting). We advocate ViM usage.

If you want to set up your own system, install any Linux distribution you like and make sure you install the programs just listed. To our knowledge, Windows Subsystem for Linux is also well suited to do all the assignments. You can install it and then install necessary packages using apt-get. Refer to the official guide located at: https://msdn.microsoft.com/en-us/commandline/wsl/install_guide .

On Apress web site for this book, http://www.apress.com/us/book/9781484224021 , you can find the following:

Two preconfigured virtual machines with the whole toolchain installed. One of them has a desktop environment; the other one is just the minimal system that can be accessed through SSH (Secure Shell). The installation instructions and other usage information is located in the README.txt file in the downloaded archive.
A link to GitHub page with all the book’s listings, answers to the questions, and solutions.

2.1.1 Working with Code Examples

Throughout this chapter, you will see numerous code examples. Compile them and if you have difficulty grasping their logic, try to execute them step by step using gdb. It is a great help in studying code. See Appendix A for a quick tutorial on gdb.

Appendix D provides more information about the system used for performance tests.

2.2 Writing “Hello, world”

2.2.1 Basic Input and Output

Unix ideology postulates that “everything is a file.” A file, in a large sense, is anything that looks like a stream of bytes. Through files one can abstract such things as

data access on a hard drive/SSD;
data exchange between programs; and
interaction with external devices.

We will follow the tradition of writing a simple “Hello, world!” program for a start. It displays a welcome message on screen and terminates. However, such a program must show characters on screen, which cannot be done directly if a program is not running on bare metal, without an operating system babysitting its activity. An operating system’s purpose is, among other things, to abstract and manage resources, and display is surely one of them. It provides a set of routines to handle communication with external devices, other programs, file systems, and so on. A program usually cannot bypass the operating system and interact directly with the resources it controls. It is limited to system calls, which are routines provided by an operating system to user applications.

Unix identifies a file with its descriptor as soon as it is opened by a program. A descriptor is nothing more than an integer value (like 42 or 999). A file is opened explicitly by invoking the open system call; however, three important files are opened as soon as a program starts and thus should not be managed manually. These are stdin, stdout, and stderr. Their descriptors are 0, 1, and 2, respectively. stdin is used to handle input, stdout to handle output, and stderr is used to output information about the program execution process but not its results (e.g., errors and diagnostics).

By default, keyboard input is linked to stdin and terminal output is linked to stdout . It means that “Hello, world!” should write into stdout .

Thus we need to invoke the write system call. It writes a given amount of bytes from memory starting at a given address to a file with a given descriptor (in our case, 1). The bytes will encode string characters using a predefined table (ASCII-table). Each entry is a character; an index in the table corresponds to its code in a range from 0 to 255.

See Listing 2-1 for our first complete example of an assembly program.

Listing 2-1. hello.asm

global _start

section .data
message: db 'hello, world!', 10

section .text
_start:
    mov     rax, 1           ;system call number should be stored in rax
    mov     rdi, 1           ; argument #1 in rdi: where to write (descriptor)?
    mov     rsi, message     ; argument #2 in rsi: where does the string start?
    mov     rdx, 14          ; argument #3 in rdx: how many bytes to write?
    syscall                  ; this instruction invokes a system call

This program invokes a write system call with correct arguments on lines 6-9. It is really the only thing it does. The next sections will explain this sample program in greater detail.

2.2.2 Program Structure

As we remember from the von Neumann machine description, there is only one memory, for both code and data; those are indistinguishable. However, a programmer wants to separate them. An assembly program is usually divided into sections. Each section has its use: for example, .text holds instructions, .data is for global variables (data available in every moment of the program execution). One can switch back and forth between sections; in the resulting program all data, corresponding to each section, will be gathered in one place.

To get rid of numeric address values programmers use labels. They are just readable names and addresses. They can precede any command and are usually separated from it by a colon. There is one label in this program at line 5. _start.

A notion of variable is typical for higher-level languages. In assembly language, in fact, notions of variables and procedures are quite subtle. It is more convenient to speak about labels (or addresses).

An assembly program can be divided into multiple files. One of them should contain the _start label. It is the entry point; it marks the first instruction to be executed.

This label should be declared global (see line 1). The meaning of it will be evident later.

Comments start with a semicolon and last until the end of the line.

Assembly language consists of commands, which are directly mapped into machine code. However, not all language constructs are commands. Others control the translation process and are usually called directives. ¹

In the “Hello, world!” example there are three directives: global, section, and db.

Note

Assembly language is, in general, case insensitive, but label names are not!

mov, mOV, Mov are all the same thing, but global _start and global _START are not! Section names are case sensitive too: section .DATA and section .data differ!

The db directive is used to create byte data. Usually data is defined using one of these directives, which differ by data format:

db—bytes;
dw—so-called words, equal to 2 bytes each;
dd—double words, equal to 4 bytes; and
dq—quad words, equal to 8 bytes.

Let’s see an example, in Listing 2-2.

Listing 2-2. data_decl.asm

section .data
   example1: db 5, 16, 8, 4, 2, 1
   example2: times 999 db 42
   example3: dw 999

times n cmd is a directive to repeat cmd n times in program code. As if you copy-pasted it n times. It also works with central processor unit (CPU) instructions.

Note that you can create data inside any section, including .text. As we told you earlier, for a CPU data and instructions are all alike and the CPU will try to interpret data as encoded instructions when asked to.

These directives allow you to define several data objects one by one, as in Listing 2-3, where a sequence of characters is followed by a single byte equal to 10.

Listing 2-3. hello.asm

message: db 'hello, world!', 10

Letters, digits, and other characters are encoded in ASCII. Programmers have agreed upon a table, where each character is assigned a unique number—its ASCII-code. We start at address corresponding to the label message. We store the ASCII codes for all letters of string "hello, world!", then we add a byte equal to 10. Why 10? By convention, to start a new line we output a special character with code 10.

Terminological chaos

It is quite common to refer to the integer format most native to the computer as machine word. As we are programming a 64-bit computer, where addresses are 64-bit, general purpose registers are 64-bit, it is pretty convenient to take the machine word size as 64 bits or 8 bytes.

In assembly programming for Intel architecture the term word was indeed used to describe a 16-bit data entry, because on the older machines it was exactly the machine word. Unfortunately, for legacy reasons, it is still used as in old times. That’s why 32-bit data is called double words and 64-bit data is referred to as quad words.

2.2.3 Basic Instructions

The mov instruction is used to write a value into either register or memory. The value can be taken from other register or from memory, or it can be an immediate one. However,

mov cannot copy data from memory to memory;
the source and the destination operands must be of the same size.

The syscall instruction is used to perform system calls in *nix systems. The input/output operations depend on hardware (which can be also used by multiple programs at the same time), so programmers are not allowed to control them directly, bypassing the operating system.

Each system call has a unique number. To perform it

The rax register has to hold system call’s number;
The following registers should hold its arguments: rdi, rsi, rdx, r10, r8, and r9.
System call cannot accept more than six arguments.
Execute syscall instruction.

It does not matter in which order the registers are initialized.

Note, that the syscall instruction changes rcx and r11! We will explain the cause later. When we wrote the “Hello, world!” program we used a simple write syscall. It accepts

File descriptor ;
The buffer address. We start taking consecutive bytes for writing from here;
The amount of bytes to write.

To compile our first program, save the code in hello.asm ² and then launch these commands in the shell:

> nasm -felf64 hello.asm -o hello.o
> ld -o hello hello.o
> chmod u+x hello

The details of compilation process along with compilation stages will be discussed in Chapter 5. Let’s launch “Hello, world!”

> ./hello
hello, world!
Segmentation fault

We have clearly output what we wanted. However, the program seems to have caused an error. What did we do wrong? After executing a system call, the program continues its work. We did not write any instructions after syscall, but the memory holds indeed some random values in the next cells.

Note

If you did not put anything at some memory address, it will certainly hold some kind of garbage, not zeroes or any kind of valid instructions.

A processor has no idea whether these values were intended to encode instructions or not. So, following its very nature, it tries to interpret them, because rip register points at them. It is highly unlikely these values encode correct instructions, so an interrupt with code 6 will occur (invalid instruction).³

So what do we do? We have to use the exit system call , which terminates the program in a correct way, as shown in Listing 2-4.

Listing 2-4. hello_proper_exit.asm

section .data
message: db 'hello, world!', 10

section .text
global _start

_start:
    mov     rax, 1           ; 'write' syscall number
    mov     rdi, 1           ; stdout descriptor
    mov     rsi, message     ; string address
    mov     rdx, 14          ; string length in bytes
    syscall

    mov     rax, 60          ; 'exit' syscall number
    xor     rdi, rdi
    syscall

Question 11

What does instruction xor rdi, rdi do?

Question 12

What is the program return code?

Question 13

What is the first argument of the exit system call?

2.3 Example: Output Register Contents

Time to try something a bit harder . Let’s output rax value in hexadecimal format, as shown in Listing 2-5.

Listing 2-5. Print rax Value: print_rax.asm

section .data
codes:
    db      '0123456789ABCDEF'

section .text
global _start
_start:
    ; number 1122... in hexadecimal format
    mov rax, 0x1122334455667788

    mov rdi, 1
    mov rdx, 1
    mov rcx, 64
   ; Each 4 bits should be output as one hexadecimal digit
   ; Use shift and bitwise AND to isolate them
   ; the result is the offset in 'codes' array
.loop:
    push rax
    sub rcx, 4
   ; cl is a register, smallest part of rcx
   ; rax -- eax -- ax -- ah + al
   ; rcx -- ecx -- cx -- ch + cl
    sar rax, cl
    and rax, 0xf

    lea rsi, [codes + rax]
    mov rax, 1

   ; syscall leaves rcx and r11 changed
    push rcx
    syscall
    pop rcx

    pop rax
   ; test can be used for the fastest 'is it a zero?' check
   ; see docs for 'test' command
    test rcx, rcx
    jnz .loop

    mov     rax, 60 ;          invoke 'exit' system call
    xor      rdi, rdi
    syscall

By shifting rax value and logical ANDing it with mask 0xF we transform the whole number into one of its hexadecimal digits. Each digit is a number from 0 to 15. Use it as an index and add it to the address of the label codes to get the representing character.

For example, given rax = 0x4A we will use indices 0x4 = 4₁₀ and 0xA = 10_10. ⁴ The first one will give us a character '4' whose code is 0x34. The second one will result into character 'a' whose code is 0x61.

Question 14

Check that the ASCII codes mentioned in the last example are correct.

We can use a hardware stack to save and restore register values, like around syscall instruction.

Question 15

What is the difference between sar and shr? Check Intel docs.

Question 16

How do you write numbers in different number systems in a way understandable to NASM? Check NASM documentation.

Note

When a program starts, the value of most registers is not well defined (it can be absolutely random). It is a great source of rookie mistakes, as one tends to assume that they are zeroed.

2.3.1 Local Labels

Notice the unusual label name . loop: it starts with a dot. This label is local. We can reuse the label names without causing name conflicts as long as they are local.

The last used dotless global label is a base one for all subsequent local labels (until the next global label occurs). The full name for .loop label is _start.loop. We can use this name to address it from anywhere in the program, even after other global labels occurs.

2.3.2 Relative Addressing

This demonstrates how to address memory in a more complex way than just by immediate address.

Listing 2-6. Relative Addressing: print_rax.asm

lea rsi, [codes + rax]

Square brackets denote indirect addressing; the address is written inside them.

mov rsi, rax—copies rax into rsi
mov rsi, [rax]—copies memory contents (8 sequential bytes) starting at address, stored in rax, into rsi. How do we know that we have to copy exactly 8 bytes? As we know, mov operands are of the same size, and the size of rsi is 8 bytes. Knowing these facts, the assembler is able to deduce that exactly 8 bytes should be taken from memory.

The instructions lea and mov have a subtle difference between their meanings. lea means “load effective address.”

It allows you to calculate an address of a memory cell and store it somewhere. This is not always trivial, because there are tricky address modes (as we will see later): for example, the address can be a sum of several operands.

Listing 2-7 provides a quick demonstration of what lea and mov are doing.

Listing 2-7. lea_vs_mov.asm

; rsi <- address of label 'codes', a number
mov rsi, codes

; rsi <- memory contents starting at 'codes' address
; 8 consecutive bytes are taken because rsi is 8 bytes long
mov rsi, [codes]

; rsi <- address of 'codes'
; in this case it is equivalent of mov rsi, codes
; in general the address can contain several components
lea rsi, [codes]

; rsi <- memory contents starting at (codes+rax)
mov rsi, [codes + rax]

; rsi <- codes + rax
; equivalent of combination:
; -- mov rsi, codes
; -- add rsi, rax
; Can't do it with a single mov!
lea rsi, [codes + rax]

2.3.3 Order of Execution

All commands are executed consecutively except when special jump instructions occur. There is an unconditional jump instruction jmp addr. It can be viewed as a substitute of mov rip, addr.⁵

Conditional jumps rely on contents of rflags register. For example, jz address jumps to address only if zero flag is set.

Usually one uses either a test or a cmp instruction to set up necessary flags coupled with conditional jump instruction.

cmp subtracts the second operand from the first; it does not store the result anywhere, but it sets the appropriate flags based on it (e.g., if operands are equal, it will set zero flag). test does the same thing but uses logical AND instead of subtraction.

An example shown in Listing 2-8 incorporates writing 1 in rbx if rax < 42, and 0 otherwise.

Listing 2-8. jumps_example.asm

    cmp rax, 42
    jl yes
    mov rbx, 0
    jmp ex
yes:
    mov rbx, 1
ex:

It is a common (and fast) way to test register value for being zero with test reg,reg instruction.

At least two commands exist for each arithmetic flag F: jF and jnF. For example, sign flag: js and jns. Other useful commands include

ja (jump if above)/jb (jump if below) for a jump after a comparison of unsigned numbers with cmp.
jg (jump if greater)/jl (jump if less) for signed.
jae (jump if above or equal), jle (jump if less or equal) and similar. Some of common jump instructions are shown in Listing 2-9.

Listing 2-9. Jump Instructions: jumps.asm

mov rax, -1
mov rdx, 2

cmp rax, rdx
jg location
ja location           ; different logic!

cmp rax, rdx
je  location          ; if rax equals rdx
jne location          ; if rax is not equal to rdx

Question 17

What is the difference between je and jz?

2.4 Function Calls

Routines (functions) allow one to isolate a piece of program logic and use it as a black box. It is a necessary mechanism to provide abstraction. Abstraction allows you to build more complex systems by encapsulating complex algorithms under opaque interfaces.

Instruction call <address> is used to perform calls. It does exactly the following:

push rip
jmp <address>

The address now stored in the stack (former rip contents) is called return address.

Any function can accept an unlimited number of arguments. The first six arguments are passed in rdi, rsi, rdx, rcx, r8, and r9, respectively. The rest is passed on to the stack in reverse order.

What we consider an end to a routine is unclear. The most straightforward thing to say is that ret instruction denotes the function end. Its semantic is fully equivalent to pop rip.

Apparently, the fragile mechanism of call and ret only works when the state of the stack is carefully managed. One should not invoke ret unless the stack is exactly in the same state as when the function started. Otherwise, the processor will take whatever is on top of the stack as a return address and use it as the new rip content, which will certainly lead to executing garbage.

Now let’s talk about how functions use registers. Obviously, executing a function can change registers. There are two types of registers.

Callee-saved registers must be restored by the procedure being called. So, if it needs to change them, it has to change them back.
These registers are callee-saved: rbx, rbp, rsp, r12-r15, a total of seven registers.
Caller-saved registers should be saved before invoking a function and restored after. One does not have to save and restore them if their value will not be of importance after.

All other registers are caller-saved.

These two categories are a convention. That is, a programmer must follow this agreement by

Saving and restoring callee-saved registers.
Being always aware that caller-saved registers can be changed during function execution.

A source of bugs

A common mistake is not saving caller-saved registers before call and using them after returning from function . Remember:

If you change rbx, rbp, rsp, or r12-r15, change them back!
If you need any other register to survive function call, save it yourself before calling!

Some functions can return a value. This value is usually the very essence of why the function is written and executed. For example, we can write a function that accepts a number as its argument and returns it squared.

Implementation-wise, we are returning values by storing them in rax before the function ends its execution. If you need to return two values, you are allowed to use rdx for the second one.

So, the pattern of calling a function is as follows:

Save all caller-saved registers you want to survive function call (you can use push for that).
Store arguments in the relevant registers (rdi, rsi, etc.).
Invoke function using call.
After function returns, rax will hold the return value.
Restore caller-saved registers stored before the function call.

Why do we need conventions?

A function is used to abstract a piece of logic, forgetting completely about its internal implementation and changing it when necessary. Such changes should be completely transparent to the outside program. The convention described previously allows you to call any function from any given place and be sure about its effects (may change any caller-saved register; will keep callee-saved registers intact).

Some system calls also return values—be careful and read the docs!

You should never use rbp and rsp. They are implicitly used during the execution. As you already know, rsp is used as a stack pointer.

On system call arguments

The arguments for system calls are stored in a different set of registers than those for functions. The fourth argument is stored in r10, while a function accepts the fourth argument in rcx!

The reason is that syscall instruction implicitly uses rcx. System calls cannot accept more than six arguments.

If you do not follow the described convention, you will be unable to change your functions without introducing bugs in places where they are called.

Now it is time to write two more functions: print_newline will print the newline character; print_hex will accept a number and print it in hexadecimal format (see Listing 2-10).

Listing 2-10. print_call.asm

section .data

newline_char: db 10
codes: db '0123456789abcdef'

section .text
global _start

print_newline:
    mov rax, 1            ; 'write' syscall identifier
    mov rdi, 1            ; stdout file descriptor
    mov rsi, newline_char ; where do we take data from
    mov rdx, 1            ; the amount of bytes to write
    syscall
   ret

print_hex:
    mov rax, rdi

    mov rdi, 1
    mov rdx, 1
    mov rcx, 64           ; how far are we shifting rax?
iterate:
    push rax              ; Save the initial rax value
    sub rcx, 4
    sar rax, cl           ; shift to 60, 56, 52, ... 4, 0
                          ; the cl register is the smallest part of rcx
    and rax, 0xf          ; clear all bits but the lowest four
    lea rsi, [codes + rax]; take a hexadecimal digit character code

    mov rax, 1            ;

    push rcx              ; syscall will break rcx
    syscall               ; rax = 1 (31) -- the write identifier,
                          ;   rdi = 1 for stdout,
                          ; rsi = the address of a character, see line 29

    pop rcx

    pop rax               ; ˆ see line 24 ˆ
    test rcx, rcx         ; rcx = 0 when all digits are shown
    jnz iterate

    ret
_start:
    mov rdi, 0x1122334455667788
    call print_hex
    call print_newline

    mov rax, 60
    xor rdi, rdi
    syscall

2.5 Working with Data

2.5.1 Endianness

Let’s try to output a value stored in memory using the function we just wrote. We are going to do it in two different ways: first we will enumerate all its bytes separately and then we will type it as usual (see Listing 2-11).

Listing 2-11. endianness.asm

section .data
demo1: dq 0x1122334455667788
demo2: db 0x11, 0x22, 0x33, 0x44, 0x55, 0x66, 0x77, 0x88

section .text

_start:
    mov rdi, [demo1]
    call print_hex
    call print_newline

    mov rdi, [demo2]
    call print_hex
    call print_newline

    mov rax, 60
    xor rdi, rdi
    syscall

When we launch it, to our surprise, we get completely different results for demo1 and demo2.

> ./main
1122334455667788
8877665544332211

As we see, multi-byte numbers are stored in reverse order!

The bits in each byte are stored in a straightforward way, but the bytes are stored from the least significant to the most significant.

This applies only to memory operations: in registers, the bytes are stored in a natural way. Different processors have different conventions on how the bytes are stored.

Big endian multibyte numbers are stored in memory starting with the most significant bytes.
Little endian multibyte numbers are stored in memory starting with the least significant bytes.

As the example shows , Intel 64 is following the little endian convention. In general, choosing one convention over the other is a matter of choice, made by hardware engineers.

These conventions do not concern arrays and strings. However, if each character is encoded using 2 bytes rather than just 1, those bytes will be stored in reverse order.

The advantage of little endian is that we can discard the most significant bytes effectively converting the number from a wider format to a narrower one, like 8 bytes.

For example, demo3: dq 0x1234. Then, to convert this number into dw we have to read a dword number starting at the same address demo3. See Table 2-1 for a complete memory layout.

Table 2-1. Little Endian and Big Endian for quad word number 0x1234

ADDRESS	VALUE – LE	VALUE – BE
demo3	0x34	0x00
demo3 + 1	0x12	0x00
demo3 + 2	0x00	0x00
demo3 + 3	0x00	0x00
demo3 + 4	0x00	0x00
demo3 + 5	0x00	0x00
demo3 + 6	0x00	0x12
demo3 + 7	0x00	0x34

Big endian is a native format often used inside network packets (e.g., TCP/IP). It is also an internal number format for Java Virtual Machine.

Middle endian is a not very well-known notion. Assume we want to create a set of routines to perform arithmetic with 128-bit numbers. Then the bytes can be stored as follows: first will be the 8 least significant bytes in reversed order and then the 8 most significant bytes also in reverse order:

7 6 5 4 3 2 1 0, 16 15 14 13 12 11 10 9 8

2.5.2 Strings

As we already know, the characters are encoded using the ASCII table . A code is assigned to each character. A string is obviously a sequence of character codes. However, it does not say anything about how to determine its length.

Strings start with their explicit length.
```
db 27, 'Selling England by the Pound'
```
A special character denotes the string ending. Traditionally, the zero code is used. Such strings are called null-terminated.
```
db 'Selling England by the Pound', 0
```

2.5.3 Constant Precomputation

It is not uncommon to see such code:

lab: db 0
...
   mov rax, lab + 1 + 2*3

NASM supports arithmetic expressions with parentheses and bit operations. Such expressions can only include constants known to the compiler. This way it can precompute all such expressions and insert the computation results (as constant numbers) in executable code. So, such expressions are NOT calculated at runtime.

A runtime analogue would need to use such instructions as add or mul.

2.5.4 Pointers and Different Addressing Types

Pointers are addresses of memory cells. They can be stored in memory or in registers.

The pointer size is 8 bytes. Data usually occupies several memory cells (i.e., several consecutive addresses). The pointers hold no information about the pointed data length. When trying to write somewhere a value whose size is not specified and can not be deduced (for example, mov [myvariable], 4), we can get compilation errors. In such cases we have to provide size explicitly as shown below:

section .data
test: dq -1

section .text

mov byte[test], 1 ;1
mov word[test], 1 ;2
mov dword[test], 1 ;4
mov qword[test], 1 ;8

Question 18

What is test equal to after each of the commands listed previously?

Let’s see how one can encode operands in instructions .

Immediately:
An instruction is itself contained in memory. The operands in some form are its parts; those parts have addresses of their own. Many instructions can contain the operand values themselves.
This is the way to move a number 10 into rax.
```
mov rax, 10
```
Through a register:
This instruction transfers rbx value into rax.
```
mov rax, rbx
```
By direct memory addressing:
This instruction transfers 8 bytes starting at the tenth address into rax:
```
mov rax, [10]
```
We can also take the address from register:
```
mov r9, 10
mov  rax, [r9]
```
We can use precomputations:
```
buffer: dq 8841, 99, 00
...
mov rax, [buffer+8]
```
The address inside this instruction was precomputed, because both base and offset are constants in control of compiler. Now it is just a number.
Base-indexed with scale and displacement
Most addressing modes are generalized by this mode. The address here is calculated based on the following components:
Address = base + index ∗ scale + displacement
- Base is either immediate or a register;
- Scale can only be immediate equal to 1, 2, 4, or 8;
- Index is immediate or a register; and
- Displacement is always immediate.

Listing 2-12 shows examples of different addressing types .

Listing 2-12. addressing.asm

mov rax, [rbx + 4* rcx + 9]
mov rax, [4*r9]
mov rdx, [rax + rbx]
lea rax, [rbx + rbx * 4]     ; rax = rbx * 5
add r8, [9 + rbx*8 + 7]

A big picture You can think about byte, word, etc. as about type specifiers. For instance, you can either push 16-, 32-, or 64-bit numbers into the stack. Instruction push 1 is unclear about how many bits wide the operand is. In the same way mov word[test], 1 signifies, that [test] is a word; there is an information about number format encoded in push word 1.

2.6 Example: Calculating String Length

Let’s start by writing a function to calculate the length of a null-terminated string.

As we do not have a routine to print something to standard output, the only way to output value is to return it as an exit code through exit system call. To see the exit code of the last process use the $? variable.

> true
> echo $?
0
> false
> echo $?
1

Let’s write an assembly program that mimics the false shell command, as shown in Listing 2-13.

Listing 2-13. false.asm

global _start

section .text
_start:
    mov rdi, 1
    mov rax, 60
    syscall

Now we have everything needed to calculate string length. Listing 2-14 shows the code.

Listing 2-14. String Length: strlen. asm

global _start

section .data

test_string: db "abcdef", 0

section .text

strlen:                   ; by our convention, first and the only argument
                          ; is taken from rdi
    xor rax, rax          ; rax will hold string length. If it is not
                          ; zeroed first, its value will be totally random

.loop:                    ; main loop starts here
    cmp byte [rdi+rax], 0 ; Check if the current symbol is null-terminator.
                          ; We absolutely need that 'byte' modifier since
                          ; the left and the right part of cmp should be
                          ; of the same size. Right operand is immediate
                          ; and holds no information about its size,
                          ; hence we don't know how many bytes should be
                          ; taken from memory and compared to zero
    je .end               ; Jump if we found null-terminator
    inc rax               ; Otherwise go to next symbol and increase
                          ; counter
    jmp .loop

.end:
    ret                   ; When we hit 'ret', rax should hold return value

_start:

    mov rdi, test_string
    call strlen
    mov rdi, rax

    mov rax, 60
    syscall

The important part (and the only part we will leave) is the strlen function. Notice, that

strlen changes registers, so after performing call strlen the registers can change their values.
strlen does not change rbx or any other callee-saved registers .

Question 19

Can you spot a bug or two in Listing 2-15? When will they occur?

Listing 2-15. Alternative Version of strlen: strlen_bug1.asm

global _start

section .data
test_string: db "abcdef", 0

section .text

strlen:
.loop:
    cmp byte [rdi+r13], 0
    je .end
    inc r13
    jmp .loop
.end:
    mov rax, r13
    ret

_start:
    mov rdi, test_string
    call strlen
    mov rdi, rax

    mov rax, 60
    syscall

2.7 Assignment: Input/Output Library

Before we start doing anything cool looking, we are going to ensure we won’t have to code the same basic routines over and over again. As for now, we do not have anything; even getting keyboard input is a pain. So, let’s build a small library for basic input and output functions.

First you have to read Intel docs [15] for the following instructions (remember, they are all described in details in the second volume):

xor
jmp, ja, and similar ones
cmp
mov
inc, dec
add, imul, mul, sub, idiv, div
neg
call, ret
push, pop

These commands are core to us and you should know them well. As you might have noticed, Intel 64 supports thousands of commands. Of course, there is no need for us to dive there. Using system calls together with instructions listed earlier will get us pretty much anywhere.

You also have to read docs for the read system call. Its code is 0; otherwise it is similar to write. Refer to the Appendix C in case of difficulties.

Edit lib.inc and provide definitions for the functions instead of stub xor rax, rax instructions. Refer to Table 2-2 for the required functions’ semantics. We do recommend implementing them in the given order because sometimes you will be able to reuse your code by calling functions you have already written.

Table 2-2. Input/Output Library Functions

Function	Definition
exit	Accepts an exit code and terminates current process.
string_length	Accepts a pointer to a string and returns its length.
print_string	Accepts a pointer to a null-terminated string and prints it to stdout .
print_char	Accepts a character code directly as its first argument and prints it to stdout .
print_newline	Prints a character with code 0xA.
print_ uint	Outputs an unsigned 8-byte integer in decimal format.
	We suggest you create a buffer on the stack⁶ and store the division results there. Each time you divide the last value by 10 and store the corresponding digit inside the buffer. Do not forget, that you should transform each digit into its ASCII code (e.g., 0x04 becomes0x34).
print_int	Output a signed 8-byte integer in decimal format.
read_char	Read one character from stdin and return it. If the end of input stream occurs, return 0.
read_word	Accepts a buffer address and size as arguments. Reads next word from stdin (skipping whitespaces⁷ into buffer). Stops and returns 0 if word is too big for the buffer specified; otherwise returns a buffer address.
	This function should null-terminate the accepted string.
parse_uint	Accepts a null-terminated string and tries to parse an unsigned number from its start.
	Returns the number parsed in rax, its characters count in rdx.
parse_ int	Accepts a null-terminated string and tries to parse a signed number from its start. Returns the number parsed in rax; its characters count in rdx (including sign if any). No spaces between sign and digits are allowed.
string_equals	Accepts two pointers to strings and compares them. Returns 1 if they are equal, otherwise 0.
string_ copy	Accepts a pointer to a string, a pointer to a buffer, and buffer’s length. Copies string to the destination. The destination address is returned if the string fits the buffer; otherwise zero is returned.

Use test.py to perform automated tests of correctness. Just run it and it will do the rest.

Remember, that a string of n characters needs n + 1 bytes to be stored in memory because of a null-terminator.

Read Appendix A to see how you can execute the program step by step observing the changes in register values and memory state.

2.7.1 Self-Evaluation

Before testing or when facing an unexpected result, check the following quick list:

Labels denoting functions should be global; others should be local.
You do not assume that registers hold zero “by default.”
You save and restore callee-saved registers if you are using them.
You save caller-saved registers you need before call and restore them after.
You do not use buffers in .data. Instead, you allocate them on the stack, which allows you to adapt multithreading if needed.
Your functions accept arguments in rdi, rsi, rdx, rcx, r8, and r9.
You do not print numbers digit after digit. Instead you transform them into strings of characters and use print_string.
parse_int and parse_uint are setting rdx correctly. It will be really important in the next assignment.
All parsing functions and read_word work when the input is terminated via Ctrl-D.

Done right, the code will not take more than 250 lines.

Question 20

Try to rewrite print_newline without calling print_char or copying its code. Hint: read about tail call optimization.

Question 21

Try to rewrite print_int without calling print_uint or copying its code. Hint: read about tail call optimization.

Question 22

Try to rewrite print_int without calling print_uint, copying its code, or using jmp. You will only need one instruction and a careful code placement.

Read about co-routines.

2.8 Summary

In this chapter we started to do real things and apply our basic knowledge about assembly language. We hope that you have overcome any possible fear of assembly. Despite being verbose to an extreme, it is not a hard language to use. We have learned to make branches and cycles and perform basic arithmetic and system calls; we have also seen different addressing modes, little and big endian. The following assembly assignments will use the little library we have built to facilitate interaction with user.

Question 23

What is the connection between rax, eax, ax, ah, and al?

Question 24

How do we gain access to the parts of r9?

Question 25

How can you work with a hardware stack? Describe the instructions you can use.

Question 26

Which ones of these instructions are incorrect and why?

mov [rax], 0
cmp [rdx], bl
mov bh, bl
mov al, al
add bpl, 9
add [9], spl
mov r8d, r9d
mov r3b, al
mov r9w, r2d
mov rcx, [rax + rbx + rdx]
mov r9, [r9 + 8*rax]
mov [r8+r7+10], 6
mov [r8+r7+10], r6

Question 27

Enumerate the callee-saved registers

Question 28

Enumerate the caller-saved registers

Question 29

What is the meaning of rip register?

Question 30

What is the SF flag?

Question 31

What is the ZF flag?

Question 32

Describe the effects of the following instructions:

sar
shr
xor
jmp
ja, jb, and similar ones.
cmp
mov
inc,dec
add
imul, mul
sub
idiv, div
call, ret
push, pop

Question 33

What is a label and does it have a size?

Question 34

How do you check whether an integer number is contained in a certain range (x, y)?

Question 35

What is the difference between ja/jb and jg/jl?

Question 36

What is the difference between je and jz?

Question 37

How do you test whether rax is zero without the cmp command?

Question 38

What is the program return code?

Question 39

How do we multiply rax by 9 using exactly one instruction?

Question 40

By using exactly two instructions (the first is neg), take an absolute value of an integer stored in rax.

Question 41

What is the difference between little and big endian?

Question 42

What is the most complex type of addressing?

Question 43

Where does the program execution start?

Question 44

rax = 0x112233445567788. We have performed push rax. What will be the contents of byte at address [rsp+3]?

Footnotes

1 The NASM manual also uses the name “pseudo instruction” for a specific subset of directives.

2 Remember: all source code, including listings, can be found on www.apress.com/us/book/9781484224021 and is also stored in the home directory of the preconfigured virtual machine!

3 Even if not, soon the sequential execution will lead the processor to the end of allocated virtual addresses, see section 4.2. In the end, the operating system will terminate the program because it is unlikely that the latter will recover from it.

4 The subscript denotes the number system’s base.

5 This action is impossible to encode using the mov command. Check Intel docs to verify that it is not implemented.

6 In fact, by decreasing rsp you allocate memory on the stack.

7 We consider spaces, tabulation, and line breaks as whitespace characters. Their codes are 0x20, 0x9, and 0x10, respectively.

Previous Chapter

1. Basic Computer Architecture

Next Chapter

3. Legacy

Table of Contents for Low-Level Programming: C, Assembly, and Program Execution on Intel® 64 Architecture

2. Assembly Language

2.1 Setting Up the Environment

2.1.1 Working with Code Examples

2.2 Writing “Hello, world”

2.2.1 Basic Input and Output

Listing 2-1. hello.asm

2.2.2 Program Structure

Note

Listing 2-2. data_decl.asm

Listing 2-3. hello.asm

Terminological chaos

2.2.3 Basic Instructions

Note

Listing 2-4. hello_proper_exit.asm

Question 11

Question 12

Question 13

2.3 Example: Output Register Contents

Listing 2-5. Print rax Value: print_rax.asm

Question 14

Question 15

Question 16

Note

2.3.1 Local Labels

2.3.2 Relative Addressing

Listing 2-6. Relative Addressing: print_rax.asm

Listing 2-7. lea_vs_mov.asm

2.3.3 Order of Execution

Listing 2-8. jumps_example.asm

Listing 2-9. Jump Instructions: jumps.asm

Question 17

2.4 Function Calls

A source of bugs

Why do we need conventions?

On system call arguments

Listing 2-10. print_call.asm

2.5 Working with Data

2.5.1 Endianness

Listing 2-11. endianness.asm

Table 2-1. Little Endian and Big Endian for quad word number 0x1234

2.5.2 Strings

2.5.3 Constant Precomputation

2.5.4 Pointers and Different Addressing Types

Question 18

Listing 2-12. addressing.asm

2.6 Example: Calculating String Length

Listing 2-13. false.asm

Listing 2-14. String Length: strlen. asm

Question 19

Listing 2-15. Alternative Version of strlen: strlen_bug1.asm

2.7 Assignment: Input/Output Library

Table 2-2. Input/Output Library Functions

2.7.1 Self-Evaluation

Question 20

Question 21

Question 22

2.8 Summary

Question 23

Question 24

Question 25

Question 26

Question 27

Question 28

Question 29

Question 30

Question 31

Question 32

Question 33

Question 34

Question 35

Question 36

Question 37

Question 38

Question 39

Question 40

Question 41

Question 42

Question 43

Question 44

Table of Contents for
Low-Level Programming: C, Assembly, and Program Execution on Intel® 64 Architecture