Igor Zhirkov, Low-Level Programming, 10.1007/978-1-4842-2403-8_3

3. Legacy

Igor Zhirkov¹

(1)Saint Petersburg, Russia

This chapter will introduce you to the legacy processor modes, which are no longer used, and to their mostly legacy features, which are still relevant today. You will see how processors evolved and learn the details of protection rings implementation (privileged and user mode). You will also understand the meaning of Global Descriptor Table. While this information helps you understanding the architecture better, it is not crucial for assembly programming in user space.

As processors evolved, each new mode increased the machine word’s length and added new features. A processor can function in one of the following modes:

Real mode (the most ancient, 16-bit one);
Protected (commonly referred as 32-bit one);
Virtual (to emulate real mode inside protected);
System management mode (for sleep mode, power management, etc.);
Long mode, with which we are already a bit familiar.

We are going to take a closer look at real and protected mode.

3.1 Real mode

Real mode is the most ancient. It lacks virtual memory; the physical memory is addressed directly and general purpose registers are 16-bit wide.

So, neither rax nor eax exist yet, but ax, al, and ah do.

Such registers can hold values from 0 to 65535, so the amount of bytes we can address using one of them is 65536 bytes. Such memory region is called segment. Do not confuse it with protected mode segments or ELF (Executable and Linkable Format) file sections!

These are the registers usable in real mode:

ip, flags;
ax, bx, cx, dx, sp, bp, si, di;
Segment registers: cs, ds, ss, es, (later also gs and fs).

As it was not straightforward to address more than 64 kilobytes of memory, engineers came up with a solution to use special segment registers in the following way:

Each physical address consists of 20 bits (so, 5 hexadecimal digits).
Each logical address consists of two components. One is taken from a segment register and encodes the segment start. The other is an offset inside this segment. The hardware calculates the physical address from these components the following way:
physical address = segment base * 16 + offset
You can often see addresses written in form of segment:offset, for example: 4a40:0002, ds:0001, 7bd3:ah.

As we already stated, programmers want to separate code from data (and stack), so they intend to use different segments for these code sections. Segment registers are specialized for that: cs stores the code segment start address, ds corresponds to data segment, and ss to stack segment. Other segment registers are used to store additional data segments.

Note that strictly speaking, the segment registers do not hold segments’ starting addresses but rather their parts (the four most significant hexadecimal digits). By adding another zero digit to multiply it by 16₁₀ we get the real segment starting address.

Each instruction referencing memory implicitly assumes usage of one of segment registers. Documentation clarifies the default segment registers for each instruction. However, common sense can help as well. For instance, mov is used to manipulate data, so the address is relative to the data segment.

mov al, [0004]   ; === mov al, ds:0004

It is possible to redefine the segment explicitly:

mov al, cs:[0004]

When the program is loaded, the loader sets ip, cs, ss, and sp registers so that cs:ip corresponds to the entry point, and ss:sp points on top of the stack.

The central processing unit (CPU) always starts in real mode, and then the main loader usually executes the code to explicitly switch it to protected mode and then to the long mode.

Real mode has numerous drawbacks .

It makes multitasking very hard. The same address space is shared between all programs, so they should be loaded at different addresses. Their relative placement should usually be decided during compilation.
Programs can rewrite each other’s code or even operating system as they all live in the same address space.
Any program can execute any instruction, including those used to set up the processor’s state. Some instructions should only be used by the operating system (like those used to set up virtual memory, perform power management, etc.) as their incorrect usage can crash the whole system.

The protected mode was intended to solve these problems.

3.2 Protected Mode

Intel 80386 was the first processor implementing protected 32-bit mode.

It provides wider versions of registers (eax, ebx, ..., esi, edi) as well as new protection mechanisms: protection rings, virtual memory, and an improved segmentation.

These mechanisms isolated programs from one another, so an abnormal termination of one of them did not harm the others. Furthermore, programs were not able to corrupt other processes’ memory.

The way of obtaining a segment starting address has changed compared to real mode. Now the start is calculated based on an entry in a special table, not by direct multiplication of segment register contents.

Linear address = segment base (taken from system table) + offset

Each of segment registers cs, ds, ss, es, gs, and fs stores so-called segment selector , containing an index in a special segment descriptor table and a little additional information. There are two types of segment descriptor tables: possibly numerous LDT (Local Descriptor Table) and only one GDT (Global Descriptor Table).

LDTs were intended for a hardware task-switching mechanism; however, operating system manufacturers did not adapt it. Today programs are isolated by virtual memory, and LDTs are not used.

GDTR is a register to store GDT address and size.

Segment selectors are structured as shown in Figure 3-1.

Figure 3-1. Segment selector (contents of any segment register)

Index denotes descriptor position in either GDT or LDT. The T bit selects either LDT or GDT. As LDTs are no longer used, it will be zero in all cases.

The table entries in GDT/LDT also store information about which privilege level is assigned to the described segment. When a segment is accessed through segment selector, a check of Request Privilege Level (RPL) value (stored in selector = segment register) against Descriptor Privilege Level (stored in descriptor table) is performed. If RPL is not privileged enough to access a high privileged segment, an error will occur. This way we could create numerous segments with various permissions and use RPL values in segment selectors to define which of them are accessible to us right now (given our privilege level).

Privilege levels are the same thing as protection rings!

It is safe to say that current privilege level (e.g., current ring) is stored in the lowest two bits of cs or ss (these numbers should be equal). This is what affects the ability to execute certain critical instructions (e.g., changing GDT itself).

It’s easy to deduce that for ds, changing these bits allows us to override the current privilege level to be less privileged specifically for data access to a selected segment.

For example, we are currently in ring0 and ds= 0x02. Even though the lowest two bits of cs and ss are 0 (as we are inside ring0), we can’t access data in a segment with privilege level higher than 2 (like 1 or 0).

In other words, the RPL field stores how privileged we are when requesting access to a segment. Segments in turn are assigned to one of four protection rings. When requesting access with a certain privilege level, the privilege level should be higher than the privilege level attributed to segment itself.

Note

You can’t change cs directly.

Figure 3-2 shows the GDT descriptor format¹.

Figure 3-2. Segment descriptor (inside GDT or LDT)

G—Granularity, e.g., size is in 0 = bytes, 1 = pages of size 4096 bytes each.

D—Default operand size (0 = 16 bit, 1 = 32 bit).

L—Is it a 64-bit mode segment?

V—Available for use by system software.

P—Present in memory right now.

S—Is it data/code (1) or is it just some system information holder (0).

X—Data (0) or code (1).

RW—For data segment, is writing allowed? (reading is always allowed); for code segment, is reading allowed? (writing is always prohibited).

DC—Growth direction: to lower or to higher addresses? (for data segment); can it be executed from higher privilege levels? (if code segment)

A—Was it accessed?

DPL—Descriptor Privilege Level (to which ring is it attached?)

The processor always (even today) starts in real mode. To enter protected mode one has to create GDT and set up gdtr; set a special bit in cr0 and make a so-called far jump. Far jump means that the segment (or segment selector) is explicitly given (and thus can be different from default), as follows:

jmp 0x08:addr

Listing 3-1 shows a small snippet of how we can turn on protected mode (assuming start32 is a label on 32-bit code start).

Listing 3-1. Enabling Protected Mode loader_start32.asm

lgdt cs:[_gdtr]

mov eax, cr0                 ; !! Privileged instruction
or al, 1                     ; this is the bit responsible for protected mode
mov cr0, eax                 ; !! Privileged instruction

    jmp (0x1 << 3):start32   ; assign first seg selector to cs

align 16
_gdtr:                       ; stores GDT's last entry index + GDT address
dw 47
dq _gdt

align 16

_gdt:
; Null descriptor (should be present in any GDT)
dd 0x00, 0x00
; x32 code descriptor:
db 0xFF, 0xFF, 0x00, 0x00, 0x00, 0x9A, 0xCF,     0x00 ; differ by exec bit
; x32 data descriptor:
db 0xFF, 0xFF, 0x00, 0x00, 0x00, 0x92, 0xCF,     0x00 ; execution off (0x92)
;  size  size  base  base  base  util  util|size  base

Align directives control alignment, the essence of which we explain later in this book.

Question 45

Decipher this segment selector: 0x08.

You might think that every memory transaction needs another one now to read GDT contents. This is not true: for each segment register there is a so-called shadow register, which cannot be directly referenced. It serves as a cache for GDT contents. It means that once a segment selector is changed, the corresponding shadow register is loaded with the corresponding descriptor from GDT. Now this register will serve as a source of all information needed about this segment.

The D flag needs a little explanation, because it depends on segment type.

It is a code segment: default address and operand sizes. One means 32-bit addresses and 32-bit or 8-bit operands; zero corresponds to 16-bit addresses and 16-bit or 8-bit operands. We are talking about encoding of machine instructions here. This behavior can be altered by preceding an instruction by a prefix 0x66 (to alter operand size) or 0x67 (to alter address size).
Stack segment (it is a data segment AND we are talking about one selected by ss).² It is again default operand size for call, ret, push/pop, etc. If the flag is set, operands are 32-bit wide and instructions affect esp; otherwise operands are 16-bit wide and sp is affected.
For data segments , growing toward low addresses, it denotes their limits (0 for 64 KB, 1 for 4 GB). This bit should always be set in long mode.

As you see, the segmentation is quite a cumbersome beast. There are reasons it was not largely adopted by operating systems and programmers alike (and is now pretty much abandoned).

No segmentation is easier for programmers;
No commonly used programming language includes segmentation in its memory model. It is always flat memory. So it is a compiler’s job to set up segments (which is hard to implement).
Segments make memory fragmentation a disaster.
A descriptor table can hold up to 8192 segment descriptors. How can we use this small amount efficiently?

After the introduction of long mode segmentation was purged from processor, but not completely. It is still used for protection rings and thus a programmer should understand it.

3.3 Minimal Segmentation in Long Mode

Even in long mode each time an instruction is selected, the processor is using segmentation. It provides us with a flat linear virtual address, which is then turned into a physical one by virtual memory routines (see section 4.2).

LDT is a part of a hardware context-switching mechanism that no one really adopted; for this reason it was disabled in long mode completely.

All memory addressing through main segment registers (cs, ds, es, and ss) do not consider the GDT values of base and offset anymore. The segment base is always fixed at 0x0 no matter the descriptor contents; the segment sizes are not limited at all. The other descriptor fields, however, are not ignored.

It means, that in long mode at least three descriptors should be present in GDT: the null descriptor (should be always present in any GDT), code, and data segments. If you want to use protection rings to implement privileged and user modes, you need also code and data descriptors for user-level code.

Why do we need separate descriptors for code and data?

No combination of descriptor flags allows a programmer to set up read/write permissions and execution permission simultaneously.

Even with the very small experience in assembly language we already have, it is not hard to decipher this loader fragment, showing an exemplary GDT. It is taken from Pure64, an open source operating system loader. As it is executed before the operating system, it does not contain user-level code or data descriptors (see Listing 3-2).

Listing 3-2. A Sample GDT gdt64.asm

align 16  ; This ensures that the next command or data element is
; stored starting at an address divisible by 16 (even if we need
; to skip some bytes to achieve that).

; The following will be copied to GDTR via LGDTR instruction:

GDTR64:                 ; Global Descriptors Table Register
    dw gdt64_end - gdt64 - 1 ; limit of GDT (size minus one)
    dq 0x0000000000001000    ; linear address of GDT

; This structure is copied to 0x0000000000001000
gdt64:
SYS64_NULL_SEL equ $-gdt64      ; Null Segment
    dq 0x0000000000000000
; Code segment, read/exec, nonconforming
SYS64_CODE_SEL equ $-gdt64
    dq 0x0020980000000000       ; 0x00209A0000000000
; Data segment, read/write, expand down
SYS64_DATA_SEL equ $-gdt64
    dq 0x0000900000000000       ; 0x0020920000000000
gdt64_end:

; Dollar sign denotes the current memory address, so
; $-gdt64 means an offset from `gdt64` label in bytes

3.4 Accessing Parts of Registers

3.4.1 An Unexpected Behavior

We are usually thinking about eax, rax, ax, etc. as parts of a same physical register. The observable behavior mostly supports this hypothesis unless we are writing into a 32-bit part of a 64-bit register. Let us take a look at the example shown in Listing 3-3.

Listing 3-3. The Land of Registry Wonders risc_cisc.asm

mov rax, 0x1122334455667788      ; rax = 0x1122334455667788
mov eax, 0x42                    ; !rax = 0x00 00 00 00 00 00 00 42
                                 ; why not rax = 0x1122334400000042 ??

mov rax, 0x1122334455667788      ; rax = 0x1122334455667788
mov ax, 0x9999                   ; rax = 0x1111222233339999, as expected
                                 ; this works as expected

mov rax, 0x1122334455667788      ; rax = 0x1122334455667788
xor eax, eax                     ; rax = 0x0000000000000000
                                 ; why not rax = 0x1122334400000000?

As you see, writing in 8-bit or 16-bit parts leaves the rest of bits intact. Writing to 32-bit parts, however, fills the upper half of a wide register with sign bit!

The reason is that how programmers are used to perceiving a processor is much different from how things are really done inside. In reality, registers rax, eax, and all others do not exist as fixed physical entities.

To explain this inconsistency, we have to first elaborate two types of instruction sets: CISC and RISC.

3.4.2 CISC and RISC

One of possible processors’ classification divides processors based on their instruction set. When designing one there are two extremes.

Make loads of specialized, high-level instructions. It corresponds to CISC (Complete Instruction Set Computer) architectures.
Use only few primitive instructions, making a RISC (Reduced Instruction Set Computer) architecture.

CISC instructions are usually slower but also do more; sometimes it is possible to implement complex instructions in a better way, than by combining primitive RISC instructions (we will see an example of that later in this book when studying SSE (Streaming SIMD Extensions) in Chapter 16). However, most programs are written in high-level languages and thus depend on compilers. It is very hard to write a compiler that makes a good use of a rich instruction set.

RISC eases the job of compilers and is also friendlier to optimizations on a lower, microcode level, such as pipelines.

Question 46

Read about microcode in general and processor pipelines.

The Intel 64 instruction set is indeed a CISC one. It has thousands of instructions—just look at the second volume of [15]! However, these instructions are decoded and translated into a stream of simpler microcode instructions. Here various optimizations take effect; the microcode instructions are reordered and some of them can even be executed simultaneously. This is not a native feature of processors but rather an adaptation aimed at better performance together with backward compatibility with older software.

It is quite unfortunate that there is not much information available on the microcode-level details of modern processors. By reading technical reviews such as [17] and optimization manuals such as the one provided by Intel, you can develop a certain intuition about it.

3.4.3 Explanation

Now back to the example shown in Listing 3-3. Let’s think about instruction decoding. The part of a CPU called instruction decoder is constantly translating commands from an older CISC system to a more convenient RISC one. Pipelines allow for a simultaneous execution of up to six smaller instructions. To achieve that, however, the notion of registers should be virtualized. During microcode execution, the decoder chooses an available register from a large bank of physical registers. As soon as the bigger instruction ends, the effects become visible to programmer: the value of some physical registers may be copied to those, currently assigned to be, let’s say, rax.

The data interdependencies between instructions stall the pipeline, decreasing performance. The worst cases occur when the same register is read and modified by several consecutive instructions (think about rflags!).

If modifying eax means keeping upper bits of rax intact, it introduces an additional dependency between current instruction and whatever instruction modified rax or its parts before. By discarding upper 32 bits on each write to eax we eliminate this dependency, because we do not care anymore about previous rax value or its parts.

This kind of a new behavior was introduced with the latest general purpose registers’ growth to 64 bits and does not affect operations with their smaller parts for the sake of compatibility. Otherwise, most older binaries would have stopped working because assigning to, for example, bl, would have modified the entire ebx, which was not true back when 64-bit registers had not yet been introduced.

3.5 Summary

This chapter was a brief historical note on processor evolution over the last 30 years. We have also elaborated on the intended use of segments back in the 32-bit era, as well as which leftovers of segmentation we are stuck with for legacy reasons. In the next chapter we are going to take a closer look at the virtual memory mechanism and its interaction with protection rings.

Footnotes

1 In this book we are approximating things a bit because certain data structures can have a different format based on page size, etc. The documentation will give you most precise answers (read volume 3, chapter 3 of [15]

2 In this case, documentation names this flag B.

Table of Contents for Low-Level Programming: C, Assembly, and Program Execution on Intel® 64 Architecture