Table of Contents

9780128037164_FC

Modern Assembly Language Programming with the ARM Processor

First Edition

Larry D. Pyeatt

publogo

Table of Contents

Cover image

Title page

Copyright

List of Tables

List of Figures

List of Listings

Preface

Choice of Processor Family

General Approach

Companion Website

Acknowledgments

Part I: Assembly as a Language

Chapter 1: Introduction

Abstract

1.1 Reasons to Learn Assembly

1.2 The ARM Processor

1.3 Computer Data

1.4 Memory Layout of an Executing Program

1.5 Chapter Summary

Exercises

Chapter 2: GNU Assembly Syntax

Abstract

2.1 Structure of an Assembly Program

2.2 What the Assembler Does

2.3 GNU Assembly Directives

2.4 Chapter Summary

Exercises

Chapter 3: Load/Store and Branch Instructions

Abstract

3.1 CPU Components and Data Paths

3.2 ARM User Registers

3.3 Instruction Components

3.4 Load/Store Instructions

3.5 Branch Instructions

3.6 Pseudo-Instructions

3.7 Chapter Summary

Exercises

Chapter 4: Data Processing and Other Instructions

Abstract

4.1 Data Processing Instructions

4.2 Special Instructions

4.3 Pseudo-Instructions

4.4 Alphabetized List of ARM Instructions

4.5 Chapter Summary

Exercises

Chapter 5: Structured Programming

Abstract

5.1 Sequencing

5.2 Selection

5.3 Iteration

5.4 Subroutines

5.5 Aggregate Data Types

5.6 Chapter Summary

Exercises

Chapter 6: Abstract Data Types

Abstract

6.1 ADTs in Assembly Language

6.2 Word Frequency Counts

6.3 Ethics Case Study: Therac-25

6.4 Chapter Summary

Exercises

Part II: Performance Mathematics

Chapter 7: Integer Mathematics

Abstract

7.1 Subtraction by Addition

7.2 Binary Multiplication

7.3 Binary Division

7.4 Big Integer ADT

7.5 Chapter Summary

Exercises

Chapter 8: Non-Integral Mathematics

Abstract

8.1 Base Conversion of Fractional Numbers

8.2 Fractions and Bases

8.3 Fixed-Point Numbers

8.4 Fixed-Point Operations

8.5 Floating Point Numbers

8.6 Floating Point Operations

8.7 Computing Sine and Cosine

8.8 Ethics Case Study: Patriot Missile Failure

8.9 Chapter Summary

Exercises

Chapter 9: The ARM Vector Floating Point Coprocessor

Abstract

9.1 Vector Floating Point Overview

9.2 Floating Point Status and Control Register

9.3 Register Usage Rules

9.4 Load/Store Instructions

9.5 Data Processing Instructions

9.6 Data Movement Instructions

9.7 Data Conversion Instructions

9.8 Floating Point Sine Function

9.9 Alphabetized List of VFP Instructions

9.10 Chapter Summary

Exercises

Chapter 10: The ARM NEON Extensions

Abstract

10.1 NEON Intrinsics

10.2 Instruction Syntax

10.3 Load and Store Instructions

10.4 Data Movement Instructions

10.5 Data Conversion

10.6 Comparison Operations

10.7 Bitwise Logical Operations

10.8 Shift Instructions

10.9 Arithmetic Instructions

10.10 Multiplication and Division

10.11 Pseudo-Instructions

10.12 Performance Mathematics: A Final Look at Sine

10.13 Alphabetized List of NEON Instructions

10.14 Chapter Summary

Part III: Accessing Devices

Chapter 11: Devices

Abstract

11.1 Accessing Devices Directly Under Linux

11.2 General Purpose Digital Input/Output

11.3 Chapter Summary

Exercises

Chapter 12: Pulse Modulation

Abstract

12.1 Pulse Density Modulation

12.2 Pulse Width Modulation

12.3 Raspberry Pi PWM Device

12.4 pcDuino PWM Device

12.5 Chapter Summary

Exercises

Chapter 13: Common System Devices

Abstract

13.1 Clock Management Device

13.2 Serial Communications

13.3 Chapter Summary

Exercises

Chapter 14: Running Without an Operating System

Abstract

14.1 ARM CPU Modes

14.2 Exception Processing

14.3 The Boot Process

14.4 Writing a Bare-Metal Program

14.5 Using an Interrupt

14.6 ARM Processor Profiles

14.7 Chapter Summary

Exercises

Index

Copyright

List of Tables

Table 1.1 Values represented by two bits 9

Table 1.2 The first 21 integers (starting with 0) in various bases 10

Table 1.3 The ASCII control characters 21

Table 1.4 The ASCII printable characters 22

Table 1.5 Binary equivalents for each character in “Hello World” 23

Table 1.6 Binary, hexadecimal, and decimal equivalents for each character in “Hello World” 24

Table 1.7 Interpreting a hexadecimal string as ASCII 24

Table 1.8 Variations of the ISO 8859 standard 25

Table 1.9 UTF-8 encoding of the ISO/IEC 10646 code points 27

Table 3.1 Flag bits in the CPSR register 58

Table 3.2 ARM condition modifiers 59

Table 3.3 Legal and illegal values for #<immediate|symbol> 60

Table 3.4 ARM addressing modes 61

Table 3.5 ARM shift and rotate operations 61

Table 4.1 Shift and rotate operations in Operand2 80

Table 4.2 Formats for Operand2 81

Table 8.1 Format for IEEE 754 half-precision 244

Table 8.2 Result formats for each term 252

Table 8.3 Shifts required for each term 252

Table 8.4 Performance of sine function with various implementations 259

Table 9.1 Condition code meanings for ARM and VFP 271

Table 9.2 Performance of sine function with various implementations 292

Table 10.1 Parameter combinations for loading and storing a single structure 304

Table 10.2 Parameter combinations for loading multiple structures 306

Table 10.3 Parameter combinations for loading copies of a structure 308

Table 10.4 Performance of sine function with various implementations 357

Table 11.1 Raspberry Pi GPIO register map 379

Table 11.2 GPIO pin function select bits 380

Table 11.3 GPPUD control codes 381

Table 11.4 Raspberry Pi expansion header useful alternate functions 385

Table 11.5 Number of pins available on each of the AllWinner A10/A20 PIO ports 385

Table 11.6 Registers in the AllWinner GPIO device 386

Table 11.7 Allwinner A10/A20 GPIO pin function select bits 388

Table 11.8 Pull-up and pull-down resistor control codes 389

Table 11.9 pcDuino GPIO pins and function select code assignments. 392

Table 12.1 Raspberry Pi PWM register map 398

Table 12.2 Raspberry Pi PWM control register bits 399

Table 12.3 Prescaler bits in the pcDuino PWM device 401

Table 12.4 pcDuino PWM register map 401

Table 12.5 pcDuino PWM control register bits 402

Table 13.1 Clock sources available for the clocks provided by the clock manager 407

Table 13.2 Some registers in the clock manager device 407

Table 13.3 Bit fields in the clock manager control registers 408

Table 13.4 Bit fields in the clock manager divisor registers 408

Table 13.5 Clock signals in the AllWinner A10/A20 SOC 409

Table 13.6 Raspberry Pi UART0 register map 413

Table 13.7 Raspberry Pi UART data register 414

Table 13.8 Raspberry Pi UART receive status register/error clear register 415

Table 13.9 Raspberry Pi UART flags register bits 415

Table 13.10 Raspberry Pi UART integer baud rate divisor 416

Table 13.11 Raspberry Pi UART fractional baud rate divisor 416

Table 13.12 Raspberry Pi UART line control register bits 416

Table 13.13 Raspberry Pi UART control register bits 417

Table 13.14 pcDuino UART addresses 422

Table 13.15 pcDuino UART register offsets 423

Table 13.16 pcDuno UART receive buffer register 424

Table 13.17 pcDuno UART transmit holding register 424

Table 13.18 pcDuno UART divisor latch low register 424

Table 13.19 pcDuno UART divisor latch high register 425

Table 13.20 pcDuno UART FIFO control register 425

Table 13.21 pcDuno UART line control register 426

Table 13.22 pcDuno UART line status register 427

Table 13.23 pcDuno UART status register 427

Table 13.24 pcDuno UART transmit FIFO level register 428

Table 13.25 pcDuno UART receive FIFO level register 428

Table 13.26 pcDuno UART transmit halt register 428

Table 14.1 The ARM user and system registers 433

Table 14.2 Mode bits in the PSR 434

Table 14.3 ARM vector table 435

List of Figures

Figure 1.1 Simplified representation of a computer system 4

Figure 1.2 Stages of a typical compilation sequence 6

Figure 1.3 Tables used for converting between binary, octal, and hex 14

Figure 1.4 Four different representations for binary integers 16

Figure 1.5 Complement tables for bases ten and two 17

Figure 1.6 A section of memory 29

Figure 1.7 Typical memory layout for a program with a 32-bit address space 30

Figure 2.1 Equivalent static variable declarations in assembly and C 42

Figure 3.1 The ARM processor architecture 54

Figure 3.2 The ARM user program registers 56

Figure 3.3 The ARM process status register 57

Figure 5.1 ARM user program registers 112

Figure 6.1 Binary tree of word frequencies 151

Figure 6.2 Binary tree of word frequencies with index added 157

Figure 6.3 Binary tree of word frequencies with sorted index 158

Figure 7.1 In signed 8-bit math, 110110012 is −3910 179

Figure 7.2 In unsigned 8-bit math, 110110012 is 21710 179

Figure 7.3 Multiplication of large numbers 180

Figure 7.4 Longhand division in decimal and binary 181

Figure 7.5 Flowchart for binary division 183

Figure 8.1 Examples of fixed-point signed arithmetic 232

Figure 9.1 ARM integer and vector floating point user program registers 267

Figure 9.2 Bits in the FPSCR 268

Figure 10.1 ARM integer and NEON user program registers 300

Figure 10.2 Pixel data interleaved in three doubleword registers 302

Figure 10.3 Pixel data de-interleaved in three doubleword registers 303

Figure 10.4 Example of vext.8 d12,d4,d9,#5 313

Figure 10.5 Examples of the vrev instruction. (A) vrev16.8 d3,d4; (B) vrev32.16 d8,d9; (C) vrev32.8 d5,d7 315

Figure 10.6 Examples of the vtrn instruction. (A) vtrn.8 d14,d15; (B) vtrn.32 d31,d15 316

Figure 10.7 Transpose of a 3 × 3 matrix 317

Figure 10.8 Transpose of a 4 × 4 matrix of 32-bit numbers 318

Figure 10.9 Example of vzip.8 d9,d4 320

Figure 10.10 Effects of vsli.32 d4,d9,#6 334

Figure 11.1 Typical hardware address mapping for memory and devices 366

Figure 11.2 GPIO pins being used for input and output. (A) GPIO pin being used as input to read the state of a push-button switch. (B) GPIO pin being used as output to drive an LED 378

Figure 11.3 The Raspberry Pi expansion header location 383

Figure 11.4 The Raspberry Pi expansion header pin assignments 384

Figure 11.5 Bit-to-pin assignments for PIO control registers 388

Figure 11.6 The pcDuino header locations 390

Figure 11.7 The pcDuino header pin assignments 391

Figure 12.1 Pulse density modulation 396

Figure 12.2 Pulse width modulation 397

Figure 13.1 Typical system with a clock management device 406

Figure 13.2 Transmitter and receiver timings for two UARTS. (A) Waveform of a UART transmitting a byte. (B) Timing of UART receiving a byte 411

Figure 14.1 The ARM process status register 433

Figure 14.2 Basic exception processing 436

Figure 14.3 Exception processing with multiple user processes 437

List of Listings

Listing 2.1 “Hello World” program in ARM assembly 36

Listing 2.2 “Hello World” program in C 37

Listing 2.3 “Hello World” assembly Listing 39

Listing 2.4 A Listing with mis-aligned data 43

Listing 2.5 A Listing with properly aligned data 45

Listing 2.6 Defining a symbol for the number of elements in an array 47

Listing 5.1 Selection in C 101

Listing 5.2 Selection in ARM assembly using conditional execution 102

Listing 5.3 Selection in ARM assembly using branch instructions 102

Listing 5.4 Complex selection in C 103

Listing 5.5 Complex selection in ARM assembly 104

Listing 5.6 Unconditional loop in ARM assembly 105

Listing 5.7 Pre-test loop in ARM assembly 105

Listing 5.8 Post-test loop in ARM assembly 106

Listing 5.9 for loop in C 106

Listing 5.10 for loop rewritten as a pre-test loop in C 107

Listing 5.11 Pre-test loop in ARM assembly 107

Listing 5.12 for loop rewritten as a post-test loop in C 108

Listing 5.13 Post-test loop in ARM assembly 108

Listing 5.14 Calling scanf and printf in C 111

Listing 5.15 Calling scanf and printf in ARM assembly 111

Listing 5.16 Simple function call in C 114

Listing 5.17 Simple function call in ARM assembly 114

Listing 5.18 A larger function call in C 114

Listing 5.19 A larger function call in ARM assembly 115

Listing 5.20 A function call using the stack in C 115

Listing 5.21 A function call using the stack in ARM assembly 116

Listing 5.22 A function call using stm to push arguments onto the stack 116

Listing 5.23 A small function in C 118

Listing 5.24 A small function in ARM assembly 118

Listing 5.25 A small C function with a register variable 119

Listing 5.26 Automatic variables in ARM assembly 119

Listing 5.27 A C program that uses recursion to reverse a string 120

Listing 5.28 ARM assembly implementation of the reverse function 121

Listing 5.29 Better implementation of the reverse function 122

Listing 5.30 Even better implementation of the reverse function 122

Listing 5.31 String reversing in C using pointers 123

Listing 5.32 String reversing in assembly using pointers 123

Listing 5.33 Initializing an array of integers in C 124

Listing 5.34 Initializing an array of integers in assembly 125

Listing 5.35 Initializing a structured data type in C 125

Listing 5.36 Initializing a structured data type in ARM assembly 126

Listing 5.37 Initializing an array of structured data in C 127

Listing 5.38 Initializing an array of structured data in assembly 128

Listing 5.39 Improved initialization in assembly 129

Listing 5.40 Very efficient initialization in assembly 130

Listing 6.1 Definition of an Abstract Data Type in a C header file 138

Listing 6.2 Definition of the image structure may be hidden in a separate header file 139

Listing 6.3 Definition of an ADT in Assembly 140

Listing 6.4 C program to compute word frequencies 140

Listing 6.5 C header for the wordlist ADT 142

Listing 6.6 C implementation of the wordlist ADT 143

Listing 6.7 Makefile for the wordfreq program 146

Listing 6.8 ARM assembly implementation of wl_print_numerical() 148

Listing 6.9 Revised makefile for the wordfreq program 149

Listing 6.10 C implementation of the wordlist ADT using a tree 151

Listing 6.11 ARM assembly implementation of wl_print_numerical() with a tree 158

Listing 7.1 ARM assembly code for adding two 64 bit numbers 176

Listing 7.2 ARM assembly code for multiplication with a 64 bit result 176

Listing 7.3 ARM assembly code for multiplication with a 32 bit result 177

Listing 7.4 ARM assembly implementation of signed and unsigned 32-bit and 64-bit division functions 187

Listing 7.5 ARM assembly code for division by constant 193 192

Listing 7.6 ARM assembly code for division of a variable by a constant without using a multiply instruction 193

Listing 7.7 Header file for a big integer abstract data type 195

Listing 7.8 C source code file for a big integer abstract data type 196

Listing 7.9 Program using the bigint ADT to calculate the factorial function 211

Listing 7.10 ARM assembly implementation if the bigint_adc function 213

Listing 8.1 Examples of fixed-point multiplication in ARM assembly 233

Listing 8.2 Dividing x by 23 239

Listing 8.3 Dividing x by 23 Using Only Shift and Add 240

Listing 8.4 Dividing x by − 50 242

Listing 8.5 Inefficient representation of a binimal 242

Listing 8.6 Efficient representation of a binimal 243

Listing 8.7 ARM assembly implementation of sin x and cos x using fixed-point calculations 252

Listing 8.8 Example showing how the sin x and cos x functions can be used to print a table 257

Listing 9.1 Simple scalar implementation of the sin x function using IEEE single precision 285

Listing 9.2 Simple scalar implementation of the sin x function using IEEE double precision 286

Listing 9.3 Vector implementation of the sin x function using IEEE single precision 288

Listing 9.4 Vector implementation of the sin x function using IEEE double precision 289

Listing 10.1 NEON implementation of the sin x function using single precision 354

Listing 10.2 NEON implementation of the sin x function using double precision 355

Listing 11.1 Function to map devices into the user program memory on a Raspberry Pi 367

Listing 11.2 Function to map devices into the user program memory space on a pcDuino 372

Listing 11.3 ARM assembly code to set GPIO pin 26 to alternate function 1 381

Listing 11.4 ARM assembly code to configure PA10 for output 388

Listing 11.5 ARM assembly code to set PA10 to output a high state 389

Listing 11.6 ARM assembly code to read the state of PI14 and set or clear the Z flag 389

Listing 13.1 Assembly functions for using the Raspberry Pi UART 418

Listing 14.1 Definitions for ARM CPU modes 435

Listing 14.2 Function to set up the ARM exception table 439

Listing 14.3 Stubs for the exception handlers 440

Listing 14.4 Skeleton for an exception handler 441

Listing 14.5 ARM startup code 443

Listing 14.6 A simple main program 446

Listing 14.7 A sample Gnu linker script 448

Listing 14.8 A sample make file 450

Listing 14.9 Running make to build the image 451

Listing 14.10 An improved main program 452

Listing 14.11 ARM startup code with timer interrupt 453

Listing 14.12 Functions to manage the pdDuino interrupt controller 454

Listing 14.13 Functions to manage the Raspberry Pi interrupt controller 457

Listing 14.14 Functions to manage the pdDuino timer0 device 459

Listing 14.15 Functions to manage the Raspberry Pi timer0 device 460

Listing 14.16 IRQ handler to clear the timer interrupt 462

Listing 14.17 A sample make file 463

Listing 14.18 Running make to build the image 464

Preface

This book is intended to be used in a first course in assembly language programming for Computer Science (CS) and Computer Engineering (CE) students. It is assumed that students using this book have already taken courses in programming and data structures, and are competent programmers in at least one high-level language. Many of the code examples in the book are written in C, with an assembly implementation following. The assembly examples can stand on their own, but students who are familiar with C, C++, or Java should find the C examples helpful.

Computer Science and Computer Engineering are very large fields. It is impossible to cover everything that a student may eventually need to know. There are a limited number of course hours available, so educators must strive to deliver degree programs that make a compromise between the number of concepts and skills that the students learn and the depth at which they learn those concepts and skills. Obviously, with these competing goals it is difficult to reach consensus on exactly what courses should be included in a CS or CE curriculum.

Traditionally, assembly language courses have consisted of a mechanistic learning of a set of instructions, registers, and syntax. Partially because of this approach, over the years, assembly language courses have been marginalized in, or removed altogether from, many CS and CE curricula. The author feels that this is unfortunate, because a solid understanding of assembly language leads to better understanding of higher-level languages, compilers, interpreters, architecture, operating systems, and other important CS an CE concepts.

One of the goals of this book is to make a course in assembly language more valuable by introducing methods (and a bit of theory) that are not covered in any other CS or CE courses, while using assembly language to implement the methods. In this way, the course in assembly language goes far beyond the traditional assembly language course, and can once again play an important role in the overall CS and CE curricula.

Choice of Processor Family

Because of their ubiquity, x86 based systems have been the platforms of choice for most assembly language courses over the last two decades. The author believes that this is unfortunate, because in every respect other than ubiquity, the x86 architecture is the worst possible choice for learning and teaching assembly language. The newer chips in the family have hundreds of instructions, and irregular rules govern how those instructions can be used. In an attempt to make it possible for students to succeed, typical courses use antiquated assemblers and interface with the antiquated IBM PC BIOS, using only a small subset of the modern x86 instruction set. The programming environment has little or no relevance to modern computing.

Partially because of this tendency to use x86 platforms, and the resulting unnecessary burden placed on students and instructors, as well as the reliance on antiquated and irrelevant development environments, assembly language is often viewed by students as very difficult and lacking in value. The author hopes that this textbook helps students to realize the value of knowing assembly language. The relatively simple ARM processor family was chosen in hopes that the students also learn that although assembly language programming may be more difficult than high-level languages, it can be mastered.

The recent development of very low-cost ARM based Linux computers has caused a surge of interest in the ARM architecture as an alternative to the x86 architecture, which has become increasingly complex over the years. This book should provide a solution for a growing need.

Many students have difficulty with the concept that a register can hold variable x at one point in the program, and hold variable y at some other point. They also often have difficulty with the concept that, before it can be involved in any computation, data has to be moved from memory into the CPU. Using a load-store architecture helps the students to more readily grasp these concepts.

Another common difficulty that students have is in relating the concepts of an address and a pointer variable. You can almost see the little light bulbs light up over their heads, when they have the “eureka!” moment and realize that pointers are just variables that hold an address. The author hopes that the approach taken in this book will make it easier for students to have that “eureka!” moment. The author believes that load-store architectures make that realization easier.

Many students also struggle with the concept of recursion, regardless of what language is used. In assembly, the mechanisms involved are exposed and directly manipulated by the programmer. Examples of recursion are scattered throughout this textbook. Again, the clean architecture of the ARM makes it much easier for the students to understand what is going on.

Some students have difficulty understanding the flow of a program, and tend to put many unnecessary branches into their code. Many assembly language courses spend so much time and space on learning the instruction set that they never have time to teach good programming practices. This textbook puts strong emphasis on using structured programming concepts. The relative simplicity of the ARM architecture makes this possible.

One of the major reasons to learn and use assembly language is that it allows the programmer to create very efficient mathematical routines. The concepts introduced in this book will enable students to perform efficient non-integral math on any processor. These techniques are rarely taught because of the time that it takes to cover the x86 instruction set. With the ARM processor, less time is spent on the instruction set, and more time can be spent teaching how to optimize the code.

The combination of the ARM processor and the Linux operating system provides the least costly hardware platform and development environment available. A cluster of 10 Raspberry Pis, or similar hosts, with power supplies and networking, can be assembled for 500 US dollars or less. This cluster can support up to 50 students logging in through ssh. If their client platform supports the X window system, then they can run GUI enabled applications. Alternatively, most low-cost ARM systems can directly drive a display and take input from a keyboard and mouse. With the addition of an NFS server (which itself could be a low-cost ARM system and a hard drive), an entire Linux ARM based laboratory of 20 workstations could be built for 250 US dollars per seat or less. Admittedly, it would not be a high-performance laboratory, but could be used to teach C, assembly, and other languages. The author would argue that inexperienced programmers should learn to program on low-performance machines, because it reinforces a life-long tendency towards efficiency.

General Approach

The approach of this book is to present concepts in different ways throughout the book, slowly building from simple examples towards complex programming on bare-metal embedded systems. Students who don’t understand a concept when it is explained in a certain way may easily grasp the concept when it is presented later from a different viewpoint.

The main objective of this book is to provide an improved course in assembly language by replacing the x86 platform with one that is less costly, more ubiquitous, well-designed, powerful, and easier to learn. Since students are able to master the basics of assembly language quickly, it is possible to teach a wider range of topics, such as fixed and floating point mathematics, ethical considerations, performance tuning, and interrupt processing. The author hopes that courses using this book will better prepare students for the junior and senior level courses in operating systems, computer architecture, and compilers.

Companion Website

Please visit the companion web site to access additional resources. Instructors may download the author’s lecture slides and solution manual for the exercises. Students and instructors may also access the laboratory manual and additional code examples. The author welcomes suggestions for additional lecture slides, laboratory assignments, or other materials.

http://booksite.elsevier.com/9780128036983

Acknowledgments

I would like to thank Randy Warner for reading the manuscript, catching errors, and making helpful suggestions. I would also like to thank the following students for suggesting exercises with answers and catching numerous errors in the drafts: Zach Buechler, Preston Cook, Joshua Daybrest, Matthew DeYoung, Josh Dodd, Matt Dyke, Hafiza Farzami, Jeremy Goens, Lawrence Hoffman, Colby Johnson, Benjamin Kaiser, Lauren Keene, Jayson Kjenstad, Murray LaHood-Burns, Derek Lane, Yanlin Li, Luke Meyer, Matthew Mielke, Forrest Miller, Christopher Navarro, Girik Ranchhod, Josh Schweigert, Christian Sieh, Weston Silbaugh, Jacob St. Amand, Njaal Tengesdal, Dylan Thoeny, Michael Vortherms, Dicheng Wu, and Kekoa (Peter) Yamaguchi. Finally, I am also very grateful for my assistants, Scott Logan, Ian Carlson, and Derek Stotz, who gave very valuable feedback during the writing of this book.

Part I

Assembly as a Language

Chapter 1

Introduction

Abstract

This chapter first gives a very high-level description of the major components of function of a computer system. It then motivates the reader by giving reasons why learning assembly language is important for Computer Scientists and Computer Engineers. It then explains why the ARM processor is a good choice for a first assembly language. Next it explains binary data representations, including various integer formats, ASCII, and Unicode. Finally, it describes the memory sections for a typical program during execution. By the end of the chapter, the groundwork has been laid for learning to program in assembly language.

Keywords

Instruction; Instruction stream; Central processing unit; Memory; Input/output device; High-level language; Assembly language; ARM processor; Binary; Hexadecimal; Decimal; Radix or base system; Base conversion; Sign magnitude; Unsigned; Complement; Excess-n; ASCII; Unicode; UTF-8; Stack; Heap; Data section; Text section

An executable computer program is, ultimately, just a series of numbers that have very little or no meaning to a human being. We have developed a variety of human-friendly languages in which to express computer programs, but in order for the program to execute, it must eventually be reduced to a stream of numbers. Assembly language is one step above writing the stream of numbers. The stream of numbers is called the instruction stream. Each number in the instruction stream instructs the computer to perform one (usually small) operation. Although each instruction does very little, the ability of the programmer to specify any sequence of instructions and the ability of the computer to perform billions of these small operations every second makes modern computers very powerful and flexible tools. In assembly language, one line of code usually gets translated into one machine instruction. In high-level languages, a single line of code may generate many machine instructions.

A simplified model of a computer system, as shown in Fig. 1.1, consists of memory, input/output devices, and a central processing unit (CPU), connected together by a system bus. The bus can be thought of as a roadway that allows data to travel between the components of the computer system. The CPU is the part of the system where most of the computation occurs, and the CPU controls the other devices in the system.

f01-01-9780128036983
Figure 1.1 Simplified representation of a computer system.

Memory can be thought of as a series of mailboxes. Each mailbox can hold a single postcard with a number written on it, and each mailbox has a unique numeric identifier. The identifier, x is called the memory address, and the number stored in the mailbox is called the contents of address x. Some of the mailboxes contain data, and others contain instructions which control what actions are performed by the CPU.

The CPU also contains a much smaller set of mailboxes, which we call registers. Data can be copied from cards stored in memory to cards stored in the CPU, or vice-versa. Once data has been copied into one of the CPU registers, it can be used in computation. For example, in order to add two numbers in memory, they must first be copied into registers on the CPU. The CPU can then add the numbers together and store the result in one of the CPU registers. The result of the addition can then be copied back into one of the mailboxes in the memory.

Modern computers execute instructions sequentially. In other words, the next instruction to be executed is at the memory address immediately following the current instruction. One of the registers in the CPU, the program counter (PC), keeps track of the location from which the next instruction is to be fetched. The CPU follows a very simple sequence of actions. It fetches an instruction from memory, increments the PC, executes the instruction, and then repeats the process with the next instruction. However, some instructions may change the PC, so that the next instruction is fetched from a non-sequential address.

1.1 Reasons to Learn Assembly

There are many high-level programming languages, such as Java, Python, C, and C++ that have been designed to allow programmers to work at a high level of abstraction, so that they do not need to understand exactly what instructions are needed by a particular CPU. For compiled languages, such as C and C++, a compiler handles the task of translating the program, written in a high-level language, into assembly language for the particular CPU on the system. An assembler then converts the program from assembly language into the binary codes that the CPU reads as instructions.

High-level languages can greatly enhance programmer productivity. However, there are some situations where writing assembly code directly is desirable or necessary. For example, assembly language may be the best choice when writing

 the first steps in booting the computer,

 code to handle interrupts,

 low-level locking code for multi-threaded programs,

 code for machines where no compiler exists,

 code which needs to be optimized beyond the limits of the compiler,

 on computers with very limited memory, and

 code that requires low-level access to architectural and/or processor features.

Aside from sheer necessity, there are several other reasons why it is still important for computer scientists to learn assembly language.

One example where knowledge of assembly is indispensable is when designing and implementing compilers for high-level languages. As shown in Fig. 1.2, a typical compiler for a high-level language must generate assembly language as its output. Most compilers are designed to have multiple stages. In the input stage, the source language is read and converted into a graph representation. The graph may be optimized before being passed to the output, or code generation, stage where it is converted to assembly language. The assembly is then fed into the system’s assembler to generate an object file. The object file is linked with other object files (which are often combined into libraries) to create an executable program.

f01-02-9780128036983
Figure 1.2 Stages of a typical compilation sequence.

The code generation stage of a compiler must traverse the graph and emit assembly code. The quality of the assembly code that is generated can have a profound influence on the performance of the executable program. Therefore, the programmer responsible for the code generation portion of the compiler must be well versed in assembly programming for the target CPU.

Some people believe that a good optimizing compiler will generate better assembly code than a human programmer. This belief is not justified. Highly optimizing compilers have lots of clever algorithms, but like all programs, they are not perfect. Outside of the cases that they were designed for, they do not optimize well. Many newer CPUs have instructions which operate on multiple items of data at once. However, compilers rarely make use of these powerful single instruction multiple data ( SIMD) instructions. Instead, it is common for programmers to write functions in assembly language to take advantage of SIMD instructions. The assembly functions are assembled into object file(s), then linked with the object file(s) generated from the high-level language compiler.

Many modern processors also have some support for processing vectors (arrays). Compilers are usually not very good at making effective use of the vector instructions. In order to achieve excellent vector performance for audio or video codecs and other time-critical code, it is often necessary to resort to small pieces of assembly code in the performance-critical inner loops. A good example of this type of code is when performing vector and matrix multiplies. Such operations are commonly needed in processing images and in graphical applications. The ARM vector instructions are explained in Chapter 9.

Another reason for assembly is when writing certain parts of an operating system. Although modern operating systems are mostly written in high-level languages, there are some portions of the code that can only be done in assembly. Typical uses of assembly language are when writing device drivers, saving the state of a running program so that another program can use the CPU, restoring the saved state of a running program so that it can resume executing, and managing memory and memory protection hardware. There are many other tasks central to a modern operating system which can only be accomplished in assembly language. Careful design of the operating system can minimize the amount of assembly required, but cannot eliminate it completely.

Another good reason to learn assembly is for debugging. Simply understanding what is going on “behind the scenes” of compiled languages such as C and C++ can be very valuable when trying to debug programs. If there is a problem in a call to a third party library, sometimes the only way a developer can isolate and diagnose the problem is to run the program under a debugger and step through it one machine instruction at a time. This does not require a deep knowledge of assembly language coding but at least a passing familiarity with assembly is helpful in that particular case. Analysis of assembly code is an important skill for C and C++ programmers, who may occasionally have to diagnose a fault by looking at the contents of CPU registers and single-stepping through machine instructions.

Assembly language is an important part of the path to understanding how the machine works. Even though only a small percentage of computer scientists will be lucky enough to work on the code generator of a compiler, they all can benefit from the deeper level of understanding gained by learning assembly language. Many programmers do not really understand pointers until they have written assembly language.

Without first learning assembly language, it is impossible to learn advanced concepts such as microcode, pipelining, instruction scheduling, out-of-order execution, threading, branch prediction, and speculative execution. There are many other concepts, especially when dealing with operating systems and computer architecture, which require some understanding of assembly language. The best programmers understand why some language constructs perform better than others, how to reduce cache misses, and how to prevent buffer overruns that destroy security.

Every program is meant to run on a real machine. Even though there are many languages, compilers, virtual machines, and operating systems to enable the programmer to use the machine more conveniently, the strengths and weaknesses of that machine still determine what is easy and what is hard. Learning assembly is a fundamental part of understanding enough about the machine to make informed choices about how to write efficient programs, even when writing in a high-level language.

As an analogy, most people do not need to know a lot about how an internal combustion engine works in order to operate an automobile. A race car driver needs a much better understanding of exactly what happens when he or she steps on the accelerator pedal in order to be able to judge precisely when (and how hard) to do so. Also, who would trust their car to a mechanic who could not tell the difference between a spark plug and a brake caliper? Worse still, should we trust an engineer to build a car without that knowledge? Even in this day of computerized cars, someone needs to know the gritty details, and they are paid well for that knowledge. Knowledge of assembly language is one of the things that defines the computer scientist and engineer.

When learning assembly language, the specific instruction set is not critically important, because what is really being learned is the fine detail of how a typical stored-program machine uses different storage locations and logic operations to convert a string of bits into a meaningful calculation. However, when it comes to learning assembly languages, some processors make it more difficult than it needs to be. Because some processors have an instruction set that is extremely irregular, non-orthogonal, large, and poorly designed, they are not a good choice for learning assembly. The author feels that teaching students their first assembly language on one of those processors should be considered a crime, or at least a form of mental abuse. Luckily, there are processors that are readily available, low-cost, and relatively easy to learn assembly with. This book uses one of them as the model for assembly language.

1.2 The ARM Processor

In the late 1970s, the microcomputer industry was a fierce battleground, with several companies competing to sell computers to small business and home users. One of those companies, based in the United Kingdom, was Acorn Computers Ltd. Acorn’s flagship product, the BBC Micro, was based on the same processor that Apple Computer had chosen for their Apple IITM line of computers; the 8-bit 6502 made by MOS Technology. As the 1980s approached, microcomputer manufacturers were looking for more powerful 16-bit and 32-bit processors. The engineers at Acorn considered the processor chips that were available at the time, and concluded that there was nothing available that would meet their needs for the next generation of Acorn computers.

The only reasonably-priced processors that were available were the Motorola 68000 (a 32-bit processor used in the Apple Macintosh and most high-end Unix workstations) and the Intel 80286 (a 16-bit processor used in less powerful personal computers such as the IBM PC). During the previous decade, a great deal of research had been conducted on developing high-performance computer architectures. One of the outcomes of that research was the development of a new paradigm for processor design, known as Reduced Instruction Set Computing (RISC). One advantage of RISC processors was that they could deliver higher performance with a much smaller number of transistors than the older Complex Instruction Set Computing (CISC) processors such as the 68000 and 80286. The engineers at Acorn decided to design and produce their own processor. They used the BBC Micro to design and simulate their new processor, and in 1987, they introduced the Acorn ArchimedesTM. The ArchimedesTM was arguably the most powerful home computer in the world at that time, with graphics and audio capabilities that IBM PCTM and Apple MacintoshTM users could only dream about. Thus began the long and successful dynasty of the Acorn RISC Machine (ARM) processor.

Acorn never made a big impact on the global computer market. Although Acorn eventually went out of business, the processor that they created has lived on. It was re-named to the Advanced RISC Machine, and is now known simply as ARM. Stewardship of the ARM processor belongs to ARM Holdings, LLC which manages the design of new ARM architectures and licenses the manufacturing rights to other companies. ARM Holdings does not manufacture any processor chips, yet more ARM processors are produced annually than all other processor designs combined. Most ARM processors are used as components for embedded systems and portable devices. If you have a smart phone or similar device, then there is a very good chance that it has an ARM processor in it. Because of its enormous market presence, clean architecture, and small, orthogonal instruction set, the ARM is a very good choice for learning assembly language.

Although it dominates the portable device market, the ARM processor has almost no presence in the desktop or server market. However, that may change. In 2012, ARM Holdings announced the ARM64 architecture, which is the first major redesign of the ARM architecture in 30 years. The ARM64 is intended to compete for the desktop and server market with other high-end processors such as the Sun SPARC and Intel Xeon. Regardless of whether or not the ARM64 achieves much market penetration, the original ARM 32-bit processor architecture is so ubiquitous that it clearly will be around for a long time.

1.3 Computer Data

The basic unit of data in a digital computer is the binary digit, or bit. A bit can have a value of zero or one. In order to store numbers larger than 1, bits are combined into larger units. For instance, using two bits, it is possible to represent any number between zero and three. This is shown in Table 1.1. When stored in the computer, all data is simply a string of binary digits. There is more than one way that such a fixed-length string of binary digits can be interpreted.

Table 1.1

Values represented by two bits

Bit 1 Bit 0 Value
0 0 0
0 1 1
1 0 2
1 1 3

Computers have been designed using many different bit group sizes, including 4, 8, 10, 12, and 14 bits. Today most computers recognize a basic grouping of 8 bits, which we call a byte. Some computers can work in units of 4 bits, which is commonly referred to as a nibble (sometimes spelled “nybble”). A nibble is a convenient size because it can exactly represent one hexadecimal digit. Additionally, most modern computers can also work with groupings of 16, 32 and 64 bits. The CPU is designed with a default word size. For most modern CPUs, the default word size is 32 bits. Many processors support 64-bit words, which is increasingly becoming the default size.

1.3.1 Representing Natural Numbers

A numeral system is a writing system for expressing numbers. The most common system is the Hindu-Arabic number system, which is now used throughout the world. Almost from the first day of formal education, children begin learning how to add, subtract, and perform other operations using the Hindu-Arabic system. After years of practice, performing basic mathematical operations using strings of digits between 0 and 9 seems natural. However, there are other ways to count and perform arithmetic, such as Roman numerals, unary systems, and Chinese numerals. With a little practice, it is possible to become as proficient at performing mathematics with other number systems as with the Hindu-Arabic system.

The Hindu-Arabic system is a base ten or radix ten system, because it uses the ten digits 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9. For our purposes, the words radix and base are equivalent, and refer to the number of individual digits available in the numbering system. The Hindu-Arabic system is also a positional system, or a place-value notation, because the value of each digit in a number depends on its position in the number. The radix ten Hindu-Arabic system is only one of an infinite family of closely related positional systems. The members of this family differ only in the radix used (and therefore, the number of characters used). For bases greater than base ten, characters are borrowed from the alphabet and used to represent digits. For example, the first column in Table 1.2 shows the character “A” being used as a single digit representation for the number 10.

Table 1.2

The first 21 integers (starting with 0) in various bases

Base
16 10 9 8 7 6 5 4 3 2
0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2 10
3 3 3 3 3 3 3 3 10 11
4 4 4 4 4 4 4 10 11 100
5 5 5 5 5 5 10 11 12 101
6 6 6 6 6 10 11 12 20 110
7 7 7 7 10 11 12 13 21 111
8 8 8 10 11 12 13 20 22 1000
9 9 10 11 12 13 14 21 100 1001
A 10 11 12 13 14 20 22 101 1010
B 11 12 13 14 15 21 23 102 1011
C 12 13 14 15 20 22 30 110 1100
D 13 14 15 16 21 23 31 111 1101
E 14 15 16 20 22 24 32 112 1110
F 15 16 17 21 23 30 33 120 1111
10 16 17 20 22 24 31 100 121 10000
11 17 18 21 23 25 32 101 122 10001
12 18 20 22 24 30 33 102 200 10010
13 19 21 23 25 31 34 103 201 10011
14 20 22 24 26 32 40 110 201 10100

t0015

In base ten, we think of numbers as strings of the 10 digits, “0”–“9”. Each digit counts 10 times the amount of the digit to its right. If we restrict ourselves to integers, then the digit furthest to the right is always the ones digit. It is also referred to as the least significant digit. The digit immediately to the left of the ones digit is the tens digit. To the left of that is the hundreds digit, and so on. The leftmost digit is referred to as the most significant digit. The following equation shows how a number can be decomposed into its constituent digits:

5783910=5×104+7×103+8×102+3×101+9×100.

si1_e

Note that the subscript of “10” on 5783910 indicates that the number is given in base ten.

Imagine that we only had 7 digits: 0, 1, 2, 3, 4, 5, and 6. We need 10 digits for base ten, so with only 7 digits we are limited to base seven. In base seven, each digit in the string represents a power of seven rather than a power of ten. We can represent any integer in base seven, but it may take more digits than in base ten. Other than using a different base for the power of each digit, the math works exactly the same as for base ten. For example, suppose we have the following number in base seven: 3304257. We can convert this number to base ten as follows:

3304257=3×75+3×74+0×73+4×72+2×71+5×70=5042110+720310+010+19610+1410+510=5783910

si2_e

Base two, or binary is the “native” number system for modern digital systems. The reason for this is mainly because it is relatively easy to build circuits with two stable states: on and off (or 1 and 0). Building circuits with more than two stable states is much more difficult and expensive, and any computation that can be performed in a higher base can also be performed in binary. The least significant (rightmost) digit in binary is referred to as the least significant bit, or LSB, while the leftmost binary digit is referred to as the most significant bit, or MSB.

1.3.2 Base Conversion

The most common bases used by programmers are base two (binary), base eight (octal), base ten (decimal) and base sixteen (hexadecimal). Octal and hexadecimal are common because, as we shall see later, they can be translated quickly and easily to and from base two, and are often easier for humans to work with than base two. Note that for base sixteen, we need 16 characters. We use the digits 0 through 9 plus the letters A through F. Table 1.2 shows the equivalents for all numbers between 0 and 20 in base two through base ten, and base sixteen.

Before learning assembly language it is essential to know how to convert from any base to any other base. Since we are already comfortable working in base ten, we will use that as an intermediary when converting between two arbitrary bases. For instance, if we want to convert a number in base three to base five, we will do it by first converting the base three number to base ten, then from base ten to base five. By using this two-stage process, we will only need to learn to convert between base ten and any arbitrary base b.

Base b to decimal

Converting from an arbitrary base b to base ten simply involves multiplying each base b digit d by bn, where n is the significance of digit d, and summing all of the results. For example, converting the base five number 34215 to base ten is performed as follows:

34215=3×53+4×52+2×51+1×50=37510+10010+1010+110=48610

si3_e

This conversion procedure works for converting any integer from any arbitrary base b to its equivalent representation in base ten. Example 1.1 gives another specific example of how to convert from base b to base ten.

Example 1.1

Converting From an Arbitrary Base to Base Ten

Converting 73625 to base ten is accomplished by expanding and summing the terms:

73625=7×53+3×52+6×51+2×50=7×125+3×25+6×5+2×1=875+75+30+2=98210

si4_e

Decimal to base b

Converting from base ten to an arbitrary base b involves repeated division by the base, b. After each division, the remainder is used as the next more significant digit in the base b number, and the quotient is used as the dividend for the next iteration. The process is repeated until the quotient is zero. For example, converting 5610 to base four is accomplished as follows:

eq1-1-9780128036983

Reading the remainders from right to left yields: 3204. This result can be double-checked by converting it back to base ten as follows:

3204=3×42+2×41+0×40=48+8+0=5610.

si5_e

Since we arrived at the same number we started with, we have verified that 5610 = 3204. This conversion procedure works for converting any integer from base ten to any arbitrary base b. Example 1.2 gives another example of converting from base ten to another base b.

Example 1.2

Converting from Base Ten to an Arbitrary Base

Converting 834110 to base seven is accomplished as follows:

eq1-2-9780128036983

Conversion between arbitrary bases

Although it is possible to perform the division and multiplication steps in any base, most people are much better at working in base ten. For that reason, the easiest way to convert from any base a to any other base b is to use a two step process. First step is to convert from base a to decimal. The second step is to convert from decimal to base b. Example 1.3 shows how to convert from any base to any other base.

Example 1.3

Converting from an Arbitrary Base to Another Arbitrary Base

Converting 848343 to base 11 is accomplished with two steps. The number is first converted to base ten as follows:

848353=8×34+4×33+8×32+3×31+4×30=8×81+4×27+8×9+3×3+4×1=648+108+72+9+4=84110

si6_e

Then the result is converted to base 11:

eq01-03-9780128036983

Bases that are powers-of-two

In addition to the methods above, there is a simple method for quickly converting between base two, base eight, and base sixteen. These shortcuts rely on the fact that 2, 8, and 16 are all powers of two. Because of this, it takes exactly four binary digits (bits) to represent exactly one hexadecimal digit. Likewise, it takes exactly three bits to represent an octal digit. Conversely, each hexadecimal digit can be converted to exactly four binary digits, and each octal digit can be converted to exactly three binary digits. This relationship makes it possible to do very fast conversions using the tables shown in Fig. 1.3.

f01-03-9780128036983
Figure 1.3 Tables used for converting between binary, octal, and hex.

When converting from hexadecimal to binary, all that is necessary is to replace each hex digit with the corresponding binary digits from the table. For example, to convert 5AC416 to binary, we just replace “5” with “0101,” replace “A” with “1010,” replace “C” with “1100,” and replace “4” with “0100.” So, just by referring to the table, we can immediately see that 5AC416 = 01011010110001002. This method works exactly the same for converting from octal to binary, except that it uses the table on the right side of Fig. 1.3.

Converting from binary to hexadecimal is also very easy using the table. Given a binary number, n, take the four least significant digits of n and find them in the table on the left side of Fig. 1.3. The hexadecimal digit on the matching line of the table is the least significant hex digit. Repeat the process with the next set of four bits and continue until there are no bits remaining in the binary number. For example, to convert 00111001010101112 to hexadecimal, just divide the number into groups of four bits, starting on the right, to get: 0011|1001|0101|01112. Now replace each group of four bits by looking up the corresponding hex digit in the table on the left side of Fig. 1.3, to convert the binary number to 395716. In the case where the binary number does not have enough bits, simply pad with zeros in the high-order bits. For example, dividing the number 10011000100112 into groups of four yields 1|0011|0001|00112 and padding with zeros in the high-order bits results in 0001|0011|0001|00112. Looking up the four groups in the table reveals that 0001|0011|0001|00112 = 131316.

1.3.3 Representing Integers

The computer stores groups of bits, but the bits by themselves have no meaning. The programmer gives them meaning by deciding what the bits represent, and how they are interpreted. Interpreting a group of bits as unsigned integer data is relatively simple. Each bit is weighted by a power-of-two, and the value of the group of bits is the sum of the non-zero bits multiplied by their respective weights. However, programmers often need to represent negative as well as non-negative numbers, and there are many possibilities for storing and interpreting integers whose value can be both positive and negative. Programmers and hardware designers have developed several standard schemes for encoding such numbers. The three main methods for storing and interpreting signed integer data are two’s complement, sign-magnitude, and excess-N, Fig. 1.4 shows how the same binary pattern of bits can be interpreted as a number in four different ways.

f01-04-9780128036983
Figure 1.4 Four different representations for binary integers.

Sign-magnitude representation

The sign-magnitude representation simply reserves the most significant bit to represent the sign of the number, and the remaining bits are used to store the magnitude of the number. This method has the advantage that it is easy for humans to interpret, with a little practice. However, addition and subtraction are slightly complicated. The addition/subtraction logic must compare the sign bits, complement one of the inputs if they are different, implement an end-around carry, and complement the result if there was no carry from the most significant bit. Complements are explained in Section 1.3.3. Because of the complexity, most integer CPUs do not directly support addition and subtraction of integers in sign-magnitude form. However, this method is commonly used for mantissa in floating-point numbers, as will be explained in Chapter 8. Another drawback to sign-magnitude is that it has two representations for zero, which can cause problems if the programmer is not careful.

Excess-(2n−1 − 1) representation

Another method for representing both positive and negative numbers is by using an excess-N representation. With this representation, the number that is stored is N greater than the actual value. This representation is relatively easy for humans to interpret. Addition and subtraction are easily performed using the complement method, which is explained in Section 1.3.3. This representation is just the same as unsigned math, with the addition of a bias which is usually (2n−1 − 1). So, zero is represented as zero plus the bias. In n = 12 bits, the bias is 212−1 − 1 = 204710, or 0111111111112. This method is commonly used to store the exponent in floating-point numbers, as will be explained in Chapter 8.

Complement representation

A very efficient method for dealing with signed numbers involves representing negative numbers as the radix complements of their positive counterparts. The complement is the amount that must be added to something to make it “whole.” For instance, in geometry, two angles are complementary if they add to 90°. In radix mathematics, the complement of a digit x in base b is simply bx. For example, in base ten, the complement of 4 is 10 − 4 = 6.

In complement representation, the most significant digit of a number is reserved to indicate whether or not the number is negative. If the first digit is less than b2si7_e (where b is the radix), then the number is positive. If the first digit is greater than or equal to b2si7_e, then the number is negative. The first digit is not part of the magnitude of the number, but only indicates the sign of the number. For example, numbers in ten’s complement notation are positive if the first digit is less than 5, and negative if the first digit is greater than 4. This works especially well in binary, since the number is considered positive if the first bit is zero and negative if the first bit is one. The magnitude of a negative number can be obtained by taking the radix complement. Because of the nice properties of the complement representation, it is the most common method for representing signed numbers in digital computers.

Finding the complement: The radix complement of an n digit number y in radix ( base) b is defined as

C(yb)=bnyb.

si9_e  (1.1)

For example, the ten’s complement of the four digit number 873410 is 104 − 8734 = 1266. In this example, we directly applied the definition of the radix complement from Eq. (1.4). That is easy in base ten, but not so easy in an arbitrary base, because it involves performing a subtraction. However, there is a very simple method for calculating the complement which does not require subtraction. This method involves finding the diminished radix complement, which is (bn − 1) − y by substituting each digit with its complement from a complement table. The radix complement is found by adding one to the diminished radix complement. Fig. 1.5 shows the complement tables for bases ten and two. Examples 1.4 and 1.5 show how the complement is obtained in bases ten and two respectively. Examples 1.6 and 1.7 show additional conversions between binary and decimal.

f01-05-9780128036983
Figure 1.5 Complement tables for bases ten and two.

Example 1.4

The Complement in Base Ten

The nine’s complement of the base ten number 593 is found by finding the digit ‘5’ in the complement table, and replacing it with its complement, which is the digit ‘4.’ The digit ‘9’ is replaced with ‘0,’ and ‘3’ is replaced with ‘6.’ Therefore the nine’s complement of 59310 is 406. Likewise, the nine’s complement of 100010 is 899910 and the nine’s complement of 099910 is 900010.

The ten’s complement of 72610 is 27310 + 1 = 27410.

Example 1.5

The One’s and Two’s Complement

The one’s complement of a binary number is found in the same way as the nine’s complement of a decimal number, but using the one’s complement table instead of the nine’s complement table. The one’s complement of 010011012 is 101100102 and the one’s complement of 0000000010110110 is 11111111010010012. Note that the one’s complement of a base two number is equivalent to the bitwise logical “not” (Boolean complement) operator. This operator is very easy to implement in digital hardware.

The two’s complement is the one’s complement plus one. The two’s complement of 10101002 is 01010112 + 1 = 01011002.

Example 1.6

Conversion from Binary to Decimal

Suppose we want to convert a signed binary number to decimal.

1. If the most significant bit is ‘1’, then

a. Find the two’s complement

b. Convert the result to base 10

c. Add a negative sign

2. else

a. Convert the result to base 10

Number One’s Complement Two’s Complement Base 10 Negative
11010010 00101101 00101110 46 − 46
1111111100010110 0000000011101001 0000000011101010 234 − 234
01110100 Not negative 116
1000001101010110 0111110010101001 0111110010101010 31914 −31914
0101001111011011 Not negative 21467

t0055

Example 1.7

Conversion from Decimal to Binary

Suppose we want to convert a negative number from decimal to binary.

1. Remove the negative sign

2. Convert the number to binary

3. Take the two’s complement

Base 10 Positive Binary One’s Complement Two’s Complement
-46 00101110 11010001 11010010
-234 0000000011101010 1111111100010101 1111111100010110
-116 01110100 10001011 10001100
-31914 0111110010101010 1000001101010110 1000001101010111
-21467 0101001111011011 1010110000100100 1010110000100101

t0060

Subtraction using complements One very useful feature of complement notation is that it can be used to perform subtraction by using addition. Given two numbers in base b, xb, and yb, the difference can be computed as:

zb=xbyb

si10_e  (1.2)

=xb+(bnyb)bn

si11_e  (1.3)

=xb+C(yb)bn,

si12_e  (1.4)

where C(yb) is the radix complement of yb. Assume that xb and yb are both positive where ybxb and both numbers have the same number of digits n (yb may have leading zeros). In this case, the result of xb + C(yb) will always be greater than or equal to bn, but less than 2 × bn. This means that the result of xb + C(yb) will always begin with a ‘1’ in the n + 1 digit position. Dropping the initial ‘1’ is equivalent to subtracting bn, making the result xy + bnbn or just xy, which is the desired result. This can be reduced to a simple procedure. When y and x are both positive and yx, the following four steps are to be performed:

1. pad the subtrahend (y) with leading zeros, as necessary, so that both numbers have the same number of digits (n),

2. find the b’s complement of the subtrahend,

3. add the complement to the minuend,

4. discard the leading ‘1’.

The complement notation provides a very easy way to represent both positive and negative integers using a fixed number of digits, and to perform subtraction by using addition. Since modern computers typically use a fixed number of bits, complement notation provides a very convenient and efficient way to store signed integers and perform mathematical operations on them. Hardware is simplified because there is no need to build a specialized subtractor circuit. Instead, a very simple complement circuit is built and the adder is reused to perform subtraction as well as addition.

1.3.4 Representing Characters

In the previous section, we discussed how the computer stores information as groups of bits, and how we can interpret those bits as numbers in base two. Given that the computer can only store information using groups of bits, how can we store textual information? The answer is that we create a table, which assigns a numerical value to each character in our language.

Early in the development of computers, several computer manufacturers developed such tables, or character coding schemes. These schemes were incompatible and computers from different manufacturers could not easily exchange textual data without the use of translation software to convert the character codes from one coding scheme to another.

Eventually, a standard coding scheme, known as the American Standard Code for Information Interchange (ASCII) was developed. Work on the ASCII standard began on October 6, 1960, with the first meeting of the American Standards Association’s (ASA) X3.2 subcommittee. The first edition of the standard was published in 1963. The standard was updated in 1967 and again in 1986, and has been stable since then. Within a few years of its development, ASCII was accepted by all major computer manufacturers, although some continue to support their own coding schemes as well.

ASCII was designed for American English, and does not support some of the characters that are used by non-English languages. For this reason, ASCII has been extended to create more comprehensive coding schemes. Most modern multilingual coding schemes are based on ASCII, though they support a wider range of characters.

At the time that it was developed, transmission of digital data over long distances was very slow, and usually involved converting each bit into an audio signal which was transmitted over a telephone line using an acoustic modem. In order to maximize performance, the standards committee chose to define ASCII as a 7-bit code. Because of this decision, all textual data could be sent using seven bits rather than eight, resulting in approximately 10% better overall performance when transmitting data over a telephone modem. A possibly unforeseen benefit was that this also provided a way for the code to be extended in the future. Since there are 128 possible values for a 7-bit number, the ASCII standard provides 128 characters. However, 33 of the ASCII characters are non-printing control characters. These characters, shown in Table 1.3, are mainly used to send information about how the text is to be displayed and/or printed. The remaining 95 printable characters are shown in Table 1.4.

Table 1.3

The ASCII control characters

Binary Oct Dec Hex Abbr Glyph Name
000 0000 000 0 00 NUL ˆ@ Null character
000 0001 001 1 01 SOH ˆA Start of header
000 0010 002 2 02 STX ˆB Start of text
000 0011 003 3 03 ETX ˆC End of text
000 0100 004 4 04 EOT ˆD End of transmission
000 0101 005 5 05 ENQ ˆE Enquiry
000 0110 006 6 06 ACK ˆF Acknowledgment
000 0111 007 7 07 BEL ˆG Bell
000 1000 010 8 08 BS ˆH Backspace
000 1001 011 9 09 HT ˆI Horizontal tab
000 1010 012 10 0A LF ˆJ Line feed
000 1011 013 11 0B VT ˆK Vertical tab
000 1100 014 12 0C FF ˆL Form feed
000 1101 015 13 0D CR ˆM Carriage return[g]
000 1110 016 14 0E SO ˆN Shift out
000 1111 017 15 0F SI ˆO Shift in
001 0000 020 16 10 DLE ˆP Data link escape
001 0001 021 17 11 DC1 ˆQ Device control 1 (oft. XON)
001 0010 022 18 12 DC2 ˆR Device control 2
001 0011 023 19 13 DC3 ˆS Device control 3 (oft. XOFF)
001 0100 024 20 14 DC4 ˆT Device control 4
001 0101 025 21 15 NAK ˆU Negative acknowledgement
001 0110 026 22 16 SYN ˆV Synchronous idle
001 0111 027 23 17 ETB ˆW End of transmission Block
001 1000 030 24 18 CAN ˆX Cancel
001 1001 031 25 19 EM ˆY End of medium
001 1010 032 26 1A SUB ˆZ Substitute
001 1011 033 27 1B ESC ˆ[ Escape
001 1100 034 28 1C FS ˆ\ File separator
001 1101 035 29 1D GS ˆ] Group separator
001 1110 036 30 1E RS ˆˆ Record separator
001 1111 037 31 1F US ˆ_ Unit separator
111 1111 177 127 7F DEL ˆ? Delete

t0020

Table 1.4

The ASCII printable characters

Binary Oct Dec Hex Glyph
010 0000 040 32 20 _
010 0001 041 33 21 !
010 0010 042 34 22
010 0011 043 35 23 #
010 0100 044 36 24 $
010 0101 045 37 25 %
010 0110 046 38 26 &
010 0111 047 39 27
010 1000 050 40 28 (
010 1001 051 41 29 )
010 1010 052 42 2A *
010 1011 053 43 2B +
010 1100 054 44 2C ,
010 1101 055 45 2D
010 1110 056 46 2E .
010 1111 057 47 2F /
011 0000 060 48 30 0
011 0001 061 49 31 1
011 0010 062 50 32 2
011 0011 063 51 33 3
011 0100 064 52 34 4
011 0101 065 53 35 5
011 0110 066 54 36 6
011 0111 067 55 37 7
011 1000 070 56 38 8
011 1001 071 57 39 9
011 1010 072 58 3A :
011 1011 073 59 3B ;
011 1100 074 60 3C <
011 1101 075 61 3D =
011 1110 076 62 3E >
011 1111 077 63 3F ?
100 0000 100 64 40 @
100 0001 101 65 41 A
100 0010 102 66 42 B
100 0011 103 67 43 C
100 0100 104 68 44 D
100 0101 105 69 45 E
100 0110 106 70 46 F
100 0111 107 71 47 G
100 1000 110 72 48 H
100 1001 111 73 49 I
100 1010 112 74 4A J
100 1011 113 75 4B K
100 1100 114 76 4C L
100 1101 115 77 4D M
100 1110 116 78 4E N
100 1111 117 79 4F O
101 0000 120 80 50 P
101 0001 121 81 51 Q
101 0010 122 82 52 R
101 0011 123 83 53 S
101 0100 124 84 54 T
101 0101 125 85 55 U
101 0110 126 86 56 V
101 0111 127 87 57 W
101 1000 130 88 58 X
101 1001 131 89 59 Y
101 1010 132 90 5A Z
101 1011 133 91 5B [
101 1100 134 92 5C \
101 1101 135 93 5D ]
101 1110 136 94 5E ˆ
101 1111 137 95 5F _
110 0000 140 96 60
110 0001 141 97 61 a
110 0010 142 98 62 b
110 0011 143 99 63 c
110 0100 144 100 64 d
110 0101 145 101 65 e
110 0110 146 102 66 f
110 0111 147 103 67 g
110 1000 150 104 68 h
110 1001 151 105 69 i
110 1010 152 106 6A j
110 1011 153 107 6B k
110 1100 154 108 6C l
110 1101 155 109 6D m
110 1110 156 110 6E n
110 1111 157 111 6F o
111 0000 160 112 70 p
111 0001 161 113 71 q
111 0010 162 114 72 r
111 0011 163 115 73 s
111 0100 164 116 74 t
111 0101 165 117 75 u
111 0110 166 118 76 v
111 0111 167 119 77 w
111 1000 170 120 78 x
111 1001 171 121 79 y
111 1010 172 122 7A z
111 1011 173 123 7B {
111 1100 174 124 7C |
111 1101 175 125 7D }
111 1110 176 126 7E ˜

t0025_at0025_b

Non-printing characters

The non-printing characters are used to provide hints or commands to the device that is receiving, displaying, or printing the data. The FF character, when sent to a printer, will cause the printer to eject the current page and begin a new one. The LF character causes the printer or terminal to end the current line and begin a new one. The CR character causes the terminal or printer to move to the beginning of the current line. Many text editing programs allow the user to enter these non-printing characters by using the control key on the keyboard. For instance, to enter the BEL character, the user would hold the control key down and press the G key. This character, when sent to a character display terminal, will cause it to emit a beep. Many of the other control characters can be used to control specific features of the printer, display, or other device that the data is being sent to.

Converting character strings to ASCII codes

Suppose we wish to covert a string of characters, such as “Hello World” to an ASCII representation. We can use an 8-bit byte to store each character. Also, it is common practice to include an additional byte at the end of the string. This additional byte holds the ASCII NUL character, which indicates the end of the string. Such an arrangement is referred to as a null-terminated string.

To convert the string “Hello World” into a null-terminated string, we can build a table with each character on the left and its equivalent binary, octal, hexadecimal, or decimal value (as defined in the ASCII table) on the right. Table 1.5 shows the characters in “Hello World” and their equivalent binary representations, found by looking in Table 1.4. Since most modern computers use 8-bit bytes (or multiples thereof) as the basic storage unit, an extra zero bit is shown in the most significant bit position.

Table 1.5

Binary equivalents for each character in “Hello World”

Character Binary
H 01001000
e 01100101
l 01101100
l 01101100
o 01101111
00100000
W 01010111
o 01101111
r 01110010
l 01101100
d 01100100
NUL 00000000

Reading the Binary column from top to bottom results in the following sequence of bytes: 01001000 01100101 01101100 01101100 01101111 00100000 01010111 01101111 01110010 01101100 01100100 0000000. To convert the same string to a hexadecimal representation, we can use the shortcut method that was introduced previously to convert each 4-bit nibble into its hexadecimal equivalent, or read the hexadecimal value from the ASCII table. Table 1.6 shows the result of extending Table 1.5 to include hexadecimal and decimal equivalents for each character. The string can now be converted to hexadecimal or decimal simply by reading the correct column in the table. So “Hello World” expressed as a null-terminated string in hexadecimal is “48 65 6C 6C 6F 20 57 6F 62 6C 64 00” and in decimal it is ”72 101 108 108 111 32 87 111 98 108 100 0”.

Table 1.6

Binary, hexadecimal, and decimal equivalents for each character in “Hello World”

Character Binary Hexadecimal Decimal
H 01001000 48 72
e 01100101 65 101
l 01101100 6C 108
l 01101100 6C 108
o 01101111 6F 111
00100000 20 32
W 01010111 57 87
o 01101111 6F 111
r 01110010 62 98
l 01101100 6C 108
d 01100100 64 100
NUL 00000000 00 0

t0035

Interpreting data as ASCII strings

It is sometimes necessary to convert a string of bytes in hexadecimal into ASCII characters. This is accomplished simply by building a table with the hexadecimal value of each byte in the left column, then looking in the ASCII table for each value and entering the equivalent character representation in the right column. Table 1.7 shows how the ASCII table is used to interpret the hexadecimal string “466162756C6F75732100” as an ASCII string.

Table 1.7

Interpreting a hexadecimal string as ASCII

Hexadecimal ASCII
46 F
61 a
62 b
75 u
6C l
6F o
75 u
73 s
21 !
00 NUL

ISO extensions to ASCII

ASCII was developed to encode all of the most commonly used characters in North American English text. The encoding uses only 128 of the 256 codes that are available in a 8-bit byte. ASCII does not include symbols frequently used in other countries, such as the British pound symbol (£) or accented characters (ü). However, the International Standards Organization (ISO) has created several extensions to ASCII to enable the representation of characters from a wider variety of languages.

The ISO has defined a set of related standards known collectively as ISO 8859. ISO 8859 is an 8-bit extension to ASCII which includes the 128 ASCII characters along with an additional 128 characters, such as the British Pound symbol and the American cent symbol. Several variations of the ISO 8859 standard exist for different language families. Table 1.8 provides a brief description of the various ISO standards.

Table 1.8

Variations of the ISO 8859 standard

Name Alias Languages
ISO8859-1 Latin-1 Western European languages
ISO8859-2 Latin-2 Non-Cyrillic Central and Eastern European languages
ISO8859-3 Latin-3 Southern European languages and Esperanto
ISO8859-4 Latin-4 Northern European and Baltic languages
ISO8859-5 Latin/Cyrillic Slavic languages that use a Cyrillic alphabet
ISO8859-6 Latin/Arabic Common Arabic language characters
ISO8859-7 Latin/Greek Modern Greek language
ISO8859-8 Latin/Hebrew Modern Hebrew languages
ISO8859-9 Latin-5 Turkish
ISO8859-10 Latin-6 Nordic languages
ISO8859-11 Latin/Thai Thai language
ISO8859-12 Latin/Devanagari Never completed. Abandoned in 1997
ISO8859-13 Latin-7 Some Baltic languages not covered by Latin-4 or Latin-6
ISO8859-14 Latin-8 Celtic languages
ISO8859-15 Latin-9 Update to Latin-1 that replaces some characters. Most
notably, it includes the euro symbol (€), which did not
exist when Latin-1 was created
ISO8859-16 Latin-10 Covers several languages not covered by Latin-9 and
includes the euro symbol (€)

Unicode and UTF-8

Although the ISO extensions helped to standardize text encodings for several languages that were not covered by ASCII, there were still some issues. The first issue is that the display and input devices must be configured for the correct encoding, and displaying or printing documents with multiple encodings requires some mechanism for changing the encoding on-the-fly. Another issue has to do with the lexicographical ordering of characters. Although two languages may share a character, that character may appear in a different place in the alphabets of the two languages. This leads to issues when programmers need to sort strings into lexicographical order. The ISO extensions help to unify character encodings across multiple languages, but do not solve all of the issues involved in defining a universal character set.

In the late 1980s, there was growing interest in developing a universal character encoding for all languages. People from several computer companies worked together and, by 1990, had developed a draft standard for Unicode. In 1991, the Unicode Consortium was formed and charged with guiding and controlling the development of Unicode. The Unicode Consortium has worked closely with the ISO to define, extend, and maintain the international standard for a Universal Character Set (UCS). This standard is known as the ISO/IEC 10646 standard. The ISO/IEC 10646 standard defines the mapping of code points (numbers) to glyphs (characters). but does not specify character collation or other language-dependent properties. UCS code points are commonly written in the form U+XXXX, where XXXX in the numerical code point in hexadecimal. For example, the code point for the ASCII DEL character would be written as U+007F. Unicode extends the ISO/IEC standard and specifies language-specific features.

Originally, Unicode was designed as a 16-bit encoding. It was not fully backward-compatible with ASCII, and could encode only 65,536 code points. Eventually, the Unicode character set grew to encompass 1,112,064 code points, which requires 21 bits per character for a straightforward binary encoding. By early 1992, it was clear that some clever and efficient method for encoding character data was needed.

UTF-8 (UCS Transformation Format-8-bit) was proposed and accepted as a standard in 1993. UTF-8 is a variable-width encoding that can represent every character in the Unicode character set using between one and four bytes. It was designed to be backward compatible with ASCII and to avoid the major issues of previous encodings. Code points in the Unicode character set with lower numerical values tend to occur more frequently than code points with higher numerical values. UTF-8 encodes frequently occurring code points with fewer bytes than those which occur less frequently. For example, the first 128 characters of the UTF-8 encoding are exactly the same as the ASCII characters, requiring only 7 bits to encode each ASCII character. Thus any valid ASCII text is also valid UTF-8 text. UTF-8 is now the most common character encoding for the World Wide Web, and is the recommended encoding for email messages.

In November 2003, UTF-8 was restricted by RFC 3629 to end at code point 10FFFF16. This allows UTF-8 to encode 1,114,111 code points, which is slightly more than the 1,112,064 code points defined in the ISO/IEC 10646 standard. Table 1.9 shows how ISO/IEC 10646 code points are mapped to a variable-length encoding in UTF-8. Note that the encoding allows each byte in a stream of bytes to be placed in one of the following three distinct categories:

Table 1.9

UTF-8 encoding of the ISO/IEC 10646 code points

First Last
UCS Code Code
Bits Point Point Bytes Byte 1 Byte 2 Byte 3 Byte 4
7 U+0000 U+007F 1 0xxxxxxx
11 U+0080 U+07FF 2 110xxxxx 10xxxxxx
16 U+0800 U+FFFF 3 1110xxxx 10xxxxxx 10xxxxxx
21 U+10000 U+10FFFF 4 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

t0050

1. If the most significant bit of a byte is zero, then it is a single-byte character, and is completely ASCII-compatible.

2. If the two most significant bits in a byte are set to one, then the byte is the beginning of a multi-byte character.

3. If the most significant bit is set to one, and the second most significant bit is set to zero, then the byte is part of a multi-byte character, but is not the first byte in that sequence.

The UTF-8 encoding of the UCS characters has several important features:

Backwards compatible with ASCII: This allows the vast number of existing ASCII documents to be interpreted as UTF-8 documents without any conversion.

Self-synchronization: Because of the way code points are assigned, it is possible to find the beginning of each character by looking only at the top two bits of each byte. This can have important performance implications when performing searches in text.

Encoding of code sequence length: The number of bytes in the sequence is indicated by the pattern of bits in the first byte of the sequence. Thus, the beginning of the next character can be found quickly. This feature can also have important performance implications when performing searches in text.

Efficient code structure: UTF-8 efficiently encodes the UCS code points. The high-order bits of the code point go in the lead byte. Lower-order bits are placed in continuation bytes. The number of bytes in the encoding is the minimum required to hold all the significant bits of the code point.

Easily extended to include new languages: This feature will be greatly appreciated when we contact intelligent species from other star systems.

With UTF-8 encoding, the first 128 characters of the UCS are each encoded in a single byte. The next 1,920 characters require two bytes to encode. The two-byte encoding covers almost all Latin alphabets, and also Arabic, Armenian, Cyrillic, Coptic, Greek, Hebrew, Syriac and Tāna alphabets. It also includes combining diacritical marks, which are used in combination with another character, such as á, ñ, and ö. Most of the Chinese, Japanese, and Korean (CJK) characters are encoded using three bytes. Four bytes are needed for the less common CJK characters, various historic scripts, mathematical symbols, and emoji (pictographic symbols).

Consider the UTF-8 encoding for the British Pound symbol (£), which is UCS code point U+00A3. Since the code point is greater than 7F16, but less than 80016, it will require two bytes to encode. The encoding will be 110xxxxx 10xxxxxx, where the x characters are replaced with the 11 least-significant bits of the code point, which are 00010100011. Thus, the character £ is encoded in UTF-8 as 11000010 10100011 in binary, or C2 A3 in hexadecimal.

The UCS code point for the Euro symbol (€) is U+20AC. Since the code point is between 80016 and FFFF16, it will require three bytes to encode in UTF-8. The three-byte encoding is 1110xxxx 10xxxxxx 10xxxxxx where the x characters are replaced with the 16 least-significant bits of the code point. In this case the code point, in binary is 0010000010101100. Therefore, the UTF-8 encoding for € is 11100010 10000010 10101100 in binary, or E2 82 AC in hexadecimal.

In summary, there are three components to modern language support. The ISO/IEC 10646 defines a mapping from code points (numbers) to glyphs (characters). UTF-8 defines an efficient variable-length encoding for code points (text data) in the ISO/IEC 10646 standard. Unicode adds language specific properties to the ISO/IEC 10646 character set. Together, these three elements currently provide support for textual data in almost every human written language, and they continue to be extended and refined.

1.4 Memory Layout of an Executing Program

Computer memory consists of number of storage locations, or cells, each of which has a unique numeric address. Addresses are usually written in hexadecimal. Each storage location can contain a fixed number of binary digits. The most common size is one byte. Most computers group bytes together into words. A computer CPU that is capable of accessing a single byte of memory is said to have byte addressable memory. Some CPUs are only capable of accessing memory in word-sized groups. They are said to have word addressable memory.

Fig. 1.6 A shows a section of memory containing some data. Each byte has a unique address that is used when data is transferred to or from that memory cell. Most processors can also move data in word-sized chunks. On a 32-bit system, four bytes are grouped together to form a word. There are two ways that this grouping can be done. Systems that store the most significant byte of a word in the smallest address, and the least significant byte in the largest address, are said to be big-endian. The big-endian interpretation of a region of memory is shown in Fig. 1.6B. As shown in Fig. 1.6C, little-endian systems store the least significant byte in the lowest address and the most significant byte in the highest address. Some processors, such as the ARM, can be configured as either little-endian or big-endian. The Linux operating system, by default, configures the ARM processor to run in little-endian mode .

f01-06-9780128036983
Figure 1.6 A section of memory.

The memory layout for a typical program is shown in Fig. 1.7. The program is divided into four major memory regions, or sections. The programmer specifies the contents of the Text and Data sections. The Stack and Heap segments are defined when the program is loaded for execution. The Stack and Heap may grow and shrink as the program executes, while the Text and Data segments are set to fixed sizes by the compiler, linker, and loader. The Text section contains the executable instructions, while the Data section contains constants and statically allocated variables. The sizes of the Text and Data segments depend on how large the program is, and how much static data storage has been declared by the programmer. The heap contains variables that are allocated dynamically, and the stack is used to store parameters for function calls, return addresses, and local (automatic) variables.

f01-07-9780128036983
Figure 1.7 Typical memory layout for a program with a 32-bit address space.

In a high-level language, storage space for a variable can be allocated in one of three ways: statically, dynamically, and automatically. Statically allocated variables are allocated from the .data section. The storage space is reserved, and usually initialized, when the program is loaded and begins execution. The address of a statically allocated variable is fixed at the time the program begins running, and cannot be changed. Automatically allocated variables, often referred to as local variables, are stored on the stack. The stack pointer is adjusted down to make space for the newly allocated variable. The address of an automatic variable is always computed as an offset from the stack pointer. Dynamic variables are allocated from the heap, using malloc, new, or a language-dependent equivalent. The address of a dynamic variable is always stored in another variable, known as a pointer, which may be an automatic or static variable, or even another dynamic variable. The four major sections of program memory correspond to executable code, statically allocated variables, dynamically allocated variables, and automatically allocated variables.

1.5 Chapter Summary

There are several reasons for Computer Scientists and Computer Engineers to learn at least one assembly language. There are programming tasks that can only be performed using assembly language, and some tasks can be written to run much more efficiently and/or quickly if written in assembly language. Programmers with assembly language experience tend to write better code even when using a high-level language, and are usually better at finding and fixing bugs.

Although it is possible to construct a computer capable of performing arithmetic in any base, it is much cheaper to build one that works in base two. It is relatively easy to build an electrical circuit with two states, using two discrete voltage levels, but much more difficult to build a stable circuit with 10 discrete voltage levels. Therefore, modern computers work in base two.

Computer data can be viewed as simple bit strings. The programmer is responsible for supplying interpretations to give meaning to those bit strings. A set of bits can be interpreted as a number, a character, or anything that the programmer chooses. There are standard methods for encoding and interpreting characters and numbers. Fig. 1.4 shows some common methods for encoding integers. The most common encodings for characters are UTF-8 and ASCII.

Computer memory can be viewed as a sequence of bytes. Each byte has a unique address. A running program has four regions of memory. One region holds the executable code. The other three regions hold different types of variables.

Exercises

1.1 What is the two’s complement of 11011101?

1.2 Perform the base conversions to fill in the blank spaces in the following table:

Base 10 Base 2 Base 16 Base 21
23
010011
ABB
2HE

t0065

1.3 What is the 8-bit ASCII binary representation for the following characters?

(a) “A”

(b) “a”

(c) “!”

1.4 What is \ minus ! given that \ and ! are ASCII characters? Give your answer in binary.

1.5 Representing characters:

(a) Convert the string “Super!” to its ASCII representation. Show your result as a sequence of hexadecimal values.

(b) Convert the hexadecimal sequence into a sequence of values in base four.

1.6 Suppose that the string “This is a nice day” is stored beginning at address 4B3269AC16. What are the contents of the byte at address 4B3269B116 in hexadecimal?

1.7 Perform the following:

(a) Convert 1011012 to base ten.

(b) Convert 102310 to base nine.

(c) Convert 102310 to base two.

(d) Convert 30110 to base 16.

(e) Convert 30110 to base 2.

(f) Represent 30110 as a null-terminated ASCII string (write your answer in hexadecimal).

(g) Convert 34205 to base ten.

(h) Convert 23145 to base nine.

(i) Convert 1167 to base three.

(j) Convert 129411 to base 5.

1.8 Given the following binary string:
01001001 01110011 01101110 00100111 01110100 00100000 01000001 01110011 01110011 01100101 01101101 01100010 01101100 01111001 00100000 01000110 01110101 01101110 00111111 00000000

(a) Convert it to a hexadecimal string.

(b) Convert the first four bytes to a string of base ten numbers.

(c) Convert the first (little-endian) halfword to base ten.

(d) Convert the first (big-endian) halfword to base ten.

(e) If this string of bytes were sent to an ASCII printer or terminal, what would be printed?

1.9 The number 1,234,567 is stored as a 32-bit word starting at address F043900016. Show the address and contents of each byte of the 32-bit word on a

(a) little-endian system,

(b) big-endian system.

1.10 The ISO/IEC 10646 standard defines 1,112,064 code points (glyphs). Each code point could be encoded using 24 bits, or three bytes. The UTF-8 encoding uses up to four bytes to encode a code point. Give three reasons why UTF-8 is preferred over a simple 3-byte per code point encoding.

1.11 UTF-8 is often referred to as Unicode. Why is this not correct?

1.12 Skilled assembly programmers can convert small numbers between binary, hexadecimal, and decimal in their heads. Without referring to any tables or using a calculator or pencil, fill in the blanks in the following table:

Binary Decimal Hexadecimal
5
1010
C
23
0101101
4B

t0070

1.13 What are the differences between a CPU register and a memory location?

Chapter 2

GNU Assembly Syntax

Abstract

This chapter begins with a high-level description of assembly language and the assembler. It then explains the five elements of assembly language syntax, and gives some examples. It then goes in to more depth about how the assembler converts assembly language files into object files, which are then linked with other object files to create an executable file. Then it explains the most commonly used directives for the GNU assembler, and gives some examples to help relate the assembly code to equivalent C code.

Keywords

Compiler; Assembler; Linker; Labels; Comments; Directives; Instructions; Sections; Symbols

All modern computers consist of three main components: the central processing unit (CPU), memory, and devices. It can be argued that the major factor that distinguishes one computer from another is the CPU architecture. The architecture determines the set of instructions that can be performed by the CPU. The human-readable language which is closest to the CPU architecture is assembly language.

When a new processor architecture is developed, its creators also define an assembly language for the new architecture. In most cases, a precise assembly language syntax is defined and an assembler is created by the processor developers. Because of this, there is no single syntax for assembly language, although most assembly languages are similar in many ways and have certain elements in common.

The GNU assembler (GAS) is a highly portable re-configurable assembler. GAS uses a simple, general syntax that works for a wide variety of architectures. Although the syntax used by GAS for the ARM processor is slightly different from the syntax defined by the developers of the ARM processor, it provides the same capabilities.

2.1 Structure of an Assembly Program

An assembly program consists of four basic elements: assembler directives, labels, assembly instructions, and comments. Assembler directives allow the programmer to reserve memory for the storage of variables, control which program section is being used, define macros, include other files, and perform other operations that control the conversion of assembly instructions into machine code. The assembly instructions are given as mnemonics, or short character strings that are easier for human brains to remember than sequences of binary, octal, or hexadecimal digits. Each assembly instruction may have an optional label, and most assembly instructions require the programmer to specify one or more operands.

Most assembly language programs are written in lines of 80 characters organized into four columns. The first column is for optional labels. The second column is for assembly instructions or assembler directives. The third column is for specifying operands, and the fourth column is for comments. Traditionally, the first two columns are 8 characters wide, the third column is 16 characters wide, and the last column is 48 characters wide. However, most modern assemblers (including GAS) do not require a fixed column widths. Listing 2.1 shows a basic “Hello World” program written in GNU ARM Assembly to run under Linux. For comparison, Listing 2.2 shows an equivalent program written in C. The assembly language version of the program is significantly longer than the C version, and will only work on an ARM processor. The C version is at a higher level of abstraction, and can be compiled to run on any system that has a C compiler. Thus, C is referred to as a high-level language, and assembly is a low-level language.

f02-02-9780128036983
Listing 2.1 "Hello World" program in ARM assembly
f02-03-9780128036983
Listing 2.2 "Hello World" program in C.

2.1.1 Labels

Most modern assemblers are called two-pass assemblers because they read the input file twice. On the first pass, the assembler keeps track of the location of each piece of data and each instruction, and assigns an address or numerical value to each label and symbol in the input file. The main goal of the first pass is to build a symbol table, which maps each label or symbol to a numerical value.

On the second pass, the assembler converts the assembly instructions and data declarations into binary, using the symbol table to supply numerical values whenever they are needed. In Listing 2.1, there are two labels: main and str. During assembly, those labels are assigned the value of the address counter at the point where they appear. Labels can be used anywhere in the program to refer to the address of data, functions, or blocks of code. In GNU assembly syntax, labels always end with a colon (:) character.

2.1.2 Comments

There are two basic comment styles: multi-line and single-line. Multi-line comments start with /* and everything is ignored until a matching sequence of */ is found. These comments are exactly the same as multi-line comments in C and C++. In ARM assembly, single line comments begin with an @ character, and all remaining characters on the line are ignored. Listing 2.1 shows both types of comment. In addition, if the name of the file ends in .S, then single line comments can begin with //. If the file name does not end with a capital .S, then the // syntax is not allowed.

2.1.3 Directives

Directives are used mainly to define symbols, allocate storage, and control the behavior of the assembler, allowing the programmer to control how the assembler does its job. The GNU assembler has many directives, but assembly programmers typically need to know only a few of them. All assembler directives begin with a period “.” which is followed by a sequence of letters, usually in lower case. Listing 2.1 uses the .data, .asciz, .text, and .globl directives. The most commonly used directives are discussed later in this chapter. There are many other directives available in the GNU Assembler which are not covered here. Complete documentation is available online as part of the GNU Binutils package.

2.1.4 Assembly Instructions

Assembly instructions are the program statements that will be executed on the CPU. Most instructions cause the CPU to perform one low-level operation, In most assembly languages, operations can be divided into a few major types. Some instructions move data from one location to another. Others perform addition, subtraction, and other computational operations. Another class of instructions is used to perform comparisons and control which part of the program is to be executed next. Chapters 3 and 4 explain most of the assembly instructions that are available on the ARM processor.

2.2 What the Assembler Does

Listing 2.3 shows how the GNU assembler will assemble the “Hello World” program from Listing 2.1. The assembler converts the string on input line 2 into the binary representation of the string. The results are shown in hexadecimal in the Code column of the listing. The first byte of the string is stored at address zero in the .data section of the program, as shown by the 0000 in the Addr column on line 2.

f02-04-9780128036983
Listing 2.3 "Hello World" assembly listing.

On line 4, the assembler switches to the .text section of the program and begins converting instructions into binary. The first instruction, on line 9, is converted into its 4-byte machine code, 00402DE916, and stored at location 0000 in the .text section of the program, as shown in the Code and Addr columns on line 6.

Next, the assembler converts the ldr instruction on line 10 into the four-byte machine instruction 0C009FE516 and stores it at address 0004. It repeats this process with each remaining instruction until the end of the program. The assembler writes the resulting data into a specially formatted file, called an object file. Note that the assembler was unable to locate the printf function. The linker will take care of that. The object file created by the assembler, hello.o, contains the data in the Code column of Listing 2.3, along with information to help the linker to link (or “patch”) the instruction on line 11 so that printf is called correctly.

After creating the object file, the next step in creating an executable program would be to invoke the linker and request that it link hello.o with the C standard library. The linker will generate the final executable file, containing the code assembled from hello.S, along with the printf function and other start-up code from the C standard library. The GNU C compiler is capable of automatically invoking the assembler for files that end in .s or .S, and can also be used to invoke the linker. For example, if Listing 2.1 is stored in a file named hello.S in the current directory, then the command

gcc -o hello hello.S

will run the GNU C compiler, telling it to create an executable program file named hello, and to use hello.S as the source file for the program. The C compiler will notice the .S extension, and invoke the assembler to create an object file which is stored in a temporary file, possibly named hello.o. Then the C compiler will invoke the linker to link hello.o with the C standard library, which provides the printf function and some start-up code which calls the main function. The linker will create an executable file named hello. When the linker has finished, the C compiler will remove the temporary object file.

2.3 GNU Assembly Directives

Each processor architecture has its own assembly language, created by the designers of the architecture. Although there are many similarities between assembly languages, the designers may choose different names for various directives. The GNU assembler supports a relatively large set of directives, some of which have more than one name. This is because it is designed to handle assembling code for many different processors without drastically changing the assembly language designed by the processor manufacturers. We will now cover some of the most commonly used directives for the GNU assembler.

2.3.1 Selecting the Current Section

The instructions and data that make up a program are stored in different sections of the program file. There are several standard sections that the programmer can choose to put code and data in. Sections can also be further divided into numbered subsections. Each section has its own address counter, which is used to keep track of the location of bytes within that section. When a label is encountered, it is assigned the value of the current address counter for the currently active section.

Selecting a section and subsection is done by using the appropriate assembly directive. Once a section has been selected, all of the instructions and/or data will go into that section until another section is selected. The most important directives for selecting a section are:

.data subsection

Instructs the assembler to append the following instructions or data to the data subsection numbered subsection. If the subsection number is omitted, it defaults to zero. This section is normally used for global variables and constants which have labels.

.text subsection

Tells the assembler to append the following statements to the end of the text subsection numbered subsection. If the subsection number is omitted, subsection number zero is used. This section is normally used for executable instructions, but may also contain constant data.

.bss subsection

The bss (short for Block Started by Symbol) section is used for defining data storage areas that should be initialized to zero at the beginning of program execution. The .bss directive tells the assembler to append the following statements to the end of the bss subsection numbered subsection. If the subsection number is omitted, subsection number zero is used. This section is normally used for global variables which need to be initialized to zero. Regardless of what is placed into the section at compile-time, all bytes will be set to zero when the program begins executing. This section does not actually consume any space in the object or executable file. It is really just a request for the loader to reserve some space when the program is loaded into memory.

.section name

In addition to the three common sections, the programmer can create other sections using this directive. However in order for custom sections to be linked into a program, the linker must be made aware of them. Controlling the linker is covered in Section 14.4.3.

2.3.2 Allocating Space for Variables and Constants

There are several directives that allow the programmer to allocate and initialize static storage space for variables and constants. The assembler supports bytes, integer types, floating point types, and strings. These directives are used to allocate a fixed amount of space in memory and optionally initialize the memory. Some of these directives allow the memory to be initialized using an expression. An expression can be a simple integer, or a C-style expression. The directives for allocating storage are as follows:

.byte expressions

.byte expects zero or more expressions, separated by commas. Each expression is assembled into the next byte. If no expressions are given, then the address counter is not advanced and no bytes are reserved.

.hword expressions
.short expressions

For the ARM processor, these two directives do exactly the same thing. They expect zero or more expressions, separated by commas, and emit a 16-bit number for each expression. If no expressions are given, then the address counter is not advanced and no bytes are reserved.

.word expressions
.long expressions

For the ARM processor, these two directives do exactly the same thing. They expect zero or more expressions, separated by commas. They will emit four bytes for each expression given. If no expressions are given, then the address counter is not advanced and no bytes are reserved.

.ascii ” string ”

The .ascii directive expects zero or more string literals, each enclosed in quotation marks and separated by commas. It assembles each string (with no trailing ASCII NULL character) into consecutive addresses.

.asciz ” string ”
.string ” string ”

The .asciz directive is similar to the .ascii directive, but each string is followed by an ASCII NULL character (zero). The “z” in .asciz stands for zero. .string is just another name for .asciz.

.float flonums
.single flonums

This directive assembles zero or more floating point numbers, separated by commas. On the ARM, they are 4-byte IEEE standard single precision numbers. .float and .single are synonymous.

.double flonums

The .double directive expects zero or more floating point numbers, separated by commas. On the ARM, they are stored as 8-byte IEEE standard double precision numbers.

Fig. 2.1A shows how these directives are used to declare variables and constants. Fig. 2.1B shows the equivalent statements for creating global variables in C or C++. Note that in both cases, the variables created will be visible anywhere within the file that they are declared, but not visible in other files which are linked into the program.

f02-01-9780128036983
Figure 2.1 Equivalent static variable declarations in assembly and C.

In C, the declaration of an array can be performed by leaving out the number of elements and specifying an initializer, as shown in the last three lines of Fig. 2.1B. In assembly, the equivalent is accomplished by providing a label, a type, and a list of values, as shown in the last three lines of Fig. 2.1A. The syntax is different, but the result is precisely the same.

Listing 2.4 shows how the assembler assigns addresses to these labels. The second column of the listing shows the address (in hexadecimal) that is assigned to each label. The variable i is assigned the first address. Since it is a word variable, the address counter is incremented by four bytes and the next address is assigned to the variable j. The address counter is incremented again, and fmt is assigned the address 0008. The fmt variable consumes seven bytes, so the ch variable gets address 000f. Finally, the array of words named ary begins at address 0012. Note that 1216 = 1810 is not evenly divisible by four, which means that the word variables in ary are not aligned on word boundaries.

f02-05-9780128036983
Listing 2.4 A listing with mis-aligned data.

2.3.3 Filling and Aligning

On the ARM CPU, data can be moved to and from memory one byte at a time, two bytes at a time (half-word), or four bytes at a time (word). Moving a word between the CPU and memory takes significantly more time if the address of the word is not aligned on a four-byte boundary (one where the least significant two bits are zero). Similarly, moving a half-word between the CPU and memory takes significantly more time if the address of the half-word is not aligned on a two-byte boundary (one where the least significant bit is zero). Therefore, when declaring storage, it is important that words and half-words are stored on appropriate boundaries. The following directives allow the programmer to insert as much space as necessary to align the next item on any boundary desired.

.align abs-expr, abs-expr, abs-expr

Pad the location counter (in the current subsection) to a particular storage boundary. For the ARM processor, the first expression specifies the number of low-order zero bits the location counter must have after advancement. The second expression gives the fill value to be stored in the padding bytes. It (and the comma) may be omitted. If it is omitted, then the fill value is assumed to be zero. The third expression is also optional. If it is present, it is the maximum number of bytes that should be skipped by this alignment directive.

.balign [lw] abs-expr, abs-expr, abs-expr

These directives adjust the location counter to a particular storage boundary. The first expression is the byte-multiple for the alignment request. For example, .balign 16 will insert fill bytes until the location counter is an even multiple of 16. If the location counter is already a multiple of 16, then no fill bytes will be created. The second expression gives the fill value to be stored in the fill bytes. It (and the comma) may be omitted. If it is omitted, then the fill value is assumed to be zero. The third expression is also optional. If it is present, it is the maximum number of bytes that should be skipped by this alignment directive.
The .balignw and .balignl directives are variants of the .balign directive. The .balignw directive treats the fill pattern as a 2-byte word value, and .balignl treats the fill pattern as a 4-byte long word value. For example, “.balignw 4,0x368d” will align to a multiple of four bytes. If it skips two bytes, they will be filled in with the value 0x368d (the exact placement of the bytes depends upon the endianness of the processor).

.skip size, fill
.space size, fill

Sometimes it is desirable to allocate a large area of memory and initialize it all to the same value. This can be accomplished by using these directives. These directives emit size bytes, each of value fill. Both size and fill are absolute expressions. If the comma and fill are omitted, fill is assumed to be zero. For the ARM processor, the .space and .skip directives are equivalent. This directive is very useful for declaring large arrays in the .bss section.

Listing 2.5 shows how the code in Listing 2.4 can be improved by adding an alignment directive at line 6. The directive causes the assembler to emit two zero bytes between the end of the ch variable and the beginning of the ary variable. These extra “padding” bytes cause the following word data to be word aligned, thereby improving performance when the word data is accessed. It is a good practice to always put an alignment directive after declaring character or half-word data.

f02-06-9780128036983
Listing 2.5 A listing with properly aligned data.

2.3.4 Setting and Manipulating Symbols

The assembler provides support for setting and manipulating symbols that can then be used in other places within the program. The labels that can be assigned to assembly statements and directives are one type of symbol. The programmer can also declare other symbols and use them throughout the program. Such symbols may not have an actual storage location in memory, but they are included in the assembler’s symbol table, and can be used anywhere that their value is required. The most common use for defined symbols is to allow numerical constants to be declared in one place and easily changed. The .equ directive allows the programmer to use a label instead of a number throughout the program. This contributes to readability, and has the benefit that the constant value can then be easily changed every place that it is used, just by changing the definition of the symbol. The most important directives related to symbols are:

.equ symbol, expression
.set symbol, expression

This directive sets the value of symbol to expression. It is similar to the C language #define directive.

.equiv symbol, expression

The .equiv directive is like .equ and .set, except that the assembler will signal an error if the symbol is already defined.

.global symbol
.globl symbol

This directive makes the symbol visible to the linker. If symbol is defined within a file, and this directive is used to make it global, then it will be available to any file that is linked with the one containing the symbol. Without this directive, symbols are visible only within the file where they are defined.

.comm symbol, length

This directive declares symbol to be a common symbol, meaning that if it is defined in more than one file, then all instances should be merged into a single symbol. If the symbol is not defined anywhere, then the linker will allocate length bytes of uninitialized memory. If there are multiple definitions for symbol, and they have different sizes, the linker will merge them into a single instance using the largest size defined.

Listing 2.6 shows how the .equ directive can be used to create a symbol holding the number of elements in an array. The symbol arysize is defined as the value of the current address counter (denoted by the .) minus the value of the ary symbol, divided by four (each word in the array is four bytes). The listing shows all of the symbols defined in this program segment. Note that the four variables are shown to be in the data segment, and the arysize symbol is marked as an “absolute” symbol, which simply means that it is a number and not an address. The programmer can now use the symbol arysize to control looping when accessing the array data. If the size of the array is changed by adding or removing constant values, the value of arysize will change automatically, and the programmer will not have to search through the code to change the original value, 5, to some other value in every place it is used.

f02-07-9780128036983
Listing 2.6 Defining a symbol for the number of elements in an array.

2.3.5 Conditional Assembly

Sometimes it is desirable to skip assembly of portions of a file. The assembler provides some directives to allow conditional assembly. One use for these directives is to optionally assemble code to aid in debugging.

.if expression

.if marks the beginning of a section of code which is only considered part of the source program being assembled if the argument (which must be an absolute expression) is non-zero. The end of the conditional section of code must be marked by the .endif directive. Optionally, code may be included for the alternative condition by using the .else directive.

.ifdef symbol

Assembles the following section of code if the specified symbol has been defined.

.ifndef symbol

Assembles the following section of code if the specified symbol has not been defined.

.else

Assembles the following section of code only if the condition for the preceding .if or.ifdef was false.

.endif

Marks the end of a block of code that is only assembled conditionally.

2.3.6 Including Other Source Files

.include ” file ”

This directive provides a way to include supporting files at specified points in the source program. The code from the included file is assembled as if it followed the point of the .include directive. When the end of the included file is reached, assembly of the original file continues. The search paths used can be controlled with the ‘-I’ command line parameter. Quotation marks are required around file. This assembler directive is similar to including header files in C and C++ using the #include compiler directive.

2.3.7 Macros

The directives .macro and .endm allow the programmer to define macros that the assembler expands to generate assembly code. The GNU assembler supports simple macros. Some other assemblers have much more powerful macro capabilities.

.macro macname
.macro macname macargs …

Begin the definition of a macro called macname. If the macro definition requires arguments, their names are specified after the macro name, separated by commas or spaces. The programmer can supply a default value for any macro argument by following the name with ‘=deflt’.

The following begins the definition of a macro called reserve_str, with two arguments. The first argument has a default value, but the second does not:

f02-08-9780128036983

When a macro is called, the argument values can be specified either by position, or by keyword. For example, reserve_str 9,17 is equivalent to reserve_str p2=17,p1=9. After the definition is complete, the macro can be called either as

reserve_str x,y

(with \p1 evaluating to x and \p2 evaluating to y), or as

reserve_str ,y

(with \p1 evaluating as the default, in this case 0, and \p2 evaluating to y). Other examples of valid .macro statements are:

f02-09-9780128036983
f02-10-9780128036983

.endm

End the current macro definition.

.exitm

Exit early from the current macro definition. This is usually used only within a .if or .ifdef directive.

\@

This is a pseudo-variable used by the assembler to maintain a count of how many macros it has executed. That number can be accessed with ‘\@’, but only within a macro definition.

Macro example

The following definition specifies a macro SHIFT that will emit the instruction to shift a given register left by a specified number of bits. If the number of bits specified is negative, then it will emit the instruction to perform a right shift instead of a left shift.

f02-11-9780128036983

After that definition, the following code:

f02-12-9780128036983

will generate these instructions:

f02-13-9780128036983

The meaning of these instructions will be covered in Chapters 3 and 4.

Recursive macro example

The following definition specifies a macro enum that puts a sequence of numbers into memory by using a recursive macro call to itself:

f02-14-9780128036983

With that definition, ‘enum 0,5’ is equivalent to this assembly input:

f02-15-9780128036983

2.4 Chapter Summary

There are four elements to assembly syntax: labels, directives, instructions, and comments. Directives are used mainly to define symbols, allocate storage, and control the behavior of the assembler. The most common assembler directives were introduced in this chapter, but there are many other directives available in the GNU assembler. Complete documentation is available online as part of the GNU Binutils package.

Directives are used to declare statically allocated storage, which is equivalent to declaring global static variables in C. In assembly, labels and other symbols are visible only within the file that they are declared, unless they are explicitly made visible to other files with the .global directive. In C, variables that are declared outside of any function are visible to all files in the program, unless the static keyword is used to make them visible only within the file where they are declared. Thus, both C and assembly support file and global scope for static variables, but with the opposite defaults and different syntax.

Directives can also be used to declare macros. Macros are expanded by the assembler and may generate multiple statements. Careful use of macros can automate some simple tasks, allowing several lines of assembly code to be replaced with a single macro invocation.

Exercises

2.1 What is the difference between

(a) the .data section and .bss section?

(b) the .ascii and .asciz directives?

(c) the .word and the .long directives?

2.2 What is the purpose of the .align assembler directive? What does “.align 2” do in GNU ARM assembly?

2.3 Assembly language has four main elements. What are they?

2.4 Using the directives presented in this chapter, show three different ways to create a null-terminated string containing the phrase “segmentation fault”.

2.5 What is the total memory, in bytes, allocated for the following variables?

f02-16-9780128036983

2.6 Identify the directive(s), label(s), comment(s), and instruction(s) in the following code:

f02-17-9780128036983

2.7 Write assembly code to declare variables equivalent to the following C code:

f02-18-9780128036983

2.8 Show how to store the following text as a single string in assembly language, while making it readable and keeping each line shorter than 80 characters:

The three goals of the mission are:

1) Keep each line of code under 80 characters,

2) Write readable comments,

3) Learn a valuable skill for readability.

2.9 Insert the minimum number of .align directives necessary in the following code so that all word variables are aligned on word boundaries and all halfword variables are aligned on halfword boundaries, while minimizing the amount of wasted space.

f02-19-9780128036983

2.10 Re-order the directives in the previous problem so that no .align directives are necessary to ensure proper alignment. How many bytes of storage were wasted by the original ordering of directives, compared to the new one?

2.11 What are the most important directives for selecting a section?

2.12 Why are .ascii and .asciz directives usually followed by an .align directive, but .word directives are not?

2.13 Using the “Hello World” program shown in Listing 2.1 as a template, write a program that will print your name.

2.14 Listing 2.3 shows that the assembler will assign the location 0000000016 to the main symbol and also to the str symbol. Why does this not cause problems?

Chapter 3

Load/Store and Branch Instructions

Abstract

This chapter explains how a particular assembly language is related to the architectural design of a particular CPU family. It then gives an overview of the ARM architecture. Next, it describes the ARM register set and data paths, including the Process Status Register, and the flags which are used to control conditional execution. Then it introduces the concept of instructions and operands, and explains immediate data used as an operand. Next it describes the load and store instructions and all of the addressing modes available on the ARM processor. Then it explains the branch and conditional branch instructions. The chapter ends with some examples showing how the branch and link instruction can be used to call functions from the C standard library.

Keywords

Architecture; Instruction set architecture; Data path; Register; Memory; Load; Store; Branch; Address; Addressing mode; Conditional execution; Function or subroutine call

The part of the computer architecture related to programming is referred to as the instruction set architecture (ISA). The ISA includes the set of registers that the user program can access, and the set of instructions that the processor supports, as well as data paths and processing elements within the processor. The first step in learning a new assembly language is to become familiar with the ISA. For most modern computer systems, data must be loaded in a register before it can be used for any data processing instruction, but there are a limited number of registers. Memory provides a place to store data that is not currently needed. Program instructions are also stored in memory and fetched into the CPU as they are needed. This chapter introduces the ISA for the ARM processor.

3.1 CPU Components and Data Paths

The CPU is composed of data storage and computational components connected together by a set of buses. The most important components of the CPU are the registers, where data is stored, and the arithmetic and logic unit (ALU), where arithmetic and logical operations are performed on the data. Some CPUs also have dedicated hardware units for multiplication and/or division. Fig. 3.1 shows the major components of the ARM CPU and the buses that connect the components together. These buses provide pathways for the data to move between the computational and storage components. The organization of the components and buses in a CPU govern what types of operations can be performed.

f03-01-9780128036983
Figure 3.1 The ARM processor architecture.

The set of instructions and addressing modes available on the ARM processor is closely related to the architecture shown in Fig. 3.1. The architecture provides for certain operations to be performed efficiently, and this has a direct relationship to the types of instructions that are supported.

Note that on the ARM, two source registers can be selected for an instruction, using the A and B buses. The data on the B bus is routed through a shifter, and then to the ALU. This allows the second operand of most instructions to be shifted an arbitrary amount before it reaches the ALU. The data on the A bus goes directly to the ALU. Additionally, the A and B buses can provide operands for the multiplier, and the multiplier can provide data for the A and B buses.

Data coming in from memory or an input/output device is fed directly onto the ALU bus. From there, it can be stored in one of the general-purpose registers. Data being written to memory or an input/output device is taken directly from the B bus, which means that store operations can move data from a register, but cannot modify the data on the way to memory or input/output devices.

The address register is a temporary register that is used by the CPU whenever it needs to read or write to memory or I/O devices. It is used every time an instruction is fetched from memory, and is used for all load and store operations. The address register can be loaded from the program counter, for fetching the next instruction. Also the address register can be loaded from the ALU, which allows the processor to support addressing modes where a register is used as a base pointer and an offset is calculated on-the-fly. After its contents are used to access memory or I/O devices, the base address can be incremented and the incremented value can be stored back into a register. This allows the processor to increment the program counter after each instruction, and to implement certain addressing modes where a pointer is automatically incremented after each memory access.

3.2 ARM User Registers

As shown in Fig. 3.2, the ARM processor provides 13 general-purpose registers, named r0 through r12. These registers can each store 32 bits of data. In addition to the 13 general-purpose registers, the ARM has three other special-purpose registers.

f03-02-9780128036983
Figure 3.2 The ARM user program registers.

The program counter, r15, always contains the address of the next instruction that will be executed. The processor increments this register by four, automatically, after each instruction is fetched from memory. By moving an address into this register, the programmer can cause the processor to fetch the next instruction from the new address. This gives the programmer the ability to jump to any address and begin executing code there.

The link register, r14, is used to hold the return address for subroutines. Certain instructions cause the program counter to be copied to the link register, then the program counter is loaded with a new address. These branch-and-link instructions are briefly covered in Section 3.5 and in more detail in Section 5.4.

The program stack was introduced in Section 1.4. The stack pointer, r13, is used to hold the address where the stack ends. This is commonly referred to as the top of the stack, although on most systems the stack grows downwards and the stack pointer really refers to the bottom of the stack. The address where the stack ends may change when registers are pushed onto the stack, or when temporary local variables (automatic variables) are allocated or deleted. The use of the stack for storing automatic variables is described in Chapter 5. The use of r13 as the stack pointer is a programming convention. Some instructions (eg, branches) implicitly modify the program counter and link registers, but there are no special instructions involving the stack pointer. As far as the hardware is concerned, r13 is exactly the same as registers r0r12, but all ARM programmers use it for the stack pointer.

Although register r13 is normally used as the stack pointer, it can be used as a general-purpose register if the stack is not used. However the high-level language compilers always use it as the stack pointer, so using it as a general-purpose register will result in code that cannot inter-operate with code generated using high-level languages. The link register, r14, can also be used as a general-purpose register, but its contents are modified by hardware when a subroutine is called. Using r13 and r14 as general-purpose registers is dangerous and strongly discouraged.

There are also two other registers which may have special purposes. As with the stack pointer, these are programming conventions. There are no special instructions involving these registers. The frame pointer (r11) is used by high-level language compilers to track the current stack frame. This is sometimes useful when running your program under a debugger, and can sometimes help the compiler to generate more efficient code for returning from a subroutine. The GNU C compiler can be instructed to use r11 as a general-purpose register by using the --omit-frame-pointer command line option. The inter-procedure scratch register r12 is used by the C library when calling functions in dynamically linked libraries. The contents may change, seemingly at random, when certain functions (such as printf) are called.

The final register in the ARM user programming model is the Current Program Status Register (CPSR). This register contains bits that indicate the status of the current program, including information about the results of previous operations. Fig. 3.3 shows the bits in the CPSR. The first four bits, N, Z, C, and V are the condition flags. Most instructions can modify these flags, and later instructions can use the flags to modify their operation. Their meaning is as follows:

f03-03-9780128036983
Figure 3.3 The ARM process status register.

Negative: This bit is set to one if the signed result of an operation is negative, and set to zero if the result is positive or zero.

Zero: This bit is set to one if the result of an operation is zero, and set to zero if the result is non-zero.

Carry: This bit is set to one if an add operation results in a carry out of the most significant bit, or if a subtract operation results in a borrow. For shift operations, this flag is set to the last bit shifted out by the shifter.

oVerflow: For addition and subtraction, this flag is set if a signed overflow occurred.

The remaining bits are used by the operating system or for bare-metal programs, and are described in Section 14.1.

3.3 Instruction Components

The ARM processor supports a relatively small set of instructions grouped into four basic instruction types. Most instructions have optional modifiers which can be used to change their behavior. For example, many instructions can have modifiers which set or check condition codes in the CPSR. The combination of basic instructions with optional modifiers results in an extremely rich assembly language. There are four general instruction types, or categories. The following sections give a brief overview of the features which are common to instructions in each category. The individual instructions are explained later in this chapter, and in the following chapter.

3.3.1 Setting and Using Condition Flags

As mentioned previously, the CPSR contains four flag bits (bits 28–31), which can be used to control whether or not certain instructions are executed. Most of the data processing instructions have an optional modifier to control whether or not the flag bits are affected when the instruction is executed. For example, the basic instruction for addition is add. When the add instruction is executed, the result is stored in a register, but the flag bits in the CPSR are not affected.

However, the programmer can add the s modifier to the add instruction to create the adds instruction. When it is executed, this instruction will affect the CPSR flag bits. The flag bits can be used by subsequent instructions to control execution and branching. The meaning of the flags depends on the type of instruction that last set the flags. Table 3.1 shows the names and meanings of the four bits depending on the type of instruction that set or cleared them. Most instructions support the s modifier to control setting the flags.

Table 3.1

Flag bits in the CPSR register

NameLogical InstructionArithmetic Instruction
N (Negative)No meaningBit 31 of the result is set. Indicates a negative number in signed operations
Z (Zero)Result is all zeroesResult of operation was zero
C (Carry)After Shift operation, ‘1’ was left in carry flagResult was greater than 32 bits
V (oVerflow)No meaningThe signed two’s complement result requires more than 32 bits. Indicates a possible corruption of the result

t0010

Most ARM instructions can have a condition modifier attached. If present, the modifier controls, at run-time, whether or not the instruction is actually executed. These condition modifiers are added to basic instructions to create conditional instructions. Table 3.2 shows the condition modifiers that can be attached to base instructions. For example, to create an instruction that adds only if the CPSR Z flag is set, the programmer would add the eq condition modifier to the basic add instruction to create the addeq instruction.

Table 3.2

ARM condition modifiers

<cond>English Meaning
alalways (this is the default <cond>
eqZ set (=)
neZ clear (≠)
geN set and V set, or N clear and V clear (≥)
ltN set and V clear, or N clear and V set (<)
gtZ clear, and either N set and V set, or N clear and V set (>)
leZ set, or N set and V clear, or N clear and V set (≤)
hiC set and Z clear (unsigned >)
lsC clear or Z (unsigned ≤)
hsC set (unsigned ≥)
csAlternate name for HS
loC clear (unsigned <)
ccAlternate name for LO
miN set (result < 0)
plN clear (result ≥ 0)
vsV set (overflow)
vcV clear (no overflow)

Setting and using condition flags are orthogonal operations. This means that they can be used in combination. Using the previous example, the programmer could add the s modifier to create the addeqs instruction, which executes only if the Z bit is set, and updates the CPSR flags only if it executes.

3.3.2 Immediate Values

An immediate value in assembly language is a constant value that is specified by the programmer. Some assembly languages encode the immediate value as part of the instruction. Other assembly languages create a table of immediate values in a literal pool and insert appropriate instructions to access them. ARM assembly language provides both methods.

Immediate values can be specified in decimal, octal, hexadecimal, or binary. Octal values must begin with a zero, and hexadecimal values must begin with “0x”. Likewise immediate values that start with “0b” are interpreted as binary numbers. Any value that does not begin with zero, 0x, or 0 b will be interpreted as a decimal value.

There are two ways that immediate values can be specified in GNU ARM assembly. The =<immediate|symbol> syntax can be used to specify any immediate 32-bit number, or to specify the 32-bit value of any symbol in the program. Symbols include program labels (such as main) and symbols that are defined using .equ and similar assembler directives. However, this syntax can only be used with load instructions, and not with data processing instructions. This restriction is necessary because of the way the ARM machine instructions are encoded. For data processing instructions, there are a limited number of bits that can be devoted to storing immediate data as part of the instruction.

The #<immediate|symbol> syntax is used to specify immediate data values for data processing instructions. The #<immediate|symbol> syntax has some restrictions. Basically, the assembler must be able to construct the specified value using only eight bits of data, a shift or rotate, and/or a complement. For immediate values that can cannot be constructed by shifting or rotating and complementing an 8-bit value, the programmer must use an ldr instruction with the =<immediate|symbol> to specify the value. That method is covered in Section 3.4. Some examples of immediate values are shown in Table 3.3.

Table 3.3

Legal and illegal values for #<immediate—symbol>

#32Ok because it can be stored as an 8-bit value
#1021Illegal because the number cannot be created from an 8-bit value using shift or rotate and complement
#1024Ok because it is 1 shifted left 10 bits
#0b1011Ok because it fits in 8 bits
#-1Ok because it is the one’s complement of 0
#0xFFFFFFFEOk because it is the one’s complement of 1
#0xEFFFFFFFOk because it is the one’s complement of 1 shifted left 31 bits
#strsizeOk if the value of strsize can be created from an 8-bit value using shift or rotate and complement

t0020

3.4 Load/Store Instructions

The ARM processor has a strict separation between instructions that perform computation and those that move data between the CPU and memory. Because of this separation between load/store operations and computational operations, it is a classic example of a load-store architecture. The programmer can transfer bytes (8 bits), half-words (16 bits), and words (32 bits), from memory into a register, or from a register into memory. The programmer can also perform computational operations (such as adding) using two source operands and one register as the destination for the result. All computational instructions assume that the registers already contain the data. Load instructions are used to move data into the registers, while store instructions are used to move data from the registers to memory.

3.4.1 Addressing Modes

Most of the load/store instructions use an <address> which is one of the six options shown in Table 3.4. The < shift_op > can be any of shift operations from Table 3.5, and shift should be a number between 0 and 31. Although there are really only six addressing modes, there are eleven variations of the assembly language syntax. Four of the variations are simply shorthand notations. One of the variations allows an immediate data value or the address of a label to be loaded into a register, and may result in the assembler generating more than one instruction. The following section describes each addressing mode in detail.

Table 3.4

ARM addressing modes

SyntaxName
[Rn, #±<offset_12>]Immediate offset
[Rn, ±Rm, <shift_op> #<shift>]Scaled register offset
[Rn, #±<offset_12>]!Immediate pre-indexed
[Rn, ±Rm, <shift_op> #<shift>]!Scaled register pre-indexed
[Rn], #±<offset_12>Immediate post-indexed
[Rn], ±Rm, <shift_op> #<shift>Scaled register post-indexed

Table 3.5

ARM shift and rotate operations

<shift>Meaning
lslLogical Shift Left by specified amount
lsrLogical Shift Right by specified amount
asrArithmetic Shift Right by specified amount

Immediate offset: [Rn, #±< offset_12 >]
The immediate offset (which may be positive or negative) is added to the contents of Rn. The result is used as the address of the item to be loaded or stored. For example, the following line of code:

 ldr r0, [r1, #12]

calculates a memory address by adding 12 to the contents of register r1. It then loads four bytes of data, starting at the calculated memory address, into register r0. Similarly, the line:

 str r9, [r6, #-8]

subtracts 8 from the contents of r6 and uses that as the address where it stores the contents of r9 in memory.

Register immediate: [Rn]
When using immediate offset mode with an offset of zero, the comma and offset can be omitted. That is, [Rn] is just shorthand notation for [Rn, #0]. This shorthand is referred to as register immediate mode. For example, the following line of code:

 ldr r3, [r2]

uses the contents of register r2 as a memory address and loads four bytes of data, starting at that address, into register r3. Likewise,

 str r8, [r0]

copies the contents of r8 to the four bytes of memory starting at the address that is in r0.

Scaled register offset: [Rn, ±Rm, < shift_op > #<shift>]
Rm is shifted as specified, then added to or subtracted from Rn. The result is used as the address of the item to be loaded or stored. For example,

 ldr r3, [r2, r1, lsl #2]

shifts the contents of r1 left two bits, adds the result to the contents of r2 and uses the sum as an address in memory from which it loads four bytes into r3. Recall that shifting a binary number left by two bits is equivalent to multiplying that number by four. This addressing mode is typically used to access an array, where r2 contains the address of the beginning of the array, and r1 is an integer index. The integer shift amount depends on the size of the objects in the array. To store an item from register r0 into an array of half-words, the following instruction could be used:

 strh r0, [r4, r5, lsl #1]

where r4 holds the address of the first byte of the array, and r5 holds the integer index for the desired array item.

Register offset: [Rn, ±Rm]
When using scaled register offset mode with a shift amount of zero, the comma and shift specification can be omitted. That is, [Rn, ±Rm] is just shorthand notation for [Rn, ±Rm, lsl #0]. This shorthand is referred to as register offset mode.

Immediate pre-indexed: [Rn, #±Rm< offset_12 >]!
The address is computed in the same way as immediate offset mode, but after the load or store, the address that was used is stored in Rn. This mode can be used to step through elements in an array, updating a pointer to the next array element before each element is accessed.

Scaled register pre-indexed: [Rn, ±Rm, < shift_op > #<shift>]!
The address is computed in the same way as scaled register offset mode, but after the load or store, the address that was used is stored in Rn. This mode can be used to step through elements in an array, updating a pointer to the current array element before each access.

Register pre-indexed: [Rn, ±Rm]!
When using scaled register pre-indexed mode with a shift amount of zero, the comma and shift specification can be omitted. That is, [Rn, ±Rm]! is shorthand notation for [Rn, ±Rm, lsl #0]!. This shorthand is referred to as register pre-indexed mode.

Immediate post-indexed: [Rn], #±< offset_12 >
Register Rn is used as the address of the value to be loaded or stored. After the value is loaded or stored, the value in Rn is updated by adding the immediate offset, which may be negative or positive. This mode can be used to step through elements in an array, updating a pointer to point at the next array element after each one is accessed.

Scaled register post-indexed: [Rn], ±Rm, < shift_op > #<shift>
Register Rn is used as the address of the value to be loaded or stored. After the value is loaded or stored, the value in Rn is updated by adding or subtracting the contents of Rm shifted as specified. This mode can be used to step through elements in an array, updating a pointer to point at the next array element after each one is accessed.

Register post-indexed: [Rn], ±Rm
When using scaled register post-indexed mode with a shift amount of zero, the comma and shift specification can be omitted. That is, [Rn], ±Rm is shorthand notation for [Rn], ±Rm, lsl #0. This shorthand is referred to as register post-indexed mode.

Load Immediate: [Rn], =<immediate|symbol>
This is really a pseudo-instruction. The assembler will generate a mov instruction if possible. Otherwise it will store the value of immediate or the address of symbol in a “literal table” and generate a load instruction, using one of the previous addressing modes, to load the value into a register. This addressing mode can only be used with the ldr instruction.

The load and store instructions allow the programmer to move data from memory to registers or from registers to memory. The load/store instructions can be grouped into the following types:

 single register,

 multiple register, and

 atomic.

The following sections describe the seven load and store instructions that are available, and all of their variations.

3.4.2 Load/Store Single Register

These instructions transfer a single word, half-word, or byte from a register to memory or from memory to a register:

ldr Load Register, and

str Store Register.

Syntax

 <op>{<cond>}{<size>} Rd, <address>

 <op> is either ldr or str.

 The optional <cond> can be any of the codes from Table 3.2 on page 59 specifying conditional execution.

 The optional <size> is one of:

b unsigned byte

h unsigned half-word

sb signed byte

sh signed half-word

 The <address> is any valid address specifier described in Section 3.4.1.

Operations

NameEffectDescription
ldrRdMem[address]Load register from memory at address
strMem[address]RdStore register in memory at address

Examples

u03-01-9780128036983

3.4.3 Load/Store Multiple Registers

ARM has two instructions for loading and storing multiple registers:

ldm Load Multiple Registers, and

stm Store Multiple Registers.

These instructions are used to store registers on the program stack, and for copying blocks of data. The ldm and stm instructions each have four variants, and each variant has two equivalent names. So, although there are only two basic instructions, there are sixteen mnemonics. These are the most complex instructions in the ARM assembly language.

Syntax

 <op><variant> Rd{!}, <register_list>{^}

 <op> is either ldm or stm.

 <variant> is chosen from the following tables:

Block Copy MethodStack Type
VariantDescriptionVariantDescription
iaIncrement AftereaEmpty Ascending
ibIncrement BeforefaFull Ascending
daDecrement AfteredEmpty Descending
dbDecrement BeforefdFull Descending

t0040

 The optional ! specifies that the address register Rd should be modified after the registers are stored.

 An optional trailing ˆ can only be used by operating system code. It causes the transfer to affect user registers instead of operating system registers.

There are two equivalent mnemonics for each load/store multiple instruction. For example, ldmia is exactly the same instruction as ldmfd, and stmdb is exactly the same instruction as stmfd. There are two different names so that the programmer can indicate what the instruction is being used for.

The mnemonics in the Block Copy Method table are used when the programmer is using the instructions to move blocks of data. For instance, the programmer may want to copy eight words from one address in memory to another address. One very efficient way to do that is to:

1. load the address of the first byte of the source into a register,

2. load the address of the first byte of the destination into another register,

3. use ldmia (load multiple increment after) to load eight registers from the source address, then

4. use stmia (store multiple increment after) to store the registers to the destination address.

Assuming source and dest are labeled blocks of data declared elsewhere, the following listing shows the exact instructions needed to move eight words from source to dest:

u03-02-9780128036983

The mnemonics in the Stack Type table are used when the programmer is performing stack operations. The most common variants are stmfd and ldmfd, which are used for pushing registers onto the program stack and later popping them back off, respectively. In Linux, the C compiler always uses the stmfd and ldmfd versions for accessing the stack. The following code shows how the programmer could save the contents of registers r0-r9 on the stack, use them to perform a block copy, then restore their contents:

u03-03-9780128036983

Note that in the previous example, after the stmfd sp!, { r0-r9 } instruction, sp will contain the address of the last word on the stack, because the optional ! was used to indicate that the register should be updated.

Operations

NameEffectDescription
ldmia and ldmfd

addrRdsi3_e

for all iregister_list do

iMem[addr]si4_e

addraddr+4si5_e

end for

if ! is present then

Rdaddrsi6_e

end if

Load multiple registers from memory, starting at the address in Rd and increment the address by four bytes after each load.
stmia and stmea

addrRdsi3_e

for all iregister_list do

Mem[addr]isi8_e

addraddr+4si5_e

end for

if ! is present then

Rdaddrsi6_e

end if

Store multiple registers in memory, starting at the address in Rd and increment the address by four bytes after each store.
ldmib and ldmed

addrRdsi3_e

for all iregister_list do

addraddr+4si5_e

iMem[addr]si4_e

end for

if ! is present then

Rdaddrsi6_e

end if

Load multiple registers from memory, starting at the address in Rd and increment the address by four bytes before each load.
stmib and stmfa

addrRdsi3_e

for all iregister_list do

addraddr+4si5_e

Mem[addr]isi8_e

end for

if ! is present then

Rdaddrsi6_e

end if

Store multiple registers in memory, starting at the address in Rd and increment the address by four bytes before each store.
ldmda and ldmfa

addrRdsi3_e

for all iregister_list do

iMem[addr]si4_e

addraddr4si21_e

end for

if ! is present then

Rdaddrsi6_e

end if

Load multiple registers from memory, starting at the address in Rd and decrement the address by four bytes after each load.
stmda and stmed

addrRdsi3_e

for all iregister_list do

Mem[addr]isi8_e

addraddr4si21_e

end for

if ! is present then

Rdaddrsi6_e

end if

Store multiple registers in memory, starting at the address in Rd and decrement the address by four bytes after each store.
ldmdb and ldmea

addrRdsi3_e

for all iregister_list do

addraddr4si21_e

iMem[addr]si4_e

end for

if ! is present then

Rdaddrsi6_e

end if

Load multiple registers from memory, starting at the address in Rd and decrement the address by four bytes before each load.
stmdb and stmfd

addrRdsi3_e

for all iregister_list do

addraddr4si21_e

Mem[addr]isi8_e

end for

if ! is present then

Rdaddrsi6_e

end if

Store multiple registers in memory, starting at the address in Rd and decrement the address by four bytes before each store.

t0045

Examples

u03-04-9780128036983

3.4.4 Swap

Multiprogramming and threading require the ability to set and test values atomically. This instruction is used by the operating system or threading libraries to guarantee mutual exclusion:

swp Atomic Load and Store

Note: swp and swpb are deprecated in favor of ldrex and strex, which work on multiprocessor systems as well as uni-processor systems.

Syntax

 swp{<cond>}{s} Rd, Rm, [Rn]

 swp{<cond>}{s}b Rd, Rm, [Rn]

 The optional s specifies whether or not the instruction should affect the bits in the CPSR.

 The optional <cond> can be any of the codes from Table 3.2 specifying conditional execution.

Operations

NameEffectDescription
swpRdMem[Rn]si35_eMem[Rn]Rmsi36_eAtomically load Rd and store Rm
swpbRdMem[Rn]si35_eMem[Rn]Rmsi36_eAtomically load Rd and store Rm

Example

u03-05-9780128036983

3.4.5 Exclusive Load/Store

These instructions are used by the operating system or threading libraries to guarantee mutual exclusion, even on multiprocessor systems:

ldrex Load Multiple Registers, and

strex Store Multiple Registers.

Exclusive load (ldrex) reads data from memory, tagging the memory address at the same time. Exclusive store (strex) stores data to memory, but only if the tag is still valid. A strex to the same address as the previous ldrex will invalidate the tag. A str to the same address may invalidate the tag (implementation defined). The strex instruction sets a bit in the specified register which indicates whether or not the store succeeded. This allows the programmer to implement semaphores on uni-processor and multiprocessor systems.

Syntax

 ldrex{<cond>} Rd, Rn

 strex{<cond>} Rd, Rn, Rm

 The optional <cond> can be any of the codes from Table 3.2 specifying conditional execution.

Operations

NameEffectDescription
ldrex

RdMem[Rn]si35_e

tagMem[Rn]truesi40_e

Load register and tag memory address
strex

if tagMem[Rn] then

Mem[Rn]Rdsi41_e

end if

Store register in memory if tag is valid

t0060

Example

u03-06-9780128036983

3.5 Branch Instructions

Branch instructions allow the programmer to change the address of the next instruction to be executed. They are used to implement loops, if-then structures, subroutines, and other flow control structures. There are two basic branch instructions:

 Branch, and

 Branch and Link (subroutine call).

3.5.1 Branch

This instruction is used to perform conditional and unconditional branches in program execution:

b Branch.

It is used for creating loops and if-then-else constructs.

Syntax

 b{<cond>} <target_label>

 The optional <cond> can be any of the codes from Table 3.2 specifying conditional execution.

 The target_label can be any label in the current file, or any label that is defined as .global or .globl in any file that is linked in.

Operations

NameEffectDescription
bpctarget_addresssi42_eload pc with new address (branch)

Examples

u03-07-9780128036983

3.5.2 Branch and Link

The following instruction is used to call subroutines:

bl Branch and Link.

The branch and link instruction is identical to the branch instruction, except that it copies the current program counter to the link register before performing the branch. This allows the programmer to copy the link register back into the program counter at some later point. This is how subroutines are called, and how subroutines return and resume executing at the next instruction after the one that called them.

Syntax

[frame=single]

 bl{<cond>} <target_address>

 The optional <cond> can be any of the codes from Table 3.2 specifying conditional execution.

 The target_address can be any label in the current file, or any label that is defined as .global or .globl in any file that is linked in.

Operations

NameEffectDescription
bl

lrpcsi43_e

pctarget_addresssi42_e

Save pc in lr, then load pc with new address

t0070

Examples

u03-08-9780128036983

Example 3.1 shows how the bl instruction can be used to call a function from the C standard library to read a single character from standard input. By convention, when a function is called, it will leave its return value in r0. Example 3.2 shows how the bl instruction can be used to call another function from the C standard library to print a message to standard output. By convention, when a function is called, it will expect to find its first argument in r0. There are other rules, which all ARM programmers must follow, regarding which registers are used when passing arguments to functions and procedures. Those rules will be explained fully in Section 5.4.

Example 3.1

Using the bl Instruction to Read a Character

Suppose we want to read a single character from standard input. This can be accomplished in C by calling the getchar () function from the C standard library as follows:

u03-09-9780128036983

The above C code assumes that the variable c has been declared to hold the result of the function. In ARM assembly language, functions always return their results in r0. The assembly programmer may then move the result to any register or memory location they choose. In the following example, it is assumed that r9 was chosen to hold the value of the variable c:

u03-10-9780128036983

Example 3.2

Using the bl Instruction to Print a Message

To print a string to standard output in C, we can use the printf () function from the C standard library as follows:

u03-11-9780128036983

The C compiler will automatically create a constant array of characters and initialize it to hold the message. Then it will load the address of the first character in the array into register r0 before calling printf (). The printf () function will expect to see an address in r0, which it will assume is the address of the format string to be printed. The function call can be made as follows in ARM assembly:

u03-12-9780128036983

3.6 Pseudo-Instructions

The assembler provides a small number of pseudo-instructions. From the perspective of the programmer, these instructions are indistinguishable from standard instructions. However, when the assembler encounters a pseudo-instruction, it may substitute a different instruction or generate a short sequence of machine instructions.

3.6.1 Load Immediate

This pseudo-instruction loads a register with any 32-bit value:

ldr Load Immediate

When this pseudo-instruction is encountered, the assembler first determines whether or not it can substitute a mov Rd,#<immediate> or mvn Rd,#<immediate> instruction. If that is not possible, then it reserves four bytes in a “literal pool” and stores the immediate value there. Then, the pseudo-instruction is translated into an ldr instruction using Immediate Offset addressing mode with the pc as the base register.

Syntax

 ldr{<cond>} Rd, =<immediate>

 The optional <cond> can be any of the codes from Table 3.2 specifying conditional execution.

 The <immediate> parameter is any valid 32-bit quantity.

Operations

NameEffectDescription
ldrRdvaluesi45_eLoad register with immediate value

Example

Example 3.3 shows how the assembler generates code from the load immediate pseudo-instruction. Line 2 of the example listing just declares two 32-bit words. They cause the next variable to be given a non-zero address for demonstration purposes, and are not used anywhere in the program, but line 3 declares a string of characters in the data section. The string is located at offset 0x00000008 from the beginning of the data section. The linker is responsible for calculating the actual address, when it assigns a location for the data section. Line 6 shows how a register can be loaded with an immediate value using the mov instruction. The next line shows the equivalent using the ldr pseudo-instruction. Note that the assembler generates the same machine instruction (FD5FE0E3) for both lines.

Example 3.3

Assembly of the Load Immediate Pseudo-Instruction

u03-13a-9780128036983u03-13b-9780128036983

Line 8 shows the ldr pseudo-instruction being used to load a value that cannot be loaded using the mov instruction. The assembler generated a load half-word instruction using the program counter as the base register, and an offset to the location where the value is stored. The value is actually stored in a literal pool at the end of the text segment. The listing has three lines labeled 11. The first line 11 is an instruction. The remaining lines are the literal pool.

On line 9, the programmer used the ldr pseudo-instruction to request that the address of str be loaded into r4. The assembler created a storage location to hold the address of str, and generated a load word instruction using the program counter as the base register and an offset to the location where the address is stored. The address of str is actually stored in the text segment, on the third line 11.

3.6.2 Load Address

These pseudo instructions are used to load the address associated with a label:

adr Load Address

adrl Load Address Long

They are more efficient than the ldr rx,=label instruction, because they are translated into one or two add or subtract operations, and do not require a load from memory. However, the address must be in the same section as the adr or adrl pseudo-instruction, so they cannot be used to load addresses of labels in the .data section.

Syntax

 <op>{<cond>}{s} Rd, label

 <op> is either adr or adrl.

 The adr pseudo-instruction will be translated into one or two pc-relative add or sub instructions.

 The adrl pseudo-instruction will always be translated into two instructions. The second instruction may be a nop instruction.

 The label must be defined in the same file and section where these pseudo-instructions are used.

Operations

NameEffectDescription
adrRdAddress of Labelsi46_eLoad Address
adrlRdAddress of Labelsi46_eLoad Address

Examples

u03-14-9780128036983

3.7 Chapter Summary

The ARM Instruction Set Architecture includes 17 registers and a four basic instruction types. This chapter explained the instructions used for

 moving data between memory and registers, and

 branching and calling subroutines.

The load and store operations are used to move data between memory and registers. The basic load and store operations, ldr and str, have a very powerful set of addressing modes. To facilitate moving multiple registers to or from memory, the ARM ISA provides the ldm and stm instructions, which each have several variants. The assembler provides two pseudo-instructions for loading addresses and immediate values.

The ARM processor provides only two types of branch instruction. The bl instruction is used to call subroutines (functions). The b instruction can be used to create loops and to create if-then-else constructs. The ability to append a condition to almost any instruction results in a very rich instruction set.

Exercises

3.1 Which registers hold the stack pointer, return address, and program counter?

3.2 Which is more efficient for loading a constant value, the ldr pseudo-instruction, or the mov instruction? Explain.

3.3 Which two variants of the Store Multiple instruction are used most often, and why?

3.4 The stm and ldm instructions include an optional ‘!’ after the address register. What does it do?

3.5 The following C statement declares an array of four integers, and initializes their values to 7, 3, 21, and 10, in that order.

int nums[]={7,3,21,10};

(a) Write the equivalent in GNU ARM assembly.

(b) Write the ARM assembly instructions to load all four numbers into registers r3, r5, r6, and r9, respectively, using:

i. a single ldm instruction, and

ii. four ldr instructions.

3.6 What is the difference between a memory location and a CPU register?

3.7 How many registers are provided by the ARM Instruction Set Architecture?

3.8 Use ldm and stm to write a short sequence of ARM assembly language to copy 16 words of data from a source address to a destination address. Assume that the source address is already loaded in r0 and the destination address is already loaded in r1. You may use registers r2 through r5 to hold values as needed. Your code is allowed to modify r0 and/or r1.

3.9 Assume that x is an array of integers. Convert the following C statements into ARM assembly language.

(a) x[8] = 100;

(b) x[10] = x[0];

(c) x[9] = x[3];

3.10 Assume that x is an array of integers, and i and j are integers. Convert the following C statements into ARM assembly language.

(a) x[i] = j;

(b) x[j] = x[i];

(c) x[i] = x[j*2];

3.11 What is the difference between the b instruction and the bl instruction? What is each used for?

3.12 What are the meanings of the following instructions?

(a) ldreq

(b) ldrlt

(c) bgt

(d) bne

(e) bge

Chapter 4

Data Processing and Other Instructions

Abstract

This chapter begins by explaining Operand2, which is used by most ARM data processing instructions to specify one of the source operands for the data processing operation. It explains all of the shift operations and how they can be combined with other data processing operations in a single instruction. It then explains each of the data processing instructions, giving a short example showing how they can be used. Short examples, relating the assembly instructions to C statements, are incorporated throughout the chapter. One of the examples shows how to construct a loop. After the data processing instructions are explained, the chapter covers the special instructions and pseudo-instructions.

Keywords

Operand2; Data processing; Shift; Loop; Comparison; Data movement; Three address instruction; Two address instruction

The ARM processor has approximately 25 data processing instructions. The exact number depends on the processor version. For example, older versions of the architecture did not have the six multiply instructions, and the Cortex M3 and newer processors have two division instructions. There are also a few special instructions that are used infrequently to perform operations that are not classified as load/store, branch, or data processing.

4.1 Data Processing Instructions

The data processing instructions operate only on CPU registers, so data must first be moved from memory into a register before processing can be performed. Most of these instructions use two source operands and one destination register. Each instruction performs one basic arithmetical or logical operation. The operations are grouped in the following categories:

 Arithmetic Operations,

 Logical Operations,

 Comparison Operations,

 Data Movement Operations,

 Status Register Operations,

 Multiplication Operations, and

 Division Operations.

4.1.1 Operand2

Most of the data processing instructions require the programmer to specify two source operands and one destination register for the result. Because three items must be specified for these instructions, they are known as three address instructions. The use of the word address in this case has nothing to do with memory addresses. The term three address instruction comes from earlier processor architectures that allow arithmetic operations to be performed with data that is stored in memory rather than registers. The first source operand specifies a register whose contents will be on the A bus in Fig. 3.1. The second source operand will be on the B bus and is referred to as Operand2. Operand2 can be any one of the following three things:

 a register (r0-r15),

 a register (r0-r15) and a shift operation to modify it, or

 a 32-bit immediate value that can be constructed by shifting, rotating, and/or complementing an 8-bit value.

The options for Operand2 allow a great deal of flexibility. Many operations that would require two instructions on most processors can be performed using a single ARM instruction. Table 4.1 shows the mnemonics used for specifying shift operations, which we refer to as < shift_op >.

Table 4.1

Shift and rotate operations in Operand2

u04-01-9780128036983

t0010

The lsl operation shifts each bit left by a specified amount n. Zero is shifted into the n least significant bits, and the most significant n bits are lost. The lsr operation shifts each bit right by a specified amount n. Zero is shifted into the n most significant bits, and the least significant n bits are lost. The asr operation shifts each bit right by a specified amount n. The n most significant bits become copies of the sign bit (bit 31), and the least significant n bits are lost. The ror operation rotates each bit right by a specified amount n. The n most significant bits become the least significant n bits. The RRX operation rotates one place to the right but the CPSR carry flag, C, is included. The carry flag and the register together create a 33 bit quantity to be rotated. The carry flag is rotated into the most significant bit of the register, and the least significant bit of the register is rotated into the carry flag. Table 4.2 shows all of the possible forms for Operand2.

Table 4.2

Formats for Operand2

#<immediate|symbol>A 32-bit immediate value that can be constructed from an 8 bit value
RmAny of the 16 registers r0-r15
Rm, <shift_op> #<shift_imm>The contents of a register shifted or rotated by an immediate amount between 0 and 31
Rm, <shift_op> RsThe contents of a register shifted or rotated by an amount specified by the contents of another register
Rm, rrxThe contents of a register rotated right by one bit through the carry flag

4.1.2 Comparison Operations

These four comparison operations update the CPSR flags, but have no other effect:

cmp Compare,

cmn Compare Negative,

tst Test Bits, and

teq Test Equivalence.

They each perform an arithmetic operation, but the result of the operation is discarded. Only the CPSR carry flags are affected.

Syntax

 <op>{<cond>} Rn, Operand2

 <op> is either cmp, cmn, tst, or teq.

 The optional <cond> can be any of the codes from Table 3.2 specifying conditional execution.

Operations

NameEffectDescription
cmpRnoperand2Compare and set CPSR flags
cmnRn + operand2Compare negative and set CPSR flags
tstRnoperand2Test bits and set CPSR flags
teqRnoperand2Test equivalence and set CPSR flags

Examples

f04-01-9780128036983

Example 4.1 shows how conditional execution and the test instruction can be used together to create an if-then-else structure. Note that in this case, the assembly code is more concise than the C code. That is not generally true.

Example 4.1

Making an If-Then-Else Construct

The following C code adds three to a if a is odd, and adds seven to a if a is even.

f04-02-9780128036983

Assuming that the value of a is currently being stored in register r4, the following ARM assembly code performs the same function:

f04-03-9780128036983

4.1.3 Arithmetic Operations

There are six basic arithmetic operations:

add Add,

adc Add with Carry,

sub Subtract,

sbc Subtract with Carry,

rsb Reverse Subtract, and

rsc Reverse Subtract with Carry.

All of them involve two 32-bit source operands and a destination register.

Syntax

 <op>{<cond>}{s} Rd, Rn, Operand2

 <op> is one of add, adc, sub, sbc, or rsb, or rsc.

 The optional s specifies whether or not the instruction should affect the bits in the CPSR.

 The optional <cond> can be any of the codes from Table 3.2 on page 59 specifying conditional execution.

Operations

NameEffectDescription
addRdRn+operand2si2_eAdd
adcRdRn+operand2+carrysi3_eAdd with carry
subRdRnoperand2si4_eSubtract
sbcRdRnoperand2+carry1si5_eSubtract with carry
rsbRdoperand2Rnsi6_eReverse subtract
rscRdoperand2Rn+carry1si7_eReverse subtract with carry

Examples

f04-04-9780128036983

Example 4.2 shows a complete program for adding the contents of two statically allocated variables and printing the result. The printf () function expects to find the address of a string in r0. As it prints the string, it finds the \%d formatting command, which indicates that the value of an integer variable should be printed. It expects the variable to be stored in r1. Note that the variable sum does not need to be stored in memory. It is stored in r1, where printf () expects to find it.

Example 4.2

Adding the Contents of Two Variables

The following C program will add together two numbers stored in memory and print the result.

f04-05-9780128036983

The equivalent ARM assembly program is as follows:

f04-06-9780128036983

Example 4.3 shows how the compare, branch, and add instructions can be used to create a loop. There are basically three steps for creating a loop: allocating and initializing the loop variable, testing the loop variable, and modifying the loop variable. In general, any of the registers r0-r12 can be used to hold the loop variable. Section 5.4 introduces some considerations for choosing an appropriate register. For now, it is assumed that r0 is available for use as the loop variable for this example.

Example 4.3

Making a Loop

Suppose we want to implement a loop that is equivalent to the following C code:

f04-07-9780128036983

The loop can be written with the following ARM assembly code:

f04-08-9780128036983

4.1.4 Logical Operations

There are five basic logical operations:

and Bitwise AND,

orr Bitwise OR,

eor Bitwise Exclusive OR,

orn Bitwise OR NOT, and

bic Bit Clear.

All of them involve two source operands and a destination register.

Syntax

 <op>{<cond>}{s} Rd, Rn, Operand2

 <op> is either and, eor, orr, orn, or bic.

 The optional s specifies whether or not the instruction should affect the bits in the CPSR.

 The optional <cond> can be any of the codes from Table 3.2 specifying conditional execution.

Operations

NameEffectDescription
andRdRnoperand2si8_eBitwise AND
orrRdRnoperand2si9_eBitwise OR
eorRdRnoperand2si10_eBitwise Exclusive OR
ornRd¬(Rnoperand2)si11_eComplement of Bitwise OR
bicRdRn¬operand2si12_eBit Clear

Examples

f04-09-9780128036983

4.1.5 Data Movement Operations

The data movement operations copy data from one register to another:

mov Move,

mvn Move Not, and

movt Move Top.

The movt instruction copies 16 bits of data into the upper 16 bits of the destination register, without affecting the lower 16 bits. It is available on ARMv6T2 and newer processors.

Syntax

 <op>{<cond>}{s} Rd, Operand2

 movt{<cond>} Rd, #immed16

 <op> is one of mov or mvn.

 The optional s specifies whether or not the instruction should affect the bits in the CPSR.

 The optional <cond> can be any of the codes from Table 3.2 specifying conditional execution.

Operations

NameEffectDescription
movRdoperand2si13_eCopy operand2 to Rd
mvnRn¬operand2si14_eCopy 1’s complement of operand2
movtRn(immed1616)(Rd0xFFFF)si15_eCopy immed16 into upper 16 bits of Rd

Examples

f04-10-9780128036983

4.1.6 Multiply Operations with 32-bit Results

These two instructions perform multiplication using two 32-bit registers to form a 32-bit result:

mul Multiply, and

mla Multiply and Accumulate.

The mla instruction adds a third register to the result of the multiplication.

Syntax

 mul{<cond>}{s} Rd, Rm, Rs

 mla{<cond>}{s} Rd, Rm, Rs, Rn

 The optional s specifies whether or not the instruction should affect the bits in the CPSR.

 The optional <cond> can be any of the codes from Table 3.2 specifying conditional execution.

Operations

NameEffectDescription
mulRdRm×Rssi16_eMultiply
mlaRdRm×Rs+Rnsi17_eMultiply and accumulate

Examples

f04-11-9780128036983

4.1.7 Multiply Operations with 64-bit Results

These instructions perform multiplication using two 32-bit registers to form a 64-bit result:

smull Signed Multiply Long,

umull Unsigned Multiply Long,

smlal Signed Multiply and Accumulate Long, and

umlal Unsigned Multiply and Accumulate Long.

The smlal and umlal instructions add a 64-bit quantity to the result of the multiplication.

Syntax

 <type><op>l{<cond>}{s} RdLo, RdHi, Rm, Rs

 <type> must be either s for signed or u for unsigned.

 <op> must be either mul, or mla.

 The optional s specifies whether or not the instruction should affect the bits in the CPSR.

 The optional <cond> can be any of the codes from Table 3.2 specifying conditional execution.

Operations

NameEffectDescription
smullRdHi:RdLoRm×Rssi18_eSigned Multiply
umullRdHi:RdLoRm×Rssi18_eUnsigned Multiply
smlalRdHi:RdLoRm×Rs+RdHi:RdLosi20_eSigned Multiply and Accumulate
umlalRdHi:RdLoRm×Rs+RdHi:RdLosi20_eUnsigned Multiply and Accumulate

Examples

f04-12-9780128036983

4.1.8 Division Operations

Some ARM processors have the following instructions to perform division:

sdiv Signed Divide, and

udiv Unsigned Divide.

The divide operations are available on Cortex M3 and newer ARM processors. The processor used on the Raspberry Pi does not have these instructions. The Raspberry Pi 2 does have them.

Syntax

 <type>div{<cond>}{s} Rd, Rm, Rn

 <type> must be either s for signed or u for unsigned.

 The optional <cond> can be any of the codes from Table 3.2 specifying conditional execution.

 The optional s specifies whether or not the instruction should affect the bits in the CPSR.

Operations

NameEffectDescription
sdivRdRm÷Rnsi22_eSigned Divide
udivRdRm÷Rnsi22_eUnsigned Divide

Examples

f04-13-9780128036983

4.2 Special Instructions

There are a few instructions that do not fit into any of the previous categories. They are used to request operating system services and access advanced CPU features.

4.2.1 Count Leading Zeros

This instruction counts the number of leading zeros in the operand register and stores the result in the destination register:

clz Count Leading Zeros.

Syntax

 clz{<cond>} Rd, Rm

 The optional <cond> can be any of the codes from Table 3.2 specifying conditional execution.

Operations

NameEffectDescription
clzRd31log2(Rm)si24_eCount leading zeros in Rm

Example

f04-14-9780128036983

4.2.2 Accessing the CPSR and SPSR

These two instructions allow the programmer to access the status bits of the CPSR and SPSR:

mrs Move Status to Register, and

msr Move Register to Status.

The SPSR is covered in Section 14.1.

Syntax

 mrs{<cond>} Rd, <CPSR|SPSR>{_<fields>}

 msr{<cond>} <CPSR|SPSR>{_<fields>}, Rd

 The optional < fields > is any combination of:

c control field

x extension field

s status field

f flags field

 The optional <cond> can be any of the codes from Table 3.2 specifying conditional execution.

Operations

NameEffectDescription
mrsRdCPSR|SPSRsi25_eMove from Status Register
msrCPSR|SPSRRnsi26_eMove to Status Register

Examples

f04-15-9780128036983

4.2.3 Software Interrupt

The following instruction allows a user program to perform a system call to request operating system services:

swi Software Interrupt.

In Unix and Linux, the system calls are documented in the second section of the online manual. Each system call has a unique id number which is defined in the /usr/include/syscall.h file.

Syntax

 swi <syscall_number>

 The <syscall_number> is encoded in the instruction. The operating system may examine it to determine which operating system service is being requested.

 In Linux, <syscall_number> is ignored. The system call number is passed in r7, and up to seven parameters are passed in r0-r6. No Linux system call requires more than seven parameters.

Operations

NameEffectDescription
swiRequest Operating SystemPerform software interrupt
Service

Example

f04-16-9780128036983

4.2.4 Thumb Mode

The ARM processor has an alternate mode where it executes a 16-bit instruction set known as Thumb. This instruction allows the programmer to change the processor mode and branch to Thumb code:

bx Branch and Exchange.

The thumb instruction set is sometimes more efficient than the full ARM instruction set, and may offer advantages on small systems.

Syntax

 bx{<cond>} Rn

 blx{<cond>} Rn

Operations

NameEffectDescription
bxpctarget_addresssi27_eBranch and change to ARM state. Bit 0 of Rn must be set to 1. Used to return from a Thumb subroutine
blxlrpc1si28_epctarget_addresssi27_eBranch and link with change to Thumb state. Bit 0 of Rn must be set to 1. Bit 0 of lr will be set to 1

Example

f04-17-9780128036983

4.3 Pseudo-Instructions

The assembler provides a small number of pseudo-instructions. From the perspective of the programmer, these instructions are indistinguishable from standard instructions. However, when the assembler encounters a pseudo-instruction, it may substitute a different instruction or generate a short sequence of machine instructions.

4.3.1 No Operation

This pseudo instruction does nothing, but takes one clock cycle to execute.

nop No Operation.

This is equivalent to a mov r0,r0 instruction.

Syntax

 nop

Operations

NameEffectDescription
nopNo effectsNo Operation

Examples

f04-18-9780128036983

4.3.2 Shifts

These pseudo instructions are assembled into mov instructions with an appropriate shift of Operand2:

lsl Logical Shift Left,

lsr Logical Shift Right,

asr Arithmetic Shift Right,

ror Rotate Right, and

rrx Rotate Right with eXtend.

Syntax

 <op>{<cond>}{s} Rd, Rn, Rs

 <op>{<cond>}{s} Rd, Rn, #shift

 rrx{<cond>}{s} Rd, Rn

 <op> must be either lsl, lsr, asr, or ror.

 Rs is a register holding the shift amount. Only the least significant byte is used.

 shift must be between 1 and 32.

 If the optional s is specified, then the N and Z flags are updated according to the result, and the C flag is updated to the last bit shifted out.

 The optional <cond> can be any of the codes from Table 3.2 on page 59 specifying conditional execution.

Operations

NameEffectDescription
lslRdRnshiftsi30_eShift Left
lsrRdRnshiftsi31_eShift Right
asrRdRnshiftsi31_eShift Right with sign extend
rrxRd:CarryCarry:Rdsi33_eRotate Right with eXtend

The rrx operation rotates one place to the right but the CPSR carry flag, C, is included. The carry flag and the register together create a 33-bit quantity to be rotated. The carry flag is rotated into the most significant bit of the register, and the least significant bit of the register is rotated into the carry flag.

Examples

f04-19-9780128036983

4.4 Alphabetized List of ARM Instructions

This chapter and the previous one introduced the core set of ARM instructions. Most of these instructions were introduced with the very first ARM processors. There are approximately 50 additional instructions and pseudo instructions that were introduced with the ARMv6 and later versions of the architecture, or that only appear in specific versions of the ARM. There are also additional instructions available on systems that have the Vector Floating Point (VFP) coprocessor and/or the NEON extensions. The instructions introduced so far are:

NamePageOperation
adc83Add with Carry
add83Add
adr75Load Address
adrl75Load Address Long
and85Bitwise AND
asr94Arithmetic Shift Right
b70Branch
bic86Bit Clear
bl71Branch and Link
bx92Branch and Exchange
clz90Count Leading Zeros
cmn81Compare Negative
cmp81Compare
eor85Bitwise Exclusive OR
ldm65Load Multiple Registers
ldr73Load Immediate
ldr64Load Register
ldrex69Load Multiple Registers
lsl94Logical Shift Left
lsr94Logical Shift Right
mla87Multiply and Accumulate
mov86Move
movt86Move Top
mrs91Move Status to Register
msr91Move Register to Status
mul87Multiply
mvn86Move Not
nop93No Operation
orn86Bitwise OR NOT
orr85Bitwise OR
ror94Rotate Right
rrx94Rotate Right with eXtend
rsb83Reverse Subtract
rsc83Reverse Subtract with Carry
sbc83Subtract with Carry
sdiv89Signed Divide
smlal88Signed Multiply and Accumulate Long
smull88Signed Multiply Long
stm65Store Multiple Registers
str64Store Register
strex69Store Multiple Registers
sub83Subtract
swi91Software Interrupt
swp68Load Multiple Registers
teq81Test Equivalence
tst81Test Bits
udiv89Unsigned Divide
umlal88Unsigned Multiply and Accumulate Long
umull88Unsigned Multiply Long

t0090

4.5 Chapter Summary

The ARM Instruction Set Architecture includes 17 registers and four basic instruction types. This chapter introduced the instructions used for

 moving data from one register to another,

 performing computational operations with two source operands and one destination register,

 multiplication and division,

 performing comparisons, and

 performing special operations.

Most of the data processing instructions are three address instructions, because they involve two source operands and produce one result. For most instructions, the second source operand can be a register, a rotated or shifted register, or an immediate value. This flexibility results in a relatively powerful assembly language. In addition, almost all instructions can be executed conditionally, which, if used properly, results in very efficient and compact code.

Exercises

4.1 If r0 initially contains 1, what will it contain after the third instruction in the sequence below?

f04-20-9780128036983

4.2 What will r0 and r1 contain after each of the following instructions? Give your answers in base 10.

f04-21-9780128036983

4.3 What is the difference between lsr and asr?

4.4 Write the ARM assembly code to load the numbers stored in num1 and num2, add them together, and store the result in numsum. Use only r0 and r1.

4.5 Given the following variable definitions:

f04-22-9780128036983

where you do not know the values of x and y, write a short sequence of ARM assembly instructions to load the two numbers, compare them, and move the largest number into register r0.

4.6 Assuming that a is stored in register r0 and b is stored in register r1, show the ARM assembly code that is equivalent to the following C code.

f04-23-9780128036983

4.7 Without using the mul instruction, give the instructions to multiply r3 by the following constants, leaving the result in r0. You may also use r1 and r2 to hold temporary results, and you do not need to preserve the original contents of r3.

(a) 10

(b) 100

(c) 575

(d) 123

4.8 Assume that r0 holds the least significant 32 bits of a 64-bit integer a, and r1 holds the most significant 32 bits of a. Likewise, r2 holds the least significant 32 bits of a 64-bit integer b, and r3 holds the most significant 32 bits of b. Show the shortest instruction sequences necessary to:

(a) compare a to b, setting the CPSR flags,

(b) shift a left by one bit, storing the result in b,

(c) add b to a, and

(d) subtract b from a.

4.9 Write a loop to count the number of bits in r0 that are set to 1. Use any other registers that are necessary.

4.10 The C standard library provides the open() function, which is documented in the second section of the Linux manual pages. This function is a very small “wrapper” to allow C programmers to access the open() system call. Assembly programmers can access the system call directly. In ARM Linux, the system call number for open() is 5. The values for flag constants used with open() are defined in

/usr/include/bits/fcntl-linux.h.

Write the ARM assembly instructions and directives necessary to make a Linux system call to open a file named input.txt for reading, without using the C standard library. In other words, write the assembly equivalent to: open(”input.txt”,O˙RDONLY); using the swi instruction.

Chapter 5

Structured Programming

Abstract

This chapter first introduces the structured programming concepts and describes the principles of good software design. It then shows how the language elements covered in the previous three chapters are used to create the elements required by structured programming, giving comparative examples of these elements in C and assembly language. It covers programming elements for sequencing, selection, and iteration. Then it covers in greater detail how to access the standard C library functions from assembly language, and how to access assembly language functions from C. It then explains how automatic variables are allocated, and covers writing recursive functions in assembly language. Finally, it explains the implementation of C structs and shows how they can be accessed from assembly language, then covers arrays in the same way.

Keywords

Structured programming; Sequencing; Selection; Iteration; Loop; Subroutine; Function; Recursion; Struct; Aggregate data; Array

Before IBM released FORTRAN in 1957, almost all programming was done in assembly language. Part of the reason for this is that nobody knew how to design a good high-level language, nor did they know how to write a compiler to generate efficient code. Early attempts at high-level languages resulted in languages that were not well structured, difficult to read, and difficult to debug. The first release of FORTRAN was not a particularly elegant language by today’s standards, but it did generate efficient code.

In the 1960s, a new paradigm for designing high-level languages emerged. This new paradigm emphasized grouping program statements into blocks of code that execute from beginning to end. These basic blocks have only one entry point and one exit point. Control of which basic blocks are executed, and in what order, is accomplished with highly structured flow control statements. The structured program theorem provides the theoretical basis of structured programming. It states that there are three ways of combining basic blocks: sequencing, selection, and iteration. These three mechanisms are sufficient to express any computable function. It has been proven that all programs can be written using only basic blocks, the pre-test loop, and if-then-else structure. Although most high-level languages provide additional statements for the convenience of the programmer, they are just “syntactical sugar.” Other structured programming concepts include well-formed functions and procedures, pass-by-reference and pass-by-value, separate compilation, and information hiding.

These structured programming languages enabled programmers to become much more productive. Well-written programs that adhere to structured programming principles are much easier to write, understand, debug, and maintain. Most successful high-level languages are designed to enforce, or at least facilitate, good programming techniques. This is not generally true for assembly language. The burden of writing a well-structured code lies with the programmer, and not with the language.

The best assembly programmers rely heavily on structured programming concepts. Failure to do so results in code that contains unnecessary branch instructions and, in the worst cases, results in something called spaghetti code. Consider a code listing where a line has been drawn from each branch instruction to its destination. If the result looks like someone spilled a plate of spaghetti on the page, then the listing is spaghetti code. If a program is spaghetti code, then the flow of control is difficult to follow. Spaghetti code is much more likely to have bugs and is extremely difficult to debug. If the flow of control is too complex for the programmer to follow, then it cannot be adequately debugged. It is the responsibility of the assembly language programmer to write code that uses a block-structured approach.

Adherence to structured programming principles results in code that has a much higher probability of working correctly. Well-written code also has fewer branch statements, making the percentage of data processing statements versus branch statements is higher. High data processing density results in higher throughput of data. In other words, writing code in a structured manner leads to higher efficiency.

5.1 Sequencing

Sequencing simply means executing statements (or instructions) in a linear sequence. When statement n is completed, statement n + 1 will be executed next. Uninterrupted sequences of statements form basic blocks. Basic blocks have exactly one entry point and one exit point. Flow control is used to select which basic block should be executed next.

5.2 Selection

The first control structure that we will examine is the basic selection construct. It is called selection because it selects one of the two (or possibly more) blocks of code to execute, based on some condition. In its most general form, the condition could be computed in a variety of ways, but most commonly it is the result of some comparison operation or the result of evaluating a Boolean expression.

Most languages support selection in the form of an if-then-else statement. Selection can be implemented very easily in ARM assembly language with a two-stage process:

1. perform an operation that updates the CPSR flags, and

2. use conditional execution to select a block of instructions to execute.

Because the ARM architecture supports conditional execution on almost every instruction, there are two basic ways to implement this control structure: by using conditional execution on all instructions in a block, or by using branch instructions. The conditional execution can be applied directly to instructions following the flag update, or to branch instructions that transfer execution to another location. Listing 5.1 shows a typical if-then-else statement in C.

f05-02-9780128036983
Listing 5.1 Selection in C.

5.2.1 Using Conditional Execution

Listing 5.2 shows the ARM code equivalent to Listing 5.1, using conditional execution. The then and else are written with one instruction each on lines 7 and 8. The then section is written as a conditional instruction with the lt condition attached. The else section is a single instruction with the opposite (ge) condition. Therefore only one of the two instructions will actually execute, depending on the results of the cmp instruction. If there are three or fewer instructions in each block that can be selected, then this is the preferred and most efficient method of writing the bodies of the then and else selections.

f05-03-9780128036983
Listing 5.2 Selection in ARM assembly using conditional execution.

5.2.2 Using Branch Instructions

Listing 5.3 shows the ARM code equivalent to Listing 5.1, using branch instructions. Note that this method requires a conditional branch, an unconditional branch, and two labels. If there are more than three instructions in either basic block, then this is the preferred and most efficient method of writing the bodies of the then and else selections.

f05-04-9780128036983
Listing 5.3 Selection in ARM assembly using branch instructions.

5.2.3 Complex Selection

More complex selection structures should be written with care. Listing 5.4 shows a fragment of C code which compares the variables a, b, and c, and sets the variable x to the least of the three values. In C, Boolean expressions use short-circuit evaluation. For example, consider the Boolean AND operator in the expression ((a<b)&&(a<c)). If the first sub-expression evaluates to false, then the truth value of the complete expression can be immediately determined to be false, so the second sub-expression is not evaluated. This usually results in the compiler generating very efficient assembly code. Good programmers can take advantage of short-circuiting by checking array bounds early in a Boolean expression and accessing array elements later in the expression. For example, the expression ((i<15)&&(array[i]<0)) makes sure that the index i is less than 15 before attempting to access the array. If the index is greater than 14, the array access will not take place. This prevents the program from attempting to access the 16th element on an array that has only 15 elements.

f05-05-9780128036983
Listing 5.4 Complex selection in C.

Listing 5.5 shows an ARM assembly code fragment which is equivalent to Listing 5.4. In this code fragment, r0 is used to store a temporary value for the variable x, and the value is only stored to memory once at the end of the fragment of code. The outer if-then-else statement is implemented using branch instructions. The first comparison is performed on line 8. If the comparison evaluates to false, then it immediately branches to the else block of the outer if-then-else statement. But if the first comparison evaluates to true, then it performs the second comparison. Again, if that comparison evaluates to false, then it branches to the else block of the outer if-then-else statement. If both comparisons evaluate to true, then it executes the then block of the outer if-then-else statement, and then branches to the statement following the else block.

f05-06-9780128036983
Listing 5.5 Complex selection in ARM assembly.

The if-then-else statement on line 5 of Listing 5.4 is implemented using conditional execution. The comparison is performed on line 13 of Listing 5.5. Lines 14 and 15 contain instructions that are conditionally executed. Since they have complementary conditions, it is guaranteed that one of them will move a value into r0. The comparison on line 13 determines which statement executes.

Note that the number of comparisons performed will always be minimized, and the number of branches has also been minimized. The only way that line 13 can be reached is if one of the first two comparisons evaluates to false. If line 2 is executed, then no matter which sequence of events occurs, the program fragment will always reach line 16 and a value will be stored in x. Thus, the ARM assembly code fragment in Listing 5.5 can be considered to be a block of code with exactly one entry point and one exit point.

When writing nested selection structures, it is important to maintain a block structure, even if the bodies of the blocks consist of only a single instruction. It is often very helpful to write the algorithm in pseudo-code or a high-level language, such as C or Java, before converting it to assembly. Prolific commenting of the code is also strongly encouraged.

5.3 Iteration

Iteration involves the transfer of control from a statement in a sequence to a previous statement in the sequence. The simplest type of iteration is the unconditional loop, also known as the infinite loop. This type of loop may be used in programs or tasks that should continue running indefinitely. Listing 5.6 shows an ARM assembly fragment containing an unconditional loop. Few high-level languages provide a true unconditional loop, but the high-level programmer can achieve a similar effect by using a conditional loop and specifying a condition that always evaluates to true.

f05-07-9780128036983
Listing 5.6 Unconditional loop in ARM assembly.

5.3.1 Pre-Test Loop

A pre-test loop is a loop in which a test is performed before the block of instructions forming the loop body is executed. If the test evaluates to true, then the loop body is executed. The last instruction in the loop body is a branch back to the beginning of the test. If the test evaluates to false, then execution branches to the first instruction following the loop body. All structured programming languages have a pre-test loop construct. For example, in C, the pre-test loop is called a while loop. In assembly, a pre-test loop is constructed very similarly to an if-then statement. The only difference is that it includes an additional branch instruction at the end of the sequence of instructions that form the body. Listing 5.7 shows a pre-test loop in ARM assembly.

f05-08-9780128036983
Listing 5.7 Pre-test loop in ARM assembly.

5.3.2 Post-Test Loop

In a post-test loop, the test is performed after the loop body is executed. If the test evaluates to true, then execution branches to the first instruction in the loop body. Otherwise, execution continues sequentially. Most structured programming languages have a post-test loop construct. For example, in C, the post-test loop is called a do-while loop. Listing 5.8 shows a post-test loop in ARM assembly. The body of a post-test loop will always be executed at least once.

f05-09-9780128036983
Listing 5.8 Post-test loop in ARM assembly.

5.3.3 For Loop

Many structured programming languages have a for loop construct, which is a type of counting loop. The for loop is not essential, and is only included as a matter of syntactical convenience. In some cases, a for loop is easier to write and understand than an equivalent pre-test or post-test loop. However, with the addition of an if-then construct, any loop can be implemented as a pre-test loop. The following sections show how loops can be converted from one form to another.

Pre-test conversion

Listing 5.9 shows a simple C program with a for loop. The program prints “Hello World” 10 times, appending an integer to the end of each line.

f05-10-9780128036983
Listing 5.9 for loop in C.

In order to write an equivalent program in assembly, the programmer must first rewrite the for loop as a pre-test loop. Listing 5.10 shows the program rewritten so that it is easier to translate into assembly. Note that the initialization of the loop variable has been moved to its own line before the while statement. Also, the loop variable is modified on the last line of the loop body. This is a straightforward conversion from one type of loop to another type. Listing 5.11 shows a translation of the pre-test loop structure into ARM assembly.

f05-11-9780128036983
Listing 5.10 for loop rewritten as a pre-test loop in C.
f05-12-9780128036983
Listing 5.11 Pre-test loop in ARM assembly.

Post-test conversion

If the programmer can guarantee that the body of a for loop will always execute at least once, then the for loop can be converted to an equivalent post-test loop. This form of loop is more efficient, because the loop control variable is tested one time less than for a pre-test loop. Also, a post-test loop requires only one label and one conditional branch instruction, whereas a pre-test loop requires two labels, a conditional branch, and an unconditional branch.

Since the loop in Listing 5.9 always executes the body exactly 10 times, we know that the body will always execute at least once. Therefore, the loop can be converted to a post-test loop. Listing 5.12 shows the program rewritten as a post-test loop so that it is easier to translate into assembly. Note that, as in the previous example, the initialization of the loop variable has been moved to its own line before the do-while loop, and the loop variable is modified on the last line of the loop body. This post-test version will produce the same output as the pre-test version. This is a straightforward conversion from one type of loop to an equivalent type. Listing 5.13 shows a straightforward translation of the post-test loop structure into ARM assembly.

f05-13-9780128036983
Listing 5.12 for loop rewritten as a post-test loop in C.
f05-14-9780128036983
Listing 5.13 Post-test loop in ARM assembly

5.4 Subroutines

A subroutine is a sequence of instructions to perform a specific task, packaged as a single unit. Depending on the particular programming language, a subroutine may be called a procedure, a function, a routine, a method, a subprogram, or some other name. Some languages, such as Pascal, make a distinction between functions and procedures. A function must return a value and must not alter its input arguments or have any other side effects (such as producing output or changing static or global variables). A procedure returns no value, but may alter the value of its arguments or have other side effects.

Other languages, such as C, make no distinction between procedures and functions. In these languages, functions may be described as pure or impure. A function is pure if:

1. the function always evaluates the same result value when given the same argument value(s), and

2. evaluation of the result does not cause any semantically observable side effect or output.

The first condition implies that the result of the function cannot depend on any hidden information or state that may change as program execution proceeds, or between different executions of the program, nor can it depend on any external input from I/O devices. The result value of a pure function does not depend on anything other than the argument values. If the function returns multiple result values, then these two conditions must apply to all returned values. Otherwise the function is impure. Another way to state this is that impure functions have side effects while pure functions have no side effects.

Assembly language does not impose any distinction between procedures and functions, pure or impure. Although every assembly language will provide a way to call subroutines and return from them, it is up to the programmer to decide how to pass arguments to the subroutines and how to pass return values back to the section of code that called the subroutine. Once again, the expert assembly programmer will use structured programming concepts to write efficient, readable, debugable, and maintainable code.

5.4.1 Advantages of Subroutines

Subroutines help programmers to design reliable programs by decomposing a large problem into a set of smaller problems. It is much easier to write and debug a set of small code pieces than it is to work on one large piece of code. Careful use of subroutines will often substantially reduce the cost of developing and maintaining a large program, while increasing its quality and reliability. The advantages of breaking a program into subroutines include:

 enabling reuse of code across multiple programs,

 reducing duplicate code within a program,

 enabling the programming task to be divided between several programmers or teams,

 decomposing a complex programming task into simpler steps that are easier to write, understand, and maintain,

 enabling the programming task to be divided into stages of development, to match various stages of a project, and

 hiding implementation details from users of the subroutine (a programming principle known as information hiding).

5.4.2 Disadvantages of Subroutines

There are two minor disadvantages in using subroutines. First, invoking a subroutine (versus using in-line code) imposes overhead. The arguments to the subroutine must be put into some known location where the subroutine can find them. if the subroutine is a function, then the return value must be put into a known location where the caller can find it. Also, a subroutine typically requires some standard entry and exit code to manage the stack and save and restore the return address.

In most languages, the cost of using subroutines is hidden from the programmer. In assembly, however, the programmer is often painfully aware of the cost, since they have to explicitly write the entry and exit code for each subroutine, and must explicitly write the instructions to pass the data into the subroutine. However, the advantages usually outweigh the costs. Assembly programs can get very large and failure to modularize the code by using subroutines will result in code that cannot be understood or debugged, much less maintained and extended.

5.4.3 Standard C Library Functions

Subroutines may be defined within a program, or a set of subroutines may be packaged together in a library. Libraries of subroutines may be used by multiple programs, and most languages provide some built-in library functions. The C language has a very large set of functions in the C standard library. All of the functions in the C standard library are available to any program that has been linked with the C standard library. Even assembly programs can make use of this library. Linking is done automatically when gcc is used to assemble the program source. All that the programmer needs to know is the name of the function and how to pass arguments to it.

5.4.4 Passing Arguments

Listing 5.14 shows a very simple C program which reads an integer from standard input using scanf and prints the integer to standard output using printf. An equivalent program written in ARM assembly is shown in Listing 5.15. These examples show how arguments can be passed to subroutines in C and equivalently in assembly language.

f05-15-9780128036983
Listing 5.14 Calling scanf and printf in C.
f05-16-9780128036983
Listing 5.15 Calling scanf and printf in ARM assembly.

All processor families have their own standard methods, or function calling conventions, which specify how arguments are passed to subroutines and how function values are returned. The function call standard allows programmers to write subroutines and libraries of subroutines that can be called by other programmers. In most cases, the function calling standards are not enforced by hardware, but assembly programmers and compiler writers conform to the standards in order to make their code accessible to other programmers. The basic subroutine calling rules for the ARM processor are simple:

 The first four arguments go in registers r0-r3.

 Any remaining arguments are pushed to the stack.

If the subroutine returns a value, then it is stored in r0 before the function returns to its caller. Calling a subroutine in ARM assembly usually requires several lines of code. The number of lines required depends on how many arguments the subroutine requires and where the data for those arguments are stored. Some variables may already be in the correct register. Others may need to be moved from one register to another. Still others may need to be pushed onto the stack. Careful programming is required to minimize the amount of work that must be done just to move the subroutine arguments into their required locations.

The ARM register set was introduced in Chapter 3. Some registers have special purposes that are dictated by the hardware design. Others have special purposes that are dictated by programming conventions. Programmers follow these conventions so that their subroutines are compatible with each other. These conventions are simply a set of rules for how registers should be used. In ARM assembly, all registers have alternate names which can be used to help remember the rules for using them. Fig. 5.1 shows an expanded view of the ARM registers, including their alternate names and conventional use.

f05-01-9780128036983
Figure 5.1 ARM user program registers

Registers r0-r3 are also known as a1-a4, because they are used for passing arguments to subroutines. Registers r4-r11 are also known as v1-v8, because they are used for holding local variables in a subroutine. As mentioned in Section 3.2, register r11 can also be referred to as fp because it is used by the C compiler to track the stack frame, unless the code is compiled using the --omit-frame- pointer command line option.

The intra-procedure scratch register, r12, is used by the C library when calling dynamically linked functions. If a subroutine does not call any C library functions, then it can use r12 as another register to store local variables. If a C library function is called, it may change the contents of r12. Therefore, if r12 is being used to store a local variable, it should be saved to another register or to the stack before a C library function is called.

5.4.5 Calling Subroutines

The stack pointer (sp), link register (lr), and program counter (pc), along with the argument registers, are all involved in performing subroutine calls. The calling subroutine must place arguments in the argument registers, and possibly on the stack as well. Placing the arguments in their proper locations is known as marshaling the arguments. After marshaling the arguments, the calling subroutine executes the bl instruction, which will modify the program counter and link register. The bl instruction copies the contents of the program counter to the link register, then loads the program counter with the address of the first instruction in the subroutine that is being called. The CPU will then fetch and execute its next instruction from the address in the program counter, which is the first instruction of the subroutine that is being called.

Our first examples of calling a function will involve the printf function from the C standard library. The printf function can be a bit confusing at first, but it is an extremely useful and flexible function for printing formatted output. The printf function examines its first argument to determine how many other arguments have been passed to it. The first argument is a format string, which is a null-terminated ASCII string. The format string may include conversion specifiers, which start with the \% character. For each conversion specifier, printf assumes that an argument has been passed in the correct register or location on the stack. The argument is retrieved, converted according to the specified format, and printed. Other specifiers include \%X to print the matching argument as an integer in hexadecimal, \%c to print the matching argument as an ASCII character, \%s to print a zero-terminated string. The integer specifiers can include an optional width and zero-padding specification. For example \%8X will print an integer in hexadecimal, using 8 characters. Any leading zeros will be printed as spaces. The format string \%08X will print an integer in hexadecimal, using 8 characters. In this case, any leading zeros will be printed as zeros. Similarly, \%15d can be used to print an integer in base 10 using spaces to pad the number up to 15 characters, while \%015d will print an integer in base 10 using zeros to pad up to 15 characters.

Listing 5.16 shows a call to printf in C. The printf function requires one argument, and can accept more than one. In this case, there is only one argument, the format string. Listing 5.17 shows an equivalent call made in ARM assembly language. The single argument is loaded into r0 in conformance with the ARM subroutine calling convention.

f05-17-9780128036983
Listing 5.16 Simple function call in C.
f05-18-9780128036983
Listing 5.17 Simple function call in ARM assembly.

Passing arguments in registers

Listing 5.18 shows a call to printf in C having four arguments. The format string is the first argument. The format string contains three conversion specifiers, and is followed by three more arguments. Arguments are matched to conversion specifiers according to their positions. The type of each argument matches the type indicated in the conversion specifier. The first conversion specifier is applied to the second argument, the second conversion specifier is applied to the third argument, and the third conversion specifier is applied to the fourth argument. The \%d conversion specifiers indicate that the arguments are to be interpreted as integers and printed in base 10.

f05-19-9780128036983
Listing 5.18 A larger function call in C.

Listing 5.19 shows an equivalent call made in ARM assembly language. The arguments are loaded into r0-r3 in conformance with the ARM subroutine calling convention. Note that we assume that formatstr has previously been defined using a .asciz or .string assembler directive or equivalent method. As long as there are four or fewer arguments that must be passed, they can all fit in registers r0-r3 (a.k.a a1-a4), but when there are more arguments, things become a little more complicated. Any remaining arguments must be passed on the program stack, using the stack pointer r13. Care must be taken to ensure that the arguments are pushed to the stack in the proper order. Also, after the function call, the arguments must be removed from the stack, so that the stack pointer is restored to its original value.

f05-20-9780128036983
Listing 5.19 A larger function call in ARM assembly.

Passing arguments on the stack

Listing 5.20 shows a call to printf in C having more than four arguments. The format string is the first argument. The format string contains five conversion specifiers, which implies that the format string must be followed by five additional arguments. Arguments are matched to conversion specifiers according to their positions. The type of each argument matches the type indicated in the conversion specifier. The first conversion specifier is applied to the second argument, the second conversion specifier is applied to the third argument, the third conversion specifier is applied to the fourth argument, etc. The \%d conversion specifiers indicate that the arguments are to be interpreted as integers and printed in base 10.

f05-21-9780128036983
Listing 5.20 A function call using the stack in C.

Listing 5.21 shows an equivalent call made in ARM assembly language. Since there are six arguments, the last two must be pushed to the program stack. The arguments are loaded into r0 one at a time and then the register pre-indexed addressing mode is used to subtract four bytes from the stack pointer and then store the argument at the top of the stack. Note that the sixth argument is pushed to the stack first, followed by the fifth argument. The remaining arguments are loaded in r0-r3. Note that we assume that formatstr has previously been defined to be ”The results are: or \ lstinline { .string assembler directive.

f05-22-9780128036983
Listing 5.21 A function call using the stack in ARM assembly.

Listing 5.22 shows how the fifth and sixth arguments can be pushed to the stack using a single stmfd instruction. The sixth argument is loaded into r3 and the fifth argument is loaded into r0, then the stmfd instruction is used to store them on the stack and adjust the stack pointer. A little care must be taken to ensure that the arguments are stored in the correct order on the stack. Remember that the stmfd instruction will always push the lowest-numbered register to the lowest address, and the stack grows downward. Therefore, r3, the sixth argument, will be pushed onto the stack first, making it grow downward by four bytes. Next, r0 is pushed, making the stack grow downward by four more bytes. As in the previous example, the remaining four arguments are loaded into a1-a4.

f05-23-9780128036983
Listing 5.22 A function call using stm to push arguments onto the stack.

After the printf function is called, the fifth and sixth arguments must be popped from the stack. If those values are no longer needed, then there is no need to load them into registers. The quickest way to pop them from the stack is to simply adjust the stack pointer back to its original value. In this case, we pushed two arguments onto the stack, using a total of eight bytes. Therefore, all we need to do is add eight to the stack pointer, thereby restoring its original value.

5.4.6 Writing Subroutines

We have looked at the conventions that are followed for calling functions. Now we will examine these same conventions from the point of view of the function being called. Because of the calling conventions, the programmer writing a function can assume that

 the first four arguments are in r0-r3,

 any additional arguments can be accessed with ldr rd,[sp,# offset ],

 the calling function will remove arguments from the stack, if necessary,

 if the function return type is not void, then they must enusure that the return value is in r0 (and possibly r1, r2, r3), and

 the return address will be in lr.

Also because of the conventions, there are certain registers that can be used freely while others must be preserved or restored so that the calling function can continue operating correctly. Registers which can be used freely are referred to as volatile, and registers which must be preserved or restored before returning are referred to as non-volatile. When writing a subroutine (function),

 registers r0-r3 and r12 are volatile,

 registers r4-r11 and r13 are non-volatile (they can be used, but their contents must be restored to their original value before the function returns),

 register r14 can be used by the function, but its contents must be saved so that the return address can be loaded into r15 when the function returns to its caller,

 if the function calls another function, then it must save register r14 either on the stack or in a non-volatile register before making the call.

Listing 5.23 shows a small C function that simply returns the sum of its six arguments. The ARM assembly version of that function is shown in Listing 5.24. Note that on line 5, the fifth argument is loaded from the stack, and on line 7, the sixth argument is loaded in a similar way, using an offset from the stack pointer. If the calling function has followed the conventions, then the fifth and sixth arguments will be where they are expected to be in relation to the stack pointer.

f05-24-9780128036983
Listing 5.23 A small function in C.
f05-25-9780128036983
Listing 5.24 A small function in ARM assembly.

5.4.7 Automatic Variables

In block-structured high-level languages, an automatic variable is a variable that is local to a block of code and not declared with static duration. It has a lifetime that lasts only as long as its block is executing. Automatic variables can be stored in one of two ways:

1. the stack is temporarily adjusted to hold the variable, or

2. the variable is held in a register during its entire life.

When writing a subroutine in assembly, it is the responsibility of the programmer to decide what automatic variables are required and where they will be stored. In high-level languages this decision is usually made by the compiler. In some languages, including C, it is possible to request that an automatic variable be held in a register. The compiler will attempt to comply with the request, but it is not guaranteed. Listing 5.25 shows a small function which requests that one of its variables be kept in a register instead of on the stack.

f05-26-9780128036983
Listing 5.25 A small C function with a register variable.

Listing 5.26 shows how the function could be implemented in assembly. Note that the array of integers consumes 80 bytes of storage on the stack, and could not possibly fit into the registers available on the ARM processor. However, the loop control variable can easily be stored in one of the registers for the duration of the function. Also notice that on line 1 the storage for the array is allocated simply by adjusting the stack pointer, and on line 9 the storage is released by restoring the stack pointer to its original contents. It is critical that the stack pointer be restored, no matter how the function returns. Otherwise, the calling function will probably mysteriously fail. For this reason, each function should have exactly one block of instructions for returning. If the function needs to return from some location other than the end, then it should branch to the return block rather than returning directly.

f05-27-9780128036983
Listing 5.26 Automatic variables in ARM assembly.

5.4.8 Recursive Functions

A function that calls itself is said to be recursive. Certain problems are easy to implement recursively, but are more difficult to solve iteratively. A problem exhibits recursive behavior when it can be defined by two properties:

1. a simple base case (or cases), and

2. a set of rules that reduce all other cases toward the base case.

For example, we can define person’s ancestors recursively as follows:

1. one’s parents are one’s ancestors (base case),

2. the ancestors of one’s ancestors are also one’s ancestors (recursion step).

Recursion is a very powerful concept in programming. Many functions are naturally recursive, and can be expressed very concisely in a recursive way. Numerous mathematical axioms are based upon recursive rules. For example, the formal definition of the natural numbers by the Peano axioms can be formulated as:

1. 0 is a natural number, and

2. each natural number has a successor, which is also a natural number.

Using one base case and one recursive rule, it is possible to generate the set of all natural numbers. Other recursively defined mathematical objects include functions and sets.

Listing 5.27 shows the C code for a small program which uses recursion to reverse the order of characters in a string. The base case where recursion ends is when there are fewer than two characters remaining to be swapped. The recursive rule is that the reverse of a string can be created by swapping the first and last characters and then reversing the string between them. In short, a string is reversed if:

f05-28-9780128036983
Listing 5.27 A C program that uses recursion to reverse a string.

1. the string has a length of zero or one character, or

2. the first and last characters have been swapped and the remaining characters have been reversed.

In Listing 5.27, line 3 checks for the base case. If the string has not been reversed according to the first rule, then the second rule is applied. Lines 5–7 swap the first and last characters, and line 8 recursively reverses the characters between them.

Listing 5.28 shows how the reverse function can be implemented using recursion in ARM assembly. Line 1 saves the link register to the stack and decrements the stack pointer. Next, storage is allocated for an automatic variable. Lines 3 and 4 test for the base case. If the current case is the base case, then the function simply returns (restoring the stack as it goes). Otherwise, the first and last characters are swapped in lines 5 through 10 and a recursive call is made in lines 11 through 13.

f05-29-9780128036983
Listing 5.28 ARM assembly implementation of the reverse function.

The code in Listing 5.28 can be made a bit more efficient. First, the test for the base case can be performed before anything else is done, as shown in Listing 5.29. Also, the local variable tmp can be stored in a volatile register rather than stored on the stack, because it is only needed for lines 4 through 8. It is not needed after the recursive call, so there is really no need to preserve it on the stack. This means that our function can use half as much stack space and will run much faster. This further refined version is shown in Listing 5.30. This version uses ip (r12) as the tmp variable instead of using the stack.

f05-30-9780128036983
Listing 5.29 Better implementation of the reverse function.
f05-31-9780128036983
Listing 5.30 Even better implementation of the reverse function.

The previous examples used the concept of an array of characters to access the string that is being reversed. Listing 5.31 shows how this problem can be solved in C using pointers to the first and last characters rather than array indices. This version only has two parameters in the reverse function, and uses pointer dereferencing rather than array indexing to access each character. Other than that difference, it works the same as the original version. Listing 5.32 shows how the reverse function can be implemented efficiently in ARM assembly. This implementation has the same number of instructions as the previous version, but lines 4 through 7 use a different addressing mode. On the ARM processor, the pointer method and the array index method are equally efficient. However, many processors do not have the rich set of addressing modes available on the ARM. On those processors, the pointer method may be significantly more efficient.

f05-32-9780128036983
Listing 5.31 String reversing in C using pointers.
f05-33-9780128036983
Listing 5.32 String reversing in assembly using pointers.

5.5 Aggregate Data Types

An aggregate data item can be referenced as a single entity, and yet consists of more than one piece of data. Aggregate data types are used to keep related data together, so that the programmer’s job becomes easier. Some examples of aggregate data are arrays, structures or records, and objects, In most programming languages, aggregate data types can be defined to create higher-level structures. Most high-level languages allow aggregates to be composed of basic types as well as other aggregates. Proper use of structured data helps to make programs less complicated and easier to understand and maintain.

In high-level languages, there are several benefits to using aggregates. Aggregates make the relationships between data clear, and allow the programmer to perform operations on blocks of data. Aggregates also make passing parameters to functions simpler and easier to read.

5.5.1 Arrays

The most common aggregate data type is an array. An array contains zero or more values of the same data type, such as characters, integers, floating point numbers, or fixed point numbers. An array may also contain values of another aggregate data type. Every element in an array must have the same type. Each data item in an array can be accessed by its array index.

Listing 5.33 shows how an array can be allocated and initialized in C. Listing 5.34 shows the equivalent code in ARM assembly. Note that in this case, the scaled register offset addressing mode was used to access each element in the array. This mode is often convenient when the size of each element in the array is an integer power of 2. If that is not the case, then it may be necessary to use a different addressing mode. An example of this will be given in Section 5.5.3.

f05-34-9780128036983
Listing 5.33 Initializing an array of integers in C.
f05-35-9780128036983
Listing 5.34 Initializing an array of integers in assembly.

5.5.2 Structured Data

The second common aggregate data type is implemented as the struct in C or the record in Pascal. It is commonly referred to as a structured data type or a record. This data type can contain multiple fields. The individual fields in the structured data may also be referred to as structured data elements, or simply elements. In most high-level languages, each element of a structured data type may be one of the base types, an array type, or another structured data type. Listing 5.35 shows how a struct can be declared, allocated, and initialized in C. Listing 5.36 shows the equivalent code in ARM assembly.

f05-36-9780128036983
Listing 5.35 Initializing a structured data type in C.
f05-37-9780128036983
Listing 5.36 Initializing a structured data type in ARM assembly.

Care must be taken using assembly to access data structures that were declared in higher level languages such as C and C++. The compiler will typically pad a data structure to ensure that the data fields are aligned for efficiency. On most systems, it is more efficient for the processor to access word-sized data if the data is aligned to a word boundary. Some processors simply cannot load or store a word from an address that is not on a word boundary, and attempting to do so will result in an exception. The assembly programmer must somehow determine the relative address of each field within the higher-level language structure. One way that this can be accomplished in C is by writing a small function which prints out the offsets to each field in the C structure. The offsets can then be used to access the fields of the structure from assembly language. Another method for finding the offsets is to run the program under a debugger and examine the data structure.

5.5.3 Arrays of Structured Data

It is often useful to create arrays of structured data. For example, a color image may be represented as a two-dimensional array of pixels, where each pixel consists of three integers which specify the amount of red, green, and blue that are present in the pixel. Typically, each of the three values is represented using an unsigned eight bit integer. Image processing software often adds a fourth value, α, specifying the transparency of each pixel.

Listing 5.37 shows how an array of pixels can be allocated and initialized in C. The listing uses the malloc() function from the C standard library to allocate storage for the pixels from the heap (see Section 1.4). Note that the code uses the sizeof () function to determine how many bytes of memory are consumed by a single pixel, then multiplies that by the width and height of the image. Listing 5.38 shows the equivalent code in ARM assembly.

f05-38-9780128036983
Listing 5.37 Initializing an array of structured data in C.
f05-39-9780128036983
Listing 5.38 Initializing an array of structured data in assembly.

Note that the code in Listing 5.38 is far from optimal. It can be greatly improved by combining the two loops into one loop. This will remove the need for the multiply on line 28 and the addition on line 29, and will simplify the code structure. An additional improvement would be to increment the single loop counter by three on each loop iteration, making it very easy to calculate the pointer for each pixel. Listing 5.39 shows the ARM assembly implementation with these optimizations.

f05-40-9780128036983
Listing 5.39 Improved initialization in assembly.

Although the implementation shown in Listing 5.39 is more efficient than the previous version, there are several more improvements that can be made. If we consider that the goal of the code is to allocate some number of bytes and initialize them all to zero, then the code can be written more efficiently. Rather than using three separate store instructions to set 3 bytes to zero on each iteration of the loop, why not use a single store instruction to set four bytes to zero on each iteration? The only problem with this approach is that we must consider the possibility that the array may end in the middle of a word. However, this can be dealt with by using two consecutive loops. The first loop sets one word of the array to zero on each iteration, and the second loop finishes off any remaining bytes. Listing 5.40 shows the results of these additional improvements. This third implementation will run much faster than the previous implementations.

f05-41-9780128036983
Listing 5.40 Very efficient initialization in assembly.

5.6 Chapter Summary

Spaghetti code is the bane of assembly programming, but it can easily be avoided. Although assembly language does not enforce structured programming, it does provide the low-level mechanisms required to write structured programs. The assembly programmer must be aware of, and assiduously practice, proper structured programming techniques. The burden of writing properly structured code blocks, with selection structures and iteration structures, lies with the programmer, and failure to apply structured programming techniques will result in code that is difficult to understand, debug, and maintain.

Subroutines provide a way to split programs into smaller parts, each of which can be written and debugged individually. This allows large projects to be divided among team members. In assembly language, defining and using subroutines is not as easy as in higher level languages. However, the benefits usually outweigh the costs. The C library provides a large number of functions. These can be accessed by an assembly program as long as it is linked with the C standard library.

Assembly provides the mechanisms to access aggregate data types. Arrays can be accessed using various addressing modes on the ARM processor. The pre-indexing and post-indexing modes allow array elements to be accessed using pointers, with the pointers being incremented after each element access. Fields in structured data records can be accessed using immediate offset addressing mode. The rich set of addressing modes available on the ARM processor allows the programmer to use aggregate data types more efficiently than on most processors.

Exercises

5.1 What does it mean for a register to be volatile? Which ARM registers are considered volatile according to the ARM function calling convention?

5.2 Fully explain the differences between static variables and automatic variables.

5.3 In ARM assembly language, write a function that is equivalent to the following C function.

f05-42-9780128036983

5.4 What are the two places where an automatic variable can be stored?

5.5 You are writing a function and you decided to use registers r4 and r5 within the function. Your function will not call any other functions; it is self-contained. Modify the following skeleton structure to ensure that r4 and r5 can be used within the function and are restored to comply with the ARM standards, but without unnecessary memory accesses.

f05-43-9780128036983

5.6 Convert the following C program to ARM assembly, using a post-test loop:

f05-44-9780128036983

5.7 Write a complete ARM function to shift a 64-bit value left by any given amount between 0 and 63 bits. The function should expect its arguments to be in registers r0, r1, and r2. The lower 32 bits of the value are passed in r0, the upper 32 bits of the value are passed in r1, and the shift amount is passed in r2.

5.8 Write a complete subroutine in ARM assembly that is equivalent to the following C subroutine.

f05-45-9780128036983

5.9 Write a complete function in ARM assembly that is equivalent to the following C function.

f05-46-9780128036983f05-47-9780128036983

5.10 Write an ARM assembly function to calculate the average of an array of integers, given a pointer to the array and the number of items in the array. Your assembly function must implement the following C function prototype:

int average(int *array, int number_of_items);

Assume that the processor does not support the div instruction, but there is a function available to divide two integers. You do not have to write this function, but you may need to call it. Its C prototype is:

int divide(int numerator, int denominator);

5.11 Write a complete function in ARM assembly that is equivalent to the following C function. Note that a and b must be allocated on the stack, and their addresses must be passed to scanf so that it can place their values into memory.

f05-48-9780128036983

5.12 The factorial function can be defined as:

x!=1ifx1,x×(x1)!otherwise.

si1_e

The following C program repeatedly reads x from the user and calculates x! It quits when it reads end-of-file or when the user enters a negative number or something that is not an integer.
Write this program in ARM assembly.

f05-49-9780128036983

5.13 For large x, the factorial function is slow. However, a lookup table can be added to the function to improve average performance. This technique is commonly known as memoization or tabling, but is sometimes called dynamic programming. The following C implementation of the factorial function uses memoization. Modify your ARM assembly program from the previous problem to include memoization.

f05-50-9780128036983f05-51-9780128036983
Chapter 6

Abstract Data Types

Abstract

This chapter extends the coverage of structured programming to include abstract data types (ADT). It begins by giving the definition of an abstract data type and giving a small example of an ADT that could be used to read, process, and write Netpbm images. The next section introduces an ADT written in C to perform word frequency counts, and shows how performance can be greatly improved by using better algorithms and/or by writing some functions in assembly language. It also shows how a binary tree structure created by C code can be traversed in assembly language. The chapter ends with a ethics module about the Therac-25 cancer treatment device.

Keywords

Abstract data type; Word frequency count; Binary tree; Index; Sort; Ethics

An abstract data type (ADT) is composed of data and the operations that work on that data. The ADT is one of the cornerstones of structured programming. Proper use of ADTs has many benefits. Most importantly, abstract data types help to support information hiding. A software module hides information by encapsulating the information into a module or other construct which presents an interface. The interface typically consists of the names of data types provided by the ADT and a set of subroutine definitions, or prototypes, for operating on the data types. The implementation of the ADT is hidden from the client code that uses the ADT.

A common use of information hiding is to hide the physical storage layout for data so that if it is changed, the change is restricted to a small subset of the total program. For example, if a three-dimensional point (x,y,z) is represented in a program with three floating point scalar variables, and the representation is later changed to a single array variable of size three, a module designed with information hiding in mind would protect the remainder of the program from such a change.

Information hiding reduces software development risk by shifting the code’s dependency on an uncertain implementation onto a well-defined interface. Clients of the interface perform operations purely through the interface, which does not change. If the implementation changes, the client code does not have to change.

Encapsulating software and data structures behind an interface allows the construction of objects that mimic the behavior and interactions of objects in the real world. For example, a simple digital alarm clock is a real-world object that most people can use and understand. They can understand what the alarm clock does, and how to use it through the provided interface (buttons and display) without needing to understand every part inside of the clock. If the internal circuitry of the clock were to be replaced with a different implementation, people could continue to use it in the same way, provided that the interface did not change.

6.1 ADTs in Assembly Language

As with all other structured programming concepts, ADTs can be implemented in assembly language. In fact, most high-level compilers convert structured programming code into assembly during compilation. All that is required is that the programmer define the data structure(s), and the set of operations that can be used on the data. Listing 6.1 gives an example of an ADT interface in C. The type Image is not fully defined in the interface. This prevents client software from accessing the internal structure of the image data type. Therefore, programmers using the ADT can modify images only by using the provided functions. Other structured programming and object-oriented programming languages such as C++, Java, Pascal, and Modula 2 provide similar protection for data structures so that client code can access the data structure only through the provided interface. Note that only the pval definition is exposed, indicating to client programs that the red, green, and blue components of a pixel must be a number between 0 and 255. In C, as with other structured programming languages, the implementation of the subroutines can also be hidden by placing them in separate compilation modules. Those modules will have access to the internal structure of the Image data type.

f06-04-9780128036983
Listing 6.1 Definition of an Abstract Data Type in a C header file.

Assembly language does not have the ability to define a data structure as such, but it does provide the mechanisms needed to specify the location of each field with respect to the beginning of a data structure, as well as the overall size of the data structure. With a little thought and effort, it is possible to implement ADTs in Assembly language. Listing 6.2 shows the private implementation of the Image data type, which is included by the C files which implement the Image data type. Listing 6.3 shows how the data structures from the previous listings can be defined in assembly language. With those definitions, any of the functions declared in Listing 6.1 can be written in assembly language.

f06-05-9780128036983
Listing 6.2 Definition of the image structure may be hidden in a separate header file.
f06-06-9780128036983
Listing 6.3 Definition of an ADT in Assembly.

6.2 Word Frequency Counts

Counting the frequency of words in written text has several uses. In digital forensics, it can be used to provide evidence as to the author of written communications. Different people have different vocabularies, and use words with differing frequency. Word counts can also be used to classify documents by type. Scientific articles from different fields contain words specific to that field, and historical novels will differ from western novels in word frequency.

Listing 6.4 shows the main function for a simple C program which reads a text file and creates a list of all the words contained in a file, along with their frequency of occurrence. The program has been divided into two parts: the main program, and an ADT which is used to keep track the words and their frequencies, and to print a table of word frequencies.

f06-07a-9780128036983f06-07b-9780128036983
Listing 6.4 C program to compute word frequencies.

The interface for the ADT is shown in Listing 6.5. There are several ways that the ADT could be implemented. Note that the interface given in the header file does not show the internal fields of the word list data type. Thus, any file which includes this header is allowed to declare pointers to wordlist data types, but cannot access or modify any internal fields. The list of words could be stored in an array, a linked list, a binary tree, or some other data structure. The subroutines could be implemented in C or in some other language, including assembly. Listing 6.6 shows an implementation in C using a linked list. Note that the function for printing the word frequency list in numerical order has not been implemented. It will be written in assembly language. Since the program is split into multiple files, it is a good idea to use the make utility to build the executable program. A basic makefile is shown in Listing 6.7.

f06-08-9780128036983
Listing 6.5 C header for the wordlist ADT.
f06-09a-9780128036983f06-09b-9780128036983f06-09c-9780128036983
Listing 6.6 C implementation of the wordlist ADT.
f06-10-9780128036983
Listing 6.7 Makefile for the wordfreq program.

Suppose we wish to implement one of the functions from Listing 6.6 in ARM assembly language. We would delete the function from the C file, create a new file with the assembly version of the function, and modify the makefile so that the new file is included in the program. The header file and the main program file would not require any changes. The header file provides function prototypes that the C compiler uses to determine how parameters should be passed to the functions. As long as our new assembly function conforms to its C header definition, the program will work correctly.

6.2.1 Sorting by Word Frequency

The linked list is created in alphabetical order, but the wl_print_numerical() function is required to print it sorted in reverse order of number of occurrences. There are several ways in which this could be accomplished, with varying levels of efficiency. The possible approaches include, but are not limited to:

 Re-ordering the linked list using an insertion sort: This approach creates a complete new list by removing each item, one at a time, from the original list, and inserting it into a new list sorted by the number of occurrences rather than the words themselves. The time complexity for this approach would be O(N2), but would require no additional storage. However, if the list were later needed in alphabetical order, or any more words were to be added, then it would need to be re-sorted in the original order.

 Sorting the linked list using a merge sort algorithm: Merge sort is one of the most efficient sorting algorithms known and can be efficiently applied to data in files and linked lists. The merge sort works as follows:

1. The sub-list size, i, is set to 1.

2. The list is divided into sub-lists, each containing i elements. Each sub-list is assumed to be sorted. (A sub-list of length one is sorted by definition.)

3. The sub-lists are merged together to create a list of sub-lists of size 2i, where each sub-list is sorted.

4. The sub-list size, i, is set to 2i.

5. The process is repeated from step 2 until iN, where N is the number of items to be sorted.

The time complexity for the merge sort algorithm is NlogNsi1_e, which is far more efficient than the insertion sort. This approach would also require no additional storage. However, if the list were later needed in alphabetical order, or any more words were to be added, then it would need to be re-sorted in the original alphabetical order.

 Create an index, and sort the index rather than rebuilding the list. Since the number of elements in the list is known, we can allocate an array of pointers. Each pointer in the array is then initialized to point to one element in the linked list. The array forms an index, and the pointers in the array can be re-sorted in any desired order, using any common sorting method such as bubble sort (O(N2)), in-place insertion sort (O(N2)), quick sort (O(NlogN)si2_e), or others. This approach requires additional storage, but has the advantage that it does not need to modify the original linked list.

There are many other possibilities for re-ordering the list. Regardless of which method is chosen, the main program and the interface (header file) need not be changed. Different implementations of the sorting function can be substituted without affecting any other code.

The wl_print_numerical() function can be implemented in assembly as shown in Listing 6.8. The function operates by re-ordering the linked list using an insertion sort as described above. Listing 6.9 shows the change that must be made to the make file. Now, when make is run, it compiles the two C files and the assembly file into object files, then links them all together. The C implementation of wl_print_numerical() in list.c must be deleted or commented out or the linker will emit an error indicating that it found two versions of wl_print_numerical().

f06-11a-9780128036983f06-11b-9780128036983
Listing 6.8 ARM assembly implementation of wl_print_numerical().
f06-12a-9780128036983f06-12b-9780128036983
Listing 6.9 Revised makefile for the wordfreq program.

6.2.2 Better Performance

The word frequency counter, as previously implemented, takes several minutes to count the frequency of words in the author’s manuscript for this textbook on a Raspberry Pi. Most of the time is spent building the list of words and re-sorting the list in order of word frequency. Most of the time for both of these operations is spent in searching for the word in the list before incrementing its count or inserting it in the list. There are more efficient ways to build ordered lists of data.

Since the code is well modularized using an ADT, the internal mechanism of the list can be modified without affecting the main program. A major improvement can be made by changing the data structure from a linked list to a binary tree. Fig. 6.1 shows an example binary tree storing word frequency counts. The time required to insert into a linked list is O(N), but the time required to insert into a binary tree is O(log2N)si3_e. To give some perspective, the author’s manuscript for this textbook contains about 125,000 words. Since log2(125,000) < 17, we would expect the linked list implementation to require about 125,000177353si4_e times as long as a binary tree implementation to process the author’s manuscript for this textbook. In reality, there is some overhead to the binary tree implementation. Even with the extra overhead, we should see a significant speedup. Listing 6.10 shows the C implementation using a balanced binary tree instead of a linked list.

f06-01-9780128036983
Figure 6.1 Binary tree of word frequencies.
f06-13a-9780128036983f06-13b-9780128036983f06-13c-9780128036983f06-13d-9780128036983f06-13e-9780128036983f06-13f-9780128036983f06-13g-9780128036983
Listing 6.10 C implementation of the wordlist ADT using a tree.

With the tree implementation, wl_print_numerical() could build a new tree, sorted on the word frequency counts. However, it may be more efficient to build a separate index, and sort the index by word frequency counts. The assembly code will allocate an array of pointers, and set each pointer to one of the nodes in the tree, as shown in Fig. 6.2. Then, it will use a quick sort to sort the pointers into descending order by word frequency count, as shown in Fig. 6.3. This implementation is shown in Listing 6.11.

f06-02-9780128036983
Figure 6.2 Binary tree of word frequencies with index added.
f06-03-9780128036983
Figure 6.3 Binary tree of word frequencies with sorted index.
f06-14a-9780128036983f06-14b-9780128036983f06-14c-9780128036983f06-14d-9780128036983
Listing 6.11 ARM assembly implementation of wl_print_numerical() with a tree.

The tree-based implementation gets most of its speed improvement through using two O(NlogN)si2_e algorithms to replace O(N2) algorithms. These examples show how a small part of a program can be implemented in assembly language, and how to access C data structures from assembly language. The functions could just as easily have been written in C rather than assembly, without greatly affecting performance. Later chapters will show examples where the assembly implementation does have significantly better performance than the C implementation.

6.3 Ethics Case Study: Therac-25

The Therac-25 was a device designed for radiation treatment of cancer. It was produced by Atomic Energy of Canada Limited (AECL), which had previously produced the Therac-6 and Therac-20 units in partnership with CGR of France. It was capable of treating tumors close to the skin surface using electron beam therapy, but could also be configured for Megavolt X-ray therapy to treat deeper tumors. The X-ray therapy required the use of a tungsten radiation shield to limit the area of the body that was exposed to the potentially lethal radiation produced by the device.

The Therac-25 used a double pass accelerator, which provided more power, in a smaller space, at less cost, compared to its predecessors. The second major innovation was that computer control was a central part of the design, rather than an add-on component as in its predecessors. Most of the hardware safety interlocks that were integral to the designs of the Therac-6 and Therac-20 were seen as unnecessary, because the software would perform those functions. Computer control was intended to allow operators to set up the machine more quickly, allowing them to spend more time communicating with patients and to treat more patients per day. It was also seen as a way to reduce production costs by relying on software, rather than hardware, safety interlocks.

There were design issues with both the software and the hardware. Although this machine was built with the goal of saving lives, between 1985 and 1986, three deaths and other injuries were attributed to the hardware and software design of this machine. Death due to radiation exposure is usually slow and painful, and the problem was not identified until the damage had been done.

6.3.1 History of the Therac-25

AECL was required to obtain US Food and Drug Administration (FDA) approval before releasing the Therac-25 to the US market. They obtained approval quickly by declaring “pre-market equivalence,” effectively claiming that the new machine was not significantly different from its predecessors. This practice was common in 1984, but was overly optimistic, considering that most of the safety features had been changed from hardware to software implementations. With FDA approval, AECL made the Therac-25 commercially available and performed a Fault Tree Analysis to evaluate the safety of the device.

Fault Tree Analysis, as its name implies, requires building a tree to describe every possible fault and assigning probabilities to those faults. After building the tree, the probabilities of hazards, such as overdose, can be calculated. Unfortunately, the engineers assumed that the software (much of which was re-used from the previous Therac models) would operate correctly. This turned out not to be the case, because the hardware interlocks present in the previous models had hidden some of the software faults. The analysts did consider some possible computer faults, such as an error being caused by cosmic rays, but assigned extremely low probabilities to those faults. As a result, the assessment was very inaccurate.

When the first report of an overdose was reported to AECL in 1985, they sent an engineer to the site to investigate. They also filed a report with the FDA and the Canadian Radiation Protection Board (CRPB). AECL also notified all users of the fact that there had been a report and recommended that operators should visually confirm hardware settings before each treatment. The AECL engineers were unable to reproduce the fault, but suspected that it was due to the design and placement of a microswitch. They redesigned the microswitch and modified all of the machines that had been deployed. They also retracted their recommendation that operators should visually confirm hardware settings before each treatment.

Later that year, a second incident occurred. In this case, there is no evidence that AECL took any action. In January of 1986, AECL received another incident report. An employee at AECL responded by denying that the Therac-25 was at fault, and stated that no other similar incidents had been reported. Another incident occurred in March of that year. AECL sent an engineer to investigate. The engineer was unable to determine the cause, and suggested that it was due to an electrical problem, which may have caused an electrical shock. An independent engineering firm was called to examine the machine and reported that it was very unlikely that the machine could have delivered an electrical shock to the patient. In April of 1986, another incident was reported. In this case, the AECL engineers, working with the medical physicist at the hospital, were able to reproduce the sequence of events that lead to the overdose.

As required by law, AECL filed a report with the FDA. The FDA responded by declaring the Therac-25 defective. AECL was ordered to notify all of the sites where the Therac-25 was in use, investigate the problem, and file a corrective action plan. AECL notified all sites, and recommended removing certain keys from the keyboard on the machines. The FDA responded by requiring them to send another notification with more information about the defect and the consequent hazards. Later in 1986, AECL filed a revised corrective action plan.

Another overdose occurred in January 1987, and was attributed to a different software fault. In February, the FDA and CRPB both ordered that all Therac-25 units be shut down, pending effective and permanent modifications. AECL spent six months developing a new corrective action plan, which included a major overhaul of the software, the addition of mechanical safety interlocks, and other safety-related modifications.

6.3.2 Overview of Design Flaws

The Therac-25 was controlled by a DEC PDP-11 computer, which was the most popular minicomputer ever produced. Around 600,000 were produced between 1970 and 1990 and used for a variety of purposes, including embedded systems, education, and general data processing. It was a 16-bit computer and was far less powerful than a Raspberry Pi. The Therac-25 computer was programmed in assembly language by one programmer and the source code was not documented. Documentation for the hardware components was written in French. After the faults were discovered, a commission concluded that the primary problems with the Therac-25 were attributable to poor software design practices, and not due to any one of several specific coding errors. This is probably the best known case where a poor overall software design, and insufficient testing, led to loss of life.

The worst problems in the design and engineering of the machine were:

 The code was not subjected to independent review.

 The software design was not considered during the assessment of how the machine could fail or malfunction.

 The operator could ignore malfunctions and cause the machine to proceed with treatment.

 The hardware and software were designed separately and not tested as a complete system until the unit was assembled at the hospitals where it was to be used.

 The design of the earlier Therac-6 and Therac-20 machines included hardware interlocks which would ensure that the X-ray mode could not be activated unless the tungsten radiation shield was in place. The hardware interlock was replaced with a software interlock in the Therac-25.

 Errors were displayed as numeric codes, and there was no indication of the severity of the error condition.

The operator interface consisted of a keyboard and text-mode monitor, which was common in the early 1980s. The interface had a data entry area in the middle of the screen and a command line at the bottom. The operator was required to enter parameters in the data entry area, then move the cursor to the command line to initiate treatment. When the operator moved the cursor to the command line, internal variables were updated and a flag variable was set to indicate that data entry was complete. That flag was cleared when a command was entered on the command line. If the operator moved the cursor back to the data entry area without entering a command, then the flag was not cleared, and any subsequent changes to the data entry area did not affect the internal variables.

A global variable was used to indicate that the magnets were currently being adjusted. This variable was modified by two functions, and did not always contain the correct value. Adjusting the magnets required about eight seconds, and the flag was correct for only a small period at the beginning of this time period.

Due to the errors in the design and implementation of the software, the following sequence of events could result in the machine causing injury to, or even the death of, the patient:

1. The operator mistakenly specified high-power mode during data entry.

2. The operator moved the cursor to the command line area.

3. The operator noticed the mistake, and moved the cursor back to the data entry area without entering a command.

4. The operator corrected the mistake and moved the cursor back to the command line.

5. The operator entered the command line area, left it, made the correction, and returned within the eight-second window required for adjusting the magnets.

If the above sequence occurred, then the operator screen could indicate that the machine was in low power mode, although it was actually set in high-power mode. During a final check before initiating the beam, the software would find that the magnets were set for high-power mode but the operator setting was for low power mode. It displayed a numeric error code and prevented the machine from starting. The operator could clear the error code by resetting the computer (which only required one key to be pressed on the keyboard). This caused the tungsten shield to withdraw but left the machine in X-ray mode. When the operator entered the command to start the beam, the machine could be in high-power mode without having the tungsten shield in place. X-rays were applied to the unprotected patient.

It took some time for this critical flaw to appear. The failure only occurred when the operator initially made a one-keystroke mistake in entering the prescription data, moved to the command area, and then corrected the mistake within eight seconds. Initially, operators were slow to enter data, and spent a lot of time making sure that the prescription was correct before initiating treatment. As they became more familiar with the machine, they were able to enter data and correct mistakes more quickly. Eventually, operators became familiar enough with the machine that they could enter data, make a correction, and return to the command area within the critical eight-second window. Also, the operators became familiar with the machine reporting numeric error codes without any indication of the severity of the code. The operators were given a table of codes and their meanings. The code reported was “no dose” and indicated “treatment pause.” There is no reason why the operator should consider that to be a serious problem; they had become accustomed to frequent malfunctions that did not have any consequences to the patient.

Although the code was written in assembly language, that fact was not cited as an important factor. The fundamental problems were poor software design and overconfidence. The reuse of code in an application for which it was not initially designed also may have contributed to the system flaws. A proper design using established software design principles, including structured programming and abstract data types, would almost certainly have avoided these fatalities.

6.4 Chapter Summary

The abstract data type is a structured programming concept which contributes to software reliability, eases maintenance, and allows for major revisions to be performed in a safe way. Many high-level languages enforce, or at least facilitate, the use of ADTs. Assembly language does not. However, the ethical assembly language programmer will make the extra effort to write code that conforms to the standards of structured programming and use abstract data types to help ensure safety, reliability, and maintainability.

ADTs also facilitate the implementation of software modules in more than one language. The interface specifies the components of the ADT, but not the implementation. The implementation can be in any language. As long as assembly programmers and compiler authors generate code that conforms to a well-known standard, their code can be linked with code written in other languages.

Poor coding practices and poor design can lead to dire consequences, including loss of life. It is the responsibility of the programmer, regardless of the language used, to make ethical decisions in the design and implementation of software. Above all, the programmer must be aware of the possible consequences of the decisions they make.

Exercises

6.1 What are the advantages of designing software using abstract data types?

6.2 Why is the internal structure of the Pixel data type hidden from client code in Listing 6.2?

6.3 High-level languages provide mechanisms for information hiding, but assembly does not. Why should the assembly programmer not simply bypass all information hiding and access the internal data structures of any ADT directly?

6.4 The assembly code in wl_print_numerical() accesses the internal structure of the wordlistnode data type. Why is it allowed to do so? Should it be allowed to do so?

6.5 Given the following definitions for a stack ADT:

f06-15-9780128036983
f06-16-9780128036983

Write the InitStack() function in ARM assembly language.

6.6 Referring to the previous question, write the Push() function in ARM assembly language.

6.7 Referring to the previous two questions, write the Pop() function in ARM assembly language.

6.8 Referring to the previous three questions, write the Top() function in ARM assembly language.

6.9 Referring to the previous three questions, write the PrintStack() function in ARM assembly language.

6.10 Re-implement all of the previous stack functions using a linked list rather than a static array.

6.11 The “Software Engineering Code of Ethics And Professional Practice” states that a responsible software engineer should “Approve software only if they have well-founded belief that it is safe, meets specifications, passes appropriate tests…” (sub-principle 1.03) and “Ensure adequate testing, debugging, and review of software…on which they work.” (sub-principle 3.10). Unfortunately, defects did make their way into the system.
The software engineering code of ethics also states that a responsible software engineer should “Treat all forms of software maintenance with the same professionalism as new development.”

(a) Explain how the Software Engineering Code of Ethics And Professional Practice were violated by the Therac 25 developers.

(b) How should the engineers and managers at AECL have responded when problems were reported?

(c) What other ethical and non-ethical considerations may have contributed to the deaths and injuries?

Part II

Performance Mathematics

Chapter 7

Integer Mathematics

Abstract

This chapter introduces the concept of high performance mathematics. The chapter starts by explaining basic math in bases other than 10. It explains subtraction using complement mathematics. Next it gives efficient algorithms for performing signed and unsigned multiplication in binary. It explains how multiplication by a constant can often be converted into a much more efficient sequence of shift and add or subtract operations, and gives a method for multiplying two arbitrarily large numbers. Next, an efficient algorithm is given for binary division, followed by a technique for converting division by a constant into multiplication by a related constant. The next section introduces an ADT, written in C, which can be used to perform basic mathematical operations on integers of any size. The chapter concludes by showing that the ADT can be made much more efficient by replacing some of the functions with assembly language implementations.

Keywords

Addition; Subtraction; Complement; Multiplication; Division; Big integer; High performance; Abstract data type

There are some differences between the way calculations are performed in a computer versus the way most of us were taught as children. The first difference is that calculations are performed in binary instead of base ten. Another difference is that the computer is limited to a fixed number of binary digits, which raises the possibility of having a result that is too large to fit in the number of bits available. This occurrence is referred to as overflow. The third difference is that subtraction is performed using complement addition.

Addition in base b is very similar to base ten addition, except that the result of each column is limited to b − 1. For example, binary addition works exactly the same as decimal addition, except that the result of each column is limited to 0 or 1. The following figure shows an addition in base ten and the equivalent addition in base two.

u07-27-9780128036983

The carry from one column to the next is shown as a small number above the column that it is being carried into. Note that carries from one column to the next are done the same way in both bases. The only difference is that there are more columns in the base two addition because it takes more digits to represent a number in binary than it does in decimal.

7.1 Subtraction by Addition

Finding the complement was explained in Section 1.3.3. Subtraction can be computed by adding the radix complement of the subtrahend to the menuend. Example 7.1 shows a complement subtraction with positive results. When x < y, the result will be negative. In the complement method, this means that there will be a ‘1’ in the most significant bit, and in order to convert the result to base ten, we must take the radix complement. Example 7.2 shows complement subtraction with negative results. Example 7.3 shows several more signed addition and subtraction operations in base ten and binary.

Example 7.1

Ten’s Complement Subtraction

Suppose we wish to calculate 38410 − 5610 using the complements method. After extending both numbers to the same number of digits, we have 38410 − 05610. From Eq. (1.1), the ten’s complement of 05610 is 104 − 05610 = 94410. Adding gives us 38410 + 94410 = 132810. After discarding the leading “1”, we have 328, which is the correct result. Both methods of subtraction are shown below:

u07-25-9780128036983

Example 7.2

Ten’s Complement Subtraction With a Negative Result

Suppose we want to calculate 284 − 481. Both numbers have three digits, so it is not necessary to pad with leading zeros. Adding the ten’s complement of y to x gives 284 + 519 = 803. This is obviously the wrong answer, since the expected answer is − 197. But all is not lost, because 803 happens to be the ten’s complement of 197. The fact that the first digit of the result is greater than four indicates that we must take the ten’s complement of the result and add a negative sign.

Example 7.3

Signed Addition and Subtraction in Decimal and Binary

u07-26-9780128036983

7.2 Binary Multiplication

Many processors have hardware multiply instructions. However hardware multipliers require a large number of transistors, and consume significant power. Processors designed for extremely low power consumption or very small size usually do not implement a multiply instruction, or only provide multiply instructions that are limited to a small number of bits. On these systems, the programmer must implement multiplication using basic data processing instructions.

7.2.1 Multiplication by a Power of Two

If the multiplier is a power of two, then multiplication can be accomplished with a shift to the left. Consider the 4-bit binary number x = x3 × 23 + x2 × 22 + x1 × 21 + x0 × 20, where xn denotes bit n of x. If x is shifted left by one bit, introducing a zero into the least significant bit, then it becomes x3×24+x2×23+x1×22+x0×21+0×20=2x3×23+x2×22+x1×21+x0×20+0×21si1_e Therefore, a shift of one bit to the left is equivalent to multiplication by two. This argument can be extended to prove that a shift left by n bits is equivalent to multiplication by 2n.

7.2.2 Multiplication of Two Variables

Most techniques for binary multiplication involve computing a set of partial products and then summing the partial products together. This process is similar to the method taught to primary schoolchildren for conducting long multiplication on base ten integers, but has been modified here for application to binary. The method typically taught in school for multiplying decimal numbers is based on calculating partial products, shifting them to the left and then adding them together. The most difficult part is to obtain the partial products, as that involves multiplying a long number by one base ten digit. The following example shows how the partial products are formed when multiplying 123 by 456.

u07-30-9780128036983

The first partial product can be written as 123 × 6 × 100 = 738. The second is 123 × 5 × 101 = 6150, and the third is 123 × 4 × 102 = 49200. In practice, we usually leave out the trailing zeros. The procedure is the same in binary, but is simpler because the partial product involves multiplying a long number by a single base 2 digit. Since the multiplier is always either zero or one, the partial product is very easy to compute. The product of multiplying any binary number x by a single binary digit is always either 0 or x. Therefore, the multiplication of two binary numbers comes down to shifting the multiplicand left appropriately for each non-zero bit in the multiplier, and then adding the shifted numbers together.

Suppose we wish to multiply two four-bit numbers, 1011 and 1010:

u07-31-9780128036983

Notice in the previous example that each partial sum is either zero or x shifted by some amount. A slightly quicker way to perform the multiplication is to leave out any partial sum which is zero. Example 7.4 shows the results of multiplying 10110 by 8910 in decimal and binary using this shorter method. For implementation in hardware and software, it is easier to accumulate the partial products, by adding each to a running sum, rather than building a circuit to add multiple binary numbers at once.

Example 7.4

Equivalent Multiplication in Decimal and Binary

u07-28-9780128036983

Binary multiplication can be implemented as a sequence of shift and add instructions. Given two registers, x and y, and an accumulator register a, the product of x and y can be computed using Algorithm 1. When applying the algorithm, it is important to remember that, in the general case, the result of multiplying an n bit number by an m bit number is (at most) an n + m bit number. For instance 112 × 112 = 10012. Therefore, when applying Algorithm 1, it is necessary to know the number of bits in x and y. Since x is shifted left on each iteration of the loop, the registers used to store x and a must both be at least as large as the number of bits in x plus the number of bits in y.

u07-37-9780128036983
Algorithm 1 Algorithm for binary multiplication.

Assume we wish to multiply two numbers, x = 01101001 and y = 01011010. Applying Algorithm 1 results in the following sequence:

axyNext operation
0000000000000000000000000110100101011010shift only
0000000000000000000000001101001000101101add, then shift
0000000011010010000000011010010000010110shift only
0000000011010010000000110100100000001011add, then shift
0000010000011010000001101001000000000101add, then shift
0000101010101010000011010010000000000010shift only
0000101010101010000110100100000000000001add, then shift
0010010011101010001101001000000000000000shift only
105 × 90 = 9450

t0040

To multiply two n bit numbers, you must be able to add two 2n-bit numbers. On the ARM processor, n is usually assumed to be 32-bits, because that is the natural word size for the ARM processor. Adding 64-bit numbers requires two add instructions and the carry from the least-significant 32 bits must be added to the sum of the most-significant 32 bits. The ARM processor provides a convenient way to perform the add with carry. Assume we have two 64 bit numbers, x and y. We have x in r0, r1 and y in r2, r3, where the high order words of each number are in the higher-numbered registers, and we want to calculate x = x + y. Listing 7.1 shows a two instruction sequence for the ARM processor. The first instruction adds the two least-significant words together and sets (or clears) the carry bit and other flags in the CPSR. The second instruction adds the two most significant words along with the carry bit.

f07-06-9780128036983
Listing 7.1 ARM assembly code for adding two 64 bit numbers.

On the ARM processor, the algorithm to multiply two 32-bit unsigned integers is very efficient. Listing 7.2 shows one possible algorithm for multiplying two 32-bit numbers to obtain a 64-bit result. The code is a straightforward implementation of the algorithm, and some modifications can be made to improve efficiency. For example, if we only want a 32-bit result, we do not need to perform 64-bit addition. This significantly simplifies the code, as shown in Listing 7.3.

f07-07-9780128036983
Listing 7.2 ARM assembly code for multiplication with a 64 bit result.
f07-08-9780128036983
Listing 7.3 ARM assembly code for multiplication with a 32 bit result.

7.2.3 Multiplication of a Variable by a Constant

If x or y is a constant, then a loop is not necessary. The multiplication can be directly translated into a sequence of shift and add operations. This will result in much more efficient code than the general algorithm. If we inspect the constant multiplier, we can usually find a pattern to exploit that will save a few instructions. For example, suppose we want to multiply a variable x by 1010. The multiplier 1010 = 10102, so we only need to add x shifted left 1 bit to x shifted left 3 bits as shown below:

u07-32-9780128036983

Now suppose we want to multiply a number x by 1110. The multiplier 1110 = 10112, so we will add x to x shifted left one bit plus x shifted left 3 bits as in the following:

u07-33-9780128036983

If we wish to multiply a number x by 100010, we note that 100010 = 11111010002 It looks like we need one shift plus five add/shift operations, or six add/shift operations. With a little thought, the number of operations can be reduced from six to five as shown below:

u07-34-9780128036983

Applying the basic multiplication algorithm to multiply a number x by 25510 would result in seven add/shift operations, but we can do it with only three operations and use only one register, as shown below:

u07-35-9780128036983

Most modern systems have assembly language instructions for multiplication, but hardware multiply units require a relatively large number of transistors. For that reason, processors intended for small embedded applications often do not have a multiply instruction. Even when a hardware multiplier is available, on some processors it is often more efficient to use shift, add, and subtract operations when multiplying by a constant. The hardware multiplier units that are available on most ARM processors are very powerful. They can typically perform multiplication with a 32-bit result in as little as one clock cycle. The long multiply instructions take between three and five clock cycles, depending on the size of the operands. Using the multiply instruction on an ARM processor to multiply by a constant usually requires loading the constant into a register before performing the multiply. Therefore, if the multiplication can be performed using three or fewer shift, add, and subtract instructions, then it will be equal to or better than using the multiply instruction.

7.2.4 Signed Multiplication

Consider the two multiplication problems shown in Figs. 7.1 and 7.2. Note that the result of a multiply depends on whether the numbers are interpreted as unsigned numbers or signed numbers. For this reason, most computer CPUs have two different multiply operations for signed and unsigned numbers.

f07-01-9780128036983
Figure 7.1 In signed 8-bit math, 110110012 is − 3910.
f07-02-9780128036983
Figure 7.2 In unsigned 8-bit math, 110110012 is 21710.

If the CPU provides only an unsigned multiply, then a signed multiply can be accomplished by using the unsigned multiply operation along with a conditional complement. The following procedure can be used to implement signed multiplication.

1. if the multiplier is negative, take the two’s complement,

2. if the multiplicand is negative, take the two’s complement,

3. perform unsigned multiply, and

4. if the multiplier or multiplicand was negative (but not both), then take two’s complement of result.

Example 7.5 demonstrates this method using one negative number.

Example 7.5

Signed Multiplication Using Unsigned Math

u07-22-9780128036983

7.2.5 Multiplying Large Numbers

Consider the method used for multiplying two digit numbers is base ten, using only the one-digit multiplication tables. Fig. 7.3 shows how a two digit number a = a1 × 101 + a0 × 100 is multiplied by another two digit number b = b1 × 101 + b0 × 100 to produce a four digit result using basic multiplication operations which only take one digit from a and one digit from b at each step.

f07-03-9780128036983
Figure 7.3 Multiplication of large numbers.

This technique can be used for numbers in any base and for any number of digits. Recall that one hexadecimal digit is equivalent to exactly four binary digits. If a and b are both 8-bit numbers, then they are also 2-digit hexadecimal numbers. In other words 8-bit numbers can be divided into groups of four bits, each representing one digit in base sixteen. Given a multiply operation that is capable of producing an 8-bit result from two 4-bit inputs, the technique shown above can then be used to multiply two 8-bit numbers using only 4-bit multiplication operations.

Carrying this one step further, suppose we are given two 16-bit numbers, but our computer only supports multiplying eight bits at a time and producing a 16-bit result. We can consider each 16-bit number to be a two digit number in base 256, and use the above technique to perform four eight bit multiplies with 16-bit results, then shift and add the 16-bit results to obtain the final 32-bit result. This approach can be extended to implement efficient multiplication of arbitrarily large numbers, using a fixed-sized multiplication operation.

7.3 Binary Division

Binary division can be implemented as a sequence of shift and subtract operations. When performing binary division by hand, it is convenient to perform the operation in a manner very similar to the way that decimal division is performed. As shown in Fig. 7.4, the operation is identical, but takes more steps in binary.

f07-04-9780128036983
Figure 7.4 Longhand division in decimal and binary.

7.3.1 Division by a Power of Two

If the divisor is a power of two, then division can be accomplished with a shift to the right. Using the same approach as was used in Section 7.2.1, it can be shown that a shift right by n bits is equivalent to division by 2n. However, care must be taken to ensure that an arithmetic shift is used if the numerator is a signed two’s complement number, and a logical shift is used if the numerator is unsigned.

7.3.2 Division by a Variable

The algorithm for dividing binary numbers is somewhat more complicated than the algorithm for multiplication. The algorithm consists of two main phases:

1. shift the divisor left until it is greater than dividend and count the number of shifts, then

2. repeatedly shift the divisor back to the right and subtract whenever possible.

Fig. 7.5 shows the algorithm in more detail. Because of the complexity of the algorithm, division in hardware requires a significant number of transistors. The ARM architecture did not introduce a divide instruction until ARMv7, and even then it was not implemented on all processors. Many ARM systems (including the Raspberry Pi) do not have hardware division. However, the ARM processor instruction set makes it possible to write very efficient code for division.

f07-05-9780128036983
Figure 7.5 Flowchart for binary division.

Before we introduce the ARM code, we will take some time to step through the algorithm using an example. Let us begin by dividing 94 by 7. The result is shown below:

u07-29-9780128036983

To implement the algorithm, we need three registers, one for the dividend, one for the divisor, and one for a counter. The dividend and divisor are loaded into their registers and the counter is initialized to zero as shown below:

Dividend01011110
Divisor00000111
Counter00000000

t0045

Next, the divisor is shifted left and the counter incremented repeatedly until the divisor is greater than the dividend. This is shown in the following sequence:

Dividend01011110
Divisor00001110
Counter00000001

t0050

Dividend01011110
Divisor00011100
Counter00000010

t0055

Dividend01011110
Divisor00111000
Counter00000011

t0060

Dividend01011110
Divisor01110000
Counter00000100

t0065

Next, we allocate a register for the quotient and initialize it to zero. Then, according to the algorithm, we repeatedly subtract if possible, shift to the right, and decrement the counter. This sequence continues until the counter becomes negative. For our example this results in the following sequence:

u07-10-9780128036983
u07-11-9780128036983
u07-12-9780128036983
u07-13-9780128036983

u07-14-9780128036983
u07-15-9780128036983

When the algorithm terminates, the quotient register contains the result of the division, and the modulus (remainder) is in the dividend register. Thus, one algorithm is used to compute both the quotient and the modulus at the same time. There are variations on this algorithm. For example, one variation is to shift a single bit left in a register, rather than incrementing a count. This variation has the same two phases as the previous algorithm, but counts in powers of two rather than by ones. The following sequence shows what occurs after each iteration of the first loop in the algorithm.

Dividend01011110
Divisor00000111
Power:00000001

t0100

Dividend01011110
Divisor00001110
Power:00000010

t0105

Dividend01011110
Divisor00011100
Power:00000100

t0110

Dividend01011110
Divisor00111000
Power:00001000

t0115

Dividend01011110
Divisor01110000
Power:00010000

t0120

The divisor is greater than the dividend, so the algorithm proceeds to the second phase. In this phase, if the divisor is less than or equal to the dividend, then the power register is added to the quotient and the divisor is subtracted from the dividend. Then, the power and Divisor registers are shifted to the right. The process is repeated until the power register is zero. The following sequence shows what the registers will contain at the end of each iteration of the second loop.

u07-16-9780128036983
u07-17-9780128036983
u07-18-9780128036983
u07-19-9780128036983
u07-20-9780128036983
u07-21-9780128036983

As with the previous version, when the algorithm terminates, the quotient register contains the result of the division, and the modulus (remainder) is in the dividend register. Listing 7.4 shows the ARM assembly code to implement this version of the division algorithm for 32-bit numbers, and the counting method for 64-bit numbers.

f07-13a-9780128036983f07-13b-9780128036983f07-13c-9780128036983f07-13d-9780128036983
Listing 7.4 ARM assembly implementation of signed and unsigned 32-bit and 64-bit division functions

7.3.3 Division by a Constant

In general, division is slow. Newer ARM processors provide a hardware divide instruction which requires between two and twelve clock cycles to produce a result, depending on the size of the operands. Older processors must perform division using software, as previously described. In either case, division is by far the slowest of the basic mathematical operations. However, division by a constant c can be converted to a multiply by the reciprocal of c. It is obviously much more efficient to use a multiply instead of a divide wherever possible. Efficient division of a variable by a constant is achieved by applying the following equality:

x÷c=x×1c.

si15_e  (7.1)

The only difficulty is that we have to do it in binary, using only integers. If we modify the right-hand side by multiplying and dividing by some power of two (2n), we can rewrite Eq. (7.1) as follows:

x÷c=x×2nc×2n.

si16_e  (7.2)

Recall that, in binary, multiplying by 2n is the same as shifting left by n bits, while multiplying by 2n is done by shifting right by n bits. Therefore, Eq. (7.2) is just Eq. (7.1) with two shift operations added. The two shift operations cancel each other out. Now, let

m=2nc.

si17_e  (7.3)

We can rewrite Eq. (7.2) as:

x÷c=x×m×2n.

si18_e  (7.4)

We now have a method for dividing by a constant c which involves multiplying by a different constant, m, and shifting the result. In order to achieve the best precision, we want to choose n such that m is as large as possible with the number of bits we have available.

Suppose we want efficient code to calculate x ÷ 23 using 8-bit signed integer multiplication. Our first task is to find m=2ncsi19_e such that 011111112m ≥ 010000002. In other words, we want to find the value of n where the most significant bit of m is zero, and the next most significant bit of m is one. If we choose n = 11, then

m=2112389.0434782609.

si20_e

Rounding to the nearest integer gives m = 89. In 8 bits, m is 010110012 or 5916. We now have values for m and n, and therefore we can apply Eq. (7.4) to divide any number x by 23. The procedure is simple: calculate y = x × m, then shift y right by 11 bits.

However, there are two more considerations. First, when the divisor is positive, the result for some values of x may be incorrect due to rounding error. It is usually sufficient to increment the reciprocal value by one in order to avoid these errors. In the previous example, the number would be changed from 5916 to 5A16. When implementing this technique for finding the reciprocal, the programmer should always verify that the results are correct for all input values. The second consideration is when the dividend is negative. In that case it is necessary to subtract one from the final result.

For example, to calculate 10110 ÷ 2310 in binary, with eight bits of precision, we first perform the multiplication as follows:

u07-23-9780128036983

Then shift the result right by 11 bits. 100011000111012 shifted right 1110 bits is: 1002 = 410. If the modulus is required, it can be calculated as 101 mod 23 = 101 − (4 × 23) = 9, which once again requires multiplication by a constant.

In the previous example the shift amount of 11 bits provided the best precision possible. But how was that number chosen? The shift amount, n, can be directly computed as

n=p+log2c1,

si21_e  (7.5)

where p is the desired number of bits of precision. The value of m can then be computed as

m=2nc+1c>0,2ncotherwise.

si22_e  (7.6)

For example, to divide by the constant 33, with 16 bits of precision, we compute n as

n=16+log2331=16+5.0443941=16+51=20,

si23_e

and then we compute m as

m=22033+1=31776.03030331776=7C2016.

si24_e

Therefore, multiplying a 16 bit number by 7C2016 and then shifting right 20 bits is equivalent to dividing by 33.

Example 7.6 shows how to calculate m and n for division by 193. On the ARM processor, division by a constant can be performed very efficiently. Listing 7.5 shows how division by 193 can be implemented using only a few lines of code. In the listing, the numbers are 32 bits in length, so the constant m is much larger than in the example that was multiplied by hand, but otherwise the method is the same.

Example 7.6

Division by Constant 193

To divide by a constant 193, with 32 bits of precision, the multiplier is computed using Eqs. (7.5) and 7.6 with p = 32 as follows:

m=232+71193+1=238193+1=1424237860.811424237860=54E4252416.

si25_e

The shift amount, n, is 38 bits.

f07-14-9780128036983
Listing 7.5 ARM assembly code for division by constant 193.

On processors without the multiply instruction, we can use the technique of shifting and adding shown previously. If we wish to divide by 23 using 32 bits of precision, we compute the multiplier as

m=232+4123+1=23523+1=1493901669.171493901669=590B216516.

si26_e

That is 010110010000101100100001011001012. Note that there are only 12 non-zero bits, and the pattern 1011001 appears three times in the 32-bit multiplier. The multiply can be implemented as 224(26x + 24x + 23x + 20x) + 213(26x + 24x + 23x + 20x) +22(26x + 24x + 23x + 20x) + 20x. So the following code sequence can be used on processors that do not have the multiply instruction:

f07-15a-9780128036983f07-15b-9780128036983
Listing 7.6 ARM assembly code for division of a variable by a constant without using a multiply instruction.

7.3.4 Dividing Large Numbers

Section 7.2.5 showed how large numbers can be multiplied by breaking them into smaller numbers and using a series of multiplication operations. There is no similar method for synthesizing a large division operation with an arbitrary number of digits in the dividend and divisor. However, there is a method for dividing a large dividend by a divisor given that the division operation can operate on numbers with at least the same number of digits as in the divisor.

Suppose we wish to perform division of an arbitrarily large dividend by a one digit divisor using a basic division operation that can divide a two digit dividend by a one digit divisor. The operation can be performed in multiple steps as follows:

1. Divide the most significant digit of the dividend by the divisor. The result is the most significant digit of the quotient.

2. Prepend the remainder from the previous division step to the next digit of the dividend, forming a two-digit number, and divide that by the divisor. This produces the next digit of the result.

3. Repeat from step 2 until all digits of the dividend have been processed.

4. Take the final remainder as the modulus.

The following example shows how to divide 6189 by 7 using only 2-digits at a time:

eq07-04-9780128036983

This method can be applied in any base and with any number of digits. The only restriction is that the basic division operation must be capable of dividing a 2n digit number by an n digit number and producing a 2n digit quotient and an n digit remainder. for example, the div instruction available on Cortex M3 and newer processors is capable of dividing a 32-bit dividend by a 32-bit divisor, producing a 32-bit quotient. The remainder can be calculated by multiplying the quotient by the divisor and subtracting the product from the dividend. Using this division operation it is possible to divide an arbitrarily large number by a 16-bit divisor.

We have seen that, given a divide operation capable of dividing an n digit number by an n digit number, it is possible to divide a dividend with any number of digits by a divisor with n2si27_e digits. Unfortunately, there is no similar method to deal with an arbitrarily large divisor, or to divide an arbitrarily large dividend by a divisor with more than n2si27_e digits. In those cases the division must be performed using a general division algorithm as shown previously.

7.4 Big Integer ADT

For some programming tasks, it may be helpful to deal with arbitrarily large integers. For example, the factorial function and Ackerman’s function grow very quickly and will overflow a 32-bit integer for small input values. In this section, we will outline an abstract data type which provides basic operations for arbitrarily large integer values. Listing 7.7 shows the C header for this ADT, and Listing 7.8 shows the C implementation. Listing 7.9 shows a small program that uses the bigint ADT to create a table of x! for all x between 0 and 100.

f07-16a-9780128036983f07-16b-9780128036983
Listing 7.7 Header file for a big integer abstract data type.
f07-17a-9780128036983f07-17b-9780128036983f07-17c-9780128036983f07-17d-9780128036983f07-17e-9780128036983f07-17f-9780128036983f07-17g-9780128036983f07-17h-9780128036983f07-17i-9780128036983f07-17j-9780128036983f07-17k-9780128036983f07-17l-9780128036983f07-17m-9780128036983f07-17n-9780128036983f07-17o-9780128036983f07-17p-9780128036983
Listing 7.8 C source code file for a big integer abstract data type.
f07-18a-9780128036983f07-18b-9780128036983
Listing 7.9 Program using the bigint ADT to calculate the factorial function.

The implementation could be made more efficient by writing some of the functions in assembly language. One opportunity for improvement is in the add function, which must calculate the carry from one chunk of bits to the next. In assembly, the programmer has direct access to the carry bit, so carry propagation should be much faster.

When attempting to speed up a C program by converting selected parts of it to assembly language, it is important to first determine where the most significant gains can be made. A profiler, such as gprof, can be used to help identify the sections of code that will matter most. It is also important to make sure that the result is not just highly optimized C code. If the code cannot benefit from some features offered by assembly, then it may not be worth the effort of re-writing in assembly. The code should be re-written from a pure assembly language viewpoint.

It is also important to avoid premature assembly programming. Make sure that the C algorithms and data structures are efficient before moving to assembly. if a better algorithm can give better performance, then assembly may not be required at all. Once the assembly is written, it is more difficult to make major changes to the data structures and algorithms. Assembly language optimization is the final step in optimization, not the first one.

Well-written C code is modularized, with many small functions. This helps readability, promotes code reuse, and may allow the compiler to achieve better optimization. However, each function call has some associated overhead. If optimal performance is the goal, then calling many small functions should be avoided. For instance, if the piece of code to be optimized is in a loop body, then it may be best to write the entire loop in assembly, rather than writing a function and calling it each time through the loop. Writing in assembly is not a guarantee of performance. Spaghetti code is slow. Load/store instructions are slow. Multiplication and division are slow. The secret to good performance is avoiding things that are slow. Good optimization requires rethinking the code to take advantage of assembly language.

The bigint_adc function was re-written in assembly, as shown in Listing 7.10. This function is used internally by several other functions in the bigint ADT to perform addition and subtraction. The profiler indicated that it is used more than any other function. If assembly language can make this function run faster, then it should have a profound effect on the program.

f07-19a-9780128036983f07-19b-9780128036983f07-19c-9780128036983f07-19d-9780128036983
Listing 7.10 ARM assembly implementation if the bigint_adc function.

The bigfact main function was executed 50 times on a Raspberry Pi, using the C version of bigint_adc and then with the assembly version. The total time required using the C version was 27.65 seconds, and the program spent 54.0% of its time (14.931 seconds) in the bigint_adc function. The assembly version ran in 15.07 seconds, and the program spent 15.3% of its time (2.306 seconds) in the bigint_adc function. Therefore the assembly version of the function achieved a speedup of 6.47 over the C implementation. Overall, the program achieved a speedup of 1.83 by writing one function in assembly.

Running gprof on the improved program reveals that most of the time is now spent in the bigint_mul function (63.2%) and two functions that it calls: bigint_mul_uint (39.1%) and bigint_shift_left_chunk (21.6%). It seems clear that optimizing those two functions would further improve performance.

7.5 Chapter Summary

Complement mathematics provides a method for performing all basic operations using only the complement, add, and shift operations. Addition and subtraction are fast, but multiplication and division are relatively slow. In particular, division should be avoided whenever possible. The exception to this rule is division by a power of the radix, which can be implemented as a shift. Good assembly programmers replace division by a constant c with multiplication by the reciprocal of c. They also replace the multiply instruction with a series of shifts and add or subtract operations when it makes sense to do so. These optimizations can make a big difference in performance.

Writing sections of a program in assembly can result in better performance, but it is not guaranteed. The chance of achieving significant performance improvement is increased if the following rules are used:

1. Only optimize the parts that really matter.

2. Design data structures with assembly in mind.

3. Use efficient algorithms and data structures.

4. Write the assembly code last.

5. Ignore the C version and write good, clean, assembly.

6. Reduce function calls wherever it makes sense.

7. Avoid unnecessary memory accesses.

8. Write good code. The compiler will beat poor assembly every time, but good assembly will beat the compiler every time.

Understanding the basic mathematical operations can enable the assembly programmer to work with integers of any arbitrary size with efficiency that cannot be matched by a C compiler. However, it is best to focus the assembly programming on areas where the greatest gains can be made.

Exercises

7.1 Multiply − 90 by 105 using signed 8-bit binary multiplication to form a signed 16-bit result. Show all of your work.

7.2 Multiply 166 by 105 using unsigned 8-bit binary multiplication to form an unsigned 16-bit result. Show all of your work.

7.3 Write a section of ARM assembly code to multiply the value in r1 by 1310 using only shift and add operations.

7.4 The following code will multiply the value in r0 by a constant C. What is C?

u07-36-9780128036983

7.5 Show the optimally efficient instruction(s) necessary to multiply a number in register r0 by the constant 6710.

7.6 Show how to divide 7810 by 610 using binary long division.

7.7 Demonstrate the division algorithm using a sequence of tables as shown in Section 7.3.2 to divide 15510 by 1110.

7.8 When dividing by a constant value, why is it desirable to have m as large as possible?

7.9 Modify your program from Exercise 5.13 in Chapter 5 to produce a 64-bit result, rather than a 32-bit result.

7.10 Modify your program from Exercise 5.13 in Chapter 5 to produce a 128-bit result, rather than a 32-bit result. How would you do this in C?

7.11 Write the bigint_shift_left_chunk function from Listing 7.8 in ARM assembly, and measure the performance improvement.

7.12 Write the bigint_mul_uint function in ARM assembly, and measure the performance improvement.

7.13 Write the bigint_mul function in ARM assembly, and measure the performance improvement.

Chapter 8

Non-Integral Mathematics

Abstract

This chapter starts by demonstrating how to convert fractional numbers to radix notation in any base. It then presents a theorem that can be used to determine in which bases a given fraction will terminate rather than repeating. That theorem is then used to explain why some base ten fractional numbers cannot be represented in binary with a finite number of bits. Next fixed-point numbers are introduced. The rules for addition, subtraction, multiplication, and division are given. Division by a constant is explained in terms of fixed-point mathematics. Next, the IEEE floating point formats are explained. The chapter ends with an example showing how fixed-point mathematics can be used to write functions for sine and cosine which give better precision and higher performance than the functions provided by GCC.

Keywords

Fixed point; Radix point; Non-terminating repeating fraction; S/U notation; Q notation; Floating point; Performance

Chapter 7 introduced methods for performing computation using integers. Although many problems can be solved using only integers, it is often necessary (or at least more convenient) to perform computation using real numbers or even complex numbers. For our purposes, a non-integral number is any number that is not an integer. Many systems are only capable of performing computation using binary integers, and have no hardware support for non-integral calculations. In this chapter, we will examine methods for performing non-integral calculations using only integer operations.

8.1 Base Conversion of Fractional Numbers

Section 1.3.2 explained how to convert integers in a given base into any other base. We will now extend the methods to convert fractional values. A fractional number can be viewed as consisting of an integer part, a radix point, and a fractional part. In base 10, the radix point is also known as the decimal point. In base 2, it is called the binimal point. For base 16, it is the heximal point, and in base 8 it is an octimal point. The term radix point is used as a general term for a location that divides a number into integer and fractional parts, without specifying the base.

8.1.1 Arbitrary Base to Decimal

The procedure for converting fractions from a given base b into base ten is very similar to the procedure used for integers. The only difference is that the digit to the left of the radix point is weighted by b0 and the exponents become increasingly negative for each digit right of the radix point. The basic procedure is the same for any base b. For example, the value 101.01012 can be converted to base ten by expanding it as follows:

1×22+0×21+1×20+0×21+1×22+0×23+1×24=4+0+1+0+14+0+116=5.312510

si1_e

Likewise, the hexadecimal fraction 4F2.9A0 can be converted to base ten by expanding it as follows:

4×162+15×161+2×160+9×161+10×162+0×163=1024+240+2+916+10256+04096=1266.601562510

si2_e

8.1.2 Decimal to Arbitrary Base

When converting from base ten into another base, the integer and fractional parts are treated separately. The base conversion for the integer part is performed in exactly the same way as in Section 1.3.2, using repeated division by the base b. The fractional part is converted using repeated multiplication. For example, to convert the decimal value 5.687510 to a binary representation:

1. Convert the integer portion, 510 into its binary equivalent, 1012.

2. Multiply the decimal fraction by two. The integer part of the result is the first binary digit to the right of the radix point.
Because x = 0.6875 × 2 = 1.375, the first binary digit to the right of the point is a 1. So far, we have 5.62510 = 101.12

3. Multiply the fractional part of x by 2 once again.
Because x = 0.375 × 2 = 0.75, the second binary digit to the right of the point is a 0. So far, we have 5.62510 = 101.102

4. Multiply the fractional part of x by 2 once again.
Because x = 0.75 × 2 = 1.50, the third binary digit to the right of the point is a 1. So now we have 5.625 = 101.101

5. Multiply the fractional part of x by 2 once again.
Because x = 0.5 × 2 = 1.00, the fourth binary digit to the right of the point is a 1. So now we have 5.625 = 101.1011

6. Since the fractional part is now zero, we know that all remaining digits will be zero.

The procedure for obtaining the fractional part can be accomplished easily using a tabular method, as shown below:

OperationResult
IntegerFraction
0.6875 × 2 = 1.37510.375
0.375 × 2 = 0.7500.75
0.75 × 2 = 1.510.5
0.5 × 2 = 1.010.0

t0020

Putting it all together, 5.687510 = 101.10112. After converting a fraction from base 10 into another base, the result should be verified by converting back into base 10. The results from the previous example can be expanded as follows:

1×22+0×21+1×20+1×21+0×22+1×23+1×24=4+0+1+12+0+18+116=5.687510

si3_e

Converting decimal fractions to base sixteen is accomplished in a very similar manner. To convert 842.23437510 into base 16, we first convert the integer portion by repeatedly dividing by 16 to yield 34A. We then repeatedly multiply the fractional part, extracting the integer portion of the result each time as shown in the table below:

OperationResult
IntegerFraction
0.234375 × 16 = 3.7530.75
0.75 × 16 = 12.0120.0

t0025

In the second line, the integer part is 12, which must be replaced with a hexadecimal digit. The hexadecimal digit for 1210 is C, so the fractional part is 3C. Therefore, 842.23437510 =34A.3C16 The result is verified by converting it back into base 10 as follows:

3×162+4×161+10×160+3×161+12×162=768+64+10+316+12256=842.23437510

si4_e

Bases that are powers-of-two

Converting fractional values between binary, hexadecimal, and octal can be accomplished in the same manner as with integer values. However, care must be taken to align the radix point properly. As with integers, converting from hexadecimal or octal to binary is accomplished by replacing each hex or octal digit with the corresponding binary digits from the appropriate table shown in Fig. 1.3.

For example, to convert 5AC.43B16 to binary, we just replace “5” with “0101,” replace “A” with “1010,” replace “C” with “1100,” replace “4” with “0100,” replace “3” with “0011,” replace “B” with “1011,” So, using the table, we can immediately see that 5AC.43B16 = 010110101100.0100001110112. This method works exactly the same way for converting from octal to binary, except that it uses the table on the right side of Fig. 1.3.

Converting fractional numbers from binary to hexadecimal or octal is also very easy when using the tables. The procedure is to split the binary string into groups of bits, working outwards from the radix point, then replace each group with its hexadecimal or octal equivalent. For example, to convert 01110010.10101112 to hexadecimal, just divide the number into groups of four bits, starting at the radix point and working outwards in both directions. It may be necessary to pad with zeroes to make a complete group on the left or right, or both. Our example is grouped as follows: |0000|0111|0010.1010|1110|2. Now each group of four bits is converted to hexadecimal by looking up the corresponding hex digit in the table on the left side of Fig. 1.3. This yields 072.AE16. For octal, the binary number would be grouped as follows: |001|110|010.101|011|100|2. Now each group of three bits is converted to octal by looking up the corresponding digit in the table on the right side of Fig. 1.3. This yields 162.5348.

8.2 Fractions and Bases

One interesting phenomenon that is often encountered is that fractions which terminate in one base may become non-terminating, repeating fractions in another base. For example, the binary representation of the decimal fraction 110si5_e is a repeating fraction, as shown in Example 8.1. The resulting fractional part from the last step performed is exactly the same as in the second step. Therefore, the sequence will repeat. If we continue, we will repeat the sequence of steps 2–5 forever. Hence, the final binary representation will be:

0.110=0.000110011001100112=0.00011¯2

si6_e

Because of this phenomenon, it is impossible to exactly represent 1.1010 (and many other fractional quantities) as a binary fraction in a finite number of bits.

Example 8.1

A Non-Terminating, Repeating Binimal

.1×2=0.2.2×2=0.4.4×2=0.8.8×2=1.6.6×2=1.2.2×2=0.4

si7_e

The fact that some base 10 fractions cannot be exactly represented in binary has lead to many subtle software bugs and round-off errors, when programmers attempt to work with currency (and other quantities) as real-valued numbers. In this section, we explore the idea that the representation problem can be avoided by working in some base other than base 2. If that is the case, then we can simply build hardware (or software) to work in that base, and will be able to represent any fractional value precisely using a finite number of digits. For brevity, we will refer to a binary fractional quantity as a binimal and a decimal fractional quantity as a decimal. We would like to know whether there are more non-terminating decimals than binimals, more non-terminating binimals than decimals, or neither. Since there are an infinite number of non-terminating decimals and an infinite number of non-terminating binimals, we could be tempted to conclude that they are equal. However, that is an oversimplification. If we ask the question differently, we can discover some important information. A better way to ask the question is as follows:

Question: Is the set of terminating decimals a subset of the set of terminating binimals, or vice versa, or neither?

We start by introducing a lemma which can be used to predict whether or not a terminating fraction in one base will terminate in another base. We introduce the notation x|y (read as “x divides y”) to indicate that y can be evenly divided by x.

Lemma 8.2.1

If x, 0 < x < 1, terminates in some base B (a product of primes), then x=NxDxsi8_e, and Dx=p1k1p2k2pnknsi9_e, where the pi are the prime factors of B.

Proof

Let x=NxDxsi8_e, and Dx=p1k1p2k2pnknsi9_e, where the pi are the prime factors of B. Then DxNx×Bkmaxsi12_e, where kmax = max(k1,k2,…kn), so x=NxDxxsi13_e terminates after kmax or fewer divisions.

Let x=NxDxsi8_e terminate after k divisions. Then Dx|Nx × Bk. Since Dx does not evenly divide Nx, Dx must be composed of some combination of the prime factors of B. Thus, Dx can be expressed as p1k1p2k2pnknsi15_e.

Theorem 8.2.1

The set of terminating binimals is a subset of the set of terminating Decimals.

Proof

Let b be a terminating binimal. Then, by Lemma 8.2.1, b=NbDbsi16_e, such that Db = 2k, for some k ≥ 0. Therefore, Db = 2k5m, for some k, m > 0, and again by the Lemma, b is also a terminating decimal.

Theorem 8.2.2

The set of terminating decimals is not a subset of the set of terminating binimals.

Proof

Let d be a terminating decimal such that d=NdDdsi17_e, where Dd = 2k5m. If m > 0, then by the Lemma, d is a non-terminating binimal.

Answer: The set of terminating binimals is a subset of the set of terminating decimals, but the set of terminating decimals is not a subset of the set of terminating binimals.

Implications

Theorem 8.2.1 implies that any binary fraction can be expressed exactly as a decimal fraction, but Theorem 8.2.2 implies that there are decimal fractions which cannot be expressed exactly in binary. Every fraction (when expressed in lowest terms) which has a non-zero power of five in its denominator cannot be represented in binary with a finite number of bits. Another implication is that some fractions cannot be expressed exactly in either binary or decimal. For example, let B = 30 = 2 * 3 * 5. Then any number with denominator 2k13k25k3si18_e terminates in base 30. However if k2≠0, then the fraction will terminate in neither base two nor base ten, because three is not a prime factor of ten or two.

Another implication of the theorem is that the more prime factors we have in our base, the more fractions we can express exactly. For instance, the smallest base that has two, three, and five as prime factors is base 30. Using that base, we can exactly express fractions in radix notation that cannot be expressed in base ten or in base two with a finite number of digits. For example, in base 30, the fraction 1115si19_e will terminate after one division since 15 = 3151. To see what the number will look like, let us extend the hexadecimal system of using letters to represent digits beyond 9. So we get this chart for base 30:

0100301101302102303103304104305105306106307107308108309109301010A301110B301210C301310D301410E301510F301610G301710H301810I301910J302010K302110L302210M302310N302410O302510P302610Q302710R302810S302910T30

si20_e

Since 1115=2230si21_e, the fraction can be expressed precisely as 0.M30. Likewise, the fraction 1345si22_e is 0.28¯10si23_e but terminates in base 30. Since 45 = 3351, this number will have three or fewer digits following the radix point. To compute the value, we will have to raise it to higher terms. Using 302 as the denominator gives us:

1345=260900

si24_e

Now we can convert it to base 30 by repeated division. 26030=8si25_e with remainder 20. Since 20 < 30, we cannot divide again. Therefore, 1345si22_e in base 30 is 0.8K.

Although base 30 can represent all fractions that can be expressed in bases two and ten, there are still fractions that cannot be represented in base 30. For example, 17si27_e has the prime factor seven in its denominator, and therefore will only terminate in bases were seven is a prime factor of the base. The fraction 17si27_e will terminate in base 7, base 14, base 21, base 42 and many others, but not in base 30. Since there are an infinite number of primes, no number system is immune from this problem. No matter what base the computer works in, there are fractions that cannot be expressed exactly with a finite number of digits. Therefore, it is incumbent upon programmers and hardware designers to be aware of round-off errors and take appropriate steps to minimize their effects.

For example, there is no reason why the hardware clocks in a computer should work in base ten. They can be manufactured to measure time in base two. Instead of counting seconds in tenths, hundredths or thousandths, they could be calibrated to measure in fourths, eighths, sixteenths, 1024ths, etc. This would eliminate the round-off error problem in keeping track of time.

8.3 Fixed-Point Numbers

As shown in the previous section, given a finite number of bits, a computer can only approximately represent non-integral numbers. It is often necessary to accept that limitation and perform computations involving approximate values. With due care and diligence, the results will be accurate within some acceptable error tolerance. One way to deal with real-valued numbers is to simply treat the data as fixed- point numbers. Fixed-point numbers are treated as integers, but the programmer must keep track of the radix point during each operation. We will present a systematic approach to designing fixed-point calculations.

When using fixed-point arithmetic, the programmer needs a convenient way to describe the numbers that are being used. Most languages have standard data types for integers and floating point numbers, but very few have support for fixed-point numbers. Notable exceptions include PL/1 and Ada, which provide support for fixed-point binary and fixed-point decimal numbers. We will focus on fixed-point binary, but the techniques presented can also be applied to fixed-point numbers in any base.

8.3.1 Interpreting Fixed-Point Numbers

Each fixed-point binary number has three important parameters that describe it:

1. whether the number is signed or unsigned,

2. the position of the radix point in relation to the right side of the sign bit (for signed numbers) or the position of the radix point in relation to the most significant bit (for unsigned numbers), and

3. the number of fractional bits stored.

Unsigned fixed-point numbers will be specified as U(i,f), where i is the position of the radix point in relation to the left side of the most significant bit, and f is the number of bits stored in the fractional part.

For example, U(10,6) indicates that there are six bits of precision in the fractional part of the number, and the radix point is ten bits to the right of the most significant bit stored. The layout for this number is shown graphically as:

u08-01-9780128036983

where i is an integer bit and f is a fractional bit. Very small numbers with no integer part may have a negative i. For example, U(−8,16) specifies an unsigned number with no integer part, eight leading zero bits which are not actually stored, and 16 bits of fractional precision. The layout for this number is shown graphically as:

u08-02-9780128036983

Likewise, signed fixed-point numbers will be specified using the following notation: S(i,f), where i is the position of the radix point in relation to the right side of the sign bit, and f is the number of fractional bits stored. As with integer two’s-complement notation, the sign bit is always the leftmost bit stored. For example, S(9,6) indicates that there are six bits in the fractional part of the number, and the radix point is nine bits to the right of the sign bit. The layout for this number is shown graphically as:

u08-03-9780128036983

where i is an integer bit and f is a fractional bit. Very small numbers with no integer part may have a negative i. For example, S(−7,16) specifies a signed number with no integer part, six leading sign bits which are not actually stored, a sign bit that is stored and 15 bits of fraction. The layout for this number is shown graphically as:

u08-04-9780128036983

Note that the “hidden” bits in a signed number are assumed to be copies of the sign bit, while the “hidden” bits in an unsigned number are assumed to be zero.

The following figure shows an unsigned fixed-point number with seven bits in the integer part and nine bits in the fractional part. It is a U(7,9) number. Note that the total number of bits is 7 + 9 = 16

u08-05-9780128036983

The value of this number in base 10 can be computed by summing the values of each non-zero bit as follows:

2139+2119+2109+299+259+239+219+209=24+22+21+20+24+26+28+29=16+4+2+1+116+164+1256+1512=23.08398437510

si29_e

Likewise, the following figure shows a signed fixed-point number with nine bits in the integer part and six bits in the fractional part. It is as S(9,6) number. Note that the total number of bits is 9 + 6 + 1 = 16.

u08-06-9780128036983

The value of this number in base 10 can be computed by summing the values of each non-zero bit as follows:

2136+2116+2106+296+256+236+216+206=27+25+24+23+21+23+25+26=128+32+16+8+12+18+132+164=184.67187510

si30_e

Note that in the above two examples, the pattern of bits are identical. The value of a number depends upon how it is interpreted. The notation that we have introduced allows us to easily specify exactly how a number is to be interpreted. For signed values, if the first bit is non-zero, then the two’s complement should be taken before the number is evaluated. For example, the following figure shows an S(8,7) number that has a negative value.

u08-07-9780128036983

The value of this number in base 10 can be computed by taking the two’s complement, summing the values of the non-zero bits, and adding a negative sign to the result. The two’s complement of 1011010101111010 is 0100101010000101 + 1 = 0100101010000110. The value of this number is:

2147+2117+297+277+227+217=27+24+22+20+25+26=128+16+4+1+132+164=149.04687510

si31_e

For a final example we will interpret this bit pattern as an S(−5,16). In that format, the layout is:

u08-08-9780128036983

The value of this number in base ten can be computed by taking the two’s complement, summing the values of the non-zero bits, and adding a negative sign to the result. The two’s complement is:

u08-09-9780128036983

The value of this number interpreted as an S(−5,16) is:

26+29+211+213+218+219=0.0181941986083984375

si32_e

8.3.2 Q Notation

Fixed-point number formats can also be represented using Q notation, which was developed by Texas Instruments. Q notation is equivalent to the S/U format used in this book, except that the integer portion is not always fully specified. In general, Q formats are specified as Qm, n where m is the number of integer bits, and n is the number of fractional bits. If a fixed word size w is being used then m may be omitted, and is assumed to be wn. For example, a Q10 number has 10 fractional bits, and the number of integer bits is not specified, but is assumed to be the number of bits required to complete a word of data. A Q2,4 number has two integer bits and four fractional bits in a 6-bit word. There are two conflicting conventions for dealing with the sign bit. In one convention, the sign bit is included as part of m, and in the other convention, it is not. When using Q notation, it is important to state which convention is being used. Additionally, a U may be prefixed to indicate an unsigned value. For example UQ8.8 is equivalent to U(8,8), and Q7,9 is equivalent to S(7,9).

8.3.3 Properties of Fixed-Point Numbers

Once the decision has been made to use fixed-point calculations, the programmer must make some decisions about the specific representation of each fixed-point variable. The combination of size and radix will affect several properties of the numbers, including:

Precision: the maximum number of non-zero bits representable,

Resolution: the smallest non-zero magnitude representable,

Accuracy: the magnitude of the maximum difference between a true real value and its approximate representation,

Range: the difference between the largest and smallest number that can be represented, and

Dynamic range: the ratio of the maximum absolute value to the minimum positive absolute value representable.

Given a number specified using the notation introduced previously, we can determine its properties. For example, an S(9,6) number has the following properties:

Precision: P = 16 bits

Resolution: R = 2−6 = 0.015625

Accuracy: A=R2=0.0078125si33_e

Range: Minimum value is 1000000000.000000 = −512 Maximum value is 0111111111.111111 = 1023.9921875 Range is G = 1023.9921875 + 512 = 1535.9921875

Dynamic range: For a signed fixed-point rational representation, S(i,f), the dynamic range is

D=2×2i2f=2i+f+1=2P.

si34_e

Therefore, the dynamic range of an S(9,6) is 216 = 65536.

Being aware of these properties, the programmer can select fixed-point representations that fit the task that they are trying to solve. This allows the programmer to strive for very efficient code by using the smallest fixed-point representation possible, while still guaranteeing that the results of computations will be within some limits for error tolerance.

8.4 Fixed-Point Operations

Fixed-point numbers are actually stored as integers, and all of the integer mathematical operations can be used. However, some care must be taken to track the radix point at each stage of the computation. The advantages of fixed-point calculations are that the operations are very fast and can be performed on any computer, even if it does not have special hardware support for non-integral numbers.

8.4.1 Fixed-Point Addition and Subtraction

Fixed-point addition and subtraction work exactly like their integer counterparts. Fig. 8.1 gives some examples of fixed-point addition with signed numbers. Note that in each case, the numbers are aligned so that they have the same number of bits in their fractional part. This requirement is the only difference between integer and fixed-point addition. In fact, integer arithmetic is just fixed-point arithmetic with no bits in the fractional part. The arithmetic that was covered in Chapter 7 was fixed-point arithmetic using only S(i,0) and U(i,0) numbers. Now we are simply extending our knowledge to deal with numbers where f≠0. There are some rules which must be followed to ensure that the results are correct. The rules for subtraction are the same as the rules for addition. Since we are using two’s complement math, subtraction is performed using addition.

f08-01-9780128036983
Figure 8.1 Examples of fixed-point signed arithmetic.

Suppose we want to add an S(7,8) number to an S(7,4) number. The radix points are at different locations, so we cannot simply add them. Instead, we must shift one of the numbers, changing its format, until the radix points are aligned. The choice of which one to shift depends on what format we desire for the result. If we desire eight bits of fraction in our result, then we would shift the S(7,4) left by four bits, converting it into an S(7,8). With the radix points aligned, we simply use an integer addition operation to add the two numbers. The result will have it’s radix point in the same location as the two numbers being added.

8.4.2 Fixed Point Multiplication

Recall that the result of multiplying an n bit number by an m bit number is an n + m bit number. In the case of fixed-point numbers, the size of the fractional part of the result is the sum of the number of fractional bits of each number, and the total size of the result is the sum of the total number of bits in each number. Consider the following example where two U(5,3) numbers are multiplied together:

f08-39-9780128036983

The result is a U(10,6) number. The number of bits in the result is the sum of all of the bits of the multiplicand and the multiplier. The number of fractional bits in the result is the sum of the number of fractional bits in the multiplicand and the multiplier. There are three simple rules to predict the resulting format when multiplying any two fixed-point numbers.

Unsigned Multiplication The result of multiplying two unsigned numbers U(i1,f1) and U(i2,f2) is a U(i1 + i2,f1 + f2) number.

Mixed Multiplication The result of multiplying a signed number S(i1,f1) and an unsigned number U(i2,f2) is an S(i1 + i2,f1 + f2) number.

Signed Multiplication The result of multiplying two signed numbers S(i1,f1) and S(i2,f2) is an S(i1 + i2 + 1,f1 + f2) number.

Note that this rule works for integers as well as fixed-point numbers, since integers are really fixed-point numbers with f = 0. If the programmer desires a particular format for the result, then the multiply is followed by an appropriate shift.

Listing 8.1 gives some examples of fixed-point multiplication using the ARM multiply instructions. In each case, the result is shifted to produce the desired format. It is the responsibility of the programmer to know what type of fixed-point number is produced after each multiplication and to adjust the result by shifting if necessary.

f08-02-9780128036983
Listing 8.1 Examples of fixed-point multiplication in ARM assembly.

8.4.3 Fixed Point Division

Derivation of the rule for determining the format of the result of division is more complicated than the one for multiplication. We will first consider only unsigned division of a dividend with format U(i1,f1) by a divisor with format U(i2,f2).

Results of fixed point division

Consider the results of dividing two fixed-point numbers, using integer operations with limited precision. The value of the least significant bit of the dividend N is 2fisi35_e and the value of the least significant bit of the divisor D is 2f2si36_e. In order to perform the division using integer operations, it is necessary to multiply N by 2fisi37_e and multiply D by 2f2si38_e so that both numbers are integers. Therefore, the division operation can be written as:

Q=N×2f1D×2f2=ND×2f1f2.

si39_e

Note that no multiplication is actually performed. Instead, the programmer mentally shifts the radix point of the divisor and dividend, then computes the radix point of the result. For example, given two U(5,3) numbers, the division operation is accomplished by converting them both to integers, performing the division, then computing the location radix point:

Q=N×23D×23=ND×20.

si40_e

Note that the result is an integer. If the programmer wants to have some fractional bits in the result, then the dividend must be shifted to the left before the division is performed.

If the programmer wants to have fq fractional bits in the quotient, then the amount that the dividend must be shifted can easily be computed as

s=fq+f1f2.

si41_e

For example, suppose the programmer wants to divide 01001.011 stored as a U(28,3) by 00011.110 which is also stored as a U(28,3), and wishes to have six fractional bits in the result. The programmer would first shift 01001.011 to the left by six bits, then perform the division and compute the position of the radix in the result as shown:

f08-41-9780128036983

Since the divisor may be between zero and one, the quotient may actually require more integer bits than there are in the dividend. Consider that the largest possible value of the dividend is Nmax=2i12f1si43_e, and the smallest positive value for the divisor is Dmin=2f2si44_e. Therefore, the maximum quotient is given by:

Qmax=2i12f12f2=2i1+f22f1f2.

si45_e

Taking the limit of the previous equation,

limf1f2Qmax=2i1+f2,

si46_e

provides the following bound on how many bits are required in the integer part of the quotient:

Qmax<2i1+f2.

si47_e

Therefore, in the worst case, the quotient will require i1 + f2 integer bits. For example, if we divide a U(3,5), a = 111.11111 = 7.9687510, by a U(5,3), b = 00000.001 = 0.12510, we end up with a U(6,2)q = 111111.11 = 63.7510.

The same thought process can be used to determine the results for signed division as well as mixed division between signed and unsigned numbers. The results can be reduced to the following three rules:

Unsigned Division The result of dividing an unsigned fixed-point number U(i1,f1) by an unsigned number U(i2,f2) is a U(i1 + f2,f1f2) number.

Mixed Division The result of dividing two fixed-point numbers where one of them is signed and the other is unsigned is an S(i1 + f2,f1f2) number.

Signed Division The result of dividing two signed fixed-point numbers is an S(i1 + f2 + 1,f1f2) number.

Consider the results when a U(2,3), a = 00000.001 = 0.12510 is divided by a U(4,1), b = 1000.0 = 8.010. The quotient is q = 0.000001, which requires six bits in the fractional part. However, if we simply perform the division, then according to the rules shown above, the result will be a U(8,−2). There is no such thing as a U(8,−2), so the result is meaningless.

When f2 > f1, blindly applying the rules will result in a negative fractional part. To avoid this, the dividend can be shifted left so that it has at least as many fractional bits as the divisor. This leads to the following rule: If f2 > f1 then convert the divisor to an S(i1,x), where xf2, then apply the appropriate rule. For example, dividing an S(5,2) by a U(3,12) would result in an S(17,−10). But shifting the S(5,2) 16 bits to the left will result in an S(5,18), and dividing that by a U(3,12) will result in an S(17,6).

Maintaining precision

Recall that integer division produces a result and a remainder. In order to maintain precision, it is necessary to perform the integer division operation in such a way that all of the significant bits are in the result and only insignificant bits are left in the remainder. The easiest way to accomplish this is by shifting the dividend to the left before the division is performed.

To find a rule for determining the shift necessary to maintain full precision in the quotient, consider the worst case. The minimum positive value of the dividend is Nmin=2f1si48_e and the largest positive value for the divisor is Dmin=2i22f2si49_e. Therefore, the minimum positive quotient is given by:

Qmin=2f12i22f2=12f12i2+f22f2=2f22f1+i2+f2=12f1+i2=2(i2+f1)

si50_e

Therefore, in the worst case, the quotient will require i2 + f1 fractional bits to maintain precision. However, fewer bits can be reserved if full precision is not required.

Recall that the least significant bit of the quotient will be 2(i2+f1)si51_e. Shifting the dividend left by i2 + f2 bits will convert it into a U(i1,i2 + f1 + f2). Using the rule above, when it is divided by a U(i2,f2), the result is a U(i1 + f2,i2 + f1). This is the minimum size which is guaranteed to preserve all bits of precision. The general method for performing fixed-point division while maintaining maximum precision is as follows:

1. shift the dividend left by i2 + f2, then

2. perform integer division.

The result will be a U(i1 + f2,i2 + f1) for unsigned division, or an S(i1 + f2 + 1,i2 + f1) for signed division. The result for mixed division is left as an exercise for the student.

8.4.4 Division by a Constant

Section 7.3.3 introduced the idea of converting division by a constant into multiplication by the reciprocal of that constant. In that section it was shown that by pre-multiplying the reciprocal by a power of two (a shift operation), then dividing the final result by the same power of two (a shift operation), division by a constant could be performed using only integer operations with a more efficient multiply replacing the (usually) very slow divide.

This section presents an alternate way to achieve the same results, by treating division by an integer constant as an application of fixed-point multiplication. Again, the integer constant divisor is converted into its reciprocal, but this time the process is considered from the viewpoint of fixed-point mathematics. Both methods will achieve exactly the same results, but some people tend to grasp the fixed-point approach better than the purely integer approach.

When writing code to divide by a constant, the programmer must strive to achieve the largest number of significant bits possible, while using the shortest (and most efficient) representation possible. On modern computers, this usually means using 32-bit integers and integer multiply operations which produce 64-bit results. That would be extremely tedious to show in a textbook, so the principals will be demonstrated here using 8-bit integers and an integer multiply which produces a 16-bit result.

Division by constant 23

Suppose we want efficient code to calculate x ÷ 23 using only 8-bit signed integer multiplication. The reciprocal of 23, in binary, is

R=123=0.00001011001000010112.

si52_e

If we store R as an S(1,11), it would look like this:

u08-10-9780128036983

Note that in this format, the reciprocal of 23 has five leading zeros. We can store R in eight bits by shifting it left to remove some of the leading zeros. Each shift to the left changes the format of R. After removing the first leading zero bit, we have:

u08-11-9780128036983

After removing the second leading zero bit, we have:

u08-12-9780128036983

After removing the third leading zero bit, we have:

u08-13-9780128036983

Note that the number in the previous format has a “hidden” bit between the radix point and the sign bit. That bit is not actually stored, but is assumed to be identical to the sign bit. Removing the fourth leading zero produces:

u08-14-9780128036983

The number in the previous format has two “hidden” bits between the radix point and the sign bit. Those bits are not actually stored, but are assumed to be identical to the sign bit. Removing the fifth leading zero produces:

u08-15-9780128036983

We can only remove five leading zero bits, because removing one more would change the sign bit from 0 to 1, resulting in a completely different number. Note that the final format has three “hidden” bits between the radix point and the sign bit. These bits are all copies of the sign bit. It is an S(−4,8) number because the sign is four bits to the right of the radix point (resulting in the three “hidden” bits). According to the rules of fixed-point multiplication given earlier, an S(7,0) number x multiplied by an S(−4,8) number R will yield an S(4,8) number y. The value y will be 23×x23si53_e because we have three “hidden” bits to the right of the radix point. Therefore,

x23=R×x×23,

si54_e

indicating that after the multiplication, we must shift the result right by three bits to restore the radix. Since 123si55_e is positive, the number R must be increased by one to avoid round-off error. Therefore, we will use R + 1 = 01011010 = 9010 in our multiply operation. To calculate y = 10110 ÷ 2310, we can multiply and perform a shift as follows:

f08-38-9780128036983

Because our task is to implement integer division, everything to the right of the radix point can be immediately discarded, keeping only the upper eight bits as the integer portion of the result. The integer portion, 1000112, shifted right three bits, is 1002 = 410. If the modulus is required, it can be calculated as: 101 − (4 × 23) = 9. Some processors, such as the Motorola HC11, have a special multiply instruction which keeps only the upper half of the result. This method would be especially efficient on that processor. Listing 8.2 shows how the 8-bit division code would be implemented in ARM assembly. Listing 8.3 shows an alternate implementation which uses shift and add operations rather than a multiply.

f08-03-9780128036983
Listing 8.2 Dividing x by 23
f08-04-9780128036983
Listing 8.3 Dividing x by 23 Using Only Shift and Add

Division by constant −50

The procedure is exactly the same for dividing by a negative constant. Suppose we want efficient code to calculate x50si56_e using 16-bit signed integers. We first convert 150si57_e into binary:

150=0.0000010100011110¯

si58_e

The two’s complement of 150si57_e is

150=1.1111101011100001¯

si60_e

We can represent 150si61_e as the following S(1,21) fixed-point number:

u08-16-9780128036983

Note that the upper seven bits are all one. We can remove six of those bits and adjust the format as follows. After removing the first leading one, the reciprocal is:

u08-17-9780128036983

Removing another leading one changes the format to:

u08-18-9780128036983

On the next step, the format is:

u08-19-9780128036983

Note that we now have a “hidden” bit between the radix point and the sign bit. The hidden bit is not actually part of the number that we store and use in the computation, but it is assumed to be the same as the sign bit.

After three more leading ones are removed, the format is:

u08-20-9780128036983

Note that there are four “hidden” bits between the radix point and the sign. Since the reciprocal 150si61_e is negative, we do not need to round by adding one to the number R. Therefore, we will use R = 10101110000101012 = AE1516 in our multiply operation.

Since we are using 16-bit integer operations, the dividend, x, will be an S(15,0). The product of an S(15,0) and an S(−5,16) will be an S(11,16). We will remove the 16 fractional bits by shifting right. The four “hidden” bits indicate that the result must be shifted an additional four bits to the right, resulting in a total shift of 20 bits. Listing 8.4 shows how the 16-bit division code would be implemented in ARM assembly.

f08-05-9780128036983
Listing 8.4 Dividing x by −50

8.5 Floating Point Numbers

Sometimes we need more range than we can easily get from fixed precision. One approach to solving this problem is to create an aggregate data type that can represent a fractional number by having fields for an exponent, a sign bit, and an integer mantissa. For example, in C, we could represent a fractional number using the data structure shown in Listing 8.5. That data structure, along with some subroutines for addition, subtraction, multiplication and division, would provide the capability to perform arithmetic without explicitly tracking the radix point. The subroutines for the basic arithmetical operations could do that, thereby freeing the programmer to work at a higher level.

f08-06-9780128036983
Listing 8.5 Inefficient representation of a binimal.

The structure shown in Listing 8.5 is a rather inefficient way to represent a fractional number, and may create different data structures on different machines. The sign only requires one bit, and the size of the exponent and mantissa are dependent upon the machine on which the code is compiled. The sign will use one bit, the exponent eight bits, and the mantissa 23 bits.

The C language includes the notion of bit fields. This allows the programmer to specify exactly how many bits are to be used for each field within a struct, Listing 8.6 shows a C data structure that consumes 32 bits on all machines and architectures. It provides the same fields as the structure in Listing 8.5, but specifies exactly how many bits each field consumes.

f08-07-9780128036983
Listing 8.6 Efficient representation of a binimal.

The compiler will compress this data structure into 32 bits, regardless of the natural word size of the machine.

The method of representing fractional numbers as a sign, exponent, and mantissa is very powerful, and IEEE has set standards for various floating point formats. These formats can be described using bit fields in C, as described above. Many processors have hardware that is specifically designed to perform arithmetic using the standard IEEE formatted data. The following sections highlight most of the IEEE defined numerical definitions.

The IEEE standard specifies the bitwise representation for numbers, and specifies parameters for how arithmetic is to be performed. The IEEE standard for numbers includes the possibility of having numbers that cannot be easily represented. For example, any quantity that is greater than the most positive representable value is positive infinity, and any quantity that is less than the most negative representable value is negative infinity. There are special bit patterns to encode these quantities. The programmer or hardware designer is responsible for ensuring that their implementation conforms to the IEEE standards. The following sections describe some of the IEEE standard data formats.

8.5.1 IEEE 754 Half-Precision

The half-precision format gives a 16-bit encoding for fractional numbers with a small range and low precision. There are situations where this format is adequate. If the computation is being performed on a very small machine, then using this format may result in significantly better performance than could be attained using one of the larger IEEE formats. However, in most situations, the programmer can achieve better performance and/or precision by using a fixed-point representation. The format is as follows:

u08-21-9780128036983

 The Significand (a.k.a. “Mantissa”) is stored using a sign-magnitude coding, with bit 15 being the sign bit.

 The exponent is an excess-15 number. That is, the number stored is 15 greater than the actual exponent.

 There are 10 bits of significand, but there are 11 bits of significand precision. There is a “hidden” bit, m10, between m9 and e0. When a number is stored in this format, it is shifted until its leftmost non-zero bit is in the hidden bit position, and the hidden bit is not actually stored. The exception to this rule is when the number is zero or very close to zero. The radix point is assumed to be between the hidden bit and the first bit stored. The radix point is then shifted by the exponent.

Table 8.1 shows how to interpret IEEE 754 Half-Precision numbers. The exponents 00000 and 11111 have special meaning. The value 00000 is used to represent zero and numbers very close to zero, and the exponent value 11111 is used to represent infinity and NaN. NaN, which is the abbreviation for not a number, is a value representing an undefined or unrepresentable value. One way to get NaN as a result is to divide infinity by infinity. Another is to divide zero by zero. The NaN value can indicate that there is a bug in the program, or that a calculation must be performed using a different method.

Table 8.1

Format for IEEE 754 half-precision

ExponentSignificand = 0Significand≠0Equation
00000± 0subnormal− 1sign × 2−14 × 0.significand
00001 …11110normalized value− 1sign × 2exp−15 × 1.significand
11111±si63_eNaN

t0010

Subnormal means that the value is too close to zero to be completely normalized. The minimum strictly positive (subnormal) value is 2−24 ≈ 5.96 × 10−8. The minimum positive normal value is 2−14 ≈ 6.10 × 10−5. The maximum exactly representable value is (2 − 2−10) × 215 = 65504.

Examples

The following bit value:

u08-22-9780128036983

represents

+1.1000101011×20101101111=1.1000101011×24=0.000110001010110.09637.

si64_e

The following bit value:

u08-23-9780128036983

represents

1.0000100101×21100101111=1.0000100101×210=10000100101.0=106110.

si65_e

8.5.2 IEEE 754 Single-Precision

The single precision format provides a 23-bit mantissa and an 8-bit exponent, which is enough to represent a reasonably large range with reasonable precision. This type can be stored in 32 bits, so it is relatively compact. At the time that the IEEE standards were defined, most machines used a 32-bit word, and were optimized for moving and processing data in 32-bit quantities. For many applications this format represents a good trade-off between performance and precision.

u08-24-9780128036983

8.5.3 IEEE 754 Double-Precision

The double-precision format was designed to provide enough range and precision for most scientific computing requirements. It provides a 10-bit exponent and a 53-bit mantissa. When the IEEE 754 standard was introduced, this format was not supported by most hardware. That has changed. Most modern floating point hardware is optimized for the IEEE 754 double-precision standard, and most modern processors are designed to move 64-bit or larger quantities. On modern floating-point hardware, this is the most efficient representation.

However, processing large arrays of double-precision data requires twice as much memory, and twice as much memory bandwidth, as single-precision.

u08-25-9780128036983

8.5.4 IEEE 754 Quad-Precision

The IEEE 754 Quad-Precision format was designed to provide enough range and precision for very demanding applications. It provides a 14-bit exponent and a 116-bit mantissa. This format is still not supported by most hardware. The first hardware floating point unit to support this format was the SPARC V8 architecture. As of this writing, the popular Intel x86 family, including the 64-bit versions of the processor, do not have hardware support for the IEEE 754 quad-precision format. On modern high-end processors such as the SPARC, this may be an efficient representation. However, for mid-range processors such as the Intel x86 family and the ARM, this format is definitely out of their league.

u08-26-9780128036983

8.6 Floating Point Operations

Many processors do not have hardware support for floating point. On those processors, all floating point must be accomplished through software. Processors that do support floating point in hardware must have quite sophisticated circuitry to manage the basic operations on data in the IEEE 754 standard formats. Regardless of whether the operations are carried out in software or hardware, the basic arithmetic operations require multiple steps.

8.6.1 Floating Point Addition and Subtraction

The steps required for addition and subtraction of floating point numbers is the same, regardless of the specific format. The steps for adding or subtracting to floating point numbers a and b are as follows:

1. Extract the exponents Ea and Eb.

2. Extract the significands Ma and Mb. and convert them into 2’s complement numbers, using the signs Sa and Sb.

3. Shift the significand with the smaller exponent right by |EaEb|.

4. Perform addition (or subtraction) on the significands to get the significand of the result, Mr. Remember that the result may require one more significant bit to avoid overflow.

5. If Mr is negative, then take the 2’s complement and set Sr to 1. Otherwise set Sr to 0.

6. Shift Mr until the leftmost 1 is in the “hidden” bit position, and add the shift amount to the smaller of the two exponents to form the new exponent Er.

7. Combine the sign Sr, the exponent Er, and significand Mr to form the result.

The complete algorithm must also provide for correct handling of infinity and NaN.

8.6.2 Floating Point Multiplication and Division

Multiplication and division of floating point numbers also requires several steps. The steps for multiplication and division of two floating point numbers a and b are as follows:

1. Calculate the sign of the result Sr.

2. Extract the exponents Ea and Eb.

3. Extract the significands Ma and Mb.

4. Multiply (or divide) the significands to form Mr.

5. Add (or subtract) the exponents (in excess-N) to get Er.

6. Shift Mr until the leftmost 1 is in the “hidden” bit position, and add the shift amount to Er.

7. Combine the sign S, the exponent Er, and significand Mr to form the result.

The complete algorithm must also provide for correct handling of infinity and NaN.

8.7 Computing Sine and Cosine

It has been said, and is commonly accepted, that “you can’t beat the compiler.” The meaning of this statement is that using hand-coded assembly language is futile and/or worthless because the compiler is “smarter” than a human. This statement is a myth, as will now be demonstrated.

There are many mathematical functions that are useful in programming. Two of the most useful functions are sinxsi66_e and cosxsi67_e. However, these functions are not always implemented in hardware, particularly for fixed-point representations. If these functions are required for fixed-point computation, then they must be written in software. These two functions have some nice properties that can be exploited. In particular:

 If we have the sinxsi66_e function, then we can calculate cosxsi67_e using the relationship

cosx=sinπ2x.

si70_e  (8.1)

Therefore, we only need to get the sine function working, and then we can implement cosine with only a little extra effort.

 sinxsi66_e is cyclical, so sin2π=sin0=sin2πsi72_e. This means that we can limit the domain of our function to the range [−π,π].

 sinxsi66_e is symmetric, so that sinx=sinxsi74_e. This means that we can further restrict the domain to [0,π].

 After we restrict the domain to [0,π], we notice another symmetry, sinx=sin(πx),π2xπsi75_e and we can further restrict the domain to [0,π2]si76_e.

 The range of both functions, sinxsi66_e and cosxsi67_e, is in the range [−1,1].

If we exploit all of these properties, then we can write a single shared function to be used by both sine and cosine. We will name this function sinq, and choose the following fixed-point formats:

 sinq will accept x as an S(1,30), and

 sinq will return an S(1,30)

These formats were chosen because S(1,30) is a good format for storing a signed number between zero and π2si79_e, and also the optimal format for storing a signed number between one and negative one.

The sine function will map x into the domain accepted by sinq and then call sinq to do the actual work. If the result should be negative, then the sine function will negate it before returning. The cosine function will use the relationship previously mentioned, and call the sine function.

We have now reduced the problem to one of approximating sinxsi66_e within the range [0,π2]si76_e. An approximation to the function sinxsi66_e can be calculated using the Taylor Series:

sinx=n=0(1)nx2n+1(2n+1)!.

si83_e  (8.2)

The first few terms of the series should be sufficient to achieve a good approximation. The maximum value possible for the seventh term is (0.5×π)1313!0.0000000510si84_e, which indicates that our function should be accurate to at least 25 bits using seven terms. If more accuracy is desired, then additional terms can be added.

8.7.1 Formats for the Powers of x

The numerators in the first nine terms of the Taylor series approximation are: x, x3, x5, x7, x9, x11, x13, x15, and x17. Given an S(1,30) format for x, we can predict the format for the numerator of each successive term in the Taylor series. If we simply perform successive multiplies, then we would get the following formats for the powers of x:

TermFormat32-bit
xS(1,30)S(1,30)
x3S(3,90)S(3,28)
x5S(5,150)S(5,26)
x7S(7,210)S(7,24)
x9S(9,270)S(9,22)
x11S(11,330)S(11,20)
x13S(13,390)S(13,18)

The middle column in the table shows that the format for x17 would require 528 bits if all of the fractional bits are retained. Dealing with a number at that level of precision would be slow and impractical. We will, of necessity, need to limit the number of bits used. Since the ARM processor provides a multiply instruction involving two 32-bit numbers, we choose to truncate the numerators to 32 bits. The third column in the table indicates the resulting format for each term if precision is limited to 32 bits.

On further consideration of the Taylor series, we notice that each of the above terms will be divided by a constant. Instead of dividing, we can multiply by the reciprocal of the constant. We will create a similar table holding the formats and constants for the factorial terms. With a bit of luck, the division (implemented as multiplication) in each term will result in a reasonable format for each resulting term.

8.7.2 Formats and Constants for the Factorial Terms

The first term of the Taylor series is x1!si85_e, so we can simply skip the division. The second term is x33!=x3×13!si86_e and the third term is x55!=x5×15!si87_e We can convert 13!si88_e to binary as follows:

MultiplicationResult
IntegerFraction
16×2=26si89_e026si90_e
26×2=46si91_e046si92_e
46×2=86si93_e126si90_e
26×2=46si91_e046si92_e
86×2=86si97_e126si90_e

t0045

Since the pattern repeats, we can conclude that 13!=0.001¯2si99_e. Since we need a negative number, we take the two’s complement, resulting in 13!=111.110¯2si100_e. Represented as an S(1,30), this would be

u08-27-9780128036983

Since the first four bits are one, we can remove three bits and store it as:

u08-28-9780128036983

In hexadecimal, this is AAAAAAAA16.

Performing the same operations, we find that 15!si101_e can be converted to binary as follows:

MultiplicationResult
IntegerFraction
1120×2=2120si102_e02120si103_e
2120×2=4120si104_e04120si105_e
4120×2=8120si106_e08120si107_e
8120×2=16120si108_e016120si109_e
16120×2=32120si110_e032120si111_e
32120×2=64120si112_e064120si113_e
64120×2=128120si114_e18120si107_e

t0050

Since the fraction in the seventh row is the same as the fraction in the third row, we know that the table will repeat forever. Therefore, 15!=0.0000001¯2si116_e. Since the first six bits to the right of the radix are all zero, we can remove the first five bits. Also adding one to the least significant bit to account for rounding error yields the following S(−6,32):

u08-29-9780128036983

In hexadecimal, the number to be multiplied is 4444444516. Note that since 15!si101_e is a positive number, the reciprocal was incremented by one to avoid round-off errors. We can apply the same procedure to the remaining terms, resulting in the following table:

TermReciprocal FormatReciprocal Value (Hex)
13!si88_eS(−2,32)AAAAAAAA
15!si101_eS(−6,32)44444445
17!si120_eS(−12,32)97F97F97
19!si121_eS(−18,32)5C778E96
111!si122_eS(−25,32)9466EA60
113!si123_eS(−32,32)5849184F

8.7.3 Putting it All Together

We want to keep as much precision as is reasonably possible for our intermediate calculations. Using 64 bits of precision for all intermediate calculations will give a good trade-off between performance and precision. The integer portion should never require more than two bits, so we choose an S(2,61) as our intermediate representation. If we combine the previous two tables, we can determine what the format of each complete term will be. This is shown in Table 8.2.

Table 8.2

Result formats for each term

NumeratorReciprocalResult
TermValueFormatValueFormatHexFormat
1xS(1,30)Extend to 64 bits and shift rightS(2,61)
2x3S(3,28)13!si88_eS(−2,32)AAAAAAAAS(2, 61)
3x5S(5,26)15!si101_eS(−6,32)44444444S(0, 63)
4x7S(7,24)17!si120_eS(−12,32)97F97F97S(−4, 64)
5x9S(9,22)19!si121_eS(−18,32)5C778E96S(−8, 64)
6x11S(11,20)111!si122_eS(−25,32)9466EA60S(−13, 64)
7x13S(13,18)113!si123_eS(−32,32)5849184FS(−18, 64)

t0060

Note that the formats were truncated to fit in a 64-bit result. We can now see that the formats for the first nine terms of the Taylor series are reasonably similar. They all require exactly 64 bits, and the radix points can be shifted so that they are aligned for addition. In order to make the shifting and adding process easier, we will pre-compute the shift amounts and store them in a look-up table.

Table 8.3 shows the shifts that are necessary to convert each term to an S(2,61) so that it can be added to the running total.

Table 8.3

Shifts required for each term

Term NumberOriginal FormatShift AmountResulting Format
1S(1,30)1S(2,61)
2S(2,61)0S(2,61)
3S(0,63)2S(2,61)
4S(−4,64)6S(2,61)
5S(−8,64)10S(2,61)
6S(−13,64)15S(2,61)
7S(−18,64)20S(2,61)

t0065

Note that the seventh term contributes very little to the final 32-bit sum which is stored in the upper 32 bits of the running total. We now have all of the information that we need in order to implement the function. Listing 8.7 shows how the sine and cosine function can be implemented in ARM assembly using fixed point computation, and Listing 8.8 shows a main program which prints a table of values and their sine and cosines.

f08-08a-9780128036983f08-08b-9780128036983f08-08c-9780128036983f08-08d-9780128036983f08-08e-9780128036983f08-08f-9780128036983
Listing 8.7 ARM assembly implementation of sinxsi66_e and cosxsi67_e using fixed-point calculations.
f08-09a-9780128036983f08-09b-9780128036983
Listing 8.8 Example showing how the sinxsi66_e and cosxsi67_e functions can be used to print a table.

8.7.4 Performance Comparison

In some situations it can be very advantageous to use fixed-point math. For example, when using an ARMv6 or older processor, there may not be a hardware floating point unit available. Table 8.4 shows the CPU time required for running a program to compute the sine function on 10,000,000 random values, using various implementations of the sine function. In each case, the program main() function was written in C. The only difference in the six implementations was the data type (which could be fixed-point, IEEE single precision, or IEEE double precision), and the sine function that was used. The times shown in the table include only the amount of CPU time actually used in the sine function, and do not include the time required for program startup, storage allocation, random number generation, printing results, or program exit. The six implementations are as follows:

Table 8.4

Performance of sine function with various implementations

OptimizationImplementationCPU seconds
None32-bit Fixed Point Assembly3.85
32-bit Fixed Point C18.99
Single Precision Software Float C56.69
Double Precision Software Float C55.95
Single Precision VFP C11.60
Double Precision VFP C11.48
Full32-bit Fixed Point Assembly3.22
32-bit Fixed Point C5.02
Single Precision Software Float C20.53
Double Precision Software Float C54.51
Single Precision VFP C3.70
Double Precision VFP C11.08

32-bit Fixed Point Assembly The sine function is computed using the code shown in Listing 8.7.

32-bit Fixed Point C The sine function is computed using exactly the same algorithm as in Listing 8.7, but it is implemented in C rather than Assembly.

Single Precision Software Float C Sine is computed using the floating point sine function which is provided by the GCC C compiler. The code is compiled for an ARMv6 or earlier processor without hardware floating point support. The C code is written to use IEEE single precision floating point numbers.

Double Precision Software Float C Exactly the same as the previous method, but using IEEE double precision instead of single precision.

Single Precision VFP C Sine is computed using the floating point sine function which is provided by the GCC C compiler. The code is compiled for the ARMv6 or later processor using hardware floating point support. The C code is written to use IEEE single precision floating point numbers.

Double Precision VFP C Same as the previous method, but using IEEE double precision instead of single precision.

Each of the six implementations was compiled both with and without compiler optimizations, resulting in a total of 12 test cases. All cases were run on a standard Raspberry Pi model B with the default CPU clock rate.

From Table 8.4, it is clear that the fixed-point implementation written in assembly beats the code generated by the compiler in every case. The closest that the compiler can get is when it can use the VFP hardware floating point unit and the compiler is run with full optimization. Even in that case the fixed-point assembly implementation is almost 15% faster than the single precision floating point implementation, and has 33% more precision (32 bits versus 24 bits). In the worst case, when a VFP hardware unit is not available, the assembly code beats the compiler by a whopping 638% in speed and 33% in precision for single precision floats, and is 1692% faster than double precision floating point at a cost of 41% in precision. Note that even with floating point hardware support, fixed point in assembly is still 3.44 times as fast as the C compiler code.

Similar results could be obtained on any processor architecture, and any reasonably complex mathematical problem. When developing software for small systems, the developer must weigh the costs and benefits of alternative implementations. For battery powered systems, it is important to realize that choices of hardware and software can affect power consumption even more strongly than computing performance. First, the power used by a system which includes a hardware floating point processor will be consistently higher than that of a system without one. Second, the reduction in processing time required for the job is closely related to the reduction in power required. Therefore, for battery operated systems, A fixed-point implementation could greatly extend battery life. The following statements summarize the results from the experiment in this section:

1. A competent assembly programmer can beat the assembler, in some cases by a very large margin.

2. If computational performance is critical, then a well-designed fixed-point implementation will usually outperform even a hardware-accelerated floating point implementation.

3. If there is no hardware support for floating point, then floating point performance is extremely poor, and fixed point will always provide the best performance.

4. If battery life is a consideration, then a fixed-point implementation can have an enormous advantage.

Note also from the table that the assembly language version of the fixed-point sine function beats the identical C version by a wide margin. Section 9.8.2 will demonstrate that a good assembly language programmer who is familiar with the floating point hardware can beat the compiler by an even wider performance margin.

8.8 Ethics Case Study: Patriot Missile Failure

Fixed-point arithmetic is very efficient on modern computers. However it is incumbent upon the programmer to track the radix point at all stages of the computation, and to ensure that a sufficient number of bits are provided on both sides of the radix point. The programmer must ensure that all computations are carried out with the desired level of precision, resolution, accuracy, range, and dynamic range. Failure to do so can have serious consequences.

On February 25, 1991, during the Gulf War, an American Patriot Missile battery in Dharan, Saudi Arabia, failed to intercept an incoming Iraqi SCUD missile. The SCUD struck an American army barracks, killing 28 soldiers and injuring around 98 other people. The cause was an inaccurate calculation of the time elapsed since the system was last booted.

The hardware clock on the system counted the time in tenths of a second since the last reboot. Current time, in seconds, was calculated by multiplying that number by 110si5_e. For this calculation, 110si5_e was represented as a U(1,23) fixed-point number. Since 110si5_e cannot be represented precisely in a fixed number of bits, there was round-off error in the calculations. The small imprecision, when multiplied by a large number, resulted in significant error. The longer the system ran after boot, the larger the error became.

The system determined whether or not it should fire by predicting where the incoming missile would be at a specific time in the future. The time and predicted location were then fed to a second system which was responsible for locking onto the target and firing the Patriot missile. The system would only fire when the missile was at the proper location at the specified time. If the radar did not detect the incoming missile at the correct time and location, then the system would not fire.

At the time of the failure, the Patriot battery had been up for around 100 h. We can estimate the error in the timing calculations by considering how the binary number was stored. The binary representation of 110si5_e is 0.00011¯si134_e. Note that it is a non-terminating, repeating binimal. The 24-bit register in the Patriot could only hold the following set of bits:

u08-30-9780128036983

This resulted in an error of 0.000000000000000000000001100¯2si135_e. The error can be computed in base 10 as:

e=224+225+228+229+232+233+

si136_e  (8.3)

=i=02(4i+24)+2(4i+25)

si137_e  (8.4)

9.5×108.

si138_e  (8.5)

To find out how much error was in the total time calculation, we multiply e by the number of tenths of a second in 100 h. This gives 9.5 × 10−8 × 100 × 60 × 60 × 10 = 0.34 s. A SCUD missile travels at about 1,676 m/s. Therefore it travels about 570 m in 0.34 s. Because of this, the targeting and firing system was expecting to find the SCUD at a location that was over half a kilometer from where it really was. This was far enough that the incoming SCUD was outside the “range gate” that the Patriot tracked. It did not detect the SCUD at its predicted location, so it could not lock on and fire the Patriot.

This is an example of how a seemingly insignificant error can lead to a major failure. In this case, it led to loss of life and serious injury. Ironically, one factor that contributed to the problem was that part of the code had been modified to provide more accurate timing calculations, while another part had not. This meant that the inaccuracies did not cancel each other. Had both sections of code been re-written, or neither section changed, then the issue probably would not have surfaced.

The Patriot system was originally designed in 1974 to be mobile and to defend against aircraft that move much more slowly than ballistic missiles. It was expected that the system would be moved often, and therefore the computer would be rebooted frequently. Also, the slow-moving aircraft would be much easier to track, and the error in predicting where it is expected to be would not be significant. The system was modified in 1986 to be capable of shooting down Soviet ballistic missiles. A SCUD missile travels at about twice the speed of the Soviet missiles that the system was re-designed for.

The system was deployed to Iraq in 1990, and successfully shot down a SCUD missile in January of 1991. In mid-February of 1991, Israeli troops discovered that the system became inaccurate if it was allowed to run for long periods of time. They claimed that the system would become unreliable after 20 hours of operation. U.S. military did not think the discovery was significant, but on February 16th, a software update was released. Unfortunately, the update could not immediately reach all units because of wartime difficulties in transportation. The Army released a memo on February 21st, stating that the system was not to be run for “very long times,” but did not specify how long a “very long time” would be. The software update reached Dhahran one day after the Patriot Missile system failed to intercept a SCUD missile, resulting in the death of 28 Americans and many more injuries.

Part of the reason this error was not found sooner was that the program was written in assembly language, and had been patched several times in its 15-year life. The code was difficult to understand and maintain, and did not conform to good programming practices. The people who worked to modify the code to handle the SCUD missiles were not as familiar with the code as they would have been if it were written more recently, and time was a critical factor. Prolonged testing could have caused a disaster by keeping the system out of the hands of soldiers in a time of war. The people at Raytheon Labs had some tough decisions to make. It cannot be said that Raytheon was guilty of negligence or malpractice. The problem with the system was not necessarily the developers, but that the system was modified often and in inconsistent ways, without complete understanding.

8.9 Chapter Summary

Sometimes it is desirable to perform calculations involving non-integral numbers. The two common ways to represent non-integral numbers in a computer are fixed point and floating point. A fixed point representation allows the programmer to perform calculations with non-integral numbers using only integer operations. With fixed point, the programmer must track the radix point throughout the computation. Floating point representations allow the radix point to be tracked automatically, but require much more complex software and/or hardware. Fixed point will usually provide better performance than floating point, but requires more programming skill.

Fractional numbers in radix notation may not terminate in all bases. Numbers which terminate in base two will also terminate in base ten, but the converse is not true. Programmers should avoid counting using fractions which do not terminate in base two, because it leads to the accumulation of round-off errors.

Exercises

8.1 Perform the following base conversions:

(a) Convert 10110.0012 to base ten.

(b) Convert 11000.01012 to base ten.

(c) Convert 10.12510 to binary.

8.2 Complete the following table (assume all values represent positive fixed-point numbers):

Base 10Base 2Base 16Base 13
49.125
101011.011
AF.3
12

t0070

8.3 You are working on a problem involving real numbers between −2 and 2 on a computer that has 16-bit integer registers and no hardware floating point support. You decide to use 16-bit fixed-point arithmetic.

(a) What fixed-point format should you use?

(b) Draw a diagram showing the sign, if any, radix point, integer part, and fractional part.

(c) What is the precision, resolution, accuracy, and range of your format?

8.4 What is the resulting type of each of the following fixed-point operations?

(a) S(24,7)×S(27,15)

(b) S(3,4)÷U(4,20)

8.5 Convert 26.64062510 to a binary U(18,14) representation. Show the ARM assembly code necessary to load that value into register r4.

8.6 For each of the following fractions, indicate whether or not it will terminate in bases 2, 5, 7, and 10.

(a) 1364si139_e

(b) 3760si140_e

(c) 2574si141_e

(d) 391250si142_e

(e) 17343si143_e

8.7 What is the exact value of the binary number 0011011100011010 when interpreted as an IEEE half-precision number? Give your answer in base ten.

8.8 The “Software Engineering Code of Ethics And Professional Practice” states that a responsible software engineer should “Approve software only if they have well-founded belief that it is safe, meets specifications, passes appropriate tests…” (sub-principle 1.03) and “Ensure adequate testing, debugging, and review of software…on which they work” (sub-principle 3.10).
The software engineering code of ethics also states that a responsible software engineer should “Treat all forms of software maintenance with the same professionalism as new development.”

(a) Explain how the Software Engineering Code of Ethics And Professional Practice were violated by the Patriot Missile system developers.

(b) How should the engineers and managers at Raytheon have responded when they were asked to modify the Patriot Missile System to work outside of its original design parameters?

(c) What other ethical and non-ethical considerations may have contributed to the disaster?

Chapter 9

The ARM Vector Floating Point Coprocessor

Abstract

This chapter begins by giving an overview of the ARM Vector Floating Point (VFP) coprocessor and the ARM VFP register set. Next, it gives an overview of the Floating Point Status and Control Register (FPSCR). It then explains RunFast mode, which gives higher performance but is not fully compliant with the IEEE floating point standards. That is followed by a explanation of vector mode, which can give an additional performance boost in some situations. Then, after a short discussion of the register usage rules, it describes each of the VFP instructions, providing a short description of each one. Next, it presents four implementations of a function to calculate sine using the ARM VFP coprocessor, and shows that they are all significantly faster than the implementation provided by GCC.

Keywords

Floating point; Vector; IEEE Compliance; Performance

Some ARM processors have dedicated hardware to support floating point operations. For ARMv7 and previous architectures, floating point is provided by an optional Vector Floating Point (VFP) coprocessor. Many newer processors also support the NEON extensions, which are covered in Chapter 10. The remainder of this chapter will explain the VFP coprocessor.

9.1 Vector Floating Point Overview

There are four major revisions of the VFP coprocessor:

VFPv1: Obsolete

VFPv2: An optional extension to the ARMv5 and ARMv6 processors. VFPv2 has 16 64-bit FPU registers.

VFPv3: An optional extension to the ARMv7 processors. It is backwards compatible with VFPv2, except that it cannot trap floating-point exceptions. VFPv3-D32 has 32 64-bit FPU registers. Some processors have VFPv3-D16, which supports only 16 64-bit FPU registers. VFPv3 adds several new instructions to the VFP instruction set.

VFPv4: Implemented on some Cortex ARMv7 processors. VFPv4 has 32 64-bit FPU registers. It adds both half-precision extensions and multiply-accumulate instructions to the features of VFPv3. Some processors have VFPv4-D16, which supports only 16 64-bit FPU registers.

Fig. 9.1 shows the 16 ARM integer registers, and the additional registers provided by the VFP coprocessor. Banks four through seven are only present on the VFPv3-D32 and VFPv4-D32 versions of the coprocessor. Note that each register in Banks zero through three can be used to store either one 64-bit number or two 32-bit numbers. For example, double precision register d0 may also be referred to as single precision registers s0 and s1. Each 32-bit VFP register can hold an integer or a single precision floating point number. Registers in Banks four through seven cannot be used as single precision registers.

f09-01-9780128036983
Figure 9.1 ARM integer and vector floating point user program registers.

The VFP adds about 23 new instructions to the ARM instruction set. The exact number of VFP instructions depends on the specific version of the VFP coprocessor. Instructions are provided to:

 transfer floating point values between VFP registers,

 transfer floating-point values between the VFP coprocessor registers and main memory,

 transfer 32-bit values between the VFP coprocessor registers and the ARM integer registers,

 perform addition, subtraction, multiplication, and division, involving two source registers and a destination register,

 compute the square root of a value,

 perform combined multiply-accumulate operations,

 perform conversions between various integer, fixed point, and floating point representations, and

 compare floating-point values.

In addition to performing basic operations involving two source registers and one destination register, VFP instructions can also perform operations involving registers arranged as short vectors (arrays) of up to eight single-precision values or four double-precision values. A single instruction can be used to perform operations on all of the elements of such vectors. This feature can substantially accelerate computation on arrays and matrices of floating point data. This type of data is common in graphics and signal processing applications. Vector mode can reduce code size and increase speed of execution by supporting parallel operations and multiple transfers.

9.2 Floating Point Status and Control Register

The Floating Point Status and Control Register (FPSCR) is similar to the CPSR register. The FPSCR stores status bits from floating point operations in much the same way as the CPSR stores status bits from integer operations. The programmer can also write to certain bits in the FPSCR to control the behavior of the VFP coprocessor. The layout of the FPSCR is shown in Fig. 9.2. The meaning of each field is as follows:

f09-02-9780128036983
Figure 9.2 Bits in the FPSCR.

N The Negative flag is set to one by vcmp if Fd < Fm.

Z The Zero flag is Set to one by vcmp if Fd = Fm.

C The Carry flag is set to one by vcmp if Fd = Fm, or Fd > Fm, or Fd and Fm are unordered.

V The oVerflow flag is set to one by vcmp if Fd and Fm are unordered.

QC NEON only. The saturation cumulative flag is set to one by saturating instructions if saturation has occurred.

DN Default NaN enable:

0: Disable Default NaN mode. NaN operands propagate through to the output of a floating-point operation.

1: Enable Default NaN mode. Any operation involving one or more NaNs returns the default NaN.

The default single precision NaN is 7FC0000016 and the default double-precision NaN is 7FF800000000000016. Default NaN mode does not comply with IEEE 754 standard, but may increase performance. NEON instructions ignore this bit and always use Default NaN mode.

FZ Flush-to-Zero enable:

0: Disable Flush-to-Zero mode.

1: Enable Flush-to-Zero mode.

Flush-to-Zero mode replaces subnormal numbers with 0. This does not comply with IEEE 754 standard, but may increase performance. NEON instructions ignore this bit and always use flush-to-Zero mode.

RMODE Rounding mode:

00 Round to Nearest (RN).

01 Round towards Plus infinity (RP).

10 Round towards Minus infinity (RM).

11 Round towards Zero (RZ).

NEON instructions ignore these bits and always use Round to Nearest mode.

STRIDE Sets the stride (distance between items) for vector operations:

00 Stride is 1.

01 Reserved.

10 Reserved.

11 Stride is 2.

LEN Sets the vector length for vector operations:

000 Vector length is 1 (scalar mode).

001 Vector length is 2.

010 Vector length is 3.

011 Vector length is 4.

100 Vector length is 5.

101 Vector length is 6.

110 Vector length is 7.

111 Vector length is 8.

IDE Input Denormal (subnormal) exception Enable:

0: Exception disabled.

1: An exception is generated when one or more operand is subnormal.

IXE IneXact exception Enable:

0: Exception disabled.

1: An exception is generated when the result contains more significand bits than the destination format can contain, and must be rounded.

UFE UnderFlow exception Enable:

0: Exception disabled.

1: An exception is generated when the result is closer to zero than can be represented by the destination format.

OFE OverFlow exception Enable:

0: Exception disabled.

1: An exception is generated when the result is farther from zero than can be represented by the destination format.

DZE Division by Zero exception Enable:

0: Exception disabled.

1: An exception is generated by divide instructions when the divisor is zero or subnormal.

IOE Invalid Operation exception Enable:

0: Exception disabled.

1: An exception is generated when the result is not defined, or cannot be represented. For example, adding positive and negative infinity gives an invalid result.

IDC The Input Subnormal Cumulative flag is set to one when an IDE condition has occurred.

IXC The IneXact Cumulative flag is set to one when an IXE condition has occurred.

UFC The UnderFlow Cumulative flag is set to one when a UFE condition has occurred.

OFC The OverFlow Cumulative flag is set to one when an OFE condition has occurred.

DZC The Division by Zero Cumulative flag is set to one when a DZE condition has occurred.

IOC The Invalid Operation Cumulative flag is set to one when an OFE condition has occurred.

The only VFP instruction that can be used to update the status flags in the FPSCR is fcmp, which is similar to the integer cmp instruction. To use the FPSCR flags to control conditional instructions, including conditional VFP instructions, they must first be moved into the CPSR register. Table 9.1 shows the meanings of the FPSCR flags when they are transferred to the CPSR and used for conditional execution on following instructions. The following rules govern how the bits in the FPSCR may be changed by subroutines:

Table 9.1

Condition code meanings for ARM and VFP

<cond>ARM Data Processing InstructionVFP fcmp Instruction
ALAlwaysAlways
EQEqualEqual
NENot EqualNot equal, or unordered
GESigned greater than or equalGreater than or equal
LTSigned less thanLess than, or unordered
GTSigned greater thanGreater than
LESigned less than or equalLess than or equal, or unordered
HIUnsigned higherGreater than, or unordered
LSUnsigned lower or sameLess than or equal
HSCarry set/unsigned higher or sameGreater than or equal, or unordered
CSSame as HSSame as HS
LOCarry clear/ unsigned lowerless than
CCSame as LOSame as LO
MINegativeLess than
PLPositive or zeroGreater than or equal, or unordered
VSOverflowUnordered (at least one NaN operand)
VCNo overflowNot unordered

1. Bits 27-31, 0-4, and 7 do not need to be preserved.

2. Subroutines may modify bits 8-12, 15, and 22-25 but the practice is discouraged. These bits should only be changed by specific support subroutines which change the global state of the program. If they are modified within a subroutine, then their original value must be restored before the function returns or calls another function.

3. Bits 16–18 and bits 20–21 may be changed by a subroutine, but must be set to zero before the function returns or calls another function.

4. All other bits are reserved for future use and must not be modified.

9.2.1 Performance Versus Compliance

Floating point operations are complex, and there are many special cases, such as dealing with NaNs, infinities, and subnormals. These special cases are a normal part of performing floating point math, but they are relatively infrequent. In order to simplify the hardware, many special situations which occur infrequently are handled by software. When one of these exceptional situations occurs, the VFP hardware sets the appropriate flags in the FPSCR and generates an interrupt. The ARM CPU then executes an interrupt handler to deal with the exceptional situation. When the routine finishes, it returns to the point where the exception occurred and execution resumes just as if the situation had been dealt with by the hardware. This approach is taken by many processor architectures to reduce the complexity, cost, and/or power consumption of the floating point hardware, This approach also allows the programmer to make a trade-off between performance and strict IEEE 754 compliance.

Full-compliance mode

The support code for dealing with VFP exceptions is included in most ARM-based operating systems. Even bare-metal embedded systems can include the VFP support service routines. With the support code enabled, the VFP coprocessor is fully compliant with the IEEE 754 standard. However, using the fully compliant mode does increase the average run-time for floating point code, and increases the size of the operating system kernel or embedded system code.

RunFast mode

When all of the VFP exceptions are disabled, Default NaN mode is enabled, and Flush-to-Zero is enabled, the VFP is not fully compliant with the IEEE 754 standard. However, floating point code runs significantly faster. For that reason, the state when bits 8–12 and bit 15 are set to zero while bits 24 and 25 are set to one is referred to as RunFast mode. There is some loss of accuracy for very small values, but the hardware no longer has to check for many of the conditions that may stall the floating point pipeline. This results in fewer stalls and much higher throughput in the hardware, as well as eliminating the necessity to handle exceptions in software. Many other floating point architectures have similar modes, so the GCC developers have found it worthwhile to provide programmers with the option of using them. User applications can be compiled to use this mode with GCC by using the - ffast -math and/or -Ofast options during compilation and linking. The startup code in the C standard library will then set the VFP to RunFast mode before calling the main function.

9.2.2 Vector Mode

A VFP vector consists of up to eight single-precision registers, or up to four double-precision registers. All of the registers in a vector must be in the same bank. Also, vectors cannot be stored in Bank 0 or Bank 4. For example, registers s8 through s10 could be treated as a vector of three single-precision values. Registers s14 through s17 cannot be treated as a vector because some of those registers are in Bank 1 and others are in Bank 2. Registers d0 through d3 cannot be treated as a vector because they are in Bank 0.

The LEN field in the FPSCR controls the length of vectors that are used for vector operations. In vector operations, the first register in the vector is given as the operand, and the remaining registers are inferred from the settings of LEN and STRIDE. The STRIDE field allows data to be interleaved. For example, if the stride is set to two, and length is set to four, then the vector starting at s8 would consist of registers s8, s10, s12, and s14, while the vector starting at s9 would consist of registers s9, s11, s13, and s15. If a vector runs off the end of a bank, then the address wraps around to the first register in the bank. For example, if length is set to six and stride is set to one, then the vector starting at s13 would consist of s13, s14, s15, s8, s9, and s10, in that order.

The vector-capable data-processing instructions have one of the following two forms:

f09-03-9780128036983

where Op is the VFP instruction, Fd is the destination register (or the first register in a vector), Fn is an operand register (or the first register in a vector), and Fm is an operand register (or the first register in a vector). Most data-processing instructions can operate in scalar mode, mixed mode, or vector mode. The mode depends on the LEN bits in the FPSCR, as well as on which register banks contain the destination and operand(s).

 The operation is scalar if the LEN field is set to zero (scalar mode) or the destination operand, Fd, is in Bank 0 or Bank 4. The operation acts on Fm (and Fn if the operation uses two operands) and places the result in Fd.

 The operation is mixed if the LEN field is not set to zero and Fm is in Bank 0 or Bank 4 but Fd is not. If the operation has only one operand, then the operation is applied to Fm and copies of the result are stored into each register in the destination vector. If the operation has two operands, then it is applied with the scalar Fm and each element in the vector starting at Fn, and the result is stored in the vector beginning at Fd.

 The operation is vector if the LEN field is not set to zero and neither Fd nor Fm is in Bank 0 or Bank 4. If the operation has only one operand, then the operation is applied to the vector starting at Fm and the results are placed in the vector starting at Fd. If the operation has two operands, then it is applied with corresponding elements from the vectors starting at Fm and Fn, and the result is stored in the vector beginning at Fd.

9.3 Register Usage Rules

As with the integer registers, there are rules for using the VFP registers. These rules are a convention, and following the convention ensures interoperability between code written by different programmers and compilers. Registers s16 through s31 are non-volatile. This implies that d8 through d15 are also non-volatile, since they are really the same registers. The contents of these registers must be preserved across subroutine calls. The remaining registers (s0 through s15, also known as d0 through d7) are volatile. They are used for passing arguments, returning results, and for holding local variables. They do not need to be preserved by subroutines. If registers d16 through d31 are present, then they are also considered volatile.

In addition to the FPSCR, all VFP implementations contain at least two additional system registers. The Floating-point System ID register (FPSID) is a read-only register whose value indicates which VFP implementation is being provided. The contents of the FPSID can be transferred to an ARM integer register, then examined to determine which VFP version is available. There is also a Floating-point Exception register (FPEXC). Two bits of the FPEXC register provide system-level status and control. The remaining bits of this register are defined by the sub-architecture. These additional system registers should not be accessed by user applications.

9.4 Load/Store Instructions

The VFP provides several instructions for moving data between memory and the VFP registers. There are instructions for loading and storing single and double precision registers, and for moving multiple registers to or from memory.. All of the load and store instructions require a memory address to be in one of the ARM integer registers.

9.4.1 Load/Store Single Register

The following instructions are used to load or store a single VFP register:

vldr Load VFP Register, and

vstr Store VFP Register.

Syntax

v<op>r{<cond>}{.<prec>} Fd, [Rn{,#offset}]

v<op>r{<cond>}{.<prec>} Fd, =label

 <op> may be either ld or st.

 Fd may be any single or double precision register.

 Rn may be any ARM integer register.

 <cond> is an optional condition code.

 <prec> may be either f32 or f64.

Operations

NameEffectDescription
vldrFdMem[Rn+offset]si1_eLoad Fd using Rn as a pointer
vstrMem[Rn+offset]Fdsi2_eStore Fd using Rn as a pointer

Examples

f09-04-9780128036983

9.4.2 Load/Store Multiple Registers

These instructions load or store multiple floating-point registers:

vldm Load Multiple VFP Registers, and

vstm Store Multiple VFP Registers.

As with the integer ldm and stm instructions, there are multiple versions for use in moving data and accessing stacks.

Syntax

v<op>m<mode>{<cond>}{.<prec>} Rn{!},<list>

vpush{<cond>}{.<prec>} <list>

vpop{<cond>}{.<prec>} <list>

 <op> may be either ld or st.

 <mode> is one of

ia Increment address after each transfer.

db Decrement address before each transfer.

 Rn may be any ARM integer register.

 <cond> is an optional condition code.

 <prec> may be either f32 or f64.

 <list> may be any set of contiguous single precision registers, or any set of contiguous double precision registers.

 If mode is db then the ! is required.

 vpop <list> is equivalent to vldmia sp!,< list >.

 vpush <list> is equivalent to vstmdb sp!,< list >.

Operations

NameEffectDescription
vldmia

addrRdsi3_e

for iregister_list do

iMem[addr]si4_e

if single then

addraddr+4si5_e

else

addraddr+8si6_e

end if

end for

if ! is present then

Rdaddrsi7_e

end if

Load multiple registers from memory starting at the address in Rd. Increment address after each load.
vstmia

addrRdsi3_e

for iregister_list do

Mem[addr]isi9_e

if single then

addraddr+4si5_e

else

addraddr+8si6_e

end if

end for

if ! is present then

Rdaddrsi7_e

end if

Store multiple registers in memory starting at the address in Rd. Increment address after each store.
vldmdb

addrRdsi3_e

for iregister_list do

if single then

addraddr4si14_e

else

addraddr8si15_e

end if

iMem[addr]si4_e

end for

Rdaddrsi7_e

Load multiple registers from memory starting at the address in Rd. Decrement address before each load.
vstmdb

addrRdsi3_e

for iregister_list do

if single then

addraddr4si14_e

else

addraddr8si15_e

end if

Mem[addr]isi9_e

end for

Rdaddrsi7_e

Store multiple registers in memory starting at the address in Rd. Decrement address before each store.

t0025

Examples

f09-05-9780128036983

9.5 Data Processing Instructions

These operations are vector-capable. For details on how to use vector mode, refer to Section 9.2.2. Instructions are provided to perform the four basic arithmetic functions, plus absolute value, negation, and square root. There are also special forms of the multiply instructions that perform multiply-accumulate.

9.5.1 Copy, Absolute Value, Negate, and Square Root

The unary operations require on source operand and a destination register. The source and destination can be the same register. There are four unary operations:

vcpy Copy VFP Register (equivalent to move),

vabs Absolute Value,

vneg Negate, and

vsqrt Square Root.

Syntax

v<op>{<cond>}.<prec> Fd, Fm

 <op> is one of cpy, abs, neg, or sqrt.

 <cond> is an optional condition code.

 <prec> may be either f32 or f64.

Operations

NameEffectDescription
vcpyFdFnsi23_eCopy
vabsFd|Fn|si24_eAbsolute Value
vnegFdFnsi25_eNegate
vsqrtFdFnsi26_eSquare Root

Examples

f09-06-9780128036983

9.5.2 Add, Subtract, Multiply, and Divide

The basic mathematical operations require two source operands and one destination. There are five basic mathematical operations:

vadd Add,

vsub Subtract,

vmul Multiply,

vnmul Negate and Multiply, and

vdiv Divide.

Syntax

v<op>{<cond>}.<prec> Fd, Fn, Fm

 <op> is one of add, sub, mul, nmul, or div.

 <cond> is an optional condition code.

 <prec> may be either f32 or f64.

Operations

NameEffectDescription
vaddFdFn+Fmsi27_eAdd
vsubFdFnFmsi28_eSubtract
vmulFdFn×Fmsi29_eMultiply
vnmulFdFn×Fmsi30_eNegate and multiply
vdivFdFn÷Fmsi31_eDivide

Examples

f09-07-9780128036983

9.5.3 Compare

The compare instruction subtracts the value in Fm from the value in Fd and sets the flags in the FPSCR based on the result. The comparison operation will raise an exception if one of the operations is a signalling NaN. There is also a version of the instruction that will raise an exception if either operand is any type of NaN. The two comparison instructions are:

vcmp Compare, and

vcmpe Compare with Exception.

Syntax

vcmp{e}{<cond>}.<prec> Fd, Fm

 If e is present, an exception is raised if either operand is any kind of NaN. Otherwise, an exception is raised only if either operand is a signaling NaN.

 <cond> is an optional condition code.

 <prec> may be either f32 or f64.

Operations

NameEffectDescription
fcmpFPSCRflags(FdFm)si32_eCompare two registers

Examples

f09-08-9780128036983

9.6 Data Movement Instructions

With the addition of all of the VFP registers, there many more possibilities for how data can be moved. There are many more registers, and VFP registers may be 32 or 64 bit. This results in several possible combinations for moving data among all of the registers. The VFP instruction set includes instructions for moving data between two VFP registers, between VFP and integer registers, and between the various system registers.

9.6.1 Moving Between Two VFP Registers

The most basic move instruction involving VFP registers simply moves data between two floating point registers. The instruction is:

vmov Move Between VFP Registers.

Syntax

vmov{<cond>}{.<prec>} Fd, Fm

 F can be s or d.

 Fd and Fm must be the same size.

 <cond> is an optional condition code.

 <prec> is either f32 or f64.

Operations

NameEffectDescription
vmovFdFmsi33_eMove Fm to Fd

Examples

f09-09-9780128036983

9.6.2 Moving Between VFP Register and One Integer Register

This version of the move instruction allows 32 bits of data to be moved between an ARM integer register and a floating point register. The instruction is:

vmov Move Between VFP and One ARM Integer Register.

Syntax

vmov{<cond>} Rd, Sn

vmov{<cond>} Sn, Rd

 Rd is an ARM integer register.

 Sd is a VFP single precision register.

 <cond> is an optional condition code.

Operations

NameEffectDescription
vmov Rd,SmRdSmsi34_eMove Sm to Rd
vmov Sm,RdSmRdsi35_eMove Rd to Sm

Examples

f09-10-9780128036983

9.6.3 Moving Between VFP Register and Two Integer Registers

This version of the move instruction is used to transfer 64 bits of data between ARM integer registers and floating point registers:

vmov Move Between VFP and Two ARM Integer Registers.

Syntax

vmov{<cond>} destination(s), source(s)

 Source and destination must be VFP or integer registers. One of them must be a set of ARM integer registers, and the other must be VFP coprocessor registers. The following table shows the possible choices for sources and destinations.

ARM IntegerFloating Point
Rl,RhDd
Sd,Sd’

 Sd and Sd’ must be adjacent, and Sd’ must be the higher-numbered register.

 <cond> is an optional condition code.

Operations

NameEffectDescription
vmov Dd,Rl,RhDdRh:Rlsi36_eMove Rh and Rl to Dd
vmov Rl,Rh,DmRh:RlDmsi37_eMove Dm to Rh and Rl
vmov Sd,Sd’,Rl,RhSdRh,SdRlsi38_eMove Rh and Rl to Sd and Sd’.
vmov Rl,Rh,Sd,Sd’RhSd,RlSdsi39_eMove Sd and Sd’ to Rh and Rl.

Examples

f09-11-9780128036983

9.6.4 Move Between ARM Register and VFP System Register

There are two instructions which allow the programmer to examine and change bits in the VFP system register(s):

vmrs Move From VFP System Register to ARM Register, and

vmsr Move From ARM Register to VFP System Register.

User programs should only access the FPSCR to check the flags and control vector mode.

Syntax

vmrs{<cond>} Rd, VFPsysreg

vmsr{<cond>} VFPsysreg, Rd

 VFPsysreg can be any of the VFP system registers.

 Rd can be APSR_nzcv or any ARM integer register.,

 <cond> is an optional condition code.

Operations

NameEffectDescription
mrsRdVFPsysregsi40_eMove data from VFP system register to integer register
msrVFPsysregRdsi41_eMove data from integer register to VFP system register

Examples

f09-12-9780128036983

9.7 Data Conversion Instructions

The ARM VFP provides several instructions for converting between various floating point and integer formats. Some VFP versions also have instructions for converting between fixed point and floating point formats.

9.7.1 Convert Between Floating Point and Integer

These instructions are used to convert integers to single or double precision floating point, or for converting single or double precision to integer:

vcvt Convert Between Floating Point and Integer

vcvtr Convert Floating Point to Integer with Rounding

These instructions always use a single precision register for the integer, but the floating point argument can be single precision or double precision. Some versions of the VFP do not support the double precision versions.

Syntax

vcvt{r}{<cond>}.<type>.f64 Sd, Dm

vcvt{r}{<cond>}.<type>.f32 Sd, Sm

vcvt{<cond>}.f64.<type> Dd, Sm

vcvt{<cond>}.f32.<type> Sd, Sm

 The optional r makes the operation use the rounding mode specified in the FPSCR. The default is to round toward zero.

 <cond> is an optional condition code.

 The <type> can be either u32 or s32 to specify unsigned or signed integer.

 These instructions can also convert from fixed point to floating point if followed by an appropriate vmul.

Operation

OpcodeEffectDescription
vcvt.f64.s32Dddouble(Sm)si42_eConvert signed integer to double
vcvt.f32,s32Sdsingle(Sm)si43_eConvert signed integer to single
vcvt.f64.u32Dddouble(Sm)si42_eConvert unsigned integer to double
vcvt.f32.u32Sdsingle(Sm)si43_eConvert unsigned integer to single
vcvt.s32.f32Sdint(Sm)si46_eConvert single to signed integer
vcvt.u32.f32Sdunsigned(Sm)si47_eConvert single to unsigned integer
vcvt.s32.f64Sdint(Dm)si48_eConvert double to signed integer
vcvt.u32.f64Sdunsigned(Dm)si49_eConvert double to unsigned integer

Examples

f09-13-9780128036983

9.7.2 Convert Between Fixed Point and Single Precision

VFPv3 and higher coprocessors have additional instructions used for converting between fixed point and single precision floating point:

vcvt Convert To or From Fixed Point.

Syntax

vcvt{<cond>}.<td>.f32 Sd, Sm, #fbits

vcvt{<cond>}.f32.<td> Sd, Sm, #fbits

 <cond> is an optional condition code.

 <td> specifies the type and size of the fixed point number, and must be one of the following:

s32 signed 32 bit value,

u32 unsigned 32 bit value,

s16 signed 16 bit value, or

u16 unsigned 16 bit value.

 The #fbits operand specifies the number of fraction bits in the fixed point number, and must be less than or equal to the size of the fixed point number indicated by <td>.

Operations

NameEffectDescription
vcvt.s32.f32Ddfixed32(Sm)si50_eConvert single precision to 32-bit signed fixed point.
vcvt.u32.f32Sdufixed32(Sm)si51_eConvert single precision to 32-bit unsigned fixed point.
vcvt.s16.f32Ddfixed16(Sm)si52_eConvert single precision to 16-bit signed fixed point.
vcvt.u16.f32Sdufixed16(Sm)si53_eConvert single precision to 16-bit unsigned fixed point.
vcvt.f32.s32Ddsingle(Sm)si54_eConvert signed 32-bit fixed point to single precision
vcvt.f32.u32Sdsingle(Sm)si43_eConvert unsigned 32-bit fixed point to single precision
vcvt.f32.s16Ddsingle(Sm)si54_eConvert signed 16-bit fixed point to single precision
vcvt.f32.16Sdsingle(Sm)si43_eConvert unsigned 16-bit fixed point to single precision

Examples

f09-14-9780128036983

9.8 Floating Point Sine Function

A fixed point implementation of the sine function was discussed in Section 8.7, and shown to be superior to the floating point sine function provided by GCC. Now that we have covered the VFP instructions, we can write an assembly version using floating point which also performs better than the routines provided by GCC.

9.8.1 Sine Function Using Scalar Mode

Listing 9.1 shows a single precision floating point implementation of the sine function, using the ARM VFPv3 instruction set. It works in a similar way to the previous fixed point code. There is a table of constants, each of which is the reciprocal of one of the factorial divisors in the Taylor series for sine. The subroutine calculates the powers of x one-by-one, and multiplies each power by the next constant in the table, summing the results as it goes. Note that the table of constants is shorter than the fixed point version of the code, because there are fewer bits of precision in a single precision floating point number than there are in the fixed point representation that was used previously.

f09-15-9780128036983
Listing 9.1 Simple scalar implementation of the sin x function using IEEE single precision.

Listing 9.2 shows a double precision floating point implementation of the sine function, using the ARM VFPv3 instruction set. Again, there is a table of constants, each of which is the reciprocal of one of the factorial divisors in the Taylor series for sine. The subroutine calculates the powers of x one-by-one, and multiplies each power by the next constant in the table, summing the results as it goes. Note that the table of constants is longer than the fixed point version of the code, because there are more bits of precision in a double precision floating point number than there are in the fixed point representation that was used previously.

f09-16-9780128036983
Listing 9.2 Simple scalar implementation of the sin x function using IEEE double precision.

9.8.2 Sine Function Using Vector Mode

The previous implementations are already faster than the implementations provided by GCC, However, it may be possible to gain a little more performance by using VFP vector mode. In the single precision code, there are five terms to be added. Since single precision vectors can have up to eight elements, the code should not require any loop at all.

Listing 9.3 shows a single precision floating point implementation of the sine function, using the ARM VFPv3 instruction set in vector mode. It performs the same operations as the previous implementation, but instead of using a loop, all of the data is pre-loaded into vector banks and then a vector multiply operation is performed. The processor is then returned to scalar mode, and the summation is performed. This implementation is slightly faster than the previous version.

f09-17-9780128036983
Listing 9.3 Vector implementation of the sin x function using IEEE single precision.

Listing 9.4 shows a double precision floating point implementation of the sine function, using the ARM VFPv3 instruction set in vector mode. It performs the same operations as the previous implementation, but performs the nine multiplications in three groups of three, using vector operations. Also, computing the powers of x is done within the loop, using a vector multiply. In this case, the vector code is significantly faster than the scalar version.

f09-18a-9780128036983f09-18b-9780128036983f09-18c-9780128036983
Listing 9.4 Vector implementation of the sin x function using IEEE double precision.

9.8.3 Performance Comparison

Table 9.2 shows the performance of various implementations of the sine function, with and without compiler optimization. The Single Precision C and Double Precision C implementations are the standard implementations provided by GCC.

Table 9.2

Performance of sine function with various implementations

OptimizationImplementationCPU seconds
NoneSingle Precision Scalar Assembly2.96
Single Precision Vector Assembly2.63
Single Precision C8.75
Double Precision Scalar Assembly4.59
Double Precision Vector Assembly3.75
Double Precision C9.21
FullSingle Precision Scalar Assembly2.16
Single Precision Vector Assembly2.06
Single Precision C2.59
Double Precision Scalar Assembly3.88
Double Precision Vector Assembly3.16
Double Precision C8.49

When compiler optimization is not used, the single precision scalar VFP implementation achieves a speedup of about 2.96, and the vector implementation achieves a speedup of about 3.33 compared to the GCC implementation. The double precision scalar VFP implementation achieves a speedup of about 2.01, and the vector implementation achieves a speedup of about 2.46 compared to the GCC implementation.

When the best possible compiler optimization is used (-Ofast), the single precision scalar VFP implementation achieves a speedup of about 1.20, and the vector implementation achieves a speedup of about 1.26 compared to the GCC implementation. The double precision scalar VFP implementation achieves a speedup of about 2.19, and the vector implementation achieves a speedup of about 2.69 compared to the GCC implementation.

In most cases, the assembly versions were significantly faster than the functions provided by GCC. GCC with full optimization using single-precision numbers was competitive, but the assembly language vector implementation still beat it by over 25%. It is clear that writing some functions in assembly can result in large performance gains.

9.9 Alphabetized List of VFP Instructions

NamePageOperation
vabs277Absolute Value
vadd278Add
vcmp279Compare
vcmpe279Compare with Exception
vcpy277Copy VFP Register
vcvt283Convert Between Floating Point and Integer
vcvt284Convert To or From Fixed Point
vcvtr283Convert Floating Point to Integer with Rounding
vdiv278Divide
vldm275Load Multiple VFP Registers
vldr274Load VFP Register
vmov280Move Between VFP and One ARM Integer Register
vmov281Move Between VFP and Two ARM Integer Registers
vmov279Move Between VFP Registers
vmrs282Move From VFP System Register to ARM Register
vmsr282Move From ARM Register to VFP System Register
vmul278Multiply
vneg277Negate
vnmul278Negate and Multiply
vsqrt277Square Root
vstm275Store Multiple VFP Registers
vstr274Store VFP Register
vsub278Subtract

9.10 Chapter Summary

The ARM VFP coprocessor adds a great deal of power to the ARM architecture. The register set is expanded to hold up to four times the amount of data that can be held in the ARM integer registers. The additional instructions allow the programmer to deal directly with the most common IEEE 754 formats for floating point numbers. The ability to treat groups of registers as vectors adds a significant performance improvement. Access to the vector features is only possible through assembly language. The GCC compiler is not capable of using these advanced features, which gives the assembly programmer a big advantage when high-performance code is needed.

Exercises

9.1 How many registers does the VFP coprocessor add to the ARM architecture?

9.2 What is the purpose of the FZ, DN, and IDE, IXE, UFE, OFE, DZE, and IOE bits in the FPSCR? What is it called when FZ and DN are set to one and all of the others are set to zero?

9.3 If a VFP coprocessor is present, how are floating point parameters passed to subroutines? How is a pointer to a floating point value (or array of values) passed to a subroutine?

9.4 Write the following C code in ARM assembly:

f09-19-9780128036983

9.5 In the previous exercise, the C code contains a subtle bug.

a. What is the bug?

b. Show two ways to fix the code in ARM assembly. Hint: One way is to change the amount of the increment, which will change the number of times that the loop executes.

9.6 The fixed point sine function from the previous chapter was not compared directly to the hand-coded VFP implementation. Based on the information in Tables 9.2 and 8.4, would you expect the fixed point sine function from the previous chapter to beat the hand-coded assembly VFP sine function in this chapter? Why or why not?

9.7 3-D objects are often stored as an array of points, where each point is a vector (array) consisting of four values, x, y, z, and the constant 1.0. Rotation, translation, scaling and other operations are accomplished by multiplying each point by a 4 × 4 transformation matrix. The following C code shows the data types and the transform operation:

f09-20-9780128036983

Write the equivalent ARM assembly code.

9.8 Optimize the ARM assembly code you wrote in the previous exercise. Use vector mode if possible.

9.9 Since the fourth element of the point is always 1.0, there is no need to actually store it. This will reduce memory requirements by about 25%, and require one fewer multiply. The C code would look something like this:

f09-21-9780128036983

Write optimal ARM VFP code to implement this function.

9.10 The function in the previous problem would typically be called multiple times to process an array of points, as in the following function:

f09-22-9780128036983

This could be somewhat inefficient. Re-write this function in assembly so that the transformation of each point is done without resorting to a function call. Make your code as efficient as possible.

Chapter 10

The ARM NEON Extensions

Abstract

This chapter begins with an overview of the NEON extensions and explains the relationship between VFP and NEON. The NEON registers are explained, and the syntax for NEON instructions is explained. Next, each of the NEON instructions are explained, with short examples. In some cases, extended examples and figures are provided to help explain the operation of complex instructions. After all of the instructions are explained, another implementation of sine is presented and compared to previous implementations and with the GCC sine function. It is shown that NEON gives a significant performance advantage over VFP and hand coded assembly is much faster than the sin function provided by the compiler.

Keywords

Single instruction multiple data (SIMD); Vector; Vector element; Instruction level parallelism; Lane

The ARM VFP coprocessor has been replaced or augmented by the NEON architecture on ARMv7 and higher systems. NEON extends the VFP instruction set with about 125 instructions and pseudo-instructions to support not only floating point, but also integer and fixed point. NEON also supports Single Instruction, Multiple Data (SIMD) operations. All NEON processors have the full set of 32 double precision VFP registers, but NEON adds the ability to view the register set as 16 128-bit (quadruple-word) registers, named q0 through q15.

A single NEON instruction can operate on up to 128 bits, which may represent multiple integer, fixed point, or floating point numbers. For example, if two of the 128-bit registers each contain eight 16-bit integers, then a single NEON instruction can add all eight integers from one register to the corresponding integers in the other register, resulting in eight simultaneous additions. For certain applications, this SIMD architecture can result in extremely fast and efficient implementations. NEON is particularly useful at handling streaming video and audio, but also can give very good performance on floating point intensive tasks. NEON instructions perform parallel operations on vectors. NEON deprecates the use of VFP vector mode covered in Section 9.2.2. On most NEON systems, using the VFP vector mode will result in an exception, which transfers control to the support code which emulates vector mode in software. This causes a severe performance penalty, so VFP vector mode should not be used on NEON systems.

Fig. 10.1 shows the ARM integer, VFP, and NEON register set. NEON views each register as containing a vector of 1, 2, 4, 8, or 16 elements, all of the same size and type. Individual elements of each vector can also be accessed as scalars. A scalar can be 8 bits, 16 bits, 32 bits, or 64 bits. The instruction syntax is extended to refer to scalars using an index, x, in a doubleword register. Dm[x] is element x in register Dm. The size of the elements is given as part of the instruction. Instructions that access scalars can access any element in the register bank.

f10-01-9780128036983
Figure 10.1 ARM integer and NEON user program registers.

10.1 NEON Intrinsics

The GCC compiler gives C (and C++) programs direct access to the NEON instructions through the NEON intrinsics. The intrinsics are a large set of functions that are built into the compiler. Most of the intrinsics functions map to one NEON instruction. There are additional functions provided for typecasting (reinterpreting) NEON vectors, so that the C compiler does not complain about mismatched types. It is usually shorter and more efficient to write the NEON code directly as assembly language functions and link them to the C code. However only those who know assembly language are capable of doing that.

10.2 Instruction Syntax

Some instructions require specific register types. Other instructions allow the programmer to choose single word, double word, or quad word registers. If the instruction requires single precision registers, then the registers are specified as Sd for the destination register, Sn for the first operand register, and Sm for the second operand register. If the instruction requires only two registers, then Sn is not used. The lower-case letter is replaced with a valid register number. The register name is not case sensitive, so S10 and s10 are both valid names for single precision register 10.

The syntax of the NEON instructions can be described using a relatively simple notation. The notation consists of the following elements:

{item} Braces around an item indicate that the item is optional. For example, many operations have an optional condition, which is written as {<cond>}.

Ry An ARM integer register. y can be any number in the range 0{15.

Sy A 32-bit or single precision register. y can be any number in the range 0{31.

Dy A 64-bit or double precision register. y can be any number in the range 0{31.

Qy A quad word register. y can be any number in the range 0{15.

Fy A VFP register. F must be either s for a single word register, or d for a double word register. y can be any valid register number.

Ny A NEON or VFP register. N must be either s for a single word register, d for a double word register, or q for a quad word register. y can be any valid register number.

Vy A NEON vector register. V must be replaced with d for a double word register, or q for a quad word register. y can be any valid register number.

Vy[x] A NEON scalar (vector element). The size of the scalar is defined as part of the instruction. V must be replaced with d for a double word register, or q for a quad word register. y can be any valid register number. x specifies which scalar element of Vy is to be used. Valid values for x can be deduced by the size of Vy and the size of the scalars that the instruction uses.

<op> Operation specific part of a general instruction format

<n> An integer usually indicating a specific instruction version

<size> An integer indicating the number of bits used

<cond> ARM condition code from Table 3.2

<type> Many instructions operate on one or more of the following specific data types:

i8 Untyped 8 bits

i16 Untyped 16 bits

i32 Untyped 32 bits

i64 Untyped 64 bits

s8 Signed 8-bit integer

s16 Signed 16-bit integer

s32 Signed 32-bit integer

s64 Signed 64-bit integer

u8 Unsigned 8-bit integer

u16 Unsigned 16-bit integer

u32 Unsigned 32-bit integer

u64 Unsigned 64-bit integer

f16 IEEE 754 half precision floating point

f32 IEEE 754 single precision floating point

f64 IEEE 754 double precision floating point

<list> A brace-delimited list of up to four NEON registers, vectors, or scalars. The general form is {Dn,D(n+a),D(n+2a),D(n+3a)} where a is either 1 or 2.

<align> Specifies the memory alignment of structured data for certain load and store operations.

<imm> An immediate value. The required format for immediate values depends on the instruction.

<fbits> Specifies the number of fraction bits in fixed point numbers.

The following function definitions are used in describing the effects of many of the instructions:

xsi1_e The floor function maps a real number, x, to the next smallest integer.

u10-01-9780128036983 The saturate function limits the value of x to the highest or lowest value that can be stored in the destination register.

xsi2_e The round function maps a real number, x, to the nearest integer.

xsi3_e The narrow function reduces a 2n bit number to an n bit number, by taking the n least significant bits.

xsi4_e The extend function converts an n bit number to a 2n bit number, performing zero extension if the number is unsigned, or sign extension if the number is signed.

10.3 Load and Store Instructions

These instructions can be used to perform interleaving of data when structured data is loaded or stored. The data should be properly aligned for best performance. These instructions are very useful for common multimedia data types.

For example, image data is typically stored in arrays of pixels, where each pixel is a small data structure such as the pixel struct shown in Listing 5.37. Since each pixel is three bytes, and a d register is 8 bytes, loading a single pixel into one register would be inefficient. It would be much better to load multiple pixels at once, but an even number of pixels will not fit in a register. It will take three doubleword or quadword registers to hold an even number of pixels without wasting space, as shown in Fig. 10.2. This is the way data would be loaded using a VFP vldr or vldm instruction. Many image processing operations work best if each color “channel” is processed separately. The NEON load and store vector instructions can be used to split the image data into color channels, where each channel is stored in a different register, as shown in Fig. 10.3.

f10-02-9780128036983
Figure 10.2 Pixel data interleaved in three doubleword registers.
f10-03-9780128036983
Figure 10.3 Pixel data de-interleaved in three doubleword registers.

Other examples of interleaved data include stereo audio, which is two interleaved channels, and surround sound, which may have up to nine interleaved channels. In all of these cases, most processing operations are simplified when the data is separated into non-interleaved channels.

10.3.1 Load or Store Single Structure Using One Lane

These instructions are used to load and store structured data across multiple registers:

vld<n> Load Structured Data, and

vst<n> Store Structured Data.

They can be used for interleaving or deinterleaving the data as it is loaded or stored, as shown in Fig. 10.3.

Syntax

 v<op><n>.<size> <list>,[Rn{:<align>}]{!}

 v<op><n>.<size> <list>,[Rn{:<align>}],Rm

 <op> must be either ld or st.

 <n> must be one of 1, 2, 3, or 4.

 <size> must be one of 8, 16, or 32.

 <list> specifies the list of registers. There are four list formats:

1. {Dd[x]}

2. {Dd[x], D(d+a)[x]}

3. {Dd[x], D(d+a)[x], D(d+2a)[x]}

4. {Dd[x], D(d+a)[x], D(d+2a)[x], D(d+3a)[x]}

where a can be either 1 or 2. Every register in the list must be in the range d0-d31.

 Rn is the ARM register containing the base address. Rn cannot be pc.

 <align> specifies an optional alignment. If <align> is not specified, then standard alignment rules apply.

 The optional ! indicates that Rn is updated after the data is transferred. This is similar to the ldm and stm instructions.

 Rm is an ARM register containing an offset from the base address. If Rm is present, Rn is updated to Rn + Rm after the address is used to access memory. Rm cannot be sp or pc.

Table 10.1 shows all valid combinations of parameters for these instructions. Note that the same vector element (scalar) x must be used in each register. Up to four registers can be specified. If the structure has more than four fields, then these instructions can be used repeatedly to load or store all of the fields.

Table 10.1

Parameter combinations for loading and storing a single structure

<n><size><list><align>Alignment
18Dd[x]Standard only
2-516Dd[x]162 byte
2-532Dd[x]324 byte
28Dd[x], D(d+1)[x]162 byte
2-516Dd[x], D(d+1)[x]324 byte
Dd[x], D(d+2)[x]324 byte
2-532Dd[x], D(d+1)[x]648 byte
Dd[x], D(d+2)[x]648 byte
38Dd[x], D(d+1)[x], D(d+2)[x]Standard only
2-516 or 32Dd[x], D(d+1)[x], D(d+2)[x]Standard only
Dd[x], D(d+2)[x], D(d+4)[x]Standard only
48Dd[x], D(d+1)[x], D(d+2)[x], D(d+3)[x]324 byte
2-516Dd[x], D(d+1)[x], D(d+2)[x], D(d+3)[x]648 byte
Dd[x], D(d+2)[x], D(d+4)[x], D(d+6)[x]648 byte
2-532Dd[x], D(d+1)[x], D(d+2)[x], D(d+3)[x]64 or 128(<align> ÷ 8) bytes
Dd[x], D(d+2)[x], D(d+4)[x], D(d+6)[x]64 or 128(<align> ÷ 8) bytes

t0010

Operations

NameEffectDescription
vld<n>

tmpRnsi5_e

incr(si6_e<size> ÷ 8)

for Dregs(<list>) do

 D[x]Mem[tmp]si7_e

 tmptmp+incrsi8_e

end for

if ! is present then

 Rntmpsi9_e

else

 if Rm is specified then

 RnRmsi10_e

end if

end if

Load one or more data items into a single lane of one or more registers
vst<n>

tmpRnsi5_e

incr(si6_e<size> ÷ 8)

for Dregs(<list>) do

 Mem[tmp]D[x]si13_e

 tmptmp+incrsi8_e

end for

if ! is present then

 Rntmpsi9_e

else

 if Rm is specified then

 RnRmsi10_e

 end if

end if

Store one or more data items from a single lane of one or more registers

t0030

Examples

f10-11-9780128036983

10.3.2 Load Copies of a Structure to All Lanes

This instruction is used to load multiple copies of structured data across multiple registers:

vld<n> Load Copies of Structured Data.

The data is copied to all lanes. This instruction is useful for initializing vectors for use in later instructions.

Syntax

 vld<n>.<size> <list>,[Rn{:<align>}]{!}

 vld<n>.<size> <list>,[Rn{:<align>}],Rm

 <n> must be one of 1, 2, 3, or 4.

 <size> must be one of 8, 16, or 32.

 <list> specifies the list of registers. There are four list formats:

1. {Dd[]}

2. {Dd[], D(d+a)[]}

3. {Dd[], D(d+a)[], D(d+2a)[]}

4. {Dd[], D(d+a)[], D(d+2a)[], D(d+3a)[]}

where a can be either 1 or 2. Every register in the list must be in the range d0-d31.

 Rn is the ARM register containing the base address. Rn cannot be pc.

 <align> specifies an optional alignment. If <align> is not specified, then standard alignment rules apply.

 The optional ! indicates that Rn is updated after the data is transferred. This is similar to the ldm and stm instructions.

 Rm is an ARM register containing an offset from the base address. If Rm is present, Rn is updated to Rn + Rm after the address is used to access memory. Rm cannot be sp or pc.

Table 10.2 shows all valid combinations of parameters for this instruction. Note that the vector element number is not specified, but the brackets [] must be present. Up to four registers can be specified. If the structure has more than four fields, then this instruction can be repeated to load or store all of the fields.

Table 10.2

Parameter combinations for loading multiple structures

<n><size><list><align>Alignment
18Dd[]Standard only
Dd[], D(d+1)[]Standard only
2-516Dd[]162 byte
Dd[], D(d+1)[]162 byte
2-532Dd[]324 byte
Dd[], D(d+1)[]324 byte
28Dd[], D(d+1)[]81 byte
8Dd[], D(d+2)[]81 byte
2-516Dd[], D(d+1)[]162 byte
Dd[], D(d+2)[]162 byte
2-532Dd[], D(d+1)[]324 byte
Dd[], D(d+2)[]324 byte
38, 16, or 32Dd[], D(d+1)[], D(d+2)[]Standard only
Dd[], D(d+2)[], D(d+4)[]Standard only
48Dd[], D(d+1)[], D(d+2)[], D(d+3)[]324 byte
Dd[], D(d+2)[], D(d+4)[], D(d+6)[]324 byte
2-516Dd[], D(d+1)[], D(d+2)[], D(d+3)[]648 byte
Dd[], D(d+2)[], D(d+4)[], D(d+6)[]648 byte
2-532Dd[], D(d+1)[], D(d+2)[], D(d+3)[]64 or 128(<align> ÷ 8) bytes
Dd[], D(d+2)[], D(d+4)[], D(d+6)[]64 or 128(<align> ÷ 8) bytes

t0015

Operations

NameEffectDescription
vld<n>

tmpRnsi5_e

incr(si6_e<size> ÷ 8)

nlanes(64÷si19_e<size>)

for Dregs(<list>) do

 for 0 ≤ x < nlanes do

 D[x]Mem[tmp]si7_e

 tmptmp+incrsi8_e

 end for

end for

if ! is present then

 Rntmpsi9_e

else

 if Rm is specified then

 RnRmsi10_e

 end if

end if

Load one or more data items into all lanes of one or more registers.

t0035

Examples

f10-12-9780128036983

10.3.3 Load or Store Multiple Structures

These instructions are used to load and store multiple data structures across multiple registers with interleaving or deinterleaving:

vld<n> Load Multiple Structured Data, and

vst<n> Store Multiple Structured Data.

Syntax

 v<op><n>.<size> <list>,[Rn{:<align>}]{!}

 v<op><n>.<size> <list>,[Rn{:<align>}],Rm

 <op> must be either ld or st.

 <n> must be one of 1, 2, 3, or 4.

 <size> must be one of 8, 16, or 32.

 <list> specifies the list of registers. There are four list formats:

1. {Dd}

2. {Dd, D(d+a)}

3. {Dd, D(d+a), D(d+2a)}

4. {Dd, D(d+a), D(d+2a), D(d+3a)}

where a can be either 1 or 2. Every register in the list must be in the range d0-d31.

 Rn is the ARM register containing the base address. Rn cannot be pc.

 <align> specifies an optional alignment. If <align> is not specified, then standard alignment rules apply.

 The options ! indicates that Rn is updated after the data is transferred, similar to the ldm and stm instructions.

 Rm is an ARM register containing an offset from the base address. If Rm is present, Rn is updated to Rn + Rm after the address is used to access memory. Rm cannot be sp or pc.

Table 10.3 shows all valid combinations of parameters for this instruction. Note that the scalar is not specified and the instructions work on all multiple vector elements. Up to four registers can be specified. If the structure has more than four fields, then this instruction can be repeated to load or store all of the fields.

Table 10.3

Parameter combinations for loading copies of a structure

<n><size><list><align>Alignment
18, 16, 32, or 64Dd648 bytes
Dd, D(d+1)64 or 128(<align> ÷ 8) bytes
Dd, D(d+1), D(d+2)648 bytes
Dd, D(d+1), D(d+2), D(d+3)64, 128, or 256(<align> ÷ 8) bytes
28, 16, or 32Dd, D(d+1)64 or 128(<align> ÷ 8) bytes
Dd, D(d+2)64 or 128(<align> ÷ 8) bytes
Dd, D(d+1), D(d+2), D(d+3)64, 128, or 256(<align> ÷ 8) bytes
38, 16, or 32Dd, D(d+1), D(d+2)648 bytes
Dd, D(d+2), D(d+3)648 bytes
48, 16, or 32Dd, D(d+1), D(d+2), D(d+3)64, 128, or 256(<align> ÷ 8) bytes
Dd, D(d+2), D(d+4), D(d+6)64, 128, or 256(<align> ÷ 8) bytes

t0020

Operations

NameEffectDescription
vld<n>

tmpRnsi5_e

incr(si6_e<size> ÷ 8)

nlanes(64÷si19_e<size>)

for 0 ≤ x < nlanes do

 for D<list> do

 D[x]Mem[tmp]si7_e

 tmptmp+incrsi8_e

 end for

end for

if ! is present then

 Rntmpsi9_e

else

 if Rm is specified then

 RnRmsi10_e

 end if

end if

Load one or more data items into all lanes of one or more registers.
vst<n>

tmpRnsi5_e

incr(si6_e<size> ÷ 8)

nlanes(64÷si19_e<size>)

for 0 ≤ x < nlanes do

 for D<list> do

 Mem[tmp]D[x]si13_e

 tmptmp+incrsi8_e

 end for

end for

if ! is present then

 Rntmpsi9_e

else

 if Rm is specified then

 RnRmsi10_e

 end if

end if

Load one or more data items into all lanes of one or more registers.

t0040

Examples

f10-13-9780128036983

10.4 Data Movement Instructions

Because they use the same set of registers, VFP and NEON share some instructions for loading, storing, and moving registers. The shared instructions are vldr, vstr, vldm, vstm, vpop, vpush, vmov, vmrs, and vmsr. These were explained in Chapter 9. NEON extends the vmov instructions to allow specification of NEON scalars and quadwords, and adds the ability to perform one’s complement during a move.

10.4.1 Moving Between NEON Scalar and Integer Register

This version of the move instruction allows data to be moved between the NEON registers and the ARM integer registers as 8-bit, 16-bit, or 32-bit NEON scalars:

vmov Move Between NEON and ARM.

Syntax

 vmov{<cond>}.<size> Dn[x],Rd

 vmov{<cond>}.<type> Rd,Dn[x]

 <cond> is an optional condition code.

 <size> must be 8, 16, or 32, and specifies the number of bits that are to be moved.

 The <type> must be u8, u16, u32, s8, s16, s32, or f32, and specifies the number of bits that are to be moved and whether or not the result should be sign-extended in the ARM integer destination register.

Operations

NameEffectDescription
vmov Dd[x],RmDn[x]Rdsi38_eMove least significant size bits of Rd to NEON scalar Dn[x].
vmov Rd,Dn[x]RdDn[x]si39_eMove NEON scalar Dn[x] to Rd, storing as specified type

Examples

f10-14-9780128036983

10.4.2 Move Immediate Data

NEON extends the VFP vmov instruction to include the ability to move an immediate value, or the one’s complement of an immediate value, to every element of a register. The instructions are:

vmov Move Immediate, and

vmvn Move Immediate NOT.

Syntax

 v<op>.<type> Vd, #<imm>

 <op> must be either <mov> or <mvn>.

 <type> must be i8, i16, i32, f32, or i64, and specifies the size of items in the vector.

 V can be s, d, or q.

 <imm> is an immediate value that matches <type>, and is copied to every element in the vector. The following table shows valid formats for imm:

<type>vmovvmvn
i80xXY0xXY
i160x00XY0xFFXY
0xXY000xXYFF
i320x000000XY0xFFFFFFXY
0x0000XY000xFFFFXYFF
0x00XY00000xFFXYFFFF
0xXY0000000xXYFFFFFF
i640xABCDEFGH0xABCDEFGH
2-3Each letter represents a byte, and must be either FF or 00
f32Any number that can be written as ± n × (2 − r), where n and r are integers, such that 16 ≤ n ≤ 31 and 0 ≤ r ≤ 7

t0050

Operations

NameEffectDescription
vmovVd[]immedsi40_eCopy immediate value to all elements of Vd.
vmvnVd[]¬immedsi41_eCopy one’s complement of immediate value to all elements of Vd.

Examples

f10-15-9780128036983

10.4.3 Change Size of Elements in a Vector

It is sometimes useful to increase or decrease the number of bits per element in a vector. NEON provides these instructions to convert a doubleword vector with elements of size y to a quadword vector with size 2y, or to perform the inverse operation:

vmovl Move and Lengthen,

vmovn Move and Narrow,

vqmovn Saturating Move and Narrow, and

vqmovun Saturating Move and Narrow Unsigned.

Syntax

 vmovl.<type> Qd, Dm

 v{q}movn.<type> Dd, Qm

 vqmovun.<type> Dd, Qm

 The valid choices for <type> are given in the following table:

OpcodeValid Types
vmovls8, s16, s32, u8, u16, or u32
vmovni8, i16, or i32
vqmovns8, s16, s32, u8, u16, or u32
vqmovuns8, s16, or s32

 q indicates that the results are saturated.

Operations

NameEffectDescription
vmovl

for 0 ≤ i < (64 ÷ size) do

 Qd[i]Dm[i]si42_e

end for

Sign or zero extends (depending on <type>) each element of a doubleword vector to twice their length
v{q}movn

for 0 ≤ i < (64 ÷ size) do

 if q is present then

 eq10-02-9780128036983

 else

 Dd[i]Qm[i])si43_e

 end if

end for

Copy the least significant half of each element of a quadword vector to the corresponding elements of a doubleword vector. If q is present, then the value is saturated
vqmovun

for 0 ≤ i < (64 ÷ size) do

eq10-03-9780128036983

end for

Copy each element of the operand vector to the corresponding element of the destination vector. The destination element is unsigned, and the value is saturated

t0065

Examples

f10-16-9780128036983

10.4.4 Duplicate Scalar

The duplicate instruction copies a scalar into every element of the destination vector. The scalar can be in a NEON register or an ARM integer register. The instruction is:

vdup Duplicate Scalar.

Syntax

 vdup.<size> Vd, Rm

 vdup.<size> Vd, Dm[x]

 <size> must be one of 8, 16 or 32.

 V can be d or q.

 Rm cannot be r15.

Operations

NameEffectDescription
vdup.<size>V d[] < −RmCopy <size> least significant bits of Rm to all elements of Vd
vdup.<size>V d[] < −Dm[x]Copy element x of Dm to all elements of Vd

Examples

f10-17-9780128036983

10.4.5 Extract Elements

This instruction extracts 8-bit elements from two vectors and concatenates them. Fig. 10.4 gives an example of what this instruction does. The instruction is:

f10-04-9780128036983
Figure 10.4 Example of vext.8 d12,d4,d9,#5.

vext Extract Elements.

Syntax

 vext.<size> Vd, Vn, Vm, #<imm>

 <size> must be one of 8, 16, 32, or 64.

 V can be d or q.

 <imm> is the number of elements to extract from the bottom of Vm. The remaining elements required to fill Vd are taken from the top of Vn.

Operation

NameEffectDescription
vext

if V is double then

 size8si44_e

else

 size16si45_e

end if

for imm > i ≥ 0 do

 Vd[i+sizeimm]Vm[i]si46_e

end for

for size > iimm do

 Vd[iimm]Vm[i]si47_e

end for

Concatenate the top of first operand to the bottom of the second operand.

t0075

Examples

f10-18-9780128036983

10.4.6 Reverse Elements

This instruction reverses the order of data in a register:

vrev Reverse Elements.

One use of this instruction is for converting data from big-endian to little-endian order, or from little-endian to big-endian order. It could also be useful for swapping data and transforming matrices. Fig. 10.5 shows three examples.

f10-05-9780128036983
Figure 10.5 Examples of the vrev instruction. (A) vrev16.8 d3,d4; (B) vrev32.16 d8,d9; (C) vrev32.8 d5,d7.

Syntax

 vrev<n>.<size> Vd, Vm

 <n> can be 16, 32, or 64.

 <size> is either 8, 16, or 32 and indicates the size of the elements to be reversed. <size> must be less than <n>.

 V can be q or d.

Operation

NameEffectDescription
vrev

n# of groupssi48_e

gsize of groupsi49_e

for 0 ≤ i < n do

 for 0 ≤ j < g do

 Vd[i×g+j]Vm[i×g+(gj1)]si50_e

 end for

end for

Reverse the order of elements of <size> bits within every element of <n> bits.

t0080

Examples

f10-19-9780128036983

10.4.7 Swap Vectors

This instruction simply swaps two NEON registers:

vswp Swap Vectors.

Syntax

 vswp{.<type>} Vd, Vm

 <type> can be any NEON data type. The assembler ignores the type, but it can be useful to the programmer as extra documentation.

 V can be q or d.

Operation

NameEffectDescription
vswpVdVm;VmVdsi51_eSwap registers

Examples

f10-20-9780128036983

10.4.8 Transpose Matrix

This instruction transposes 2 × 2 matrices:

vtrn Transpose Matrix.

Fig. 10.6 shows two examples of this instruction. Larger matrices can be transposed using a divide-and-conquer approach.

f10-06-9780128036983
Figure 10.6 Examples of the vtrn instruction. (A) vtrn.8 d14,d15; (B) vtrn.32 d31,d15.

Syntax

 vtrn.<size> Vd, Vm

 <size> is either 8, 16, or 32 and indicates the size of the elements in the matrix (or matrices).

 V can be q or d.

Operation

NameEffectDescription
vtrn

n# of elementssi52_e

for 0 ≤ i < n by 2 do

 tmpVm[i]si53_e

 Vm[i]Vd[i+1]si54_e

 Vd[i+1]tmpsi55_e

end for

Treat two vectors as an array of 2 × 2 matrices and transpose them.

t0090

Examples

f10-21-9780128036983

Fig. 10.7 shows how the vtrn instruction can be used to transpose a 3 × 3 matrix. Transposing a 4 × 4 matrix requires the transposition of 13 2 × 2 matrices. However, this instruction can operate on multiple 2 × 2 sub-matrices in parallel, and can group elements into different sized sub-matrices. There is also a very useful swap instruction that can exchange the rows of a matrix. Using the swap and transpose instructions, transposing a 4 × 4 matrix of 16-bit elements can be done with only four instructions, as shown in Fig. 10.8.

f10-07-9780128036983
Figure 10.7 Transpose of a 3 × 3 matrix.
f10-08-9780128036983
Figure 10.8 Transpose of a 4 × 4 matrix of 32-bit numbers.

10.4.9 Table Lookup

The table lookup instructions use indices held in one vector to lookup values from a table held in one or more other vectors. The resulting values are stored in the destination vector. The table lookup instructions are:

vtbl Table Lookup, and

vtbx Table Lookup with Extend.

Syntax

 v<op>.8 Dd, <list>, Dm

 <op> is one of tbl or tbx

 <list> specifies the list of registers. There are five list formats:

1. {Dn},

2. {Dn, D(n+1)},

3. {Dn, D(n+1), D(n+2)},

4. {Dn, D(n+1), D(n+2), D(n+3)}, or

5. {Qn, Q(n+1)}.

 Dm is the register holding the indices.

 The table can contain up to 32 bytes.

Operations

NameEffectDescription
vtbl

Minrsi56_e first register

Maxrsi57_e last register

for 0 ≤ i < 8 do

 rMinr+(Dm[i]÷8)si58_e

 if r > Maxr then

 Dd[i]0si59_e

 else

 eDm[i]mod8si60_e

 Dd[i]Dr[e]si61_e

 end if

end for

Use indices Dm to look up values in a table and store them in Dd. If the index is out of range, zero is stored in the corresponding destination.
vtbx

Minrsi56_e first register

Maxrsi57_e last register

for 0 ≤ i < 8 do

 rMinr+(Dm[i]÷8)si58_e

 if rMaxr then

 eDm[i]mod8si60_e

 Dd[i]Dr[e]si61_e

 end if

end for

Use indices Dm to look up values in a table and store them in Dd. If the index is out of range, the corresponding destination is unchanged.

t0095

Examples

f10-22-9780128036983

10.4.10 Zip or Unzip Vectors

These instructions are used to interleave or deinterleave the data from two vectors:

vzip Zip Vectors, and

vuzp Unzip Vectors.

Fig. 10.9 gives an example of the vzip instruction. The vuzp instruction performs the inverse operation.

f10-09-9780128036983
Figure 10.9 Example of vzip.8 d9,d4.

Syntax

 v<op>.<size> Vd, Vm

 <op> is either zip or uzp.

 <size> is either 8, 16, or 32 and indicates the size of the elements in the matrix (or matrices).

 V can be q or d.

Operations

NameEffectDescription
vzip

n# of elementssi52_e

for 0 ≤ i < (n ÷ 2) by 2 do

 tmp1[2×i]Vm[i]si68_e

 tmp1[2×i+1]Vd[i]si69_e

end for

for (n ÷ 2) ≤ i < n by 2 do

 tmp2[2×i]Vm[i]si70_e

 tmp2[2×i+1]Vd[i]si71_e

end for

Vmtmp1si72_e

Vdtmp2si73_e

Interleave data from two vectors. tmp is a vector of suitable size.
vuzp

n# of elementssi52_e

for 0 ≤ i < (n ÷ 2) by 2 do

 tmp1[i]Vm[2×i]si75_e

 tmp2[i]Vm[2×i+1]si76_e

end for

for (n ÷ 2) ≤ i < n by 2 do

 tmp1[i]Vd[2×i]si77_e

 tmp2[i]Vd[2×i+1]si78_e

end for

Vmtmp1si72_e

Vdtmp2si73_e

Interleave data from two vectors. tmp is a vector of suitable size.

t0100

Examples

f10-23-9780128036983

10.5 Data Conversion

When high precision is not required, The IEEE half-precision format can be used to store floating point numbers in memory. This can reduce memory requirements by up to 50%. This can also result in a significant performance improvement, since only half as much data needs to be moved between the CPU and main memory. However, on most processors half-precision data must be converted to single precision before it is used in calculations. NEON provides enhanced versions of the vcvt instruction which support conversion to and from IEEE half precision. There are also versions of vcvt which operate on vectors, and perform integer or fixed-point to floating-point conversions.

10.5.1 Convert Between Fixed Point and Single-Precision

This instruction can be used to perform a data conversion between single precision and fixed point on each element in a vector:

vcvt Convert Data Format.

The elements in the vector must be a 32-bit single precision floating point or a 32-bit integer. Fixed point (or integer) arithmetic operations are up to twice as fast as floating point operations. In some cases it is much more efficient to make this conversion, perform the calculations, then convert the results back to floating point.

Syntax

 vcvt{<cond>}.<type>.f32 Sd, Sm{, #<fbits>}

 vcvt{<cond>}.f32.<type> Sd, Sm{, #<fbits>}

 <cond> is an optional condition code.

 <type> must be either s32 or u32.

 The optional <fbits> operand specifies the number of fraction bits for a fixed point number, and must be between 0 and 32. If it is omitted, then it is assumed to be zero.

Operations

NameEffectDescription
vcvt.s32.f32Fd[]fixed(Fm[])si81_eConvert single precision to 32-bit signed fixed point or integer.
vcvt.u32.f32Fd[]ufixed(Fm[])si82_eConvert single precision to 32-bit unsigned fixed point or integer.
vcvt.f32.s32Fd[]single(Fm[])si83_eConvert signed 32-bit fixed point or integer to single precision
vcvt.f32.u32Fd[]single(Fm[])si83_eConvert unsigned 32-bit fixed point or integer to single precision

Examples

f10-24-9780128036983

10.5.2 Convert Between Half-Precision and Single-Precision

NEON systems with the half-precision extension provide the following instruction to perform conversion between single precision and half precision floating point formats:

vcvt Convert Between Half and Single.

Syntax

 vcvt<op>{<cond>}.f16.f32 Sd, Sm

 vcvt<op>{<cond>}.f32.f16 Sd, Sm

 The <op> must be either b or t and specifies whether the top or bottom half of the register should be used for the half-precision number.

 <cond> is an optional condition code.

Operations

NameEffectDescription
vcvtb.f16.f32Sdhalf(Sm)si85_eConvert single precision to half precision and store in bottom half of destination
vcvtt.f16.f32Sdhalf(Sm)si85_eConvert single precision to half precision and store in top half of destination
vcvtb.f32.f16Sdsingle(Sm)si87_eConvert half precision number from bottom half of source to single precision
vcvtt.f32.f16Sdsingle(Sm)si87_eConvert half precision number from top half of source to single precision

Examples

f10-25-9780128036983

10.6 Comparison Operations

NEON adds the ability to perform integer comparisons between vectors. Since there are multiple pairs of items to be compared, the comparison instructions set one element in a result vector for each pair of items. After the comparison operation, each element of the result vector will have every bit set to zero (for false) or one (for true). Note that if the elements of the result vector are interpreted as signed two’s-complement numbers, then the value 0 represents false and the value − 1 represents true.

10.6.1 Vector Compare

The following instructions perform comparisons of all of the corresponding elements of two vectors in parallel:

vceq Compare Equal,

vcge Compare Greater Than or Equal,

vcgt Compare Greater Than,

vcle Compare Less Than or Equal, and

vclt Compare Less Than.

The vector compare instructions compare each element of a vector with the corresponding element in a second vector, and sets an element in the destination vector for each comparison. If the comparison is true, then all bits in the result element are set to one. Otherwise, all bits in the result element are set to zero. Note that summing the elements of the result vector (as signed integers) will give the two’s complement of the number of comparisons which were true.

Note: vcle and vclt are actually pseudo-instructions. They are equivalent to vcgt and vcge with the operands reversed.

Syntax

 vc<op>.<type> Vd, Vn, Vm

 vc<op>.<type> Vd, Vn, #0

 <op> must be one of eq, ge, gt, le, or lt.

 If <op> is eq, then <type> must be i8, i16, i32, or f32.

 If <op> is not eq and Rop is #0, then < type > must be s8, s16, s32, or f32.

 If <op> is not eq and the third operand is a register, then <type> must be s8, s16, s32, u8, u16, u32, or f32.

 The result data type is determined from the following table:

Operand TypeResult Type
i32, s32, u32, or f32i32
i16, s16, or u16i16
i8, s8, or u8i8

 If the third operand is #0, then it is taken to be a vector of the correct size in which every element is zero.

 V can be d or q.

Operations

NameEffectDescription
vc<op>

for ivector_length do

 if Fm[i]<op> Rop[i]

 then

 Fd[i]111si89_e

 else

 Fd[i]000si90_e

 end if

end for

Compare each scalar in Fn to the corresponding scalar in Fm. Set the corresponding scalar in Fd to all ones if <op> is true, and all zeros if <op> is not true.

t0120

Examples

f10-26-9780128036983

10.6.2 Vector Absolute Compare

The following instructions perform comparisons between the absolute values of all of the corresponding elements of two vectors in parallel:

vacgt Absolute Compare Greater Than, and

vacge Absolute Compare Greater Than or Equal.

The vector absolute compare instruction compares the absolute value of each element of a vector with the absolute value of the corresponding element in a second vector, and sets an element in the destination vector for each comparison. If the comparison is true, then all bits in the result element are set to one. Otherwise, all bits in the result element are set to zero. Note that summing the elements of the result vector (as signed integers) will give the two’s complement of the number of comparisons which were true.

Syntax

 vac<op>.f32 Vd, Vn, Vm

 <op> must be either ge or gt.

 V can be d or q.

 The operand element type must be f32.

 The result element type is i32.

Operations

NameEffectDescription
vac<op>

for ivector_length do

 if |Fm[i]|<op> |Fn[i]|

then

 Fd[i]111si89_e

 else

 Fd[i]000si90_e

 end if

end for

Compare each scalar in Fn to the corresponding scalar in Fm. If the comparison is true, then set all bits in the corresponding scalar in Fd to one. Otherwise set all bits in the corresponding scalar in Fd to zero.

t0125

Examples

f10-27-9780128036983

10.6.3 Vector Test Bits

NEON provides the following vector version of the ARM tst instruction:

vtst Test Bits.

The vector test bits instruction performs a logical AND operation between each element of a vector and the corresponding element in a second vector. If the result is not zero, then every bit in the corresponding element of the result vector is set to one. Otherwise, every bit in the corresponding element of the result vector is set to zero.

Syntax

 vtst.<size> Vd, Vn, Vm

 V can be d or q.

 <size> must be one of 8, 16 or 32

 The result element type is defined by the following table:

<size>Result Type
32i32
16i16
8i8

Operations

NameEffectDescription
vtst

for ivector_length do

 if (Fm[i] ∧ Fn[i])≠0 then

 Fd[i]111si89_e

 else

 Fd[i]000si90_e

 end if

end for

Perform logical AND between each scalar in Fn and the corresponding scalar in Fm. Set the corresponding scalar in Fd to all ones if the result is not zero, and all zeros otherwise

t0135

Examples

f10-28-9780128036983

10.7 Bitwise Logical Operations

NEON adds the ability to perform integer and bitwise logical operations on the VFP register set. Recall that integer operations can also be used on fixed-point data. These operations add a great deal of power to the ARM processor.

10.7.1 Bitwise Logical Operations

NEON includes vector versions of the following five basic logical operations:

vand Bitwise AND,

veor Bitwise Exclusive-OR,

vorr Bitwise OR,

vorn Bitwise Complement and OR, and

vbic Bit Clear.

All of them involve two source operands and a destination register.

Syntax

 v<op>{.<type>} Vd, Vn, Vm

 <op> must be one of and, eor, orr, orn, or bic.

 V must be either q or d.

 type must be i8, i16, i32, or i64. For these bitwise logical operations, type does not matter.

Operations

NameEffectDescription
vandVdVnVmsi95_eLogical AND
veorVdVnVmsi96_eExclusive OR
vorrVdVnVmsi97_eLogical OR
vornVd¬(VnVm)si98_eComplement of Logical OR
vbicVdVn¬Vmsi99_eBit Clear

Examples

f10-29-9780128036983

10.7.2 Bitwise Logical Operations with Immediate Data

It is often useful to clear and/or set specific bits in a register. The NEON instruction set provides the following vector versions of the logical OR and bit clear instructions:

vorr Bitwise OR Immediate, and

vbic Bit Clear Immediate.

Syntax

 v<op>.<type> Vd, #<imm>

 <op> must be either orr, or bic.

 V must be either q or d to specify whether the operation involves quadwords or doublewords.

 <type> must be i16 or i32.

 <imm> is a 16-bit or 32-bit immediate value, which is interpreted as a pattern for filling the immediate operand. The following table shows acceptable patterns for <imm>, based on what was chosen for <type>:

<type>
i16i32
0x00XY0x000000XY
0xXY000x0000XY00
0x00XY0000
0xXY000000

t0145

Operations

NameEffectDescription
vorrVdVdimm:immsi100_eLogical OR
vbicVdVdimm:immsi101_eBit Clear

Examples

f10-30-9780128036983

10.7.3 Bitwise Insertion and Selection

NEON provides three instructions which can be used to combine the bits in two registers or to extract specific bits from a register, according to a pattern:

vbit Bitwise Insert,

vbif Bitwise Insert if False, and

vbsl Bitwise Select.

Syntax

[frame=single]

 v<op>{.<type>} Vd, Vn, Vm

 <op> can be bif, bit, or bsl.

 V can be d or q.

 The <type> must be i8, i16, i32, or i64, and specifies the size of items in the vectors. Note that for these bitwise logical operations, the type does not matter. so the assembler ignores it. However, it can be useful to the programmer as extra documentation.

Operations

NameEffectDescription
vbitFd(Fd¬Fm)(FnFm)si102_eInsert each bit from the first operand into the destination if the corresponding bit of the second operand is 1
vbifFd(FdFm)(Fn¬Fm)si103_eInsert each bit from the first operand into the destination if the corresponding bit of the second operand is 0
vbslFd(FdFn)(¬FdFm)si104_eSelect each bit for the destination from the first operand if the corresponding bit of the destination is 1, or from the second operand if the corresponding bit of the destination is 0

Examples

f10-31-9780128036983

10.8 Shift Instructions

The NEON shift instructions operate on vectors. Shifts are often used for multiplication and division by powers of two. The results of a left shift may be larger than the destination register, resulting in overflow. A shift right is equivalent to division. In some cases, it may be useful to round the result of a division, rather than truncating. NEON provides versions of the shift instruction which perform saturation and/or rounding of the result.

10.8.1 Shift Left by Immediate

These instructions shift each element in a vector left by an immediate value:

vshl Shift Left Immediate,

vqshl Saturating Shift Left Immediate,

vqshlu Saturating Shift Left Immediate Unsigned, and

vshll Shift Left Immediate Long.

Overflow conditions can be avoided by using the saturating version, or by using the long version, in which case the destination is twice the size of the source.

Syntax

 vshl.<type> Vd, Vm, #<imm>

 vqshl{u}.<type> Vd, Vm, #<imm>

 vshll.<type> Qd, Dm, #<imm>

 If u is present, then the results are unsigned.

 The valid choices for <type> are given in the following table:

OpcodeValid Types
vshli8, i16, i32, i64, s8, s16, or s32
vqshls8, s16, s32, s64, u8, u16, u32, or u64
vqshlus8, s16, s32, or s64
vshllu8, u16, u32, u64, s8, s16, or s32

Operations

NameEffectDescription
vshl

Vd[]Vm[]immsi105_e

Each element of Vm is shifted left by the immediate value and stored in the corresponding element of Vd. Bits shifted past the end of an element are lost.
vshll

Qd[]Dm[]immsi106_e

Each element of Vm is shifted left by the immediate value and stored in the corresponding element of Vd. The values are sign or zero extended, depending on <type>
vqshl{u}

eq10-04-9780128036983

Each element of Vm is shifted left by the immediate value and stored in the corresponding element of Vd. If the result of the shift is outside the range of the destination element, then the value is saturated. If u was specified, then the destination is unsigned. Otherwise, it is signed

t0160

Examples

f10-34-9780128036983

10.8.2 Shift Left or Right by Variable

These instructions shift each element in a vector, using the least significant byte of the corresponding element of a second vector as the shift amount:

vshl Shift Left or Right by Variable,

vrshl Shift Left or Right by Variable and Round,

vqshl Saturating Shift Left or Right by Variable, and

vqrshl Saturating Shift Left or Right by Variable and Round.

If the shift value is positive, the operation is a left shift. If the shift value is negative, then it is a right shift. A shift value of zero is equivalent to a move. If the operation is a right shift, and r is specified, then the result is rounded rather than truncated. Results are saturated if q is specified.

Syntax

 v{q}{r}shl.<type> Vd, Vn, Vm

 If q is present, then the results are saturated.

 If r is present, then right shifted values are rounded rather than truncated.

 V can be d or q.

 <type> must be one of s8, s16, s32, s64, s8, s16, s32, or s64.

Operations

NameEffectDescription
vshl

if q is present then

 if r is present then

 eq10-05-9780128036983

 else

 eq10-06-9780128036983

 end if

else

 if r is present then

 Vd[]Vn[]Vm[]si107_e

 else

 Vd[]Vn[]Vm[]si108_e

 end if

end if

Each element of Vm is shifted left by the immediate value and stored in the corresponding element of Vd. Bits shifted past the end of an element are lost.

t0165

Examples

f10-35-9780128036983

10.8.3 Shift Right by Immediate

These instructions shift each element in a vector right by an immediate value:

vshr Shift Right Immediate,

vrshr Shift Right Immediate and Round,

vshrn Shift Right Immediate and Narrow,

vrshrn Shift Right Immediate Round and Narrow,

vsra Shift Right and Accumulate Immediate, and

vrsra Shift Right Round and Accumulate Immediate.

Syntax

 v{r}shr{<cond>}.<type> Vd, Vm, #<imm>

 v{r}shrn{<cond>}.<type> Vd, Vm, #<imm>

 v{r}sra{<cond>}.<type> Vd, Vm, #<imm>

 V can be d or q.

 If r is present, then right shifted values are rounded rather than truncated.

 <cond> is an optional condition code.

 The valid choices for <type> are given in the following table:

OpcodeValid Types
v{r}shru8, u16, u32, u64, s8, s16, s32, or s64,
v{r}shrni16, i32, or i64
v{r}srau8, u16, u32, u64, s8, s16, s32, or s64,

Operations

NameEffectDescription
v{r}shr

if r is present then

 Vd[]Vm[]immsi109_e

else

 Vd[]Vm[]immsi110_e

end if

Each element of Vm is shifted right with zero extension by the immediate value and stored in the corresponding element of Vd. Results can be rounded both.
v{r}shrn

if r is present then

 Vd[]Vm[]immsi111_e

else

 Vd[]Vm[]immsi112_e

end if

Each element of Vm is shifted right with zero extension by the immediate value, optionally rounded, then narrowed and stored in the corresponding element of Vd.
v{r}sra

if r is present then

 Vd[]Vd[]+Vm[]immsi113_e

else

 Vd[]Vd[]+Vm[]immsi114_e

end if

Each element of Vm is shifted right with sign or zero extension by the immediate value and accumulated in the corresponding element of Vd. Results can be rounded.

t0175

Examples

f10-36-9780128036983

10.8.4 Saturating Shift Right by Immediate

These instructions shift each element in a quad word vector right by an immediate value:

vqshrn Saturating Shift Right Immediate,

vqrshrn Saturating Shift Right Immediate Round,

vqshrun Saturating Shift Right Immediate Unsigned, and

vqrshrun Saturating Shift Right Immediate Round Unsigned.

The result is optionally rounded, then saturated, narrowed, and stored in a double word vector.

Syntax

 vq{r)shr{u}n.<type> Dd, Qm, #<imm>

 If r is present, then right shifted values are rounded rather than truncated.

 If u is present, then the results are unsigned, regardless of the type of elements in Qm.

 The valid choices for <type> are given in the following table:

OpcodeValid Types
vq{r}shrnu16, u32, u64, s16, s32, or s64,
vq{r}shruns16, s32, or s64,

 <imm> Is the amount that elements are to be shifted, and must be between zero and one less than the number of bits in <type>.

Operations

NameEffectDescription
vq{r}shrn

if r is present then

 eq10-07-9780128036983

else

 eq10-08-9780128036983

end if

Each element of Vm is shifted right with sign extension by the immediate value, optionally rounded, then saturated and narrowed, and stored in the corresponding element of Vd.
vq{r}shrun

if r is present then

 eq10-09-9780128036983

else

 eq10-10-9780128036983

end if

Each element of Vm is shifted right with zero extension by the immediate value, optionally rounded, then saturated and narrowed, and stored in the corresponding element of Vd.

t0185

Examples

f10-37-9780128036983

10.8.5 Shift and Insert

These instructions perform bitwise shifting of each element in a vector, then combine the results with the contents of the destination register:

vsli Shift Left and Insert,

vsri Shift Right and Insert.

Fig. 10.10 provides an example.

f10-10-9780128036983
Figure 10.10 Effects of vsli.32 d4,d9,#6.

Syntax

 vs<dir>i.<size> Vd, Vm, #<imm>

 <dir> must be l for a left shift, or r for a right shift.

 <size> must be 8, 16, 32, or 64.

 <imm> is the amount that elements are to be shifted, and must be between zero and <size>− 1 for vsli, or between one and <size> for vsri.

Operations

NameEffectDescription
vsli

mask(1imm+1)1si115_e

Vd[](maskVd[])(Vm[]imm)si116_e

Each element of Vm is shifted left and combined with lower <imm> bits of the corresponding element of Vd.
vsri

mask¬(1sizeimm+1)1si117_e

Vd[](maskVd[])(Vm[]imm)si118_e

Each element of Vm is shifted right and combined with upper <imm> bits of the corresponding element of Vd.

t0190

Examples

f10-38-9780128036983

10.9 Arithmetic Instructions

NEON provides several instructions for addition, subtraction, and multiplication, but does not provide a divide instruction. Whenever possible, division should be performed by multiplying the reciprocal. When dividing by constants, the reciprocal can be calculated in advance, as shown in Chapter 8. For dividing by variables, NEON provides instructions for quickly calculating the reciprocals for all elements in a vector. In most cases, this is faster than using a divide instruction. When division is absolutely unavoidable, the VFP divide instructions can be used.

10.9.1 Vector Add and Subtract

The following eight instructions perform vector addition and subtraction:

vadd Add

vqadd Saturating Add

vaddl Add Long

vaddw Add Wide

vsub Subtract

vqsub Saturating Subtract

vsubl Subtract Long

vsubw Subtract Wide

The Vector Add (vadd) instruction adds corresponding elements in two vectors and stores the results in the corresponding elements of the destination register. The Vector Subtract (vsub) instruction subtracts elements in one vector from corresponding elements in another vector and stores the results in the corresponding elements of the destination register. Other versions allow mismatched operand and destination sizes, and the saturating versions prevent overflow by limiting the range of the results.

Syntax

 v{q}<op>.<type> Vd, Vn, Vm

 v<op>l.<type> Qd, Dn, Dm

 v<op>w.<type> Qd, Qn, Dm

 <op> is either add or sub.

 The valid choices for <type> are given in the following table:

OpcodeValid Types
v<op>i8, i16, i32, i64, or f32
vq<op>s8, s16, s32, s64, u8, u16, u32, or u64
v<op>ls8, s16, s32, u8, u16, or u32
v<op>ws8, s16, s32, u8, u16, or u32

Operations

NameEffectDescription
v<op>

Vd[]Vn[]<op>Vm[]si119_e

The operation is applied to corresponding elements of Vn and Vm. The results are stored in the corresponding elements of Vd.
vq<op>

eq10-11-9780128036983

The operation is applied to corresponding elements of Vn and Vm. The results are saturated then stored in the corresponding elements of Vd.
v<op>l

Qd[]Dn[]<op>Dm[]si120_e

The operation is applied to corresponding elements of Dn and Dm. The results are zero or sign extended then stored in the corresponding elements of Qd.
v<op>w

Qd[]Qn[]<op>Dm[]si121_e

The elements of Vm are sign or zero extended, then the operation is applied with corresponding elements of Vn. The results are stored in the corresponding elements of Vd.

t0200

Examples

f10-39-9780128036983

10.9.2 Vector Add and Subtract with Narrowing

These instructions add or subtract the corresponding elements of two vectors, and narrow by taking the most significant half of the result:

vaddhn Add and Narrow

vraddhn Add, Round, and Narrow

vsubhn Subtract and Narrow

vrsubhn Subtract, Round, and Narrow

The results are stored in the corresponding elements of the destination register. Results can be optionally rounded instead of truncated.

Syntax

 v{r}<op>hn.<type> Dd, Qn, Qm

 <op> is either add or sub.

 If <r> is specified, then the result is rounded instead of truncated.

 <type> must be either i16, i32, or i64.

Operations

NameEffectDescription
v<op>hn

shiftsize÷2si122_e

if r is present then

 xVn[]<op>Vm[]si123_e

 Vd[]xshiftsi124_e

else

 xVn[]<op>Vm[]si125_e

 Vd[]xshiftsi124_e

end if

The operation is applied to corresponding elements of Vn and Vm. The results are optionally rounded, then narrowed by taking the most significant half, and stored in the corresponding elements of Vd.

t0205

Examples

f10-40-9780128036983

10.9.3 Add or Subtract and Divide by Two

These instructions add or subtract corresponding elements from two vectors then shift the result right by one bit:

vhadd Halving Add

vrhadd Halving Add and Round

vhsub Halving Subtract

The results are stored in corresponding elements of the destination vector. If the operation is addition, then the results can be optionally rounded.

Syntax

 v{r}hadd.<type> Vd, Vn, Vm

 vhsub.<type> Vd, Vn, Vm

 If <r> is specified, then the result is rounded instead of truncated.

 <type> must be either s8, s16, s32, u8, u16, ar u32.

Operations

NameEffectDescription
v{r}hadd

if r is present then

 Vd[]Vn[]+Vm[]1si127_e

else

 Vd[]Vn[]+Vm[]1si128_e

end if

The corresponding elements of Vn and Vm are added together, optionally rounded, then shifted right one bit. Results are stored in the corresponding elements of Vd.
vhsub

Vd[]Vn[]Vm[]1si129_e

The elements of Vn are subtracted from the corresponding elements of Vm. Results are shifted right one bit and stored in the corresponding elements of Vd.

t0210

Examples

f10-41-9780128036983

10.9.4 Add Elements Pairwise

These instructions add vector elements pairwise:

vpadd Add Pairwise

vpaddl Add Pairwise Long

vpadal Add Pairwises and Accumulate Long

The long versions can be used to prevent overflow.

Syntax

 vpadd.<type> Dd, Dn, Dm

 vp<op>l.<type> Vd, Vm

 <op> must be either add or ada.

 The valid choices for <type> are given in the following table:

OpcodeValid Types
vpaddi8, i16, i32, or f32
vp<op>ls8, s16, s32, u8, u16, or u32

Operations

NameEffectDescription
vpadd

n# of elementssi52_e

for 0 ≤ i < (n ÷ 2) do

 Dd[i]Dm[i]+Dm[i+1]si131_e

end for

for (n ÷ 2) ≤ i < n do

 ji(n÷2)si132_e

 Dd[i]Dn[j]+Dn[j+1]si133_e

end for

Add elements of two vectors pairwise and store the results in another vector.
vpaddl

n# of elementssi52_e

for 0 ≤ i < (n ÷ 2) by 2 do

 Vd[i]Vm[i]+Vm[i+1]si135_e

end for

Add elements of a vector pairwise and store the results in another vector.
vpadal

n# of elementssi52_e

for 0 ≤ i < (n ÷ 2) by 2 do

 Vd[i]Vd[i]+Vm[i]+Vm[i+1]si137_e

end for

Add elements of a vector pairwise and accumulate the results in another vector.

t0220

Examples

f10-43-9780128036983

10.9.5 Absolute Difference

These instructions subtract the elements of one vector from another and store or accumulate the absolute value of the results:

vaba Absolute Difference and Accumulate

vabal Absolute Difference and Accumulate Long

vabd Absolute Difference

vabdl Absolute Difference Long

The long versions can be used to prevent overflow.

Syntax

v<op>.<type> Vd, Vn, Vm

v<op>l.<type> Qd, Dn, Dm

 <op> is either aba or abd.

 The valid choices for <type> are given in the following table:

OpcodeValid Types
vabds8, s16, s32, u8, u16, u32, or f32
vabas8, s16, s32, u8, u16, or u32
vabdls8, s16, s32, u8, u16, or u32
vabals8, s16, s32, u8, u16, or u32

Operations

NameEffectDescription
vabd

Vd[]Vn[]Vm[]si138_e

Subtract corresponding elements and take the absolute value
vaba

Vd[]Vd[]+Vn[]Vm[]si139_e

Subtract corresponding elements and take the absolute value. Accumulate the results
vabdl

Qd[]Dn[]Dm[]si140_e

Extend and subtract corresponding elements, then take the absolute value
v<op>w

Qd[]Qd[]+Dn[]Dm[]si141_e

Extend and subtract corresponding elements, then take the absolute value. Accumulate the results

t0230

Examples

f10-45-9780128036983

10.9.6 Absolute Value and Negate

These operations compute the absolute value or negate each element in a vector:

vabs Absolute Value

vneg Negate

vqabs Saturating Absolute Value

vqneg Saturating Negate

The saturating versions can be used to prevent overflow.

Syntax

 v{q}<op>.<type> Vd, Vm

 If q is present then results are saturated.

 <op> is either abs or neg.

 The valid choices for <type> are given in the following table:

OpcodeValid Types
vabss8, s16, s32, or f32
vnegs8, s16, s32, or f32
vqabss8, s16, or s32
vqnegs8, s16, or s32

Operations

NameEffectDescription
v{q}abs

if q is present then

 eq10-12-9780128036983

else

 Vd[]Vm[]si142_e

end if

Copy absolute value of each element of Vm to the corresponding element of Vd, optionally saturating the result
v{q}neg

if q is present then

 eq10-13-9780128036983

else

 Vd[]Vm[]si143_e

end if

Copy absolute value of each element of Vm to the corresponding element of Vd, optionally saturating the result

t0240

Examples

f10-46-9780128036983

10.9.7 Get Maximum or Minimum Elements

The following four instructions select the maximum or minimum elements and store the results in the destination vector:

vmax Maximum

vmin Minimum

vpmax Pairwise Maximum

vpmin Pairwise Minimum

Syntax

 v<op>.<type> Vd, Vn, Vm

 vp<op>.<type> Dd, Dn, Dm

 <op> is either max or min.

 <type> must be one of s8, s16, s32, u8, u16, u32, or f32.

Operations

NameEffectDescription
vmax

n# of elementssi52_e

for 0 ≤ i < n do

 if V n[i] > V m[i] then

 Vd[i]Vn[i]si145_e

 else

 Vd[i]Vm[i]si146_e

 end if

end for

Compare corresponding elements and copy the greater of each pair into the corresponding element in the destination vector
vpmax

n# of elementssi52_e

for 0 ≤ i < (n ÷ 2) do

 if Dm[i] > Dm[i + 1] then

 Dd[i]Dm[i]si148_e

 else

 Dd[i]Dm[i+1]si149_e

 end if

end for

for (n ÷ 2) ≤ i < n do

 if Dn[i] > Dn[i + 1] then

 Dd[i+(n÷2)]Dn[i]si150_e

 else

 Dd[i+(n÷2)]Dn[i+1]si151_e

 end if

end for

Compare elements pairwise and copy the greater of each pair into an element in the destination vector, another vector
vmin

n# of elementssi52_e

for 0 ≤ i < n do

 if V n[i] < V m[i] then

 Vd[i]Vn[i]si145_e

 else

 Vd[i]Vm[i]si146_e

 end if

end for

Compare corresponding elements and copy the lesser of each pair into the corresponding element in the destination vector
vpmin

n# of elementssi52_e

for 0 ≤ i < (n ÷ 2) do

 if Dm[i] < Dm[i + 1] then

 Dd[i]Dm[i]si148_e

 else

 Dd[i]Dm[i+1]si149_e

 end if

end for

for (n ÷ 2) ≤ i < n do

 if Dn[i] < Dn[i + 1] then

 Dd[i+(n÷2)]Dn[i]si150_e

 else

 Dd[i+(n÷2)]Dn[i+1]si151_e

 end if

end for

Compare elements pairwise and copy the lesser of each pair into an element in the destination vector, another vector

t0245

Examples

f10-47-9780128036983

10.9.8 Count Bits

These instructions can be used to count leading sign bits or zeros, or to count the number of bits that are set for each element in a vector:

vcls Count Leading Sign Bits

vclz Count Leading Zero Bits

vcnt Count Set Bits

Syntax

 v<op>.<type> Vd, Vm

 <op> is either cls, clz or cnt.

 The valid choices for <type> are given in the following table:

OpcodeValid Types
vclss8, s16, or s32
vclzu8, u16, or u32
vcnti8

Operations

NameEffectDescription
vcls

n# of elementssi52_e

for 0 ≤ i < n) do

 Vd[i]leading_sign_bits(Vm[i])si161_e

end for

Count the number of consecutive bits that are the same as the sign bit for each element in Fm, and store the counts in the corresponding elements of Fd
vcls

n# of elementssi52_e

for 0 ≤ i < n) do

 Vd[i]leading_zero_bits(Vm[i])si163_e

end for

Count the number of leading zero bits for each element in Fm, and store the counts in the corresponding elements of Fd.
vcnt

n# of elementssi52_e

for 0 ≤ i < n) do

 Vd[i]count_one_bits(Vm[i])si165_e

end for

Count the number of bits in Fm that are set to one, and store the counts in the corresponding elements of Fd

t0260

Examples

f10-48-9780128036983

10.10 Multiplication and Division

There is no vector divide instruction in NEON. Division is accomplished with multiplication by the reciprocals of the divisors. The reciprocals are found by making an initial estimate, then using the Newton-Raphson method to improve the approximation. This can actually be faster than using a hardware divider. NEON supports single precision floating point and unsigned fixed point reciprocal calculation. Fixed point reciprocals provide higher precision. Division using the NEON reciprocal method may not provide the best precision possible. If the best possible precision is required, then the VFP divide instruction should be used.

10.10.1 Multiply

These instructions are used to multiply the corresponding elements from two vectors:

vmul Multiply

vmla Multiply Accumulate

vmls Multiply Subtract

vmull Multiply Long

vmlal Multiply Accumulate Long

vmlsl Multiply Subtract Long

The long versions can be used to avoid overflow.

Syntax

 v<op>.<type> Vd, Vn, Vm

 v<op>l.<type> Qd, Dn, Dm

 <op> is either mul, mla. or mls.

 The valid choices for <type> are given in the following table:

OpcodeValid Types
vmulp8, i8, i16, or i32
vmlai8, i16, or i32
vmlsi8, i16, or i32
vmullp8, s8, s16, s32, u8, u16, or u32
vmlals8, s16, s32, u8, u16, or u32
vmlsls8, s16, s32, u8, u16, or u32

Operations

NameEffectDescription
vmul

Vd[]Vn[]×Vm[]si166_e

Multiply corresponding elements from two vectors and store the results in a third vector
vmla

Vd[]Vd[]+(Vn[]×Vm[])si167_e

Multiply corresponding elements from two vectors and add the results in a third vector
vmul

Vd[]Vd[](Vn[]×Vm[])si168_e

Multiply corresponding elements from two vectors and subtract the results from a third vector
vmull

Qd[]Dn[]×Dm[]si169_e

Multiply corresponding elements from two vectors and store the results in a third vector
vmlal

Qd[]Qd[]+(Dn[]×Dm[])si170_e

Multiply corresponding elements from two vectors and add the results in a third vector
vmul

Qd[]Qd[](Dn[]×Dm[])si171_e

Multiply corresponding elements from two vectors and subtract the results from a third vector

t0270

Examples

f10-49-9780128036983

10.10.2 Multiply by Scalar

These instructions are used to multiply each element in a vector by a scalar:

vmul Multiply by Scalar

vmla Multiply Accumulate by Scalar

vmls Multiply Subtract by Scalar

vmull Multiply Long by Scalar

vmlal Multiply Accumulate Long by Scalar

vmlsl Multiply Subtract Long by Scalar

The long versions can be used to avoid overflow.

Syntax

v<op>.<type> Vd, Vn, Dm[x]

v<op>l.<type> Qd, Dn, Dm[x]

 <op> is either mul, mla. or mls.

 The valid choices for <type> are given in the following table:

OpcodeValid Types
vmuli16, i32, or f32
vmlai16, i32, or f32
vmlsi16, i32, or f32
vmulls16, s32, u16, or u32
vmlals16, s32, u16, or u32
vmlsls16, s32, u16, or u32

 x must be valid for the chosen <type>.

Operations

NameEffectDescription
vmul

Vd[]Vn[]×Dm[x]si172_e

Multiply corresponding elements from two vectors and store the results in a third vector
vmla

Vd[]Vd[]+(Vn[]×Dm[x])si173_e

Multiply corresponding elements from two vectors and add the results in a third vector
vmul

Vd[]Vd[](Vn[]×Dm[x])si174_e

Multiply corresponding elements from two vectors and subtract the results from a third vector
vmull

Qd[]Dn[]×Dm[x]si175_e

Multiply corresponding elements from two vectors and store the results in a third vector
vmlal

Qd[]Qd[]+(Dn[]×Dm[x])si176_e

Multiply corresponding elements from two vectors and add the results in a third vector
vmul

Qd[]Qd[](Dn[]×Dm[x])si177_e

Multiply corresponding elements from two vectors and subtract the results from a third vector

t0280

Examples

f10-50-9780128036983

10.10.3 Fused Multiply Accumulate

A fused multiply accumulate operation does not perform rounding between the multiply and add operations. The two operations are fused into one. NEON provides the following fused multiply accumulate instructions:

vfma Fused Multiply Accumulate

vfnma Fused Negate Multiply Accumulate

vfms Fused Multiply Subtract

vfnms Fused Negate Multiply Subtract

Using the fused multiply accumulate can result in improved speed and accuracy for many computations that involve the accumulation of products.

Syntax

 <op>{<cond>}.<prec> Fd, Fn, Fm

<op> is one of vfma, vfnma, vfms, or vfnms.

<cond> is an optional condition code.

<prec> may be either f32 or f64.

Operations

NameEffectDescription
vfmaFdFd+Fn×Fmsi178_eMultiply and accumulate
vnfnmaFdFd+Fn×Fmsi179_eNegate, multiply, and accumulate
vfmsFdFdFn×Fmsi180_eMultiply and subtract
vnfmsFdFdFn×Fmsi181_eNegate multiply, and subtract

Examples

f10-51-9780128036983

10.10.4 Saturating Multiply and Double (Low)

These instructions perform multiplication, double the results, and perform saturation:

vqdmull Saturating Multiply Double (Low)

vqdmlal Saturating Multiply Double Accumulate (Low)

vqdmlsl Saturating Multiply Double Subtract (Low)

Syntax

 vqd<op>l.<type> Qd, Dn, Dm

 vqd<op>l.<type> Qd, Dn, Dm[x]

 <op> is either mul, mla. or mls.

 <type> must be either s16 or s32.

Operations

NameEffectDescription
vqdmull

if second operand is scalar then

 eq10-14-9780128036983

else

 eq10-15-9780128036983

end if

Multiply elements, double the results, and store in the destination vector with saturation
vqdmull

if second operand is scalar then

 eq10-16-9780128036983

else

 eq10-17-9780128036983

end if

Multiply elements, double the results, and add to the destination vector with saturation
vqdmull

if second operand is scalar then

 eq10-18-9780128036983

else

 eq10-19-9780128036983

end if

Multiply elements, double the results, and subtract from the destination vector with saturation

t0290

Examples

f10-52-9780128036983

10.10.5 Saturating Multiply and Double (High)

These instructions perform multiplication, double the results, perform saturation, and store the high half of the results:

vqdmulh Saturating Multiply Double (High)

vqrdmulh Saturating Multiply Double (High) and Round

Syntax

 vq{r}dmulh.<type> Vd, Vn, Vm

 vq{r}dmulh.<type> Vd, Vn, Dm[x]

 <type> must be either s16 or s32.

Operations

NameEffectDescription
vqdmulh

nsize of<type>si182_e

if second operand is scalar then

 Vd[]Vn[]×Dm[x]×2nsi183_e

else

 Vd[]Vn[]×Vm[]×2nsi184_e

end if

Multiply elements, double the results and store the high half in the destination vector with saturation
vqrdmulh

nsize of<type>si182_e

if second operand is scalar then

 Vd[]Vn[]×Dm[x]×2nsi186_e

else

 Vd[]Vn[]×Vm[]×2nsi187_e

end if

Multiply elements, double the results, round, and store the high half in the destination vector with saturation

t0295

Examples

f10-53-9780128036983

10.10.6 Estimate Reciprocals

These instructions perform the initial estimates of the reciprocal values:

vrecpe Reciprocal Estimate

vrsqrte Reciprocal Square Root Estimate

These work on floating point and unsigned fixed point vectors. The estimates from this instruction are accurate to within about eight bits. If higher accuracy is desired, then the Newton-Raphson method can be used to improve the initial estimates. For more information, see the Reciprocal Step instruction.

Syntax

 v<op>.<type> Vd, Vm

 <op> is either recpe or rsqrte.

 <type> must be either u32, or f32.

 If <type> is u32, then the elements are assumed to be U(1,31) fixed point numbers, and the most significant fraction bit (bit 30) must be 1, and the integer part must be zero. The vclz and shift by variable instructions can be used to put the data in the correct format.

 The result elements are always f32.

Operations

NameEffectDescription
vrecpe

n# of elementssi52_e

for 0 ≤ i < n) do

 Vd[i](1÷Vm[i])si189_e

end for

Find an approximate reciprocal of each element in a vector
vrsqrte

n# of elementssi52_e

for 0 ≤ i < n) do

 Vd[i](1÷Vm[i])si191_e

end for

Find an approximate reciprocal square root of each element in a vector

t0300

Examples

f10-54-9780128036983

10.10.7 Reciprocal Step

These instructions are used to perform one Newton-Raphson step for improving the reciprocal estimates:

vrecps Reciprocal Step

vrsqrts Reciprocal Square Root Step

For each element in the vector, the following equation can be used to improve the estimates of the reciprocals:

xn+1=xn(2dxn),

si192_e

where xn is the estimated reciprocal from the previous step, and d is the number for which the reciprocal is desired. This equation converges to 1dsi193_e if x0 is obtained using vrecpe on d. The vrecps instruction computes

xn+1=2dxn,

si194_e

so one additional multiplication is required to complete the update step. The initial estimate x0 must be obtained using the vrecpe instruction.

For each element in the vector, the following equation can be used to improve the estimates of the reciprocals of the square roots:

xn+1=xn3dxn22,

si195_e

where xn is the estimated reciprocal from the previous step, and d is the number for which the reciprocal is desired. This equation converges to 1dsi196_e if x0 is obtained using vrsqrte on d. The vrsqrts instruction computes

xn+1=3dxn2,

si197_e

so two additional multiplications are required to complete the update step. The initial estimate x0 must be obtained using the vrsqrte instruction.

Syntax

 v<op>.<type> Vd, Vn, Vm

 <op> is either recps or rsqrts.

 <type> must be either u32, or f32.

Operations

NameEffectDescription
vrecpe

n# of elementssi52_e

for 0 ≤ i < n) do

 Vd[i]2Vn[i]×Vm[i]si199_e

end for

Perform most of the Newton-Raphson reciprocal improvement step.
vrsqrte

n# of elementssi52_e

for 0 ≤ i < n) do

 Vd[i](3Vn[i]×Vm[i])÷2si201_e

end for

Perform most of the Newton-Raphson reciprocal square root improvement step

t0305

Examples

f10-55-9780128036983

10.11 Pseudo-Instructions

The GNU assembler supports five pseudo-instructions for NEON. Two of them are vcle and vclt, which were covered in Section 10.6.1. The other three are explained in the following sections.

10.11.1 Load Constant

This pseudo-instruction loads a constant value into every element of a NEON vector, or into a VFP single-precision or double-precision register:

vldr Load Constant.

This pseudo-instruction will use vmov if possible. Otherwise, it will create an entry in the literal pool and use vldr.

Syntax

 vldr{<cond>}.<type> Vd, =<imm>

 <cond> is an optional condition code.

 <type> must be one of i8, i16, i32, i64, s8, s16, s32, s64, u8, u16, u32, u64, f32, or f64.

 <imm> is a value appropriate for the specified <type>.

Operations

NameEffectDescription
vldr

Vd<imm>si202_e

Load a constant

t0310

Examples

f10-56-9780128036983

10.11.2 Bitwise Logical Operations with Immediate Data

It is often useful to clear and/or set specific bits in a register. The following pseudo-instructions can provide bitwise logical operations:

vand Bitwise AND Immediate

vorn Bitwise Complement and OR Immediate

Syntax

 v<op>.<type> Vd, #<imm>

 <op> must be either and, or orn.

 V must be either q or d to specify whether the operation involves quadwords or doublewords.

 <type> must be i8, i16, i32, or i64.

 <imm> is a 16-bit or 32-bit immediate value, which is interpreted as a pattern for filling the immediate operand. The following table shows acceptable patterns for <imm>, based on what was chosen for <type>:

<type>
i8,i16i32,i64
0xFFXY0xFFFFFFXY
0xXYFF0xFFFFXYFF
0xFFXYFFFF
0xXYFFFFFF

t0315

Operations

NameEffectDescription
vandVdVdimm:immsi101_eLogical OR
vornVd¬(Vdimm:imm)si204_eBit Clear

Examples

f10-57-9780128036983

10.11.3 Vector Absolute Compare

The following pseudo-instructions perform comparisons between the absolute values of all of the corresponding elements of two vectors in parallel:

vacle Absolute Compare Less Than or Equal

vaclt Absolute Compare Less Than

The vector absolute compare instruction compares the absolute value of each element of a vector with the absolute value of the corresponding element in a second vector, and sets an element in the destination vector for each comparison. If the comparison is true, then all bits in the result element are set to one. Otherwise, all bits in the result element are set to zero. Note that summing the elements of the result vector (as signed integers) will give the two’s complement of the number of comparisons which were true.

Syntax

 vac<op>.f32 Vd, Vn, Vm

 <op> must be either lt or lt.

 V can be d or q.

 The operand element type must be f32.

 The result element type is i32.

Operations

NameEffectDescription
vac<op>

for ivector_length do

 if |Fm[i]|<op> |Fn[i]|

then

 Fd[i]111si89_e

else

 Fd[i]000si90_e

 end if

end for

Compare each scalar in Fn to the corresponding scalar in Fm. If the comparison is true, then set all bits in the corresponding scalar in Fd to one. Otherwise set all bits in the corresponding scalar in Fd to zero.

t0325

Examples

f10-58-9780128036983

10.12 Performance Mathematics: A Final Look at Sine

In Chapter 9, four versions of the sine function were given. Those implementations used scalar and VFP vector modes for single-precision and double-precision. Those previous implementations are already faster than the implementations provided by GCC, However, it may be possible to gain a little more performance by taking advantage of the NEON architecture. All versions of NEON are guaranteed to have a very large register set, and that fact can be used to attain better performance.

10.12.1 Single Precision

Listing 10.1 shows a single precision floating point implementation of the sine function, using the ARM NEON instruction set. It performs the same operations as the previous implementations of the sine function, but performs many of the calculations in parallel. This implementation is slightly faster than the previous version.

f10-59-9780128036983
Listing 10.1 NEON implementation of the sin x function using single precision.

10.12.2 Double Precision

Listing 10.2 shows a double precision floating point implementation of the sine function. This code is intended to run on ARMv7 and earlier NEON/VFP systems with the full set of 32 double-precision registers. NEON systems prior to ARMv8 do not have NEON SIMD instructions for double precision operations. This implementation is faster than Listing 9.4 because it uses a large number of registers, does not contain a loop, and is written carefully so that multiple instructions can be at different stages in the pipeline at the same time. This technique of gaining performance is known as loop unrolling.

f10-60a-9780128036983f10-60b-9780128036983f10-60c-9780128036983
Listing 10.2 NEON implementation of the sin x function using double precision.

10.12.3 Performance Comparison

Table 10.4 compares the implementations from Listings 10.1 and 10.2 with the VFP vector implementations from Chapter 9 and the sine function provided by GCC. Notice that in every case, using vector mode VFP instructions is slower than the scalar VFP version. As mentioned previously, vector mode is deprecated on NEON processors. On NEON systems, vector mode is emulated in software. Although vector mode is supported, using it will result in reduced performance, because each vector instruction causes the operating system to take over and substitute a series of scalar floating point operations on-the-fly. A great deal of time was spent by the operating system software in emulating the VFP hardware vector mode.

Table 10.4

Performance of sine function with various implementations

OptimizationImplementationCPU seconds
NoneSingle Precision VFP scalar Assembly1.74
Single Precision VFP vector Assembly27.09
Single Precision NEON Assembly1.32
Single Precision C4.36
Double Precision VFP scalar Assembly2.83
Double Precision VFP vector Assembly106.46
Double Precision NEON Assembly2.24
Double Precision C4.59
FullSingle Precision VFP scalar Assembly1.11
Single Precision VFP vector Assembly27.15
Single Precision NEON Assembly0.96
Single Precision C1.69
Double Precision VFP scalar Assembly2.56
Double Precision VFP vector Assembly107.5.53
Double Precision NEON Assembly2.05
Double Precision C4.27

When compiler optimization is not used, the single precision scalar VFP implementation achieves a speedup of about 2.51, and the NEON implementation achieves a speedup of about 3.30 compared to the GCC implementation. The double precision scalar VFP implementation achieves a speedup of about 1.62, and the loop-unrolled NEON implementation achieves a speedup of about 2.05 compared to the GCC implementation.

When the best possible compiler optimization is used (-Ofast), the single precision scalar VFP implementation achieves a speedup of about 1.52, and the NEON implementation achieves a speedup of about 1.76 compared to the GCC implementation. The double precision scalar VFP implementation achieves a speedup of about 1.67, and the loop-unrolled NEON implementation achieves a speedup of about 2.08 compared to the GCC implementation. The single precision NEON version was 1.16 times as fast as the VFP scalar version and the double precision NEON implementation was 1.25 times as fast as the VFP scalar implementation.

Although the VFP versions of the sine function ran without modification on the NEON processor, re-writing them for NEON resulted in significant performance improvement. Performance of the vectorized VFP code running on a NEON processor was abysmal. The take-away lesson is that a programmer can improve performance by writing some functions in assembly that are specifically targeted to run on an specific platform. However, assembly code which improves performance on one platform may actually result in very poor performance on a different platform. To achieve optimal or near-optimal performance, it is important for the programmer to be aware of exactly which hardware platform is being used.

10.13 Alphabetized List of NEON Instructions

NamePageOperation
vaba339Absolute Difference and Accumulate
vabal339Absolute Difference and Accumulate Long
vabd339Absolute Difference
vabdl339Absolute Difference Long
vabs340Absolute Value
vacge324Absolute Compare Greater Than or Equal
vacgt324Absolute Compare Greater Than
vacle353Absolute Compare Less Than or Equal
vaclt353Absolute Compare Less Than
vadd335Add
vaddhn336Add and Narrow
vaddl335Add Long
vaddw335Add Wide
vand326Bitwise AND
vand352Bitwise AND Immediate
vbic326Bit Clear
vbic327Bit Clear Immediate
vbif328Bitwise Insert if False
vbit328Bitwise Insert
vbsl328Bitwise Select
vceq323Compare Equal
vcge323Compare Greater Than or Equal
vcgt323Compare Greater Than
vcle323Compare Less Than or Equal
vcls342Count Leading Sign Bits
vclt323Compare Less Than
vclz342Count Leading Zero Bits
vcnt342Count Set Bits
vcvt322Convert Between Half and Single
vcvt321Convert Data Format
vdup312Duplicate Scalar
veor326Bitwise Exclusive-OR
vext313Extract Elements
vfma346Fused Multiply Accumulate
vfms346Fused Multiply Subtract
vfnma346Fused Negate Multiply Accumulate
vfnms346Fused Negate Multiply Subtract
vhadd337Halving Add
vhsub337Halving Subtract
vld¡n¿305Load Copies of Structured Data
vld¡n¿307Load Multiple Structured Data
vld¡n¿303Load Structured Data
vldr351Load Constant
vmax341Maximum
vmin341Minimum
vmla343Multiply Accumulate
vmla345Multiply Accumulate by Scalar
vmlal344Multiply Accumulate Long
vmlal345Multiply Accumulate Long by Scalar
vmls343Multiply Subtract
vmls345Multiply Subtract by Scalar
vmlsl344Multiply Subtract Long
vmlsl345Multiply Subtract Long by Scalar
vmov310Move Immediate
vmov309Move Between NEON and ARM
vmovl311Move and Lengthen
vmovn311Move and Narrow
vmul343Multiply
vmul345Multiply by Scalar
vmull343Multiply Long
vmull345Multiply Long by Scalar
vmvn310Move Immediate Negative
vneg340Negate
vorn326Bitwise Complement and OR
vorn352Bitwise Complement and OR Immediate
vorr326Bitwise OR
vorr327Bitwise OR Immediate
vpadal338Add Pairwises and Accumulate Long
vpadd338Add Pairwise
vpaddl338Add Pairwise Long
vpmax341Pairwise Maximum
vpmin341Pairwise Minimum
vqabs340Saturating Absolute Value
vqadd335Saturating Add
vqdmlal347Saturating Multiply Double Accumulate (Low)
vqdmlsl347Saturating Multiply Double Subtract (Low)
vqdmulh348Saturating Multiply Double (High)
vqdmull347Saturating Multiply Double (Low)
vqmovn311Saturating Move and Narrow
vqmovun311Saturating Move and Narrow Unsigned
vqneg340Saturating Negate
vqrdmulh348Saturating Multiply Double (High) and Round
vqrshl330Saturating Shift Left or Right by Variable and Round
vqrshrn332Saturating Shift Right Immediate Round
vqrshrun333Saturating Shift Right Immediate Round Unsigned
vqshl329Saturating Shift Left Immediate
vqshl330Saturating Shift Left or Right by Variable
vqshlu329Saturating Shift Left Immediate Unsigned
vqshrn332Saturating Shift Right Immediate
vqshrun333Saturating Shift Right Immediate Unsigned
vqsub335Saturating Subtract
vraddhn336Add, Round, and Narrow
vrecpe348Reciprocal Estimate
vrecps349Reciprocal Step
vrev314Reverse Elements
vrhadd337Halving Add and Round
vrshl330Shift Left or Right by Variable and Round
vrshr331Shift Right Immediate and Round
vrshrn331Shift Right Immediate Round and Narrow
vrsqrte348Reciprocal Square Root Estimate
vrsqrts349Reciprocal Square Root Step
vrsra331Shift Right Round and Accumulate Immediate
vrsubhn336Subtract, Round, and Narrow
vshl329Shift Left Immediate
vshl330Shift Left or Right by Variable
vshll329Shift Left Immediate Long
vshr331Shift Right Immediate
vshrn331Shift Right Immediate and Narrow
vsli334Shift Left and Insert
vsra331Shift Right and Accumulate Immediate
vsri334Shift Right and Insert
vst<n>307Store Multiple Structured Data
vst<n>303Store Structured Data
vsub335Subtract
vsubhn336Subtract and Narrow
vsubl335Subtract Long
vsubw335Subtract Wide
vswp315Swap Vectors
vtbl318Table Lookup
vtbx318Table Lookup with Extend
vtrn316Transpose Matrix
vtst325Test Bits
vuzp319Unzip Vectors
vzip319Zip Vectors

t0330_at0330_b

10.14 Chapter Summary

NEON can dramatically improve performance of algorithms that can take advantage of data parallelism. However, compiler support for automatically vectorizing and using NEON instructions is still immature. NEON intrinsics allow C and C++ programmers to access NEON instructions, by making them look like C functions. It is usually just as easy and more concise to write NEON assembly code as it is to use the intrinsics functions. A careful assembly language programmer can usually beat the compiler, sometimes by a wide margin. The greatest gains usually come from converting an algorithm to avoid floating point, and taking advantage of data parallelism.

Exercises

10.1 What is the advantage of using IEEE half-precision? What is the disadvantage?

10.2 NEON achieved relatively modest performance gains on the sine function, when compared to VFP.

(a) Why?

(b) List some tasks for which NEON could significantly outperform VFP.

10.3 There are some limitations on the size of the structure that can be loaded or stored using the vld<n> and vst<n> instructions. What are the limitations?

10.4 The sine function in Listing 10.2 uses a technique known as “loop unrolling” to achieve higher performance. Name at least three reasons why this code is more efficient than using a loop?

10.5 Reimplement the fixed-point sine function from Listing 8.7 using NEON instructions. Hint: you should not need to use a loop. Compare the performance of your NEON implementation with the performance of the original implementation.

10.6 Reimplement Exercise 9.10 using NEON instructions.

10.7 Fixed point operations may be faster than floating point operations. Modify your code from the previous example so that it uses the following definitions for points and transformation matrices:

f10-61-9780128036983

Use saturating instructions and/or any other techniques necessary to prevent overflow. Compare the performance of the two implementations.

Part III

Accessing Devices

Chapter 11

Devices

Abstract

This chapter starts with a high-level explanation of how devices may be accessed in a modern computer system, and then explains that most devices on modern architectures are memory-mapped. Next, it explains how memory mapped devices can be accessed by user processes under Linux, by making use of the mmap system call. Code examples are given, showing how several devices can be mapped into the memory of a user-level program on the Raspberry Pi and pcDuino. Next the General Purpose I/O devices on both systems are explained, providing the reader with the opportunity to do a comparison between two different devices which perform almost precisely the same functions.

Keywords

Device; Memory map; General purpose I/O (GPIO); I/O Pin; Header; Pull-up and pull-down resistor; LED; Switch

As mentioned in Chapter 1, a computer system consists of three main parts: the CPU, memory, and devices. The typical computing system has many devices of various types for performing specific functions. Some devices, such as data caches, are closely coupled to the CPU, and are typically controlled by executing special CPU instructions that can only be accessed in assembly language. However, most of the devices on a typical system are accessed and controlled through the system data bus. These devices appear to the programmer to be ordinary memory locations. The hardware in the system bus decodes the addresses coming from the CPU, and some addresses correspond to devices rather than memory. Fig. 11.1 shows the memory layout for a typical system. The exact locations of the devices and memory are chosen by the system hardware designers. From the programmer’s standpoint, writing data to certain memory addresses results in the data being transferred to a device rather than stored in memory. The programmer must read documentation on the hardware design to determine exactly where the devices are in memory.

f11-01-9780128036983
Figure 11.1 Typical hardware address mapping for memory and devices.

11.1 Accessing Devices Directly Under Linux

There are devices that allow data to be read or written from external sources, devices that can measure time, devices for moving data from one location in memory to another, devices for modifying the addresses of memory regions, and devices for even more esoteric purposes. Some devices are capable of sending signals to the CPU to indicate that they need attention, while others simply wait for the CPU to check on their status.

A modern computer system, such as the Raspberry Pi, has dozens or even hundreds of devices. Programmers write device driver software for each device. A device driver provides a few standard function calls for each device, so that it can be used easily. The specific set of functions depends on the type of device and the design of the operating system. Operating system designers strive to define a small set of device types, and to define a standard software interface for each type in order to make devices interchangeable.

Devices are typically controlled by writing specific values to the device’s internal device registers. For the ARM processor, access to most device registers is accomplished using the load and store instructions. Each device is assigned a base address in memory. This address corresponds with the first register inside the device. The device may also have other registers that are accessible at some pre-defined offset address from the base address. Some registers are read-only, some are write-only, and some are read-write. To use the device, the programmer must read from, and write appropriate data to, the correct device registers. For every device, there is a programmer’s model and documentation explaining what each register in the device does. Some devices are well designed, easy to use, and well documented. Some devices are not, and the programmer must work harder to write software to use them.

Linux is a powerful, multiuser, multitasking operating system. The Linux kernel manages all of the devices and protects them from direct access by user programs. User programs are intended to access devices by making system calls. The kernel accesses the devices on behalf of the user programs, ensuring that an errant user program cannot misuse the devices and other resources on the system. Attempting to directly access the registers in any device will result in an exception. The kernel will take over and kill the offending process.

However, our programs will need direct access to the device registers. Linux allows user programs to gain direct access through the mmap() system call. Listing 11.1 shows how four devices can be mapped into the memory space of a user program on a Raspberry Pi. In most cases, the user program will need administrator privileges in order to perform the mapping. The operating system does not usually give permission for ordinary users to access devices directly. However Linux does provide the ability to change permissions on /dev/mem, or for user programs to run with elevated privileges.

f11-08a-9780128036983f11-08b-9780128036983f11-08c-9780128036983f11-08d-9780128036983f11-08e-9780128036983f11-08f-9780128036983
Listing 11.1 Function to map devices into the user program memory on a Raspberry Pi

Listing 11.2 shows how four devices can be mapped into the memory space of a user program on a pcDuino. The devices are equivalent to the devices mapped in Listing 11.1. Some of the devices are described in the following sections of this chapter. The pcDuino devices and Raspberry Pi devices operate differently, but provide similar functionality. Note that most of the code is the same for both listings. The only real differences between Listings 11.1 and 11.2 are the names of the devices and their hardware addresses.

f11-09a-9780128036983f11-09b-9780128036983f11-09c-9780128036983f11-09d-9780128036983f11-09e-9780128036983
Listing 11.2 Function to map devices into the user program memory space on a pcDuino.

11.2 General Purpose Digital Input/Output

One type of device, commonly found on embedded systems, is the General Purpose I/O (GPIO) device. Although there are many variations on this device provided by different manufacturers, they all provide similar capabilities. The device provides a set of input and/or output bits, which allow signals to be transferred to or from the outside world. Each bit of input or output in a GPIO device is generally referred to as a pin, and a group of pins is referred to as a GPIO port. Ports commonly support 8 bits of input or output, but some devices have 16 or 32 bit ports. Some GPIO devices support multiple ports, and some systems have multiple GPIO devices in them.

A system with a GPIO device usually has some type of connector or wires that allow external inputs or outputs to be connected to the system. For example, the IBM PC has a type of GPIO device that was originally intended for communications with a parallel printer. On that platform, the GPIO device is commonly referred to as the parallel printer port.

Some GPIO devices, such as the one on the IBM PC, are arranged as sets of pins that can be switched as a group to either input or output. In many modern GPIO devices, each pin can be individually configured to accept or source different input and output voltages. On some devices, the amount of drive current available can be configured. Some include the ability to configure built-in pull-up and/or pull-down resistors. On most older GPIO devices, the input and output voltages are typically limited to the supply voltage of the GPIO device, and the device may be damaged by greater voltages. Newer GPIO devices generally can tolerate 5 V on inputs, regardless of the supply voltage of the device.

GPIO devices are very common in systems that are intended to be used for embedded applications. For most GPIO devices:

 individual pins or groups of pins can be configured,

 pins can be configured to be input or output,

 pins can be disabled so that they are neither input nor output,

 input values can be read by the CPU (typically high=1, low=0),

 output values can be read or written by the CPU, and

 input pins can be configured to generate interrupt requests.

Some GPIO devices may also have more advanced features, such as the ability to use Direct Memory Access (DMA) to send data without requiring the CPU to move each byte or word. Fig. 11.2 shows two common ways to use GPIO pins. Fig. 11.2A shows a GPIO pin that has been configured for input, and connected to a push-button switch. When the switch is open, the pull-up resistor pulls the voltage on the pin to a high state. When the switch is closed, the pin is pulled to a low state and some current flows through the pull-up resistor to ground. Typically, the pull-up resistor would be around 10 kΩ. The specific value is not critical, but it must be high enough to limit the current to a small amount when the switch is closed. Fig. 11.2B shows a GPIO pin that is configured as an output and is being used to drive an LED. When a 1 is output on the pin, it is at the same voltage as Vcc (the power supply voltage), and no current flows. The LED is off. When a 0 is output on the pin, current is drawn through the resistor and the LED, and through the pin to ground. This causes the LED to be illuminated. Selection of the resistor is not critical, but it must be small enough to light the LED without allowing enough current to destroy either the LED or the GPIO circuitry. This is typically around 1 kΩ. Note that, in general, GPIO pins can sink more current than they can source, so it is most common to connect LEDs and other devices in the way shown.

f11-02-9780128036983
Figure 11.2 GPIO pins being used for input and output. (A) GPIO pin being used as input to read the state of a push-button switch. (B) GPIO pin being used as output to drive an LED.

11.2.1 Raspberry Pi GPIO

The Broadcom BCM2835 system-on-chip contains 54 GPIO pins that are split into two banks. The GPIO pins are named using the following format: GPIOx, where x is a number between 0 and 53. The GPIO pins are highly configurable. Each pin can be used for general purpose I/O, or can be configured to serve up to six pre-defined alternate functions. Configuring a GPIO pin for an alternate function usually allows some other device within the BCM2835 to use the pin. For example, GPIO4 can be used

 for general purpose I/O,

 to send the signal generated by General Purpose Clock 0 to external devices,

 to send bit one of the Secondary Address Bus to external devices, or

 to receive JTAG data for programming the firmware of the device.

The last eight GPIO pins, GPIO46–GPIO53 have no alternate functions, and are used only for GPIO.

In addition to the alternate function, all GPIO pins can be configured individually as input or output. When configured as input, a pin can also be configured to detect when the signal changes, and to send an interrupt to the ARM CPU. Each input pin also has internal pull-up and pull-down resistors, which can be enabled or disabled by the programmer.

The GPIO pins on the BCM2835 SOC are very flexible and are quite complex, but are well designed and not difficult to program, once the programmer understands how the pins operate and what the various registers do. There are 41 registers that control the GPIO pins. The base address for the GPIO device is 20200000. The 41 registers and their offsets from the base address are shown in Table 11.1.

Table 11.1

Raspberry Pi GPIO register map

OffsetNameDescriptionSizeR/W
0016GPFSEL0GPIO Function Select 032R/W
0416GPFSEL1GPIO Function Select 132R/W
0816GPFSEL2GPIO Function Select 232R/W
0C16GPFSEL3GPIO Function Select 332R/W
1016GPFSEL4GPIO Function Select 432R/W
1416GPFSEL5GPIO Function Select 532R/W
1C16GPSET0GPIO Pin Output Set 032W
2016GPSET1GPIO Pin Output Set 132W
2816GPCLR0GPIO Pin Output Clear 032W
2C16GPCLR1GPIO Pin Output Clear 132W
3416GPLEV0GPIO Pin Level 032R
3816GPLEV1GPIO Pin Level 132R
4016GPEDS0GPIO Pin Event Detect Status 032R/W
4416GPEDS1GPIO Pin Event Detect Status 132R/W
4C16GPREN0GPIO Pin Rising Edge Detect Enable 032R/W
5016GPREN1GPIO Pin Rising Edge Detect Enable 132R/W
5816GPFEN0GPIO Pin Falling Edge Detect Enable 032R/W
5C16GPFEN1GPIO Pin Falling Edge Detect Enable 132R/W
6416GPHEN0GPIO Pin High Detect Enable 032R/W
6816GPHEN1GPIO Pin High Detect Enable 132R/W
7016GPLEN0GPIO Pin Low Detect Enable 032R/W
7416GPLEN1GPIO Pin Low Detect Enable 132R/W
7C16GPAREN0GPIO Pin Async. Rising Edge Detect 032R/W
8016GPAREN1GPIO Pin Async. Rising Edge Detect 132R/W
8816GPAFEN0GPIO Pin Async. Falling Edge Detect 032R/W
8C16GPAFEN1GPIO Pin Async. Falling Edge Detect 132R/W
9416GPPUDGPIO Pin Pull-up/down Enable32R/W
9816GPPUDCLK0GPIO Pin Pull-up/down Enable Clock 032R/W
9C16GPPUDCLK1GPIO Pin Pull-up/down Enable Clock 132R/W

t0010

Setting the GPIO pin function

The first six 32-bit registers in the device are used to select the function for each of the 54 GPIO pins. The function of each pin is controlled by a group of three bits in one of these registers. The mapping is very regular. Bits 0–2 of GPIOFSEL0 control the function of GPIO pin 0. Bits 3–5 of GPIOFSEL0 control the function of GPIO pin 1, and so on, up to bits 27–29 of GPIOFSEL0, which control the function of GPIO pin 9. The next pin, pin 10, is controlled by bits 0–2 of GPIOFSEL1. The pins are assigned in sequence through the remaining bits, until bits 27–29, which control GPIO pin 19. The remaining four GPIOFSEL registers control the remaining GPIO pins. Note that bits 30 and 31 of all of the GPIOFSEL registers are not used, and most of the bits in GPIOFSEL5 are not assigned to any pin. The meaning of each combination of the three bits is shown in Table 11.2. Note that the encoding is not as simple as one might expect.

Table 11.2

GPIO pin function select bits

MSB-LSBFunction
000Pin is an input
001Pin is an output
100Pin performs alternate function 0
101Pin performs alternate function 1
110Pin performs alternate function 2
111Pin performs alternate function 3
011Pin performs alternate function 4
010Pin performs alternate function 5

The procedure for setting the function of a GPIO pin is as follows:

 Determine which GPIOFSEL register controls the desired pin.

 Determine which bits of the GPIOFSEL register are used.

 Determine what the bit pattern should be.

 Read the GPIOFSEL register.

 Clear the correct bits using the bic instruction.

 Set them to the correct pattern using the orr instruction.

For example, Listing 11.3 shows the sequence of code which would be used to set GPIO pin 26 to alternate function 1.

f11-10-9780128036983
Listing 11.3 ARM assembly code to set GPIO pin 26 to alternate function 1.

Setting GPIO output pins

To use a GPIO pin for output, the function select bits for that pin must be set to 001. Once that is done, the output can be driven high or low by using the GPSET and GPCLR registers. GPIO pin 0 is set to a high output by writing a 1 to bit 0 of GPSET0, and it is set to low output by writing a 1 to bit 0 of GPCLR0. GPIO pin 1 is similarly controlled by bit 1 in GPSET0 and GPCLR0. Each of the GPIO pins numbered 0 through 31 is assigned one bit in GPSET0 and one bit in GPCLR0. GPIO pin 32 is assigned to bit 0 of GPSET1 and GPCLR1, GPIO pin 33 is assigned to bit 1 of GPSET1 and GPCLR1, and so on. Since there are only 54 GPIO pins, bits 22–31 of GPSET1 and GPCLR1 are not used. The programmer can set or clear several outputs simultaneously by writing the appropriate bits in the GPSET and GPCLR registers.

Reading GPIO input pins

To use a GPIO pin for input, the function select bits for that pin must be set to 000. Once that is done, the input can be read at any time by reading the appropriate GPLEV register and examining the bit that corresponds with the input pin. GPIO pin 0 is read as bit 0 of GPLEV0, GPIO pin 1 is similarly read as bit 1 of GPLEV1. Each of the GPIO pins numbered 0 through 31 is assigned one bit in GPLEV0. GPIO pin 32 is assigned to bit 0 of GPLEV1, GPIO pin 33 is assigned to bit 1 of GPLEV1, and so on. Since there are only 54 GPIO pins, bits 22–31 of GPLEV1 are not used. The programmer can read the status of several inputs simultaneously by reading one of the GPLEV registers and examining the bits corresponding to the appropriate pins.

Enabling internal pull-up or pull-down

Input pins can be configured with internal pull-up or pull-down resistors. This can simplify the design of the system. For instance, Fig. 11.2A, shows a push-button switch connected to an input, with an external pull-up resistor. That resistor is unnecessary if the internal pull-up for that pin is enabled.

Enabling the pull-up or pull-down is a two step process. The first step is to configure the type of change to be made, and the second step is to perform that change on the selected pin(s). The first step is accomplished by writing to the GPPUD register. The valid binary control codes are shown in Table 11.3.

Table 11.3

GPPUD control codes

CodeFunction
00Disable pull-up and pull-down
01Enable pull-down
10Enable pull-up

Once the GPPUD register is configured, the selected operation can be performed on multiple pins by writing to one or both of the GPPUDCLK registers. GPIO pins are assigned to bits in these two registers in the same way as the pins are assigned in the GPLEV, GPSET, and GPCLR registers. Writing 1 to bit 0 of GPPUDCLK0 will configure the pull-up or pull-down for GPIO pin 0, according to the control code that is currently in the GPPUD register.

Detecting GPIO events

The GPEDS registers are used for detecting events that have occurred on the GPIO pins. For instance a pin may have transitioned from low to high, and back to low. If the CPU does not read the GPLEV register often enough, then such an event could be missed. The GPEDS registers can be configured to capture such events so that the CPU can detect that they occurred.

GPIO pins are assigned to bits in these two registers in the same way as the pins are assigned in the GPLEV, GPSET, and GPCLR registers. If bit 1 of GPEDS0 is set, then that indicates that an event has occurred on GPIO pin 0. Writing a 0 to that bit will clear the bit and allow the event detector to detect another event. Each pin can be configured to detect specific types of events by writing to the GPREN, GPHEN, GPLEN, GPAREN, and GPAFEN registers. For more information, refer to the BCM2835 ARM Peripherals manual.

GPIO pins available on the Raspberry Pi

The Raspberry Pi provides access to several of the 54 GPIO pins through the expansion header. The expansion header is a group of physical pins located in the corner of the Raspberry Pi board. Fig. 11.3 shows where the header is located on the Raspberry Pi. Wires can be connected to these pins and then the GPIO device can be programmed to send and/or receive digital information. Fig. 11.4 shows which signals are attached to the various pins. Some of the pins are used to provide power and ground to the external devices.

f11-03-9780128036983
Figure 11.3 The Raspberry Pi expansion header location.
f11-04-9780128036983
Figure 11.4 The Raspberry Pi expansion header pin assignments.

Table 11.4 shows some useful alternate functions available on each pin of the Raspberry Pi expansion header. Many of the alternate functions available on these pins are not really useful. Those functions have been left out of the table. The most useful alternate functions are probably GPIO 14 and 15, which can be used for serial communication, and GPIO 18, which can be used for pulse width modulation. Pulse width modulation is covered in Section 12.2, and serial communication is covered in Section 13.2. The Serial Peripheral Interface (SPI) functions could also be useful for connecting the Raspberry Pi to other devices which support SPI. Also, the SDA and SCL functions could be used to communicate with I2C devices.

Table 11.4

Raspberry Pi expansion header useful alternate functions

Alternate Function
Pin05
GPIO 2SDA1
GPIO 3SCL1
GPIO 4GPCLK0
GPIO 7SPI0_CE1_N
GPIO 8SPI0_CE0_N
GPIO 9SPI0_MISO
GPIO 10SPI0_MOSI
GPIO 11SPI0_SCLK
GPIO 14TXD0TXD1
GPIO 15RXD0RXD1
GPIO 18PCM_CLKPWM0

t0025

11.2.2 pcDuino GPIO

The AllWinner A10/A20 system-on-chip contains 175 GPIO pins, which are arranged in seven ports. Each of the seven ports is identified by a letter between “A” and “I.” The ports are part of the PIO device, which is mapped at address 01C2080016. The GPIO pins are named using the following format: PNx, where N is a letter between “A” and “I” indicating the port, and x is a number indicating a pin on the given port. The assignment of pins to ports is somewhat irregular, as shown in Table 11.5. Some ports have as many as 28 physical pins, while others have as few as six. However, the layout of the registers in the device is very regular. Given any port and pin combination, finding the correct registers and sets of bits within the registers, is very straightforward.

Table 11.5

Number of pins available on each of the AllWinner A10/A20 PIO ports

PortPins
A18
B24
C25
D28
E12
F6
G12
H28
I22

Each of the 9 ports is controlled by a set of 9 registers, for a total of 81 registers. There are seven additional registers that can be used to configure pins as interrupt sources. Interrupt processing is explained in Section 14.2. All of the port and interrupt registers together make a total of 88 registers for the GPIO device. The complete register map with the offset of each register from the device base address is shown in Table 11.6.

Table 11.6

Registers in the AllWinner GPIO device

OffsetNameDescription
00016PA_CFG0Function select for Port A, Pins 0–7
00416PA_CFG1Function select for Port A, Pins 8–15
00816PA_CFG2Function select for Port A, Pins 16–17
00C16PA_CFG3Not used
01016PA_DATPort A Data Register
01416PA_DRV0Port A Multi-driving, Pins 0–15
01816PA_DRV1Port A Multi-driving, Pins 16–17
01C16PA_PULL0Port A Pull-Up/-Down, Pins 0–15
02016PA_PULL1Port A Pull-Up/-Down, Pins 16–17
02416PB_CFG0Function select for Port B, Pins 0–7
02816PB_CFG1Function select for Port B, Pins 8–15
02C16PB_CFG2Function select for Port B, Pins 16–23
03016PB_CFG3Not used
03416PB_DATPort B Data Register
03816PB_DRV0Port B Multi-driving, Pins 0–15
03C16PB_DRV1Port B Multi-driving, Pins 16–23
04016PB_PULL0Port B Pull-Up/-Down, Pins 0–15
04416PB_PULL1Port B Pull-Up/-Down, Pins 16–23
04816PC_CFG0Function select for Port C, Pins 0–7
04C16PC_CFG1Function select for Port C, Pins 8–15
05016PC_CFG2Function select for Port C, Pins 16–23
05416PC_CFG3Function select for Port C, Pin 24
05816PC_DATPort C Data Register
05C16PC_DRV0Port C Multi-driving, Pins 0–15
06016PC_DRV1Port C Multi-driving, Pins 16–23
06416PC_PULL0Port C Pull-Up/-Down, Pins 0–15
06816PC_PULL1Port C Pull-Up/-Down, Pins 16–23
06C16PD_CFG0Function select for Port D, Pins 0–7
07016PD_CFG1Function select for Port D, Pins 8–15
07416PD_CFG2Function select for Port D, Pins 16–23
07816PD_CFG3Function select for Port D, Pins 24–27
07C16PD_DATPort D Data Register
08016PD_DRV0Port D Multi-driving, Pins 0–15
08416PD_DRV1Port D Multi-driving, Pins 16–27
08816PD_PULL0Port D Pull-Up/-Down, Pins 0–15
08C16PD_PULL1Port D Pull-Up/-Down, Pins 16–27
09016PE_CFG0Function select for Port E, Pins 0–7
09416PE_CFG1Function select for Port E, Pins 8–11
09816PE_CFG2Not used
09C16PE_CFG3Not used
0A016PE_DATPort E Data Register
0A416PE_DRV0Port E Multi-driving, Pins 0–11
0A816PE_DRV1Not used
0AC16PE_PULL0Port E Pull-Up/-Down, Pins 0–11
0B016PE_PULL1Not used
0B416PF_CFG0Function select for Port F, Pins 0–5
0B816PF_CFG1Not used
0BC16PF_CFG2Not used
0C016PF_CFG3Not used
0C416PF_DATPort F Data Register
0C816PF_DRV0Port F Multi-driving, Pins 0–5
0CC16PF_DRV1Not used
0D016PF_PULL0Port F Pull-Up/-Down, Pins 0–5
0D416PF_PULL1Not used
0D816PG_CFG0Function select for Port G, Pins 0–7
0DC16PG_CFG1Function select for Port G, Pins 8–11
0E016PG_CFG2Not used
0E416PG_CFG3Not used
0E816PG_DATPort G Data Register
0EC16PG_DRV0Port G Multi-driving, Pins 0–11
0F016PG_DRV1Not used
0F416PG_PULL0Port G Pull-Up/-Down, Pins 0–11
0F816PG_PULL1Not used
0FC16PH_CFG0Function select for Port H, Pins 0–7
10016PH_CFG1Function select for Port H, Pins 8–15
10416PH_CFG2Function select for Port H, Pins 16–23
10816PH_CFG3Function select for Port H, Pins 24–27
10C 16PH_DATPort H Data Register
11016PH_DRV0Port H Multi-driving, Pins 0–15
11416PH_DRV1Port H Multi-driving, Pins 16–27
11816PH_PULL0Port H Pull-Up/-Down, Pins 0–15
11C16PH_PULL1Port H Pull-Up/-Down, Pins 16–27
12016PI_CFG0Function select for Port I, Pins 0–7
12416PI_CFG1Function select for Port I, Pins 8–15
12816PI_CFG2Function select for Port I, Pins 16–21
12C16PI_CFG3Not used
13016PI_DATPort I Data Register
13416PI_DRV0Port I Multi-driving, Pins 0–15
13816PI_DRV1Port I Multi-driving, Pins 16–21
13C16PI_PULL0Port I Pull-Up/-Down, Pins 0–15
14016PI_PULL1Port I Pull-Up/-Down, Pins 16–21
20016PIO_INT_CFG0PIO Interrupt Configure Register 0
20416PIO_INT_CFG1PIO Interrupt Configure Register 1
20816PIO_INT_CFG2PIO Interrupt Configure Register 2
20C16PIO_INT_CFG3PIO Interrupt Configure Register 3
21016PIO_INT_CTLPIO Interrupt Control Register
21416PIO_INT_STATUSPIO Interrupt Status Register
21816PIO_INT_DEBPIO Interrupt Debounce Register

t0035_at0035_b

The GPIO pins are highly configurable. Each pin can be used either for general purpose I/O, or can be configured to serve one of up to six pre-defined alternate functions. Configuring a GPIO pin for an alternate function usually allows some other device within the A10/A20 SOC to use the pin. For example PB2 (pin 2 of port B) can be used for general purpose I/O, or can be used to output the signal from a Pulse Width Modulator (PWM) device (explained in Section 12.2). Each input pin also has internal pull-up and pull-down resistors, which can be enabled or disabled by the programmer.

Setting the GPIO pin function

The first four registers for each port are used to configure the functions for each of the pins. The function of each pin is controlled by three bits in one of the four configuration registers. Pins 0–7 are controlled using configuration register 0. Pins 8–15 are controlled by configuration register 1, and so on. The assignment of pins to control bits is shown in Fig. 11.5. Note that eight pins are controlled by each register, and there is an unused bit between each group of three bits.

f11-05-9780128036983
Figure 11.5 Bit-to-pin assignments for PIO control registers.

Each GPIO pin can be configured by writing a 3-bit code to the appropriate location in the correct port configuration register. The meanings of each possible code is shown in Table 11.7. For example, to configure port A, pin 10 (PA10) for output, the 3-bit code 001 must be written to bits 8–10 the PA_CFG1 register, without changing any other bit in the register. Listing 11.4 shows how this operation can be accomplished.

Table 11.7

Allwinner A10/A20 GPIO pin function select bits

MSB-LSBFunction
000Pin is an input
001Pin is an output
010Pin performs alternate function 0
011Pin performs alternate function 1
100Pin performs alternate function 2
101Pin performs alternate function 3
110Pin performs alternate function 4
111Pin performs alternate function 5
f11-11-9780128036983
Listing 11.4 ARM assembly code to configure PA10 for output.

Reading and setting GPIO pins

An output pin can be set to a high state by setting the corresponding bit in the correct port data register. Likewise the pin can be set to a low state by clearing its corresponding bit. Care must be taken to avoid changing any other bits in the port data register. Listing 11.5 shows how this operation can be accomplished for setting a port to output a high state. To set the port output to a low state, the orr instruction would be replaced with a bic instruction.

f11-12-9780128036983
Listing 11.5 ARM assembly code to set PA10 to output a high state.

To determine the current state of an output pin or read an input pin, the programmer can read the contents of the correct port data register and use bitwise logical operations to isolate the appropriate bit. For example, to read the state of pin 14 of port I (PI14), the programmer would read the PI_DAT register and mask all bits except bit 14. Listing 11.6 shows how this operation can be accomplished. Another method would be to use the tst instruction, rather than the ands instruction, to set the CPSR flags.

f11-13-9780128036983
Listing 11.6 ARM assembly code to set PA10 to output a high state.

Enabling internal pull-up or pull-down

Input pins can be configured with internal pull-up or pull-down resistors. This can simplify the design of the system. For instance, Fig. 11.2a, shows a push-button switch connected to an input with an external pull-up resistor. That resistor is unnecessary if the internal pull-up for that pin is enabled. Each pin is assigned two bits in one of the port pull-up/-down registers. The pull-up and pull-down resistors for pin 0 on port B are controlled using bits 0 and 1 of the PB_PULL0 register. Likewise the pull-up and pull-down resistors for pin 19 of port C are controlled using bits 6 and 7 of the PC_PULL1 register. Table 11.8 shows the bit patterns used to configure the pull-up and pull-down resisters for a pin.

Table 11.8

Pull-up and pull-down resistor control codes

CodeFunction
00Disable pull-up and pull-down
01Enable pull-up
10Enable pull-down
11Reserved

Detecting GPIO events

When configured as an input, most of the pins on the pdDuino can be configured to generate an interrupt, which notifies the CPU than an event has occurred. Configuration of interrupts is beyond the scope of this chapter. It is accomplished using the PIO_INT registers.

GPIO pins available on the pcDuino

The pcDuino provides access to several of the 175 GPIO pins through the expansion headers. Fig. 11.6 shows where the headers are located on the pcDuino. Wires can be plugged into the holes in these headers and then the GPIO device can be programmed to send and/or receive digital and/or analog signals. The physical layout of the pcDuino header makes it compatible with a wide range of expansion modules designed for the Arduino family of microcontroller boards.

f11-06-9780128036983
Figure 11.6 The pcDuino header locations.

Some of the header holes can provide power and ground to the external devices. Analog signals can be read into the pcDuino using the ADC header connections. Fig. 11.7 shows the pcDuino names for the signals that are available on the headers. Table 11.9 shows how the pcDuino header signal names are mapped to the actual port pins on the AllWinner A10/A20 chip. It also shows the most useful alternate functions available on each of the pins. Many alternate functions are left out of the table because they are not really useful. Note that the pcDunio and the Raspberry Pi both provide pins to perform PWM, UART communications, and SPI.

f11-07-9780128036983
Figure 11.7 The pcDuino header pin assignments.

Table 11.9

pcDuino GPIO pins and function select code assignments.

Function Select Code Assignment
pcDuino Pin NamePortPin010011100110
UART-Rx(GPIO0)I19UART2_RXEINT31
UART-Tx(GPIO1)I18UART2_TXEINT30
GPIO3(GPIO2)H7UART5_RXEINT7
PWM0(GPIO3)H6UART5_TXEINT6
GPIO4H8EINT8
PWM1(GPIO5)B2PWM0
PWM2(GPIO6)I3PWM1
GPIO7H9EINT9
GPIO8H10EINT10
PWM3(GPIO9)H5EINT5
SPI_CS(GPIO10)I10SPI0_CS0UART5_TXEINT22
SPI_MOSI(GPIO11)I12SPI0_MOSIUART6_TXCLK_OUT_AEINT24
SPI_MISO(GPIO12)I13SPI0_MISOUART6_RXCLK_OUT_BEINT25
SPI_CLK(GPIO13)I11SPI0_CLKUART5_RXEINT23

t0050

11.3 Chapter Summary

All input and output are accomplished by using devices. There are many types of devices, and each device has its own set of registers which are used to control the device. The programmer must understand the operation of the device and the use of each register in order to use the device at a low level. Computer system manufacturers usually can provide documentation providing the necessary information for low-level programming. The quality of the documentation can vary greatly, and a general understanding of various types of devices can help in deciphering poor or incomplete documentation.

There are two major tasks where programming devices at the register level is required: operating system drivers and very small embedded systems. Operating systems provide an abstract view of each device and this allows programmers to use them more easily. However, someone must write that driver, and that person must have intimate knowledge of the device. On very small systems, there may not be a driver available. In that case, the device must be accessed directly. Even when an operating system provides a driver, it is sometimes necessary or desirable for the programmer to access the device directly. For example, some devices may provide modes of operation or capabilities that are not supported by the operating system driver. Linux provides a mechanism which allows the programmer to map a physical device into the program’s memory space, thereby gaining access to the raw device registers.

Exercises

11.1 Explain the relationships and differences between device registers, memory locations, and CPU registers.

11.2 Why is it necessary to map the device into user program memory before accessing it under Linux? Would this step be necessary under all operating systems or in the case where there is no operating system and our code is running on the “bare metal?”

11.3 What is the purpose of a GPIO device?

11.4 The Raspberry Pi and the PcDuino have very different GPIO devices.

(a) Are they functionally equivalent?

(b) Are they equally programmer-friendly?

(c) If you have answered no to either of the previous questions, then what are the differences?

11.5 Draw a circuit diagram showing how to connect:

(a) a pushbutton switch to GPIO 23 and an LED to GPIO 27 on the Raspberry Pi, and

(b) a pushbutton switch to GPIO12 and an LED to GPIO13 on the PcDuino.

11.6 Assuming the systems are wired according to the previous exercise, write two functions. One function must initialize the GPIO pins, and the other function must read the state of the switch and turn the LED on if the button is pressed, and off if the button is not pressed. Write the two functions for

(a) a Raspberry Pi, and

(b) a PcDuino.

11.7 Write the code necessary to route the output from PWM0 to GPIO 18 on a Raspberry Pi.

11.8 Write the code necessary to route the output from PWM0 to GPIO 5 on a PcDuino.

Chapter 12

Pulse Modulation

Abstract

This chapter begins by explaining pulse density and pulse width modulation in general terms. It then introduces and describes the PWM device on the Raspberry Pi. Following that, it covers the pcDuino PWM device. This gives the reader another opportunity to see two different devices which both perform essentially the same functions.

Keywords

Pulse width modulation; Pulse density modulation; Digital to analog; Low pass filter

The GPIO device provides a method for sending digital signals to external devices. This can be useful to control devices that have basically two states: on and off. In some situations, it is useful to have the ability to turn a device on at varying levels. For instance, it could be useful to control a motor at any required speed, or control the brightness of a light source. One way that this can be accomplished is through pulse modulation.

The basic idea is that the computer sends a stream of pulses to the device. The device acts as a low-pass filter, which averages the digital pulses into an analog voltage. By varying the percentage of time that the pulses are high, versus low, the computer can control how much average energy is sent to the device. The percentage of time that the pulses are high versus low is known as the duty cycle. Varying the duty cycle is referred to as modulation. There are two major types of pulse modulation: pulse density modulation (PDM) and pulse width modulation (PWM). Most pulse modulation devices are configured in three steps as follows:

1. The base frequency of the clock that drives the PWM device is configured. This step is usually optional.

2. The mode of operation for the pulse modulation device is configured by writing to one or more configuration registers in the pulse modulation device.

3. The cycle time is set by writing a “range” value into a register in the pulse modulation device. This value is usually set as a multiple of the base clock cycle time.

Once the device is configured, the duty cycle can be changed easily by writing to one or more registers in the pulse modulation device.

12.1 Pulse Density Modulation

With PDM, also known as pulse frequency modulation (PFM), the duration of the positive pulses does not change, but the time between them (the pulse density) is modulated. When using PDM devices, the programmer typically sets the device cycle time tc in a register, then uses another register to specify the number of pulses d that are to be sent during a device cycle. The number of pulses is typically referred to as the duty cycle and must be chosen such that 0 ≤ dtc. For instance, if tc = 1024, then the device cycle time is 1024 times the cycle time of the clock that drives the device. If d = 512, then the device will send 512 pulses, evenly spaced, during the device cycle. Each pulse will have the same duration as the base clock. The device will continue to output this pulse pattern until d is changed.

Fig. 12.1 shows a signal that is being sent using PDM, and the resulting set of pulses. Each pulse transfers a fixed amount of energy to the device. When the pulses arrive at the device, they are effectively filtered using a low pass filter. The resulting received signal is also shown. Notice that the received signal has a delay, or phase shift, caused by the low-pass filtering. This approach is suitable for controlling certain types of devices, such as lights and speakers.

f12-01-9780128036983
Figure 12.1 Pulse density modulation.

However, when driving such devices directly with the digital pulses, care must be taken that the minimum frequency of pulses remains above the threshold that can be detected by human senses. For instance, when driving a speaker, the minimum pulse frequency must be high enough that the individual pulses cannot be distinguished by the human ear. This minimum frequency is around 40 kHz. Likewise, when driving an LED directly, the minimum frequency must be high enough that the eye cannot detect the individual pulses, because they will be seen as a flickering effect. That minimum frequency is around 70 Hz. To reduce or alleviate this problem, designers may add a low-pass filter between the PWM device and the device that is being driven.

12.2 Pulse Width Modulation

In PWM, the frequency of the pulses remains fixed, but the duration of the positive pulse (the pulse width) is modulated. When using PWM devices, the programmer typically sets the device cycle time tc in a register, then uses another register to specify the number of base clock cycles, d, for which the output should be high. The percentage dtc×100si1_e is typically referred to as the duty cycle and d must be chosen such that 0 ≤ dtc. For instance, if tc = 1024, then the device cycle time is 1024 times the cycle time of the clock that drives the device. If d = 512, then the device will output a high signal for 512 clock cycles, then output a low signal for 512 clock cycles. It will continue to repeat this pattern of pulses until d is changed.

Fig. 12.2 shows a signal that is being sent using PWM. The pulses are also shown. Each pulse transfers some energy to the device. The width of each pulse determines how much energy is transferred. When the pulses arrive at the device, they are effectively filtered using a low-pass filter. The resulting received signal is shown by the dashed line. As with PDM, the received signal has a delay, or phase shift, caused by the low-pass filtering.

f12-02-9780128036983
Figure 12.2 Pulse width modulation.

One advantage of PWM over PDM is that the digital circuit is not as complex. Another advantage of PWM over PDM is that the frequency of the pulses does not vary, so it is easier for the programmer to set the base frequency high enough that the individual pulses cannot be detected by human senses. Also, when driving motors it is usually necessary to match the pulse frequency to the size and type of motor. Mismatching the frequency can cause loss of efficiency as well as overheating of the motor and drive electronics. In severe cases, this can cause premature failure of the motor and/or drive electronics. With PWM, it is easier for the programmer to control the base frequency, and thereby avoid those problems.

12.3 Raspberry Pi PWM Device

The Broadcom BCM2835 system-on-chip includes a device that can create two PWM signals. One of the signals (PWM0) can be routed through GPIO pin 18 (alternate function 5), where it is available on the Raspberry Pi expansion header at pin 12. PWM0 can also be routed through GPIO pin 40. On the Raspberry Pi, pin 40 it is sent through a low-pass filter, and then to the Raspberry Pi audio output port as the right stereo channel. The other signal (PWM1) can be routed through GPIO pin 45. From there, it is sent through a low-pass filter, and then to the Raspberry Pi audio output port as the left stereo channel. So, both PWM channels are accessible, but PWM1 is only accessible through the audio output port after it has been low-pass filtered. The raw PWM0 signal is available through the Raspberry Pi expansion header at pin 12.

There are three modes of operation for the BCM2835 PWM device:

1. PDM mode,

2. PWM mode, and

3. serial transmission mode.

The following paragraphs explain how the device can be used in basic PWM mode, which is the most simple and straightforward mode for this device. Information on how to use the PDM and serial transmission modes, the FIFO, and DMA is available in the BCM2835 ARM Peripherals manual.

The base address of the PWM device is 2020C00016 and it contains eight registers. Table 12.1 shows the offset, name, and a short description for each of the registers. The mode of operation is selected for each channel independently by writing appropriate bits in the PWMCTL register. The base clock frequency is controlled by the clock manager device, which is explained in Section 13.1. By default, the system startup code sets the base clock for the PWM device to 100 MHz.

Table 12.1

Raspberry Pi PWM register map

OffsetNameDescriptionSizeR/W
0016PWMCTLPWM Control32R/W
0416PWMSTAPWM FIFO Status32R/W
0816PWMDMACPWM DMA Configuration32R/W
1016PWMRNG1PWM Channel 1 Range32R/W
1416PWMDAT1PWM Channel 1 Data32R/W
1816PWMFIF1PWM FIFO Input32R/W
2016PWMRNG2PWM Channel 2 Range32R/W
2416PWMDAT2PWM Channel 2 Data32R/W

t0010

Table 12.2 shows the names and short descriptions of the bits in the PWMCTL register. There are 8 bits used for controlling channel 1 and 8 bits for controlling channel 2. PWENn is the master enable bit for channel n. Setting that bit to 0 disables the PWM channel, while setting it to 1 enables the channel. MODEn is used to select whether the channel is in serial transmission mode or in the PDM/PWM mode. If MODEn is set to 0, then MSENn is used to choose whether channel n is in PDM mode or PWM mode. If MODEn is set to 1, then RPTLn, SBITn, USEFn, and CLRFn are used to manage the operation of the FIFO for channel n. POLAn is used to enable or disable inversion of the output signal for channel n.

Table 12.2

Raspberry Pi PWM control register bits

BitNameDescriptionValues
0PWEN1Channel 1 Enable

0: Channel is disabled

1: Channel is enabled

1MODE1Channel 1 Mode

0: PDM or PWM mode

1: Serial mode

2RPTL1Channel 1 Repeat Last

0: Transmission stops when FIFO empty

1: Last data are sent repeatedly

3SBIT1Channel 1 Silence Bit

0: Output goes low when not transmitting

1: Output goes high when not transmitting

4POLA1Channel 1 Polarity

0: 0 is low voltage and 1 is high voltage

1: 1 is low voltage and 0 is high voltage

5USEF1Channel 1 Use FIFO

0: Data register is used

1: FIFO is used

6CLRF1Channel 1 Clear FIFO

Write 0: No effect

Write 1: Causes FIFO to be emptied

7MSEN1Channel 1 PWM Enable

0: PDM mode

1: PWM mode

8PWEN2Channel 2 Enable

0: Channel is disabled

1: Channel is enabled

9MODE2Channel 2 Mode

0: PDM or PWM mode

1: Serial mode

10RPTL2Channel 2 Repeat Last

0: Transmission stops when FIFO empty

1: Last data are sent repeatedly

11SBIT2Channel 2 Silence Bit

0: Output goes low when not transmitting

1: Output goes high when not transmitting

12POLA2Channel 2 Polarity

0: 0 is low voltage and 1 is high voltage

1: 1 is low voltage and 0 is high voltage

13USEF2Channel 2 Use FIFO

0: Data register is used

1: FIFO is used

14UnusedReserved
16MSEN2Channel 2 PWM Enable

0: PDM mode

1: PWM mode

16–31UnusedReserved

t0015

The PWMRNGn registers are used to define the base period for the corresponding channel. In PDM mode, evenly distributed pulses are sent within a period of length defined by this register, and the number of pulses sent during the base period is controlled by writing to the corresponding PWMDATn register. In PWM mode, the PWMRNGn register defines the base frequency for the pulses, and the duty cycle is controlled by writing to the corresponding PWMDATn register. Example 12.1 gives an overview of the steps needed to configure PWM0 for use in PWM mode.

Example 12.1

Example of Determining Clock Values on the Raspberry Pi

Suppose we wish to use PWM0 to perform PWM with a base frequency of 100 kHz and the ability to control the duty cycle with a resolution of 0.1%. The steps would be as follows:

1. Verify that the clock manager device is configured to send a 100 MHz clock to the pulse modulator device through PWM_CLK.

2. To obtain a frequency of 100 kHz from a 100-MHz clock, it is necessary to divide by 1000. Therefore the second step is to store 1000 in the PWMRNG1 register.

3. Before enabling the PWM channel, it is prudent to initialize the duty cycle. The safest initial value is 0%, or completely off. This is accomplished by writing zero to the PWMDAT1 register.

4. Enable PWM channel 1 to operate in PWM mode by setting bit zero of PWMCTL to 1, bit one of PWMCTL to 0, bit five of PWMCTL to 0, and bit seven of PWMCTL to 1.

Once this initialization is performed, we can set or change the duty cycle at any time by writing a value between 0 and 1000 to the PWMDAT1 register.

12.4 pcDuino PWM Device

The AllWinner A10/A20 SOCs have a hardware PWM device which is capable of generating two PWM signals. The PWM device is driven by the OSC24M signal, which is generated by the Clock Control Unit (CCU) in the AllWinner SOC. This base clock runs at 24 MHz by default, and changing the base frequency could affect many other devices in the system. The base clock can be divided by one of 11 predefined values using a prescaler built into the PWM device. Each of the two channels has its own prescaler. Table 12.3 shows the possible settings for the prescalers.

Table 12.3

Prescaler bits in the pcDuino PWM device

ValueEffect
0000Base clock is divided by 120
0001Base clock is divided by 180
0010Base clock is divided by 240
0011Base clock is divided by 360
0100Base clock is divided by 480
0101,0110,0111Not used
1000Base clock is divided by 1200
1001Base clock is divided by 2400
1010Base clock is divided by 3600
1011Base clock is divided by 4800
1100Base clock is divided by 7200
1101,1110Not used
1111Base clock is divided by 1

There are two modes of operation for the PWM device. In the first mode, the device operates like a standard PWM device as described in Section 12.2. In the second mode, it sends a single pulse and then waits until it is triggered again by the CPU. In this mode, it is a monostable multivibrator, also known as a one-shot multivibrator, or just one-shot. The duration of the pulse is controlled using the pre-scaler and the period register.

The PWM device is mapped at address 01C20C0016. Table 12.4 shows the registers and their offsets from the base address. All of the device configuration is done through a single control register, which can also be read in order to determine the status of the device. The bits in the control register are shown in Table 12.5.

Table 12.4

pcDuino PWM register map

OffsetNameDescription
20016PWMCTLPWM Control
20416PWM_CH0_PERIODPWM Channel 0 Period
20816PWM_CH1_PERIODPWM Channel 1 Period

Table 12.5

pcDuino PWM control register bits

BitNameDescriptionValues
3-0CH0_PRESCALChannel 0 PrescaleThese bits must be set before PWM Channel 0 clock is enabled. See Table 12.3.
4CH0_ENChannel 0 Enable0: Channel disabled
1: Channel enabled
5CH0_ACT_STAChannel 0 Polarity0: Channel is active low
1: Channel is active high
6SCLK_CH0_GATINGChannel 0 Clock0: Clock disabled
1: Clock enabled
7CH0_PUL_STARTStart pulseIf configured for pulse mode, writing a 1 causes the PWM device to emit a single pulse.
8PWM0_BYPASSBypass PWM0: Output PWM device signal
1: Output base clock
9SCLK_CH0_MODESelect Mode0: PWM mode
1: Pulse mode
10-14Not Used
18-15CH1_PRESCALChannel 1 PrescaleThese bits must be set before PWM Channel 1 clock is enabled. See Table 12.3.
19CH1_ENChannel 1 Enable0: Channel disabled
1: Channel enabled
20CH1_ACT_STAChannel 1 Polarity0: Channel is active low
1: Channel is active high
21SCLK_CH1_GATINGChannel 1 Clock0: Clock disabled
1: Clock enabled
22CH1_PUL_STARTStart pulseIf configured for pulse mode, writing a 1 causes the PWM device to emit a single pulse.
23PWM1_BYPASSBypass PWM0: Output PWM device signal
1: Output base clock
24SCLK_CH1_MODESelect Mode0: PWM mode
1: Pulse mode
27-25Not Used
28PWM0_RDYCH0 Period Ready0: PWM0 Period register is ready
1: PWM0 Period register is busy
29PWM1_RDYCH1 Period Ready0: PWM1 Period register is ready
1: PWM1 Period register is busy
31–30Not Used

t0030

Before enabling a PWM channel, the period register for that channel should be initialized. The two period registers are each organized as two 16-bit numbers. The upper 16 bits control the total number of clock cycles in one period. In other words, they control the base frequency of the PWM signal. The PWM frequency is calculated as

f=OSC24MPSCN+1,

si2_e

where OSC24M is the frequency of the base clock (the default is 24 MHz), PSC is the prescale value set in the channel prescale bits in the PWM control register, and N is the value stored in the upper 16 bits of the channel period register.

The lower 16 bits of the channel period register control the duty cycle. The duty cycle (expressed as % of full on) can be calculated as

d=DN×100,

si3_e

where N is the value stored in the upper 16 bits of the channel period register, and D is the value stored in the lower 16 bits of the channel period register. Note that the condition DN must always remain true. If the programmer allows D to become greater than N, the results are unpredictable.

The procedure for configuring the AllWinner A10/A20 PWM device is as follows:

1. Disable the desired channel:

(a) Read the PWM control register into x.

(b) Clear all of the bits in x for the desired PWM channel.

(c) Write x back to the PWM control register

2. Initialize the period register for the desired channel.

(a) Calculate the desired value for N.

(b) Let D = 0.

(c) Let y = N × 216 + D.

(d) Write y to the desired channel period register.

3. Set the prescaler.

(a) Select the four-bit code for the desired divisor from Table 12.3.

(b) Set the prescaler code bits in x.

(c) Write x back to the PWM control register.

4. Enable the PWM device.

(a) Set the appropriate bits in x to enable the desired channel, select the polarity, and enable the clock.

(b) Write x to the PWM control register.

Once the control register is configured, the duty cycle can be controlled by calculating a new value for D and then writing y = N × 216 + D to the desired channel period register.

12.5 Chapter Summary

Pulse modulation is a group of methods for generating analog signals using digital equipment, and is commonly used in control systems to regulate the power sent to motors and other devices. Pulse modulation techniques can have very low power loss compared to other methods of controlling analog devices, and the circuitry required is relatively simple.

The cycle frequency must be programmed to match the application. Typically, 10 Hz is adequate for controlling an electric heating element, while 120 Hz would be more appropriate for controlling an incandescent light bulb. Large electric motors may be controlled with a cycle frequency as low as 100 Hz, while smaller motors may need frequencies around 10,000 Hz. It can take some experimentation to find the best frequency for any given application.

Exercises

12.1 Write ARM assembly programs to configure PWM0 and the GPIO device to send a signal out on Raspberry Pi header pin 12 with:

(a) period of 1 ms and duty cycle of 25%, and

(b) frequency of 150 Hz and duty cycle of 63%.

12.2 Write ARM assembly programs to configure PWM0 and the GPIO device to send a signal out on the pcDuino PWM1/GPIO5 pin with:

(a) period of 1 ms and duty cycle of 25%, and

(b) frequency of 150 Hz and duty cycle of 63%.

Chapter 13

Common System Devices

Abstract

This chapter briefly describes some of the devices which are present in most modern computer systems. It then describes in detail the clock management devices on the Raspberry Pi and the pcDuino. Next, it gives an explanation of asynchronous serial communications, and explains how there is some tolerance for mismatch between the clock rate of the transmitter and receiver. It then explains the Universal Asynchronous Receiver/Transmitter (UART) device. Next it covers in detail the UART devices present on the Raspberry Pi and the PcDuino. Once again, the reader is given the opportunity to do a comparison between two different devices which perform almost precisely the same functions.

Keywords

Universal asynchronous receiver/transmitter (UART); Clock manager; Serial communications; RS232

There are some classes of devices that are found in almost every system, including the smallest embedded systems. Such common devices include hardware for managing the clock signals sent to other devices, and serial communications (typically RS232). Most mid-sized or large systems also include devices for managing virtual memory, managing the cache, driving a display, interfacing with keyboard and mouse, accessing disk and other storage devices, and networking. Small embedded systems may have devices for converting analog signals to digital and vice versa, pulse width modulation, and other purposes. Some systems, such as the Raspberry Pi and pcDuino, have all or most of the devices of large systems, as well as most of the devices found on embedded systems. In this chapter, we look at two devices found on almost every system.

13.1 Clock Management Device

Very simple computer systems can be driven by a single clock. Most devices, including the CPU, are designed as state machines. The clock device sends a square-wave signal at a fixed frequency to all devices that need it. The clock signal tells the devices when to transition to the next state. Without the clock signal, none of the devices would do anything.

More complex computers may contain devices which need to run at different rates. This requires the system to have separate clock signals for each device (or group of devices). System designers often solve this problem by adding a clock manager device to the system. This device allows the programmer to configure the clock signals that are sent to the other devices in the system. Fig. 13.1 shows a typical system. The clock manager, just like any other device, is configured by the CPU writing data to its registers using the system bus.

f13-01-9780128036983
Figure 13.1 Typical system with a clock management device.

13.1.1 Raspberry Pi Clock Manager

The BCM2835 system-on-chip contains an ARM CPU and several devices. Some of the devices need their own clock to drive their operation at the correct frequency. Some devices, such as serial communications receivers and transmitters, need configurable clocks so that the programmer has control over the speed of the device. To provide this flexibility and allow the programmer to have control over the clocks for each device, the BCM2835 includes a clock manager device, which can be used to configure the clock signals driving the other devices in the system.

The Raspberry Pi has a 19.2 MHz oscillator which can be used as a base frequency for any of the clocks. The BCM2835 also has three phase-locked-loop circuits that boost the oscillator to higher frequencies. Table 13.1 shows the frequencies that are available from various sources. Each device clock can be driven by one of the PLLs, the external 19.2 MHz oscillator, a signal from the HDMI port, or either of two test/debug inputs.

Table 13.1

Clock sources available for the clocks provided by the clock manager

NumberNameFrequencyNote
0GND0 HzClock is stopped
1oscillator19.2 MHz
2testdebug0UnknownUsed for system testing
3testdebug1UnknownUsed for system testing
4PLLA650 MHzMay not be available
5PLLC200 MHzMay not be available
6PLLD500 MHz
7HDMI auxiliaryUnknown
8–15GND0 HzClock is stopped

t0010

Among the clocks controlled by the clock manager device are the core clock (CM_VPU), the system timer clock (PM_TIME) which controls the speed of the system timer, the GPIO clocks which are documented in the Raspberry Pi peripheral documentation, the pulse modulator device clocks, and the serial communications clocks. It is generally not a good idea to modify the settings of any of the clocks without good reason.

The base address of the clock manager device is 2010100016. Some of the clock manager registers are shown in Table 13.2. Each clock is managed by two registers: a control register and a divisor. The control register is used to enable or disable a clock, to select which source oscillator drives the clock, and to select an optional multistage noise shaping (MASH) filter level. MASH filtering is useful for reducing the perceived noise when a clock is being used to generate an audio signal. In most cases, MASH filtering should not be used.

Table 13.2

Some registers in the clock manager device

OffsetNameDescription
07016CM_GP0_CTLGPIO Clock 0 (GPCLK0) Control
07416CM_GP0_DIVGPIO Clock 0 (GPCLK0) Divisor
07816CM_GP1_CTLGPIO Clock 1 (GPCLK1) Control
07c16CM_GP1_DIVGPIO Clock 1 (GPCLK1) Divisor
08016CM_GP2_CTLGPIO Clock 2 (GPCLK2) Control
08416CM_GP2_DIVGPIO Clock 2 (GPCLK2) Divisor
09816CM_PCM_CTLPulse Code Modulator Clock (PCM_CLK) Control
09c16CM_PCM_DIVPulse Code Modulator Clock (PCM_CLK) Divisor
0a016CM_PWM_CTLPulse Modulator Device Clock (PWM_CLK) Control
0a416CM_PWM_DIVPulse Modulator Device Clock (PWM_CLK) Divisor
0f016CM_UART_CTLSerial Communications Clock (UART_CLK) Control
0f416CM_UART_DIVSerial Communications Clock (UART_CLK) Divisor

Table 13.3 shows the meaning of the bits in the control registers for each of the clocks, and Table 13.4 shows the fields in the clock manager divisor registers. The procedure for configuring one of the clocks is:

Table 13.3

Bit fields in the clock manager control registers

BitNameDescription
3–0SRCClock source chosen from Table 13.1
4ENABWriting a 0 causes the clock to shut down. The clock will not stop immediately. The BUSY bit will be 1 while the clock is shutting down. When the BUSY bit becomes 0, the clock has stopped and it is safe to reconfigure it. Writing a 1 to this bit causes the clock to start
5KILLWriting a 1 to this bit will stop and reset the clock. This does not shut down the clock cleanly, and could cause a glitch in the clock output
6-Unused
7BUSYA 1 in this bit indicates that the clock is running
8FLIPWriting a 1 to this bit will invert the clock output. Do not change this bit while the clock is running
10–9MASHControls how the clock source is divided.

00: Integer division

01: 1-stage MASH division

10: 2-stage MASH division

11: 3-stage MASH division

Do not change this while the clock is running.
23–11Unused
31–24PASSWDThis field must be set to 5A16 every time the clock control register is written to

t0020

Table 13.4

Bit fields in the clock manager divisor registers

BitNameDescription
11–0DIVFFractional part of divisor. Do not change this while the clock is running
23–12DIVIInteger part of divisor. Do not change this while the clock is running
31–24PASSWDThis field must be set to 5A16 every time the clock divisor register is written to

1. Read the desired clock control register.

2. Clear bit 4 in the word that was read, then OR it with 5A00000016 and store the result back to the desired clock control register.

3. Repeatedly read the desired clock control register, until bit 7 becomes 0.

4. Calculate the divisor required and store it into the desired clock divisor register.

5. Create a word to configure and start the clock. Begin with 5A00000016, and set bits 3–0 to select the desired clock source. Set bits 10–9 to select the type of division, and set bit 4 to 1 to enable the clock.

6. Store the control word into the desired clock control register.

Selection of the divisor depends on which clock source is used, what type of division is selected, and the desired output of the clock being configured. For example, to set the PWM clock to 100 kHz, the 19.20 MHz clock can be used. Dividing that clock by 192 will provide a 100-KHz clock. To accomplish this, it is necessary to stop the PWM clock as described, store the value 5A0C000016 in the PWM clock divisor register, and then start the clock by writing 5A00001116 into the PWM clock control register.

13.1.2 pcDuino Clock Control Unit

The AllWinner A10/A20 SOCs have a relatively simple clock manager, which is referred to as the Clock Control Unit. All of the clock signals in the system are driven by two crystal oscillators: the main oscillator runs at 24 MHz, and the real-time-clock oscillator, which runs at 32768 Hz. The real-time-clock oscillator is used only to provide a signal to the real-time-clock device.

The main clock oscillator drives many of the devices in the system, but there are seven phase-locked-loop circuits in the CCU which provide signals for devices which need clocks that are faster or slower than 24 MHz. Table 13.5 shows which devices are driven by the nine clock signals.

Table 13.5

Clock signals in the AllWinner A10/A20 SOC

Clock DomainModulesFrequencyDescription
OSC24MMost modules24 MHzMain clock
CPU32_clkCPU2 kHz–1.2 GHzDrives CPU
AHB_clkAHB devices8 kHz–276 MHzDrives some devices
APB_clkPeripheral bus500 Hz–138 MHzDrives some devices
SDRAM_clkSDRAM0 Hz–400 MHzDrives SDRAM memory
Usb:clkUSB480 MHzDrives USB devices

t0030

13.2 Serial Communications

There are basically two methods for transferring data between two digital devices: parallel and serial. Parallel connections use multiple wires to carry several bits at one time, typically including extra wires to carry timing information. Parallel communications are used for transferring large amounts of data over very short distances. However, this approach becomes very expensive when data must be transferred more than a few meters. Serial, on the other hand, uses a single wire to transfer the data bits one at a time. When compared to parallel transfer, the speed of serial transfer typically suffers. However, because it uses significantly fewer wires, the distance may be greatly extended, reliability improved, and cost vastly reduced.

13.2.1 UART

One of the oldest and most common devices for communications between computers and peripheral devices is the Universal Asynchronous Receiver/Transmitter, or UART. The word “universal” indicates that the device is highly configurable and flexible. UARTs allow a receiver and transmitter to communicate without a synchronizing signal.

The logic signal produced by the digital UART typically oscillates between zero volts for a low level and five volts for a high level, and the amount of current that the UART can supply is limited. For transmitting the data over long distances, the signals may go through a level-shifting or amplification stage. The circuit used to accomplish this is typically called a line driver. This circuit boosts the signal provided by the UART and also protects the delicate digital outputs from short circuits and signal spikes. Various standards, such as RS-232, RS-422, and RS-485 define the voltages that the line driver uses. For example, the RS-232 standard specifies that valid signals are in the range of + 3 to + 15 V, or − 3 to − 15 V. The standards also specify the maximum time that is allowable when shifting from a high signal to a low signal and vice versa, the amount of current that the device must be capable of sourcing and sinking, and other relevant design criteria.

The UART transmits data by sending each bit sequentially. The receiving UART re-assembles the bits into the original data. Fig. 13.2 shows how the transmitting UART converts a byte of data into a serial signal, and how the receiving UART samples the signal to recover the original data. Serializing the transmission and reassembly of the data are accomplished using shift registers. The receiver and transmitter each have their own clocks, and are configured so that the clocks run at the same speed (or close to the same speed). In this case, the receiver’s clock is running slightly slower than the transmitter’s clock, but the data are still received correctly.

f13-02-9780128036983
Figure 13.2 Transmitter and receiver timings for two UARTS. (A) Waveform of a UART transmitting a byte. (B) Timing of UART receiving a byte.

To transfer a group of bits, called a data frame, the transmitter typically first sends a start bit. Most UARTs can be configured to transfer between four and eight data bits in each group. The transmitting and receiving UARTS must be configured to use the same number of data bits. After each group of data bits, the transmitter will return the signal to the low state and keep it there for some minimum period. This period is usually the time that it would take to send two bits of data, and is referred to as the two stop bits. The stop bits allow the receiver to have some time to process the received byte and prepare for the next start bit. Fig. 13.2A shows what a typical RS-232 signal would look like when transferring the value 5616 (the ASCII “V” character). The UART enters the idle state only if there is not another byte immediately ready to send. If the transmitter has another byte to send, then the start bit can begin at the end of the second stop bit.

Note that it is impossible to ensure that the receiver and transmitter have clocks which are running at exactly the same speed, unless they use the same clock signal. Fig. 13.2B shows how the receiver can reassemble the original data, even with a slightly different clock rate. When the start bit is detected by the receiver, it prepares to receive the data bits, which will be sent by the transmitter at an expected rate (within some tolerance). The receive circuitry of most UARTS is driven by a clock that runs 16 times as fast as the baud rate. The receive circuitry uses its faster clock to latch each bit in the middle of its expected time period. In Fig. 13.2B, the receiver clock is running slower than the transmitter clock. By the end of the data frame, the sample time is very far from the center of the bit, but the correct value is received. If the clocks differed by much more, or if more than eight data bits were sent, then it is very likely that incorrect data would be received. Thus, as long as their clocks are synchronized within some tolerance (which is dependent on the number of data bits and the baud rate), the data will be received correctly.

The RS-232 standard allows point-to-point communication between two devices for limited distances. With the RS-232 standard, simple one-way communications can be accomplished using only two wires: One to carry the serial bits, and another to provide a common ground. For bi-directional communication, three wires are required. In addition, the RS-232 standard specifies optional hand-shaking signals, which the UARTs can use to signal their readiness to transmit or receive data. The RS-422 and RS-485 standards allow multiple devices to be connected using only two wires.

The first UART device to enjoy widespread use was the 8250. The original version had 12 registers for configuration, sending, and receiving data. The most important registers are the ones that allow the programmer to set the transmit and receive bit rates, or baud. One baud is one bit per second. The baud is set by storing a 16 bit divisor in two of the registers in the UART. The chip is driven by an external clock, and the divisor is used to reduce the frequency of the external clock to a frequency that is appropriate for serial communication. For example, if the external clock runs at 1 MHz, and the required baud is 1200, then the divisor must be 833.3¯833si1_e. Note that the divisor can only be an integer, so the device cannot achieve exactly 1200 baud. However, as explained previously, the sending and receiving devices do not have to agree precisely on the baud. During the transmission and reception of a byte, 1200.48 baud is close enough that the bits will be received correctly even if the other end is running slightly below 1200 baud. In the 8250, there was only one 8-bit register for sending data and only one 8-bit register for receiving data. The UART could send an interrupt to the CPU after each byte was transmitted or received. When receiving, the CPU had to respond to the interrupt very quickly. If the current byte was not read quickly enough by the CPU, it would be overwritten by the subsequent incoming byte. When transmitting, the CPU needed to respond quickly to interrupts to provide the next byte to be sent, or the transmission rate would suffer.

The next generation of UART device was the 16550A. This device is the model for most UART devices today. It features 16-byte input and output buffers and the ability to trigger interrupts when a buffer is partially full or partially empty. This allows the CPU to move several bytes of data at a time and results in much lower CPU overhead and much higher data transmission and reception rates. The 16550A also supports much higher baud rates than the 8250.

13.2.2 Raspberry Pi UART0

The BCM2835 system-on-chip provides two UART devices: UART0 and UART1. UART 1 is part of the I2C device, and is not recommended for use as a UART. UART0 is a PL011 UART, which is based on the industry standard 16550A UART. The major differences are that the PL011 allows greater flexibility in configuring the interrupt trigger levels, the registers appear in different locations, and the locations of bits in some of the registers is different. So, although it operates very much like a 16550A, things have been moved to different locations. The transmit and receive lines can be routed through GPIO pin 14 and GPIO pin 15, respectively. UART0 has 18 registers, starting at its base address of 2E2010016. Table 13.6 shows the name, location, and a brief description for each of the registers.

Table 13.6

Raspberry Pi UART0 register map

OffsetNameDescription
0016UART_DRData Register
0416UART_RSRECRReceive Status Register/Error Clear Register
1816UART_ FRFlag register
2016UART_ILPRnot in use
2416UART_IBRDInteger Baud rate divisor
2816UART_FBRDFractional Baud rate divisor
2c16UART_LCRHLine Control register
3016UART_CRControl register
3416UART_IFLSInterrupt FIFO Level Select Register
3816UART_IMSCInterrupt Mask Set Clear Register
3c16UART_RISRaw Interrupt Status Register
4016UART_MISMasked Interrupt Status Register
4416UART_ICRInterrupt Clear Register
4816UART_DMACRDMA Control Register
8016UART_ITCRTest Control register
8416UART_ITIPIntegration test input reg
8816UART_ITOPIntegration test output reg
8c16UART_TDRTest Data reg

UART_DR: The UART Data Register is used to send and receive data. Data are sent or received one byte at a time. Writing to this register will add a byte to the transmit FIFO. Although the register is 32 bits, only the 8 least significant bits are used in transmission, and 12 least significant bits are used for reception. If the FIFO is empty, then the UART will begin transmitting the byte immediately. If the FIFO is full, then the last byte in the FIFO will be overwritten with the new byte that is written to the Data Register. When this register is read, it returns the byte at the top of the receive FIFO, along with four additional status bits to indicate if any errors were encountered. Table 13.7 specifies the names and use of the bits in the UART Data Register.

Table 13.7

Raspberry Pi UART data register

BitNameDescriptionValues
7–0DATAData

Read: Last data received

Write: Data byte to transmit

8FEFraming error

0: No error

1: The received character did not have a valid stop bit

9PEParity error

0: No error

1: The received character did not have the correct parity, as set in theEPS and SPS bits of the Line Control Register (UART_LCRH)

10BEBreak error

0: No error

1: A break condition was detected. The data input line was held low forlonger than the time it would take to receive a complete byte,including the start and stop bits.

11OEOverrun error

0: No error

1: Data was not read quickly enough, and one or more bytes wereoverwritten in the input buffer

31–12-Not usedWrite as zero, read as don’t care

t0040

UART_RSRECR: The UART Receive Status Register/Error Clear Register is used to check the status of the byte most recently read from the UART Data Register, and to check for overrun conditions at any time. The status information for overrun is set immediately when an overrun condition occurs. The Receive Status Register/Error Clear Register provides the same four status bits as the Data Register (but in bits 3–0 rather than bits 11–8). The received data character must be read first from the Data Register, before reading the error status associated with that data character from the RSRECR register. Since the Data Register also contains these 4 bits, this register may not be required, depending on how the software is written. Table 13.8 describes the bits in this register.

Table 13.8

Raspberry Pi UART receive status register/error clear register

BitNameDescriptionValues
0FEFraming error

0: No error

1: The received character did not have a valid stop bit

1PEParity error

0: No error

1: The received character did not have the correct parity, as set in theEPS and SPS bits of the Line Control Register (UART_LCRH)

2BEBreak error

0: No error

1: A break condition was detected. The data input line was held low for longer than the time it would take to receive a complete byte,including the start and stop bits.

3OEOverrun error

0: No error

1: Data was not read quickly enough, and one or more bytes wereoverwritten in the input buffer

31–4Not usedWrite as zero, read as don’t care

t0045

UART_FR: The UART Flag Register can be read to determine the status of the UART. The bits in this register are used mainly when sending and receiving data using the FIFOs. When several bytes need to be sent, the TXFF flag should be checked to ensure that the transmit FIFO is not full before each byte is written to the data register. When receiving data, the RXFE bit can be used to determine whether or not there is more data to be read from the FIFO. Table 13.9 describes the flags in this register.

Table 13.9

Raspberry Pi UART flags register bits

BitNameDescriptionValues
0CTSClear To Send

0: Sender indicates they are ready to receive

1: Sender is NOT ready to receive

1DSRData Set ReadyNot implemented: Write as zero, read as don’t care
2DCDData Carrier DetectNot implemented: Write as zero, read as don’t care
3BUSYUART is busy

0: UART is not transmitting data

1: UART is transmitting a byte

4RXFEReceive FIFO Empty

0: Receive FIFO contains bytes that have been received

1: Receive FIFO is empty

5TXFFTransmit FIFO is Full

0: There is room for at least one more byte in the transmit FIFO

1: Transmit FIFO is full – do not write to the data register at this time

6RXFFReceive FIFO is Full

0: There is no more room in the receive FIFO

1: There is still some space in the receive FIFO

7TXFETransmit FIFO is Empty

0: There are no bytes waiting to be transmitted

1: There is at least one byte waiting to be transmitted

8RIRing IndicatorNot implemented: Write as zero, read as don’t care
31–9Not usedWrite as zero, read as don’t care

t0050

UART_ILPR: This is the IrDA register, which is supported by some PL011 UARTs. IrDA stands for the Infrared Data Association, which is a group of companies that cooperate to provide specifications for a complete set of protocols for wireless infrared communications. The name “IrDA” also refers to that set of protocols. IrDA is not implemented on the Raspberry Pi UART. Writing to this register has no effect and reading returns 0.

UART_IBRD and UART_FBRD: UART_FBRD is the fractional part of the baud rate divisor value, and UART_IBRD is the integer part. The baud rate divisor is calculated as follows:

BAUDDIV=UARTCLK16×Baudrate

si2_e  (13.1)

where UARTCLK is the frequency of the UART_CLK that is configured in the Clock Manager device. The default value is 3 MHz. BAUDDIV is stored in two registers. UART_IBRD holds the integer part and UART_FBRD holds the fractional part. Thus BAUDDIV should be calculated as a U(16,6) fixed point number. The contents of the UART_IBRD and UART_FBRD registers may be written at any time, but the change will not have any effect until transmission or reception of the current character is complete. Table 13.10 shows the arrangement of the integer baud rate divisor register, and Table 13.11 shows the arrangement of the fractional baud rate divisor register.

Table 13.10

Raspberry Pi UART integer baud rate divisor

BitNameDescriptionValues
15–0IBRDInteger Baud Rate DivisorSee Eq. (13.1)
31–16Not usedWrite as zero, read as don’t care

t0055

Table 13.11

Raspberry Pi UART fractional baud rate divisor

BitNameDescriptionValues
5-0FBRDFractional Baud Rate DivisorSee Eq. (13.1)
31-6Not usedWrite as zero, read as don’t care

t0060

UART_LCRH: UART_LCRH is the line control register. It is used to configure the communication parameters. This register must not be changed until the UART is disabled by writing zero to bit 0 of UART_CR, and the BUSY flag in UART_FR is clear. Table 13.12 shows the layout of the line control register.

Table 13.12

Raspberry Pi UART line control register bits

BitNameDescriptionValues
0BRKSend Break

0: Normal operation

1: After the current character is sent, take the TXD output to a lowlevel and keep it there

1PENParity Enable

0: Parity checking and generation is disabled

1: Generate and send parity bit and check parity on received data

2EPSEven Parity Select

0: Odd parity

1: Even parity

3STP2Two Stop Bits

0: Send one stop bit for each data word

1: Send two stop bits for each data word

4FENFIFO Enable

0: Transmit and Receive FIFOs are disabled

1: Transmit and Receive FIFOs are enabled

6–5WLENWord Length

00: 5 bits per data word

01: 6 bits per data word

10: 7 bits per data word

11: 8 bits per data word

31–7Not usedWrite as zero, read as don’t care

t0065

UART_CR: The UART Control Register is used for configuring, enabling, and disabling the UART. Table 13.13 shows the layout of the control register. To enable transmission, the TXE bit and UARTEN bit must be set to 1. To enable reception, the RXE bit and UARTEN bit must be set to 1. In general, the following steps should be used to configure or re-configure the UART:

Table 13.13

Raspberry Pi UART control register bits

BitNameDescriptionValues
0UARTENUART Enable

0: UART disabled

1: UART enabled.

1SIRENNot usedWrite as zero, read as don’t care
2SIRLPNot usedWrite as zero, read as don’t care
3–6Not usedWrite as zero, read as don’t care
7LBELoopback Enable

0: Loopback disabled

1: Loopback enabled. Transmitted data is also fed back to the receiver.

8TXETransmit enable

0: Transmitter is disabled

1: Transmitter is enabled

9RXEReceive enable

0: Receiver is disabled

1: Receiver is enabled

10DTRNot usedWrite as zero, read as don’t care
11RTSComplement of nUARTRTS
12OUT1Not usedWrite as zero, read as don’t care
13OUT2Not usedWrite as zero, read as don’t care
14RTSENRTS Enable

0: Hardware RTS disabled.

1: Hardware RTS Enabled

15CTSENCTS Enable

0: Hardware CTS disabled.

1: Hardware CTS Enabled

16–31Not usedWrite as zero, read as don’t care

t0070

(a) Disable the UART.

(b) Wait for the end of transmission or reception of the current character.

(c) Flush the transmit FIFO by setting the FEN bit to 0 in the Line Control Register.

(d) Reprogram the Control Register.

(e) Enable the UART.

Interrupt Control: The UART can signal the CPU by asserting an interrupt when certain conditions occur. This will be covered in more detail in Chapter 14. For now, it is enough to know that there are five additional registers which are used to configure and use the interrupt mechanism.
UART_IFLS defines the FIFO level that triggers the assertion of the interrupt signal. One interrupt is generated when the FIFO reaches the specified level. The CPU must clear the interrupt before another can be generated.
UART_IMSC is the interrupt mask set/clear register. It is used to enable or disable specific interrupts. This register determines which of the possible interrupt conditions are allowed to generate an interrupt to the CPU.
UART_RIS is the raw interrupt status register. It can be read to raw status of interrupts conditions before any masking is performed.
UART_MIS is the masked interrupt status register. It contains the masked status of the interrupts. This is the register that the operating system should use to determine the cause of a UART interrupt.
UART_ICR is the interrupt clear register. writing to it clears the interrupt conditions. The operating system should use this register to clear interrupts before returning from the interrupt service routine.

UART_DMACR: The DMA control register is used to configure the UART to access memory directly, so that the CPU does not have to move each byte of data to or from the UART. DMA will be explained in more detail in Chapter 14.

Additional Registers: The remaining registers, UART_ITCR, UART_ITIP, and UART_ITOP, are either unimplemented or are used for testing the UART. These registers should not be used.

13.2.3 Basic Programming for the Raspberry Pi UART

Listing 13.1 shows four basic functions for initializing the UART, changing the baud rate, sending a character, and receiving a character using UART0 on the Raspberry Pi. Note that a large part of the code simply defines the location and offset for all of the registers (and bits) that can be used to control the UART.

f13-03a-9780128036983f13-03b-9780128036983f13-03c-9780128036983f13-03d-9780128036983f13-03e-9780128036983
Listing 13.1 Assembly functions for using the Raspberry Pi UART.

13.2.4 pcDuino UART

The AllWinner A10/A20 SOC includes eight UART devices. They are all fully compatible with the 16550A UART, and also provide some enhancements. All of them provide transmit (TX) and receive (RX) signals. UART0 has the full set of RS232 signals, including RTS, CTS, DTR, DSR, DCD, and RING. UART1 has the RTS and CTS signals. The remaining six UARTs only provide the TX and RX signals. They can all be configured for serial IrDA. Table 13.14 shows the base address for each of the eight UART devices.

Table 13.14

pcDuino UART addresses

NameAddress
UART00x01C28000
UART10x01C28400
UART20x01C28800
UART30x01C28C00
UART40x01C29000
UART50x01C29400
UART60x01C29800
UART70x01C29C00

When the 16550 UART was designed, 8-bit processors were common, and most of them provided only 16 address bits. Memory was typically limited to 64 kB, and every byte of address space was important. Because of these considerations, the designers of the 16550 decided to limit the number of addresses used to 8, and to only use eight bits of data per address. There are 10 registers in the 16550 UART, but some of them share the same address. For example, there are three registers mapped to an offset address of zero, two registers mapped at offset four, and two registers mapped at offset eight. Bit seven in the Line Control Register is used to determine which of the registers is active for a given address.

Because they are meant to be fully backwards-compatible with the 16550, the AllWinner A10/A20 SOC UART devices also use only 8 bits for each register, and the first 12 registers correspond exactly with the 16550 UART. The only differences are that the pcDuino uses word addresses rather than byte addresses, and they provide four additional registers that are used for IrDA mode. Table 13.15 shows the arrangement of the registers in each of the 8 UARTs on the pcDuino. The following sections will explain the registers.

Table 13.15

pcDuino UART register offsets

Register NameOffsetDescription
UART_RBR0x00UART Receive Buffer Register
UART_THR0x00UART Transmit Holding Register
UART_DLL0x00UART Divisor Latch Low Register
UART_DLH0x04UART Divisor Latch High Register
UART_IER0x04UART Interrupt Enable Register
UART_IIR0x08UART Interrupt Identity Register
UART_FCR0x08UART FIFO Control Register
UART_LCR0x0CUART Line Control Register
UART_MCR0x10UART Modem Control Register
UART_LSR0x14UART Line Status Register
UART_MSR0x18UART Modem Status Register
UART_SCH0x1CUART Scratch Register
UART_USR0x7CUART Status Register
UART_TFL0x80UART Transmit FIFO Level
UART_RFL0x84UART_RFL
UART_HALT0xA4UART Halt TX Register

The baud rate is set using a 16-bit Baud Rate Divisor, according to the following equation:

BAUDDIV=sclk16×Baudrate

si3_e  (13.2)

where sclk is the frequency of the UART serial clock, which is configured by the Clock Manager device. The default frequency of the clock is 24 MHz. BAUDDIV is stored in two registers. UART_DLL holds the least significant 8 bits, and UART_DLH holds the most significant 8 bits. Thus BAUDDIV should be calculated as a 16-bit unsigned integer. Note that for high baud rates, it may not be possible to get exactly the rate desired. For example, a baud rate of 115200 would require a divisor of 13.02083¯si4_e. Since the baud rate divisor can only be given as an integer, the desired rate must be based on a divisor of 13, so the true baud rate will be 2400000016×13=115384.615385si5_e, or about 0.16% faster than desired. Although slightly fast, it is well within the tolerance for RS232 communication.

UART_RBR: The UART Receive Buffer Register is used to receive data, 1 byte at a time. If the receive FIFO is enabled, then as the UART receives data, it places the data into a receive FIFO. Reading from this address removes 1 byte from the receive FIFO. If the FIFO becomes full and another data byte arrives, then the new data are lost and an overrun error occurs. Table 13.16 shows the layout of the receive buffer register.

Table 13.16

pcDuno UART receive buffer register

BitNameDescriptionValues
7–0RBRDataRead only: One byte of received data. Bit 7 of LCR must bezero.
31–8Unused

t0085

UART_THR: Writing to the Transmit Holding Register will cause that byte to be transmitted by the UART. If the transmit FIFO is enabled, then the byte will be added to the end of the transmit FIFO. If the FIFO is empty, then the UART will begin transmitting the byte immediately. If the FIFO is full, then the new data byte will be lost. Table 13.17 shows the layout of the transmit holding register.

Table 13.17

pcDuno UART transmit holding register

BitNameDescriptionValues
7–0THRDataWrite only: One byte of data to transmit. Bit 7 of LCR must bezero.
31–8Unused

t0090

UART_DLL: The UART Divisor Latch Low register is used to set the least significant byte of the baud rate divisor. When bit 7 of the Line Control Register is set to one, writing to this address will access the DLL register. If bit 7 of the Line Control Register is set to zero, then writing to this address will access the transmit holding register. Table 13.18 shows the layout of the UART_DLL register.

Table 13.18

pcDuno UART divisor latch low register

BitNameDescriptionValues
7–0DLLDataWrite only: Least significant eight bits of the Baud Rate Divisor. Bit 7 of LCR must be one.
31–8Unused

t0095

UART_DLH: The UART Divisor Latch High register is used to set the most significant byte of the baud rate divisor. When bit 7 of the Line Control Register is set to one, writing to this address will access the DLH register. If bit 7 of the Line Control Register is set to zero, then writing to this address will access the Interrupt Enable Register rather than the Divisor Latch High register. Table 13.19 shows the layout of the UART_DLL register.
If the two Divisor Latch Registers (DLL and DLH) are set to zero, the baud clock is disabled and no serial communications occur. DLH should be set before DLL, and at least eight clock cycles of the UART clock should be allowed to pass before data are transmitted or received.

Table 13.19

pcDuno UART divisor latch high register

BitNameDescriptionValues
7–0DLHDataWrite only: Most significant eight bits of the Baud Rate Divisor. Bit 7 of LCR must be one.
31–8Unused

t0100

UART_FCR: is the UART FIFO control register. It is used to enable or disable the receive and transmit FIFOs (buffers), flush their contents, set the level at which the transmit and receive FIFOs trigger an interrupt, and to control Direct Memory Access (DMA) Table 13.20 shows the layout of the UART_FCR register.

Table 13.20

pcDuno UART FIFO control register

BitNameDescription
0FIFOEFIFO Enable

0: transmit and receive FIFOs disabled

1: transmit and receive FIFOs enabled

1RFIFORReceive FIFO Reset: writing a 1 to this bit causes the receive FIFO to be reset, and then continue normal operation
2XFIFORTransmit FIFO Reset: writing a 1 to this bit causes the transmit FIFO to be reset, and then continue normal operation
3DMAMDMA Mode:

0: Mode 0

1: Mode 1

5–4TETTransmit Empty Trigger: These bits control the level at which the Transmit Holding Register Empty interrupt is triggered

00: FIFO is completely empty

01: There are two characters in the FIFO

10: The FIFO is 25% full

11: The FIFO is 50% full

This setting has no effect if THRE_MODE_USER is disabled
7–6RTReceive Trigger: These bits control the level at which the Received Data Available interrupt is triggered.

00: There is one character in the FIFO

01: The FIFO is 25% full

10: The FIFO is 50% full

11: There is room for two more characters in the FIFO

This setting has no effect if THRE_MODE_USER is disabled.
31–8Unused

t0105

UART_LCR: The Line Control Register is used to control the parity, number of data bits, and number of stop bits for the serial port. Bit 7 also controls which registers are mapped at offsets 0, 4, and 8 from the device base address. Table 13.21 shows the layout of the UART_LCR register.

Table 13.21

pcDuno UART line control register

BitNameDescription
1–0DLSThis field controls the number of data bits:

00: 5 data bits

01: 6 data bits

10: 7 data bits

11: 8 data bits

2STOPThis bit controls the number of stop bits used for transmitting and receiving data.

0: 1 stop bit

1: If DLS is set to 00, then 1.5 stop bits, otherwise 2 stop bits

3PENParity Enable:

0: Parity disabled

1: Parity enabled

4EPSEven Parity Select:

0: Odd Parity

1: Even Parity

5Unused
6BCBWriting a one to this bit causes a break to be sent. This bit must be set to zero for normal operation.
7DLABThe Divisor Latch Access Bit controls the behavior of other registers:

0: The RBR, THR, and IER registers are accessible (RBR is used for read at offset 0, and THR for write at offset 0).

1: The DLL and DLM registers are accessible

31–8Unused

t0110

UART_LSR: The Line Status Register is used to read status information from the UART. Table 13.22 shows the layout of the UART_LSR register.

Table 13.22

pcDuno UART line status register

BitNameDescription
0DRWhen the Data Ready bit is set to 1, it indicates that at least one byte is ready to be read from the receive FIFO or RBR.
1OEWhen the Overrun Error bit is set to 1, it indicates that an overrun error occurred for the byte at the top of the receive FIFO.
2PEWhen the Parity Error bit is set to 1, it indicates that a parity error occurred for the byte at the top of the receive FIFO.
3FEWhen the Framing Error bit is set to 1, it indicates that a framing error occurred for the byte at the top of the receive FIFO.
4BIWhen the Break Interrupt bit is set to 1, it indicates thata break has been received.
5THREWhen the Transmit Holding Register Empty bit is 1, it indicates that there are there are no bytes waiting to be transmitted, but there may be a byte currently being transmitted.
6TEMTWhen the Transmitter Empty bit is 1, it indicates that there are no bytes waiting to be transmitted and no byte currently being transmitted.
7FIFOERRWhen this bit is 1, an error has occurred (PE, BE, or BI) in the receive FIFO. This bit is cleared when the Line Status Register is read.
31–8Unused

UART_USR: The UART Status Register is used to read information about the status of the transmit and receive FIFOs, and the current state of the receiver and transmitter. Table 13.23 shows the layout of the UART_USR register. This register contains essentially the same information as the status register in the Raspberry Pi UART.

Table 13.23

pcDuno UART status register

BitNameDescription
0BUSYWhen the Busy bit is 1, it indicates that the UART is currently busy. When it is 0, the UART is idle or inactive.
1TFNFWhen the Transmit FIFO Not Full bit is 1, it indicates that at least one more byte can be safely written to the Transmit FIFO.
2TFEWhen the Transmit FIFO Empty bit is 1, it indicates that there are no bytes remaining in the transmit FIFO.
3RFNEWhen the Receive FIFO Not Empty bit is 1, it indicates that at least one more byte is waiting to be read from the receive FIFO.
4RFFWhen the Receive FIFO Full bit is 1, it indicates that there is no more room in the receive FIFO. If data is not read before the next character is received, an overrun error will occur.
31–5Unused

UART_TFL: The UART Transmit FIFO Level register allows the programmer to determine exactly how many bytes are currently in the transmit FIFO. Table 13.24 shows the layout of the UART_TFL register.

Table 13.24

pcDuno UART transmit FIFO level register

BitNameDescription
6–0TFLThe Transmit FIFO level field contains an integer which indicates the number of bytes currently in the transmit FIFO.
31–7Unused

UART_RFL: The UART Receive FIFO Level register allows the programmer to determine exactly how many bytes are currently in the receive FIFO. Table 13.25 shows the layout of the UART_RFL register.

Table 13.25

pcDuno UART receive FIFO level register

BitNameDescription
6–0RFLThe Receive FIFO level field contains an integer which indicates the number of bytes currently in the receive FIFO.
31–7Unused

UART_HALT: The UART transmit halt register is used to halt the UART so that it can be reconfigured. After the configuration is performed, it is then used to signal the UART to restart with the new settings. It can also be used to invert the receive and transmit polarity. Table 13.26 shows the layout of the UART_HALT register.

Table 13.26

pcDuno UART transmit halt register

BitNameDescription
0Unused
1CHCFG_AT_BUSYSetting this bit to 1 causes the UART to allow changing the Line Control Register (except the DLAB bit) and allows setting the baud rate even when the UART is busy. When this bit is set to 0, changes can only occur when the BUSY bit in the UART Status Register is 0.
2CHANGE_UPDATEAfter writing 1 to CHCFG_AT_BUSY and performing the configuration, 1 should be written to this bit to signal that the UART should re-start with the new configuration. This bit will stay at 1 while the new configuration is loaded, and go back to 0 when the re-start is complete.
3Unused
4SIR_TX_INVERTThis bit allows the polarity of the transmitter to be inverted.

0: Normal polarity

1: Polarity inverted

5SIR_RX_INVERTThis bit allows the polarity of the receiver to be inverted.

0: Normal polarity

1: Polarity inverted

31–5Unused

t0135

Interrupt Control: The UART can signal the CPU by asserting an interrupt when certain conditions occur. This will be covered in more detail in Chapter 14. For now, it is enough to know that there are five additional registers which are used to configure and use the interrupt mechanism.
UART_IFLS defines the FIFO level that triggers the assertion of the interrupt signal. One interrupt is generated when the FIFO reaches the specified level. The CPU must clear the interrupt before another can be generated.
UART_IER is the interrupt enable register. It is used to enable or disable the generation of interrupts for specific conditions.
UART_IIR is the Interrupt Identity Register. When an interrupt occurs, the CPU can read this register to determine what caused the interrupt.

Additional Registers There are several additional registers which are not needed for basic use of the UART.
UART_MCR is the Modem Control Register. It is used to configure the port for IrDA mode, enable Automatic Flow Control, and manage the RS-232 RTS and DTR hardware handshaking signals for the ports in which they are implemented. The default configuration disables these extra features.
UART_MSR is the Modem Status Register, which is used to read the state of the RS-232 modem control and status lines on ports that implement them. This register can be ignored unless a telephone modem is being used on the port.
UART_SCH is the Modem Scratch Register. It provides 8 bits of storage for temporary data values. In the days of 8 and 16-bit computers, when the 16550 UART was designed, this extra byte of storage was useful.

13.3 Chapter Summary

Most modern computer systems have some type of Universal Asynchronous Receiver/Transmitter. These are serial communications devices, and are meant to provide communications with other systems using RS-232 (most commonly) or some other standard serial protocol. Modern systems often have a large number of other devices as well. Each device may need its own clock source to drive it at the correct frequency for its operation. The clock sources for all of the devices are often controlled by yet another device: the clock manager.

Although two systems may have different UARTs, these devices perform the same basic functions. The specifics about how they are programmed will vary from one system to another. However, there is always enough similarity between devices of the same class that a programmer who is familiar with one specific device can easily learn to program another similar device. The more experience a programmer has, the less time it takes to learn how to control a new device.

Exercises

13.1 Write a function for setting the PWM clock on the Raspberry Pi to 2 MHz.

13.2 The UART_GET_BYTE function in Listing 13.1 contains skeleton code for handling errors, but does not actually do anything when errors occur. Describe at least two ways that the errors could be handled.

13.3 Listing 13.1 provides four functions for managing the UART on the Raspberry Pi. Write equivalent functions for the pcDuino UART.

Chapter 14

Running Without an Operating System

Abstract

This chapter starts by describing the extra responsibilities that the programmer must assume when writing code to run without an operating system (bare metal). It then explains privileged and user modes and describes all of the privileged modes available on the ARM processor. Next, it gives an overview of exception processing, and provides example code for setting up the vector table stubs for exception handling functions on the ARM processor. Next, it describes the boot processes on the Raspberry Pi and the pcDuino. After that, it shows how to write a basic bare metal program, without any exception processing. The chapter finishes by showing a more efficient version of the bare metal program using an interrupt.

Keywords

Bare metal; Exception; Vector table; Exception handler; Sleep mode; User mode; Privileged mode; Startup code; Linker script; Boot loader; Interrupt

The previous chapters assumed that the software would be running in user mode under an operating system. Sometimes, it is necessary to write assembly code to run on “bare metal,” which simply means: without an operating system. For example, when we write an operating system kernel, it must run on bare metal and a significant part of the code (especially during the boot process) must be written in assembly language. Coding on bare metal is useful to deeply understand how the hardware works and what happens in the lowest levels of an operating system. There are some significant differences between code that is meant to run under an operating system and code that is meant to run on bare metal.

The operating system takes care of many details for the programmer. For instance, it sets up the stack, text, and data sections, initializes static variables, provides an interface to input and output devices, and gives the programmer an abstracted view of the machine. When accessing data on a disk drive, the programmer uses the file abstraction. The underlying hardware only knows about blocks of data. The operating system provides the data structures and operations which allow the programmer to think of data in terms of files and streams of bytes. A user program may be scattered in physical memory, but the hardware memory management unit, managed by the operating system, allows the programmer to view memory as a simple memory map (such as shown in Fig. 1.7). The programmer uses system calls to access the abstractions provided by the operating system. On bare metal, there are no abstractions, unless the programmer creates them.

However, there are some software packages to help bare-metal programmers. For example, Newlib is a C standard library intended for use in bare-metal programs. Its major features are that:

 it implements the hardware-independent parts of the standard C library,

 for I/O, it relies on only a few low-level functions that must be implemented specifically for the target hardware, and

 many target machines are already supported in the Newlib source code.

To support a new machine, the programmer only has to write a few low-level functions in C and/or Assembly to initialize the system and perform low-level I/O on the target hardware.

14.1 ARM CPU Modes

Many early computers were not capable of protecting the operating system from user programs. That problem was solved mostly by building CPUs that support multiple “levels of privilege” for running programs. Almost all modern CPUs have the ability to operate in at least two modes:

User mode is the mode that normal user programs use when running under an operating system, and

Privileged mode is reserved for operating system code. There are operations that can be performed in privileged mode which cannot be performed in user mode.

The ARM processor provides six privileged modes and one user mode. Five of the privileged modes have their own stack pointer (r13) and link register (r14). When the processor mode is changed, the corresponding link register and stack pointer become active, “replacing” the user stack pointer and link register.

In any of the six privileged modes, the link registers and stack pointers of the other modes can be accessed. The privileged mode stack pointers and link registers are not accessible from user mode. One of the privileged modes, FIQ, has five additional registers which become active when the processor enters FIQ mode. These registers “replace” registers r8 through r12. Additionally, five of the privileged modes have a Saved Process Status Register (SPSR). When entering those privileged modes, the CPSR is copied into the corresponding SPSR. This allows the CPSR to be restored to its original contents when the privileged code returns to the previously active mode. The full register set for all modes is shown in Table 14.1. Registers r0 through r7 and the program counter are shared by all modes. Some processors have an additional monitor mode, as part of the ARMv6-M and ARMv7-M security extensions.

Table 14.1

The ARM user and system registers

usrsvcabtundirqfiq
sys
r0
r1
r2
r3
r4
r5
r6
r7
r8r8_fiq
r9r9_fiq
r10r10_fiq
r11 (fp)r11_fiq
r12 (ip)r12_fiq
r13 (sp)r13_svcr13_abtr13_undr13_irqr13_fiq
r14 (lr)r14_svcr14_abtr14_undr14_irqr14_fiq
r15 (pc)
CPSRCPSRCPSRCPSRCPSRCPSR
SPSR_svcSPSR_abtSPSR_undSPSR_irqSPSR_fiq

t0010

All of the bits of the Program Status Register (PSR) are shown in Fig. 14.1. The processor mode is selected by writing a bit pattern into the mode bits (M[4:0]) of the PSR. The bit pattern assignment for each processor mode is shown in Table 14.2. Not all combinations of the mode bits define a valid processor mode. An illegal value programmed into M[4:0] causes the processor to enter an unrecoverable state. If this occurs, a hardware reset must be used to re-start the processor. Programs running in user mode cannot modify these bits directly. User programs can only change the processor mode by executing the software interrupt (swi) instruction (also known as the svc instruction), which automatically gives control to privileged code in the operating system. The hardware is carefully designed so that the user program cannot run its own code in privileged mode.

f14-01-9780128036983
Figure 14.1 The ARM process status register.

Table 14.2

Mode bits in the PSR

M[4:0]ModeNameRegister Set
10000usrUserR0-R14, CPSR, PC
10001fiqFast InterruptR0-R7, R8_fiq-R14_fiq, CPSR, SPSR_fiq, PC
10010irqInterrupt RequestR0-R12, R13_irq, R14_irq, CPSR, SPSR_irq, PC
10011svcSupervisorR0-R12, R13_svc R14_svc CPSR, SPSR_irq, PC
10111abtAbortR0-R12, R13_abt R14_abt CPSR, SPSR_abt PC
11011undUndefined InstructionR0-R12, R13_und R14_und, CPSR, SPSR_und PC
11111sysSystemR0-R14, CPSR, PC

t0015

The swi instruction does not really cause an interrupt, but the hardware and operating system handle it in a very similar way. The software interrupt is used by user programs to request that the operating system perform some task on their behalf. Another general class of interrupt is the “hardware interrupt.” This class of interrupt may occur at any time and is used by hardware devices to signal that they require service. Another type of interrupt may be generated within the CPU when certain conditions arise, such as attempting to execute an unknown instruction. These are generally known as “exceptions” to distinguish them from hardware interrupts. On the ARM processor, there are three bits in the CPSR which affect interrupt processing:

I: when set to one, normal hardware interrupts are disabled,

F: when set to one, fast hardware interrupts are disabled, and

A: (only on ARMv6 and later processors) when set to one, imprecise aborts are disabled (this is an abort on a memory write that has been held in a write buffer in the processor and not written to memory until later, perhaps after another abort).

Programs running in user mode cannot modify these bits. Therefore, the operating system gains control of the CPU whenever an interrupt occurs and the user program cannot disable interrupts and continue to run. Most operating systems use a hardware timer to generate periodic interrupts, thus they are able to regain control of the CPU every few milliseconds.

14.2 Exception Processing

Most of the privileged modes are entered automatically by the hardware when certain exceptional circumstances occur. For example, when a hardware device needs attention, it can signal the processor by causing an interrupt. When this occurs, the processor immediately enters IRQ mode and begins executing the IRQ exception handler function. Some devices can cause a fast interrupt, which causes the processor to immediately enter FIQ mode and begin executing the FIQ exception handler function. There are six possible exceptions that can occur, each one corresponding to one of the six privileged modes. Each exception must be handled by a dedicated function, with one additional function required to handle CPU reset events. The first instruction of each of these seven exception handlers is stored in a vector table at a known location in memory (usually address 0). When an exception occurs, the CPU automatically loads the appropriate instruction from the vector table and executes it. Table 14.3 shows the address, exception type, and the mode that the processor will be in, for each entry in ARM vector table. The vector table usually contains branch instructions. Each branch instruction will jump to the correct function for handling a specific exception type. Listing 14.1 shows a short section of assembly code which provides definitions for the ARM CPU modes.

Table 14.3

ARM vector table

AddressExceptionMode
0x00000000Resetsvc
0x00000004Undefined Instructionund
0x00000008Software Interruptsvc
0x0000000CPrefetch Abortabt
0x00000010Data Abortabt
0x00000014Reserved
0x00000018Interrupt Requestirq
0x0000001CFast Interrupt Requestfiq
f14-02-9780128036983
Listing 14.1 Definitions for ARM CPU modes.

Many bare-metal programs consist of a single thread of execution running in user mode to perform some task. This main program is occasionally interrupted by the occurrence of some exception. The exception is processed, and then control returns to the main thread. Fig. 14.2 shows the sequence of events when an exception occurs in such a system. The main program typically would be running with the CPU in user mode. When the exception occurs, the CPU executes the corresponding instruction in the vector table, which branches to the exception handler. The exception handler must save any registers that it is going to use, execute the code required to handle the exception, then restore the registers. When it returns to the user mode process, everything will be as it was before the exception occurred. The user mode program continues executing as if the exception never occurred.

f14-03-9780128036983
Figure 14.2 Basic exception processing.

More complex systems may have multiple tasks, threads of execution, or user processes running concurrently. In a single-processor system, only one task, thread, or user process can actually be executing at any given instant, but when an exception occurs, the exception handler may change the currently active task, thread, or user process. This is the basis for all modern multiprocessing systems. Fig. 14.3 shows how an exception may be processed on such a system. It is common on multi-processing systems for a timer device to be used to generate periodic interrupts, which allows the currently active task, thread, or user process to be changed at a fixed frequency.

f14-04-9780128036983
Figure 14.3 Exception processing with multiple user processes.

When any exception occurs, it causes the ARM CPU hardware to perform a very well-defined sequence of actions:

1. The CPSR is copied into the SPSR for the mode corresponding to the type of exception that has occurred.

2. The CPSR mode bits are changed, switching the CPU into the appropriate privileged mode.

3. The banked registers for the new mode become active.

4. The I bit of the CPSR is cleared, which disables interrupts.

5. If the exception was an FIQ, or if a reset has occurred, then the FIQ bit is cleared, disabling fast interrupts.

6. The program counter is copied to the link register for the new mode.

7. The program counter is loaded with the address in the vector table corresponding with the exception that has occurred.

8. The processor then fetches the next instruction using the program counter as usual. However, the program counter has been set so that in loads an instruction from the vector table.

The instruction in the vector table should cause the CPU to branch to a function which handles the exception. At the end of that function, the program counter must be loaded with the address of the instruction where the exception occurred, and the SPSR must be copied back into the CPSR. That will cause the processor to branch back to where it was when the exception occurred, and return to the mode that it was in at that time.

14.2.1 Handling Exceptions

Listing 14.2 shows in detail how the vector table is initialized. The vector table contains eight identical instructions. These instructions load the program counter, which causes a branch. In each case, the program counter is loaded with a value at the memory location that is 32 bytes greater than the corresponding load instruction. An offset of 24 is used because the program counter will have advanced 8 bytes by the time the load instruction is executed. The addresses of the exception handlers have been stored in a second table, that begins at an address 32 bytes after the first load instruction. Thus, each instruction in the vector table loads a unique address into the program counter. Note that one of the slots in the vector table is not used and is reserved by ARM for future use. That slot is treated like all of the others, but it will never be used on any current ARM processor.

f14-05-9780128036983
Listing 14.2 Function to set up the ARM exception table.

Listing 14.3 shows the stub functions for each of the exception handlers.

f14-06a-9780128036983f14-06b-9780128036983
Listing 14.3 Stubs for the exception handlers.

Note that the return sequence depends on the type of exception. For some exceptions, the return address must be adjusted. This is because the program counter may have been advanced past the instruction where the exception occurred. These stub functions simply return the processor to the mode and location at which the exception occurred. To be useful, they will need to be extended significantly. Note that these functions all return using a data processing instruction with the optional s specified and with the program counter as the destination register. This special form of data processing instruction indicates that the SPSR should be copied into the CPSR at the same time that the program counter is loaded with the return address. Thus, the function returns to the point where the exception occurred, and the processer switches back into the mode that it was in when the exception occurred.

A special form of the ldm instruction can also be used to return from an exception processing function. In order to use that method, the exception handler should start by adjusting the link register (depending on the type of exception) and then pushing it onto the stack. The handler should also push any other registers that it will need to use. At the end of the function, an ldmfd is used to restore the registers, but instead of restoring the link register, it loads the program counter. Also a carat (ˆ) is added to the end of the instruction. Listing 14.4 shows the skeleton for an exception handler function using this method.

f14-07-9780128036983
Listing 14.4 Skeleton for an exception handler.

14.3 The Boot Process

In order to create a bare-metal program, we must understand what the processor does when power is first applied or after a reset. The ARM CPU begins to execute code at a predetermined address. Depending on the configuration of the ARM processor, the program counter starts either at address 0 or 0xFFFF0000. In order for the system to work, the startup code must be at the correct address when the system starts up.

On the Raspberry Pi, when power is first applied, the ARM CPU is disabled and the graphics processing unit (GPU) is enabled. The GPU runs a program that is stored in ROM. That program, called the first stage boot loader, reads the second stage boot loader from a file named (bootcode.bin) on the SD card. That program enables the SDRAM, and then loads the third stage bootloader, start.elf. At this point, some basic hardware configuration is performed, and then the kernel is loaded to address 0x8000 from the kernel.img file on the SD card. Once the kernel image file is loaded, a “b #0x8000” instruction is placed at address 0, and the ARM CPU is enabled. The ARM CPU executes the branch instruction at address 0, then immediately jumps to the kernel code at address 0x8000.

To run a bare-metal program on the Raspberry Pi, it is only necessary to build an executable image and store it as kernel.img on the SD card. Then, the boot process will load the bare-metal program instead of the Linux kernel image. Care must be taken to ensure that the linker prepares the program to run at address 0x8000 and places the first executable instruction at the beginning of the image file. It is also important to make a copy of the original kernel image so that it can be restored (using another computer). If the original kernel image is lost, then there will be no way to boot Linux until it is replaced.

The pcDuino uses u-boot, which is a highly configurable open-source boot loader. The boot loader is configured to attempt booting from the SD card. If a bootable SD card is detected, then it is used. Otherwise, the pcDuino boots from its internal NAND flash. In either case, u-boot finds the Linux kernel image file, named uImage, loads it at address 0x40008000, and then jumps to that location. The easiest way to run bare-metal code on the pcDuino is to create a duplicate of the operating system on an SD card, then replace the uImage file with another executable image. Care must be taken to ensure that the linker prepares the program to run at address 0x40008000 and places the first executable instruction at the beginning of the image file. If the SD card is inserted, then the bare-metal code will be loaded. Otherwise, it will boot normally from the NAND flash memory.

14.4 Writing a Bare-Metal Program

A bare-metal program should be divided into several files. Some of the code may be written in assembly, and other parts in C or some other language. The initial startup code, and the entry and exit from exception handlers, must be written in assembly. However, it may be much more productive to write the main program and the remainder of the exception handlers as C functions and have the assembly code call them.

14.4.1 Startup Code

Other than the code being loaded at different addresses, there is very little difference between getting bare-metal code running on the Raspberry Pi and the pcDuino. For either platform, the bare-metal program must include some start-up code. The startup code will:

 initialize the stack pointers for all of the modes,

 set up interrupt and exception handling,

 initialize the .bss section,

 configure the CPU and critical systems (optional),

 set up memory management (optional),

 set up process and/or thread management (optional),

 initialize devices (optional), and call the main function.

The startup code requires some knowledge of the target platform, and must be at least partly written in assembly language. Listing 14.5 shows a function named _start which sets up the stacks, initializes the .bss section, calls a function to set up the vector table, then calls the main function:

f14-08a-9780128036983f14-08b-9780128036983f14-08c-9780128036983
Listing 14.5 ARM startup code.

The first task for the startup code is to ensure that the stack pointer for each processor mode is initialized. When an exception or interrupt occurs, the processor will automatically change into the appropriate mode and begin executing an exception handler, using the stack pointer for that mode. Hardware interrupts can be disabled, but some exceptions cannot be disabled. In order to guarantee correct operation, a stack must be set up for each processor mode, and an exception handler must be provided. The exception handler does not actually have to do anything.

On the Raspberry Pi, memory is mapped to begin at address 0, and all models have at least 256 MB of memory. Therefore, it is safe to assume that the last valid memory address is 0x0FFFFFFF. If each mode is given 4 kB of stack space, then all of the stacks together will consume 32 kB, and the initial stack addresses can be easily calculated. Since the C compiler uses a full descending stack, the initial stack pointers can be assigned addresses 0x10000000, 0x0FFFF000, 0x0FFFE000, etc.

For the pcDuino, there is a small amount of memory mapped at address 0, but most of the available memory is in the region between 0x40000000 and 0xBFFFFFFF. The pcDuino has at least 1 GB of memory. One possible way to assign the stack locations is: 0x50000000, 0x4FFFF000, 0x4FFFE000, etc. This assignment of addresses will make it easy to write one piece of code to set up the stacks for either the Raspberry Pi or the pcDuino.

After initializing the stacks, the startup code must set all bytes in the .bss section to zero. Recall that the .bss section is used to hold data that is initialized to zero, but the program file does not actually contain all of the zeros. Programs running under an operating system can rely on the C standard library to initialize the .bss section. If it is not linked to a C library, then a bare-metal program must set all of the bytes in the .bss section to zero for itself.

14.4.2 Main Program

The final part of this bare-metal program is the main function. Listing 14.6 shows a very simple main program which reads from three GPIO pins which have pushbuttons connected to them, and controls three other pins that have LEDs connected to them. When a button is pressed the LED associated with it is illuminated. The only real difference between the pcDuino and Raspberry Pi versions of this program is in the functions which drive the GPIO device. Therefore, those functions have been removed from the main program file. This makes the main program portable; it can run on the pcDuino or the Raspberry Pi. It could also run on any other ARM system, with the addition of another file to implement the mappings and functions for using the GPIO device for that system.

f14-09-9780128036983
Listing 14.6 A simple main program.

14.4.3 The Linker Script

When compiling the program, it is necessary to perform a few extra steps to ensure that the program is ready to be loaded and run by the boot code. The last step in compiling a program is to link all of the object files together, possibly also including some object files from system libraries. A linker script is a file that tells the linker which sections to include in the output file, as well as which order to put them in, what type of file is to be produced, and what is to be the address of the first instruction. The default linker script used by GCC creates an ELF executable file, which includes startup code from the C library and also includes information which tells the loader where the various sections reside in memory. The default linker script creates a file that can be loaded by the operating system kernel, but which cannot be executed on bare metal.

For a bare-metal program, the linker must be configured to link the program so that the first instruction of the startup function is given the correct address in memory. This address depends on how the boot loader will load and execute the program. On the Raspberry Pi this address is 0x8000, and on the pcDuino this address is 0x40008000. The linker will automatically adjust any other addresses as it links the code together. The most efficient way to accomplish this is by providing a custom linker script to be used instead of the default system script. Additionally, either the linker must be instructed to create a flat binary file, rather than an ELF executable file, or a separate program (objcopy) must be used to convert the ELF executable into a flat binary file.

Listing 14.7 is an example of a linker script that can be used to create a bare-metal program. The first line is just a comment. The second line specifies the name of the function where the program begins execution. In this case, it specifies that a function named _start is where the program will begin execution. Next, the file specifies the sections that the output file will contain. For each output section, it lists the input sections that are to be used.

f14-10-9780128036983
Listing 14.7 A sample Gnu linker script.

The first output section is the .text section, and it is composed of any sections whose names end in .text.boot followed by any sections whose names end in .text. In Listing 14.5, the _start function was placed in the .text.boot section, and it is the only thing in that section. Therefore the linker will put the _start function at the very beginning of the program. The remaining text sections will be appended, and then the remaining sections, in the order that they appear. After the sections are concatenated together, the linker will make a pass through the resulting file, correcting the addresses of branch and load instructions as necessary so that the program will execute correctly.

14.4.4 Putting it All Together

Compiling a program that consists of multiple source files, a custom linker script, and special commands to create an executable image can become tedious. The make utility was created specifically to help in this situation. Listing 14.8 shows a make script that can be used to combine all of the elements of the program together and produce a uImage file for the pcDuino and a kernel.img file for the Raspberry Pi. Listing 14.9 shows how the program can be built by typing “make” at the command line.

f14-11-9780128036983
Listing 14.8 A sample make file.
f14-12-9780128036983
Listing 14.9 Running make to build the image.

14.5 Using an Interrupt

The main program shown in Listing 14.6 is extremely wasteful because it runs the CPU in a loop, repeatedly checking the status of the GPIO pins. It uses far more CPU time (and electrical power) than is necessary. In reality, the pins are unlikely to change state very often, and it is sufficient to check them a few times per second. It only takes a few nanoseconds to check the input pins and set the output pins so the CPU only needs to be running for a few nanoseconds at a time, a few times per second.

A much more efficient implementation would set up a timer to send interrupts at a fixed frequency. Then the main loop can check the buttons, set the outputs, and put the CPU to sleep. Listing 14.10 shows the main program, modified to put the processor to sleep after each iteration of the main loop. The only difference between this main function and the one in Listing 14.6 is the addition of a wfi instruction at line 43. The new implementation will consume far less electrical power and allow the CPU to run cooler, thereby extending its life. However, some additional work must be performed in order to set up the timer and interrupt system before the main function is called.

f14-13-9780128036983
Listing 14.10 An improved main program.

14.5.1 Startup Code

Some changes must be made to the startup code in Listing 14.5 so that after setting up the vector table, it calls a function to initialize the interrupt controller then calls another function to set up the timer. Listing 14.5 shows the modified startup function.

Lines 50 through 57 have been added to initialize the interrupt controller, enable the timer, and change the CPU into user mode before calling main. Of course, the hardware timers and interrupt controllers on the pcDuino and Raspberry Pi are very different.

14.5.2 Interrupt Controllers

The pcDuino has an ARM Generic Interrupt Controller (GIC-400) device to manage interrupts. The GIC device can handle a large number of interrupts. Each one is a separate input signal to the GIC. The GIC hardware prioritizes each input, and assigns each one a unique integer identifier. When the CPU receives an interrupt, it simply reads the GIC to determine which hardware device signaled the interrupt, calls the function which handles that device, then writes to one of the GIC registers to indicate that the interrupt has been processed. Listing 14.12 provides a few basic functions for managing this device.

f14-15a-9780128036983f14-15b-9780128036983f14-15c-9780128036983f14-15d-9780128036983
Listing 14.12 Functions to manage the pdDuino interrupt controller.

The Raspberry Pi has a much simpler interrupt controller. It can enable and disable interrupt sources, and requires that the programmer read up to three registers to determine the source of an interrupt. For our purposes, we only need to manage the ARM timer interrupt. Listing 14.13 provides a few basic functions for using this device to enable the timer interrupt. Extending these functions to provide functionality equal to the GIC would not be very difficult, but would take some time. It would be necessary to set up a mapping from the interrupt bits in the interrupt register controller to integer values, so that each interrupt source has a unique identifier. Then the functions could be written to use those identifiers. The result would be a software implementation to provide capabilities equivalent to the GIC.

f14-16a-9780128036983f14-16b-9780128036983
Listing 14.13 Functions to manage the Raspberry Pi interrupt controller.

Note that although the devices are very different internally, they perform basically the same function. With the addition of a software driver layer, implemented in Listings 14.12 and 14.13 the devices become interchangeable and other parts of the bare-metal program do not have to be changed when porting from one platform to the other.

f14-14a-9780128036983f14-14b-9780128036983
Listing 14.11 ARM startup code with timer interrupt.

14.5.3 Timers

The pcDuino provides several timers that could be used, Timer0 was chosen arbitrarily. Listing 14.14 provides a few basic functions for managing this Device.

f14-17a-9780128036983f14-17b-9780128036983
Listing 14.14 Functions to manage the pdDuino timer0 device.

The Raspberry Pi also provides several timers that could be used, but the ARM timer is the easiest to configure. Listing 14.15 provides a few basic functions for managing this device:

f14-18a-9780128036983f14-18b-9780128036983
Listing 14.15 Functions to manage the Raspberry Pi timer0 device.

14.5.4 Exception Handling

The final step in writing the bare-metal code to operate in an interrupt-driven fashion is to modify the IRQ handler from Listing 14.3. Listing 14.16 shows a new version of the IRQ exception handler which checks and clears the timer interrupt, then returns to the location and CPU mode that were current when the interrupt occurred. This code works for both platforms.

f14-19-9780128036983
Listing 14.16 IRQ handler to clear the timer interrupt.

14.5.5 Building the Interrupt-Driven Program

Finally, the make file must be modified to include the new source code that was added to the program. Listing 14.17 shows the modified make script. The only change is that two extra object files have been added. when make is run, those files will be compiled and linked with the program. Listing 14.9 shows how the program can be built by typing “make” at the command line.

f14-20a-9780128036983
Listing 14.17 A sample make file.

14.6 ARM Processor Profiles

Since its introduction in 1982 as the flagship processor for Acorn RISC Machine, the ARM processor has gone through many changes. Throughout the years, ARM processors have always maintained a good balance of simplicity, performance, and efficiency. Although originally intended as a desktop processor, the ARM architecture has been more successful than any other architecture for use in embedded applications. That is at least partially because of good choices made by its original designers. The architectural decisions resulted in a processor that provides relatively high computing power with a relatively small number of transistors. This design also results in relatively low power consumption.

Today, there are almost 20 major versions of the ARMv7 architecture, targeted for everything from smart sensors to desktops and servers, and sales of ARM-based processors outnumber all other processor architectures combined. Historically, ARM has given numbers to various versions of the architecture. With the ARMv7, they introduced a simpler scheme to describe different versions of the processor. They divided their processor families into three major profiles:

ARMv7-A: Applications processors are capable of running a full, multiuser, virtual memory, multiprocessing operating system.

ARMv7-R: Real-time processors are for embedded systems that may need powerful processors, cache, and/or large amounts of memory.

ARMv7-M: Microcontroller processors only execute Thumb instructions and are intended for use in very small cost-sensitive embedded systems. They provide low cost, low power, and small size, and may not have hardware floating point or other high-performance features.

In 2014, ARM introduced the ARMv8 architecture. This is the first radical change in the ARM architecture in over 30 years. The new architecture extends the register set to thirty 64-bit general purpose registers, and has a completely new instruction set. Compatibility with ARMv7 and earlier code is supported by switching the processor into 32-bit mode, so that it

f14-20b-9780128036983
Listing 14.18 Running make to build the image.

executes the 32-bit ARM instruction set. This is somewhat similar to the way that the Thumb instructions are supported on 32-bit ARM cores, but the change to 32-bit code can only be made when the processor is in privileged mode, and drops back to unprivileged mode.

14.7 Chapter Summary

Writing bare-metal programs can be a daunting task. However, that task can be made easier by writing and testing code under an operating system before attempting to run it bare metal. There are some functions which cannot be tested in this way. In those cases, it is best to keep those functions as simple as possible. Once the program works on bare metal, extra capabilities can be added.

Interrupt-driven processing is the basis for all modern operating systems. The system timer allows the O/S to take control periodically and select a different process to run on the CPU. Interrupts allow hardware devices to do their jobs independently and signal the CPU when they need service. The ability to restrict user access to devices and certain processor features provides the basis for a secure and robust system.

Exercises

14.1 What are the advantages of a CPU which supports user mode and privileged mode over a CPU which does not?

14.2 What are the six privileged modes supported by the ARM architecture?

14.3 The interrupt handling mechanism is somewhat complex and requires significant programming effort to use. Why is it preferred over simply having the processor poll I/O devices?

14.4 Where does program control transfer to when a hardware interrupt occurs?

14.5 What is the purpose of the Undefined Instruction exception? How can it be used to allow an older processor to run programs that have new instructions? What other uses does it have?

14.6 What is an swi instruction? What is its use in operating systems? What is the key difference between an swi instruction and an interrupt?

14.7 Which of the following operations should be allowed only in privileged mode? Briefly explain your decision for each one.

(a) Execute an swi instruction.

(b) Disable all interrupts.

(c) Read the time-of-day clock.

(d) Receive a packet of data from the network.

(e) Shutdown the computer.

14.8 The main program in Listing 14.10 has two different methods to put the processor to sleep waiting for an interrupt. One method is for the Raspberry Pi, while the other is for the pcDuino. In order to compile the code, the correct lines must be uncommented and the unneeded lines must be commented out or removed. Explain two ways to change the code so that exactly the same main program can be used on both systems.

14.9 The programs in this chapter assumed the existence of libraries of functions for controlling the GPIO pins on the Raspberry Pi and the pcDuino. Both libraries provide the same high-level functions, but one operates on the Raspberry Pi GPIO device and the other operates on the pcDuino GPIO device. The C prototypes for the functions are: int GPIO_get_pin(int pin), void GPIO_set_pin(int pin,int state), GPIO_dir_input (int pin), and GPIO_dir_output (int pin). Write these libraries in ARM assembly language for both platforms.

14.10 Write an interrupt-driven program to read characters from the serial port on either the Raspberry Pi or the pcDuino. The UART on either system can be configured to send an interrupt when a character is received.
When a character is received through the UART and an interrupt occurs, the character should be echoed by transmitting it back to the sender. The character should also be stored in a buffer. If the character received is newline (“n), or if the buffer becomes full, then the contents of the buffer should be transmitted through the UART. Then, the buffer cleared and prepared to receive more characters.

Index

Note: Page numbers followed by b indicate boxes, f indicate figures and t indicate tables.

A

Absolute difference 339–340
Absolute value 340–341
Abstract data type (ADT) 
in assembly language 138–139
big integer ADT 195–196, 211
in C header file 138
implementation of 137
interface 137
Therac-25 
design flaws 163–165
history of 162–163
X-ray therapy 161
use of 137
word frequency counts 
better performance 150–161
C header for 141–142
C implementation 141–142, 145
C program to compute 140–141
makefile for 141–142, 146
revised makefile for 148–150
sorting by 147–150
wl_print_numerical function 147–150, 157–161
Accessing devices, Linux 365–376
Acorn Archimedes™ 8
Acorn RISC Machine (ARM) processor 8–9
Addition 
in decimal and binary 173b
fixed-point operation 231–232
floating point operation 246–247
subtraction by 172
vector 335–337
VFP 278
American Standard Code for Information Interchange (ASCII) 
control characters 20, 21t
converting character strings to ASCII codes 21–23, 23t, 24t
interpreting data as ASCII strings 23–24, 24t
ISO extensions to ASCII 24–25, 25t
unicode and UTF-8 25–28, 27t
Arbitrary base 
base ten to 11
to decimal, conversion 220–223
Arithmetic and logic unit (ALU) 54–55
Arithmetic instructions, ARM 83–85
Arithmetic instructions, NEON 335–343
absolute difference 339–340
absolute value and negate 340–341
add vector elements pairwise 338–339
count bits 342–343
select maximum/minimum elements 341–342
vector addition and subtraction 335–337
ARM assembly 
automatic variables 118–119
calling scanf and printf 110–111
complex selection 103–104
function call using stack 115–116
for loop re-written as a post-test loop 107–108
post-test loop 106, 108
pre-test loop 105–107
program 36
reverse function implementation 121–122
simple function call 114
structured data type 124–126
unconditional loop 104–105
ARM condition modifiers 59t
ARM CPU modes 432–435
ARM instruction set architecture 95–96
data processing instructions 79–80
arithmetic operations 83–85
comparison operations 81–82
data movement operations 86–87
division operations 89–90
logical operations 85–86
multiply operations with 32-bit results 87–88
multiply operations with 64-bit results 88–89
Operand2 80, 80t, 81t
pseudo-instructions, ARM 93
no operation 93–94
shifts 94–95
special instructions 
accessing CPSR and SPSR 91
count leading zeros 90
software interrupt 91–92
thumb mode 92–93
ARM processor 
architecture 54f
ARM user registers 55–58, 56f, 57f
branch instructions 70
branch 70–71
branch and link 71–72
load/store instructions 60–61
addressing modes 61–63, 61t
exclusive load/store 69–70
multiple register 65–68
single register 64
swap 68–69
profiles 461–464
pseudo-instructions 73
load address 75–76
load immediate 73–75
ARM user program registers 112f
Assembler 38–40
Assembly language 3
ADTs 138–139
reason to learn 4–8
Atomic Energy of Canada Limited (AECL) 161–162

B

Bare-metal programs 
coding on 431
compiling 449
exception processing 435–436
features 432
linker script 447–448
main program 445–447
Raspberry Pi 442
startup code 443–445
writing 442–449
Base address 
clock manager device 407
for GPIO device 378–379
in memory 367
PWM device 398
Big integer ADT 195–216
bigint_adc function 213–216
C source code file 211
factorial function calculation 212
header file 196
Binary division 
constant 190–194
flowchart for 183f
large numbers 194–195
power of two 181
64-bit functions, signed and unsigned 190
32-bit functions, signed and unsigned 190
variable 182–186
Binary multiplication 
algorithm for 175
large numbers 179–181, 180f
power of two 173
signed multiplication 178–179, 179f, 180b
64 bit numbers 175–176
32-bit numbers 176–177
of two variables 173–176
variable by constant 177–178
Binary tree, of word frequency 151f
index added 157f
sorted index 158f
Binimals 223–224
non-terminating, repeating 223b
terminating 224
Bitwise logical operations, NEON 326–327
with immediate data 327–328, 352–353
insertion and selection 328–329
Boot loader 442, 447
Boot process 442
Branch instructions, ARM processor 70
branch 70–71
branch and link 71–72

C

Central processing unit (CPU) 
components and data paths 54–55
description 3–4
C language 
array of integers 124
array of structured data 127
calling scanf and printf 110
complex selection 103
larger function call 114
for loop 106
program 36
using recursion to reverse a string 120–121
Clock Control Unit (CCU) 409
Clock management device 405–409, 406f
control registers 408t
divisor registers 408t
pcDuino CCU 409
Raspberry Pi 406–409
registers 407t
Communications 
parallel 409
serial 409–429
pcDuino UART 422–429
Raspberry Pi UART0 413–422
UART 410–412
Compare instruction 
ARM 81–82
vector 323–324
vector absolute 353–354
VFP 279
Compilation sequence 5, 6f
Compiler, GNU C 38–40
Complex Instruction Set Computing (CISC) processor 8
Computer data 9
base conversion 
base b to decimal 11–12, 12b
base conversion 10t, 11–15
bases, powers-of-two 14–15, 14f
conversion between arbitrary bases 13b
decimal to base b 12, 13b
characters 20–28, 21t, 22t
non-printing 20–21
printing 20, 22t
ISO 24–25
Unicode and UTF-8 25–28
integers 15, 16f
complement representation 16–19, 17f, 18b, 19b
excess-(2n−−1–1) representation 16
sign-magnitude representation 15
natural numbers 9–11
Conditional assembly 46–47
Control registers 
clock management device 407, 408t
pcDuino UART FIFO 425t
Raspberry Pi UART 416, 417t
Cosine function 
ARM assembly implementation 251, 257
battery powered systems 260
double precision software float C 259
double precision VFP C 260
factorial terms, formats and constants for 249–251
formats for powers of x 248–249
intermediate calculations 251
performance comparison 259–260
performance implementations 259t
properties 247–248
single precision software float C 259
single precision VFP C 259
table printing 251, 258
32-bit fixed point assembly 259
32-bit fixed point C 259
Count bits 342–343
Current Program Status Register (CPSR) 57–58
accessing 91
flag bits 58, 58t

D

Data conversion instructions 
NEON 321–322
fixed point and single-precision 321–322
half-precision and single-precision 322
vector floating point 
fixed point to single precision 284–285
floating point to integer 282–284
Data frame 410
Data movement instructions 
ARM 86–87
NEON 309–320
change size of elements in vector 311–312
duplicate scalar 312–313
extract elements 313–314
move immediate data 310–311
moving between NEON scalar and integer register 309–310
reverse elements 314–315
swap vectors 315–316
table lookup 317–319
transpose matrix 316–317
zip/unzip vectors 319–320
vector floating point 279–282
ARM register and VFP system register 282
between two VFP registers 279–280
VFP register and one integer register 280–281
VFP register and two integer registers 281
Data processing instructions, ARM 79–80
arithmetic operations 83–85
comparison operations 81–82
data movement operations 86–87
division operations 89–90
logical operations 85–86
multiply operations 
with 64-bit results 88–89
with 32-bit results 87–88
Operand2 80, 80t, 81t
vector floating point 277–279
compare instruction 279
mathematical operations 278
unary operations 277–278
Data register, Raspberry Pi UART 413, 414t
Data section, memory 28–29
Decimal 223–224
to arbitrary base, conversion 220–223
terminating 224
Direct Memory Access (DMA) 377–378
control register 418
Division 
binary 
constant 190–194
flowchart for 183f
large numbers 194–195
power of two 181
64-bit functions, signed and unsigned 190
32-bit functions, signed and unsigned 190
variable 182–186
by constant 236–241
in decimal and binary 181f
fixed-point operation 234–236
floating point operation 247
maintaining precision 236
mixed 235
NEON 343
results of 234–235
signed 235
unsigned 235
of variable by constant 193
VFP 278
Divisor registers 
clock management device 408t
divisor latch high register 424, 425t
divisor latch low register 424, 424t
Double-precision floating point number 
IEEE 754 245–246
sine function 355, 357
Duty cycle 395

E

Exception handling 438–441, 461
skeleton for 441
stub functions 438–441
Exception processing 434–441, 436f
ARM vector table 434–435, 435t
bare-metal programs 435–436
handling exceptions 438–441
skeleton for 441
stub functions 438–441
with multiple user processes 436, 437f
Executing program, memory layout of 28–31, 29f, 30f
Extract elements 313–314

F

Fault Tree Analysis 162
FIFO control register 425t
Fixed-point numbers 
interpreting 226–230
properties of 230–231
Q notation 230
signed 227–228
two’s complement 229
unsigned 226, 228
Fixed-point operation 
addition 231–232
division 
by constant 236–241
maintaining precision 236
mixed 235
results of 234–235
signed 235
unsigned 235
multiplication 232–233
to single-precision 284–285, 321–322
subtraction 231–232
Flags register 414, 415t
Floating-point Exception register (FPEXC) 274
Floating point numbers 
binimal representation 242–243
IEEE 754 
double-precision 245–246
half-precision 243–245
quad-precision 246
single-precision 245
to integer 282–284
Floating point operations 
addition 246–247
division 247
multiplication 247
subtraction 246–247
Floating Point Status and Control Register (FPSCR) 268–273
bits in 268–269, 268f
performance vs. compliance 271–272
vector mode 272–273
Floating-point System ID register (FPSID) 274
Fractional baud rate divisor 414, 416t
Fractional numbers, base conversion 223–225
arbitrary base to decimal 220
decimal to arbitrary base 220–223
powers-of-two 222–223
Full-compliance mode 272
Fused multiply accumulate operation 346

G

General Purpose I/O (GPIO) device 376–392, 395
applications 377–378
features 377–378
GPIO pin event detect status registers 382
GPIO pin pull-up/down registers 381–382
input and output 378f
parallel printer port 377
pcDuino 382–392
detecting GPIO events 390
enabling internal pull-up/pull-down 389–390
function select code assignments 392t
GPIO pins available on 390–392
header pin assignments 391f
reading and setting GPIO pins 388–389
setting GPIO pin function 384–385
pin function select bits 380t
port 376–377
Raspberry Pi 378–382
detecting GPIO events 382
enabling internal pull-up/pull-down 381–382
GPIO pins available on 382
header pin assignments 384f
reading GPIO input pins 381
setting GPIO output pins 380–381
setting GPIO pin function 379–380
Generic Interrupt Controller (GIC) device 449–451
GNU assembler (GAS) 35, 40
directives 40
allocating space for variables and constants 41–43, 42f
conditional assembly 46–47
current section selection 40–41
filling and aligning 43–45
including other source files 47–48
macros 48–50
setting and manipulating symbols 45–47
program structure 36, 38
assembler directives 36–38
assembly instructions 36, 38
comments 37
labels 37
GNU C compiler 38–40, 57

H

Half-precision floating point number 
IEEE 754 243–245
to single-precision 322
Hardware interrupt 434
High-level language 
description 4–5
structured data type 73–74
Hindu-Arabic number system 9–10

I

IBM PC 377
Image data type 138–139
Immediate data 
bitwise logical operations with 327–328, 352–353
data movement NEON instructions 310–311
Information hiding 137
Instruction components 58
immediate values 59–60, 60t
setting and using condition flags 58–59, 58t
Instruction set architecture (ISA) 53
Instruction stream 3
Integer baud rate divisor 414, 416t
Integer mathematics 
big integer ADT 195–216
binary division 
constant 190–194
large numbers 194–195
power of two 181
variable 182–186
binary multiplication by 
large numbers 179–181
power of two 173
signed multiplication 178–179, 180b
two variables 173–176
variable by constant 177–178
division 236, 239
floating point to 282–284
overflow 171
subtraction by addition 172
Integer register 
moving between NEON scalar and 309–310
VFP register and 280–281
Interrupt clear register 418
Interrupt controllers 449–451
Interrupt-driven program 461
Interrupt enable register 429
Interrupt Identity Register 429
Interrupt mask set/clear register 417

L

Least significant bit (LSB) 11
LED, GPIO device 377–378
Line control register 
pcDuino UART 425, 426t
Raspberry Pi UART 416, 416t
Line driver 410
Line status register 426, 427t
Linked list 
index creation 147, 157f
re-ordering 147
sorted index 158f
sorting 147
Linker 38–40, 46
Linker script 447–448
Linux, accessing devices under 365–376
Load and store instructions 60–61
ARM 55–58
addressing modes 61–63, 61t
exclusive load/store 69–70
multiple register 65–68
NEON 302–309
load copies of structure to all lanes 305–307
multiple structures data 307–309
single structure using one lane 303–305, 304t
single register 64
swap 68–69
Load constant 351–352
Loop unrolling 355
Low pass filter 395–396, 398

M

Macros, GNU assembly directives 48–50
Masked interrupt status register 418
Mathematical operations, VFP 278
Memory 
base address in 367
of executing program 28–31, 29f, 30f
hardware address mapping for 366f
on Raspberry Pi 372
Modem Control Register 429
Modem Scratch Register 429
Modem Status Register 429
Monostable multivibrator 400
Most significant bit (MSB) 11
Multiplication 
binary 
algorithm for 175
large numbers 179–181, 180f
power of two 173
signed multiplication 178–179, 179f, 180b
64 bit numbers 175–176
32-bit numbers 176–177
of two variables 173–176
variable by constant 177–178
in decimal and binary 174b
fixed-point operation 232–233
floating point operation 247
mixed 233
NEON 343–351
estimate reciprocals 348–349
fused multiply accumulate 346
reciprocal step 349–351
saturating multiply and double 347–348
by scalar 345–346
signed 233
unsigned 233
VFP 278
Multistage noise shaping (MASH) filtering 407

N

NEON instructions 298–299, 358–361
arithmetic instructions 335–343
absolute difference 339–340
absolute value and negate 340–341
add vector elements pairwise 338–339
count bits 342–343
select maximum/minimum elements 341–342
vector addition and subtraction 335–337
bitwise logical operations 326–327
with immediate data 327–328
insertion and selection 328–329
comparison operations 322–326
vector absolute compare 324–325
vector comparison 323–324
vector test bits 325–326
data conversion between 
fixed point and single-precision 321–322
half-precision and single-precision 322
data movement instructions 309–320
change size of elements in vector 311–312
duplicate scalar 312–313
extract elements 313–314
move immediate data 310–311
moving between NEON scalar and integer register 309–310
reverse elements 314–315
swap vectors 315–316
table lookup 317–319
transpose matrix 316–317
zip/unzip vectors 319–320
intrinsics functions 299
load and store instructions 302–309
load copies of structure to all lanes 305–307, 308t
multiple structures 306t, 307–309
single structure using one lane 303–305, 304t
multiplication and division 343–351
estimate reciprocals 348–349
fused multiply accumulate 346
reciprocal step 349–351
saturating multiply and double 347–348
by scalar 345–346
pseudo-instructions 351–354
bitwise logical operations with immediate data 352–353
load constant 351–352
vector absolute compare 353–354
shift instructions 329–334
saturating shift right by immediate 332–333
shift and insert 333–334
shift left by immediate 329–330
shift left/right by variable 330–331
shift right by immediate 331–332
sine function 354–358, 357t
double precision 355, 357
performance comparison 357–358, 357t
single precision 354–355
syntax of 299–302
user program registers 300f
Newlib 432
Newton-Raphson method 343, 348–349
for improving reciprocal estimates 349–350
Non-integral mathematics 
fixed-point numbers 
interpreting 226–230
properties of 230–231
Q notation 230
fixed-point operations 
addition and subtraction 231–232
division 234–241
multiplication 232–233
floating point numbers 
double-precision, IEEE 754 245–246
half-precision, IEEE 754 243–245
quad-precision, IEEE 754 246
single-precision, IEEE 754 245
floating point operations 
addition and subtraction 246–247
multiplication and division 247
fractional numbers, base conversion 
arbitrary base to decimal 220
decimal to arbitrary base 220–223
fractions and bases 223–225
Patriot missile failure 261–263
sine and cosine function 
factorial terms, formats and constants 249–251
formats for powers of x 248–249
performance comparison 259–260
table printing 258
using fixed-point calculations 257

O

Operand2 80, 80t, 81t
Operating system 431–432
designers 365–366

P

Parallel communications 409
Patriot missile failure 261–263
pcDuino 382–392
bare-metal programs 
linker script 447
main program 445–447
startup code 445
boot process 442
Clock Control Unit 409
GPIO 
detecting events 390
enabling internal pull-up/pull-down 389–390
function select code assignments 392t
header locations 390f
header pin assignments 391f
pin function setting 384–385
pins available on 390–392
reading and setting GPIO pins 388–389
user program memory space on 372, 376
interrupt controllers 449–451, 457
PWM device 400–403
configuring 403
control register bits 402t
prescaler bits 401t
register map 401t
timer0 device 458, 460
UART 422–429
addresses 422t
divisor latch high register 424, 425t
divisor latch low register 424, 424t
FIFO control register 425t
interrupt control 429
interrupt enable register 429
Interrupt Identity Register 429
line control register 425, 426t
line status register 426, 427t
Modem Control Register 429
Modem Scratch Register 429
Modem Status Register 429
receive buffer register 423, 424t
receive FIFO level register 426, 428t
register offsets 423t
status register 426, 427t
transmit FIFO level register 426, 428t
transmit halt register 428, 428t
transmit holding register 424, 424t
PDP-11 163
Privileged mode 432–433
Program Status Register (PSR) 433–434
mode bits 434t
Pseudo-instructions, ARM processor 73, 93
load address 75–76
load immediate 73–75
NEON 351–354
bitwise logical operations with immediate data 352–353
load constant 351–352
vector absolute compare 353–354
no operation 93–94
shifts 94–95
Pulse density modulation (PDM) 396, 396f
Pulse frequency modulation (PFM) 396, 396f
Pulse modulation 
pcDuino PWM device 400–403
PDM 396, 396f
PWM 397, 397f
Raspberry Pi PWM device 398–400, 400b
types 395
Pulse width modulation (PWM) 397, 397f
pcDuino PWM device 400–403
Raspberry Pi PWM device 398–400, 400b

Q

Q notation 230
Quad-precision floating point number 246

R

Radix point 220
Radix ten Hindu-Arabic system 10
Raspberry Pi 365–367
bare-metal programs 442
linker script 447
main program 445–447
startup code 445
clock management device 406–409
GPIO 378–382
detecting events 382
enabling internal pull-up/pull-down 381–382
header pin assignments 384f
output pins setting 380–381
pin alternate functions 385t
pin function setting 379–380
pins available on 382
reading input pins 381
register 379t
user program memory on 372
header location 383f
interrupt controllers 441, 451
PWM device 398–400, 400b
clock values on 400
control register bits 399t
register map 398t
timer0 device 458–461
UART 413–418
assembly functions for 422
basic programming for 418–422
control register 416, 417t
data register 413, 414t
DMA control register 418
flags register bits 414, 415t
fractional baud rate divisor 414, 416t
integer baud rate divisor 414, 416t
interrupt clear register 418
interrupt control 417
interrupt mask set/clear register 417
line control register bits 416, 416t
masked interrupt status register 418
raw interrupt status register 418
receive status register/error clear register 414, 415t
registers 413t
Raw interrupt status register 418
Receive buffer register, UART 423, 424t
Receive FIFO level register, UART 426, 428t
Receive status register/error clear register 414, 415t
Reciprocals 
estimate 348–349
step 349–351
Reduced Instruction Set Computing (RISC) processor 8
Reverse elements 314–315
RS-232 standard 410, 412
RS-422 standards 410, 412
RS-485 standards 410, 412
RunFast mode 272

S

Saved Process Status Register (SPSR) 432–433
Scalar 
duplication 312–313
multiplication by 345–346
sine function using 285–286
Serial communications 409–429
pcDuino UART 422–429
addresses 422t
divisor latch high register 424, 425t
divisor latch low register 424, 424t
FIFO control register 425t
interrupt control 429
line control register 425, 426t
line status register 426, 427t
receive buffer register 423, 424t
receive FIFO level register 426, 428t
register offsets 423t
status register 426, 427t
transmit FIFO level register 426, 428t
transmit halt register 428, 428t
transmit holding register 424, 424t
Raspberry Pi UART0 413–418
assembly functions for 422
basic programming for 418–422
control register 416, 417t
data register 413, 414t
flags register bits 414, 415t
fractional baud rate divisor 414, 416t
integer baud rate divisor 414, 416t
interrupt control 417
line control register bits 416, 416t
receive status register/error clear register 414, 415t
register map 413t
UART 410–412
Serial Peripheral Interface (SPI) functions 382
Shift instructions, NEON 329–334
saturating shift right by immediate 332–333
shift and insert 333–334
shift left by immediate 329–330
shift left/right by variable 330–331
shift right by immediate 331–332
Sine function 
ARM assembly implementation 251, 257
battery powered systems 260
double precision software float C 259
double precision VFP C 260
factorial terms, formats and constants for 249–251
formats for powers of x 248–249
intermediate calculations 251
double precision 355, 357
performance comparison 357–358, 357t
single precision 354–355
performance comparison 259–260
performance implementations 259t
properties 247–248
scalar implementation 286–287
single precision software float C 259
single precision VFP C 259
sinq 248
table printing 251, 258
32-bit fixed point assembly 259
32-bit fixed point C 259
vector implementation 289, 291
VFP 
performances 291, 292t
scalar mode 285–286
vector mode 287–291
Single instruction multiple data (SIMD) instructions 5
Single-precision floating point number 
fixed point to 284–285, 321–322
half-precision to 322
IEEE 754 245
sine function 354–355
Sorting 
linked list 147
by word frequency 147–150
Spaghetti code 100
Special instructions, ARM 
accessing CPSR and SPSR 91
count leading zeros 90
software interrupt 91–92
thumb mode 92–93
Stack and Heap segments 28–29
Status register 
ARM process 433f
pcDuino UART 426, 427t
Structured programming 
aggregate data types 123–131
arrays 124–125
arrays of structured data 126–131
structured data 124–126
description 99–100
iteration 104–108
for loop 106–108
post-test loop 106
pre-test loop 105
selection 101–104
complex selection 103–104
using branch instructions 102
using conditional execution 101–102
sequencing 100–101
subroutines 108–122
advantages 109
automatic variables 118–119
calling 113–117
disadvantages 110
passing parameters 110–113
recursive functions 119–122
standard C library functions 110
writing 117–118
Subtraction 
by addition 172
in decimal and binary 173b
fixed-point operation 231–232
floating point operation 246–247
ten’s complement 172b
vector 335–337
VFP 278
Swap vectors 315–316

T

Table lookup 317–319
Text section, memory 28–29
Therac-25 
for cancer 161
design flaws 163–165
double pass accelerator 161
history of 162–163
overdose 162–163
X-ray therapy 161
Three address instruction 80
Transmit FIFO level register 426, 428t
Transmit halt register 428, 428t
Transmit holding register 424, 424t
Transpose matrix 316–317

U

UCS Transformation Format-8-bit (UTF-8) 26–27
Unary operations 277–278
Universal Asynchronous Receiver/Transmitter (UART) 410–412
line driver 410
pcDuino 422–429
addresses 422t
divisor latch high register 424, 425t
divisor latch low register 424, 424t
FIFO control register 425t
interrupt control 429
line control register 425, 426t
line status register 426, 427t
receive buffer register 423, 424t
receive FIFO level register 426, 428t
register offsets 423t
status register 426, 427t
transmit FIFO level register 426, 428t
transmit halt register 428, 428t
transmit holding register 424, 424t
Raspberry Pi 413–418
assembly functions for 422
basic programming for 418–422
control register 416, 417t
data register 413, 414t
flags register bits 414, 415t
fractional baud rate divisor 414, 416t
integer baud rate divisor 414, 416t
interrupt control 417
line control register bits 416, 416t
receive status register/error clear register 414, 415t
register map 413t
standards 410
transmitter and receiver timings for 411f
Universal Character Set (UCS) code 26
Unzip vectors 319–320
User mode 432

V

Vector absolute comparison 324–325, 353–354
Vector floating point (VFP) 
code meanings for 271t
compare instruction 279
coprocessor 266–268
data conversion instructions 282–285
data movement instructions between 279–282
ARM register and VFP system register 282
two VFP register 279–280
VFP register and one integer register 280–281
VFP register and two integer register 281
data processing instructions 277–279
compare instruction 279
mathematical operations 278
unary operations 277–278
FPSCR 268–273
instructions 292
load and store instructions 274–277
overview 266–268
register usage rules 273–274
sine function 
performance 291, 292t
using scalar mode 285–286
using vector mode 287–291
user program registers 267f
Vectors 268
addition and subtraction 335–337
change size of elements 311–312
comparison operation 323–324
FPSCR 272–273
sine function using 287–291
swapping 315–316
unzip 319–320
Vector table 434–435, 435t
Vector test bits 325–326

W

wl_print_numerical function 147–150
Word frequency counts, ADT 
better performance 150–161
binary tree of 151f, 157f, 158f
C header for 141–142
C implementation 141–142, 145, 150–151, 157
C program to compute 140–141
makefile for 141–142, 146
revised makefile for 148–150
sorting by 147–150

Z

Zip vectors 319–320

Table of Contents

9780128037164_FC

Modern Assembly Language Programming with the ARM Processor

First Edition

Larry D. Pyeatt

publogo

Table of Contents

Cover image

Title page

Copyright

List of Tables

List of Figures

List of Listings

Preface

Choice of Processor Family

General Approach

Companion Website

Acknowledgments

Part I: Assembly as a Language

Chapter 1: Introduction

Abstract

1.1 Reasons to Learn Assembly

1.2 The ARM Processor

1.3 Computer Data

1.4 Memory Layout of an Executing Program

1.5 Chapter Summary

Exercises

Chapter 2: GNU Assembly Syntax

Abstract

2.1 Structure of an Assembly Program

2.2 What the Assembler Does

2.3 GNU Assembly Directives

2.4 Chapter Summary

Exercises

Chapter 3: Load/Store and Branch Instructions

Abstract

3.1 CPU Components and Data Paths

3.2 ARM User Registers

3.3 Instruction Components

3.4 Load/Store Instructions

3.5 Branch Instructions

3.6 Pseudo-Instructions

3.7 Chapter Summary

Exercises

Chapter 4: Data Processing and Other Instructions

Abstract

4.1 Data Processing Instructions

4.2 Special Instructions

4.3 Pseudo-Instructions

4.4 Alphabetized List of ARM Instructions

4.5 Chapter Summary

Exercises

Chapter 5: Structured Programming

Abstract

5.1 Sequencing

5.2 Selection

5.3 Iteration

5.4 Subroutines

5.5 Aggregate Data Types

5.6 Chapter Summary

Exercises

Chapter 6: Abstract Data Types

Abstract

6.1 ADTs in Assembly Language

6.2 Word Frequency Counts

6.3 Ethics Case Study: Therac-25

6.4 Chapter Summary

Exercises

Part II: Performance Mathematics

Chapter 7: Integer Mathematics

Abstract

7.1 Subtraction by Addition

7.2 Binary Multiplication

7.3 Binary Division

7.4 Big Integer ADT

7.5 Chapter Summary

Exercises

Chapter 8: Non-Integral Mathematics

Abstract

8.1 Base Conversion of Fractional Numbers

8.2 Fractions and Bases

8.3 Fixed-Point Numbers

8.4 Fixed-Point Operations

8.5 Floating Point Numbers

8.6 Floating Point Operations

8.7 Computing Sine and Cosine

8.8 Ethics Case Study: Patriot Missile Failure

8.9 Chapter Summary

Exercises

Chapter 9: The ARM Vector Floating Point Coprocessor

Abstract

9.1 Vector Floating Point Overview

9.2 Floating Point Status and Control Register

9.3 Register Usage Rules

9.4 Load/Store Instructions

9.5 Data Processing Instructions

9.6 Data Movement Instructions

9.7 Data Conversion Instructions

9.8 Floating Point Sine Function

9.9 Alphabetized List of VFP Instructions

9.10 Chapter Summary

Exercises

Chapter 10: The ARM NEON Extensions

Abstract

10.1 NEON Intrinsics

10.2 Instruction Syntax

10.3 Load and Store Instructions

10.4 Data Movement Instructions

10.5 Data Conversion

10.6 Comparison Operations

10.7 Bitwise Logical Operations

10.8 Shift Instructions

10.9 Arithmetic Instructions

10.10 Multiplication and Division

10.11 Pseudo-Instructions

10.12 Performance Mathematics: A Final Look at Sine

10.13 Alphabetized List of NEON Instructions

10.14 Chapter Summary

Part III: Accessing Devices

Chapter 11: Devices

Abstract

11.1 Accessing Devices Directly Under Linux

11.2 General Purpose Digital Input/Output

11.3 Chapter Summary

Exercises

Chapter 12: Pulse Modulation

Abstract

12.1 Pulse Density Modulation

12.2 Pulse Width Modulation

12.3 Raspberry Pi PWM Device

12.4 pcDuino PWM Device

12.5 Chapter Summary

Exercises

Chapter 13: Common System Devices

Abstract

13.1 Clock Management Device

13.2 Serial Communications

13.3 Chapter Summary

Exercises

Chapter 14: Running Without an Operating System

Abstract

14.1 ARM CPU Modes

14.2 Exception Processing

14.3 The Boot Process

14.4 Writing a Bare-Metal Program

14.5 Using an Interrupt

14.6 ARM Processor Profiles

14.7 Chapter Summary

Exercises

Index

Copyright

List of Tables

Table 1.1 Values represented by two bits 9

Table 1.2 The first 21 integers (starting with 0) in various bases 10

Table 1.3 The ASCII control characters 21

Table 1.4 The ASCII printable characters 22

Table 1.5 Binary equivalents for each character in “Hello World” 23

Table 1.6 Binary, hexadecimal, and decimal equivalents for each character in “Hello World” 24

Table 1.7 Interpreting a hexadecimal string as ASCII 24

Table 1.8 Variations of the ISO 8859 standard 25

Table 1.9 UTF-8 encoding of the ISO/IEC 10646 code points 27

Table 3.1 Flag bits in the CPSR register 58

Table 3.2 ARM condition modifiers 59

Table 3.3 Legal and illegal values for #<immediate|symbol> 60

Table 3.4 ARM addressing modes 61

Table 3.5 ARM shift and rotate operations 61

Table 4.1 Shift and rotate operations in Operand2 80

Table 4.2 Formats for Operand2 81

Table 8.1 Format for IEEE 754 half-precision 244

Table 8.2 Result formats for each term 252

Table 8.3 Shifts required for each term 252

Table 8.4 Performance of sine function with various implementations 259

Table 9.1 Condition code meanings for ARM and VFP 271

Table 9.2 Performance of sine function with various implementations 292

Table 10.1 Parameter combinations for loading and storing a single structure 304

Table 10.2 Parameter combinations for loading multiple structures 306

Table 10.3 Parameter combinations for loading copies of a structure 308

Table 10.4 Performance of sine function with various implementations 357

Table 11.1 Raspberry Pi GPIO register map 379

Table 11.2 GPIO pin function select bits 380

Table 11.3 GPPUD control codes 381

Table 11.4 Raspberry Pi expansion header useful alternate functions 385

Table 11.5 Number of pins available on each of the AllWinner A10/A20 PIO ports 385

Table 11.6 Registers in the AllWinner GPIO device 386

Table 11.7 Allwinner A10/A20 GPIO pin function select bits 388

Table 11.8 Pull-up and pull-down resistor control codes 389

Table 11.9 pcDuino GPIO pins and function select code assignments. 392

Table 12.1 Raspberry Pi PWM register map 398

Table 12.2 Raspberry Pi PWM control register bits 399

Table 12.3 Prescaler bits in the pcDuino PWM device 401

Table 12.4 pcDuino PWM register map 401

Table 12.5 pcDuino PWM control register bits 402

Table 13.1 Clock sources available for the clocks provided by the clock manager 407

Table 13.2 Some registers in the clock manager device 407

Table 13.3 Bit fields in the clock manager control registers 408

Table 13.4 Bit fields in the clock manager divisor registers 408

Table 13.5 Clock signals in the AllWinner A10/A20 SOC 409

Table 13.6 Raspberry Pi UART0 register map 413

Table 13.7 Raspberry Pi UART data register 414

Table 13.8 Raspberry Pi UART receive status register/error clear register 415

Table 13.9 Raspberry Pi UART flags register bits 415

Table 13.10 Raspberry Pi UART integer baud rate divisor 416

Table 13.11 Raspberry Pi UART fractional baud rate divisor 416

Table 13.12 Raspberry Pi UART line control register bits 416

Table 13.13 Raspberry Pi UART control register bits 417

Table 13.14 pcDuino UART addresses 422

Table 13.15 pcDuino UART register offsets 423

Table 13.16 pcDuno UART receive buffer register 424

Table 13.17 pcDuno UART transmit holding register 424

Table 13.18 pcDuno UART divisor latch low register 424

Table 13.19 pcDuno UART divisor latch high register 425

Table 13.20 pcDuno UART FIFO control register 425

Table 13.21 pcDuno UART line control register 426

Table 13.22 pcDuno UART line status register 427

Table 13.23 pcDuno UART status register 427

Table 13.24 pcDuno UART transmit FIFO level register 428

Table 13.25 pcDuno UART receive FIFO level register 428

Table 13.26 pcDuno UART transmit halt register 428

Table 14.1 The ARM user and system registers 433

Table 14.2 Mode bits in the PSR 434

Table 14.3 ARM vector table 435

List of Figures

Figure 1.1 Simplified representation of a computer system 4

Figure 1.2 Stages of a typical compilation sequence 6

Figure 1.3 Tables used for converting between binary, octal, and hex 14

Figure 1.4 Four different representations for binary integers 16

Figure 1.5 Complement tables for bases ten and two 17

Figure 1.6 A section of memory 29

Figure 1.7 Typical memory layout for a program with a 32-bit address space 30

Figure 2.1 Equivalent static variable declarations in assembly and C 42

Figure 3.1 The ARM processor architecture 54

Figure 3.2 The ARM user program registers 56

Figure 3.3 The ARM process status register 57

Figure 5.1 ARM user program registers 112

Figure 6.1 Binary tree of word frequencies 151

Figure 6.2 Binary tree of word frequencies with index added 157

Figure 6.3 Binary tree of word frequencies with sorted index 158

Figure 7.1 In signed 8-bit math, 110110012 is −3910 179

Figure 7.2 In unsigned 8-bit math, 110110012 is 21710 179

Figure 7.3 Multiplication of large numbers 180

Figure 7.4 Longhand division in decimal and binary 181

Figure 7.5 Flowchart for binary division 183

Figure 8.1 Examples of fixed-point signed arithmetic 232

Figure 9.1 ARM integer and vector floating point user program registers 267

Figure 9.2 Bits in the FPSCR 268

Figure 10.1 ARM integer and NEON user program registers 300

Figure 10.2 Pixel data interleaved in three doubleword registers 302

Figure 10.3 Pixel data de-interleaved in three doubleword registers 303

Figure 10.4 Example of vext.8 d12,d4,d9,#5 313

Figure 10.5 Examples of the vrev instruction. (A) vrev16.8 d3,d4; (B) vrev32.16 d8,d9; (C) vrev32.8 d5,d7 315

Figure 10.6 Examples of the vtrn instruction. (A) vtrn.8 d14,d15; (B) vtrn.32 d31,d15 316

Figure 10.7 Transpose of a 3 × 3 matrix 317

Figure 10.8 Transpose of a 4 × 4 matrix of 32-bit numbers 318

Figure 10.9 Example of vzip.8 d9,d4 320

Figure 10.10 Effects of vsli.32 d4,d9,#6 334

Figure 11.1 Typical hardware address mapping for memory and devices 366

Figure 11.2 GPIO pins being used for input and output. (A) GPIO pin being used as input to read the state of a push-button switch. (B) GPIO pin being used as output to drive an LED 378

Figure 11.3 The Raspberry Pi expansion header location 383

Figure 11.4 The Raspberry Pi expansion header pin assignments 384

Figure 11.5 Bit-to-pin assignments for PIO control registers 388

Figure 11.6 The pcDuino header locations 390

Figure 11.7 The pcDuino header pin assignments 391

Figure 12.1 Pulse density modulation 396

Figure 12.2 Pulse width modulation 397

Figure 13.1 Typical system with a clock management device 406

Figure 13.2 Transmitter and receiver timings for two UARTS. (A) Waveform of a UART transmitting a byte. (B) Timing of UART receiving a byte 411

Figure 14.1 The ARM process status register 433

Figure 14.2 Basic exception processing 436

Figure 14.3 Exception processing with multiple user processes 437

List of Listings

Listing 2.1 “Hello World” program in ARM assembly 36

Listing 2.2 “Hello World” program in C 37

Listing 2.3 “Hello World” assembly Listing 39

Listing 2.4 A Listing with mis-aligned data 43

Listing 2.5 A Listing with properly aligned data 45

Listing 2.6 Defining a symbol for the number of elements in an array 47

Listing 5.1 Selection in C 101

Listing 5.2 Selection in ARM assembly using conditional execution 102

Listing 5.3 Selection in ARM assembly using branch instructions 102

Listing 5.4 Complex selection in C 103

Listing 5.5 Complex selection in ARM assembly 104

Listing 5.6 Unconditional loop in ARM assembly 105

Listing 5.7 Pre-test loop in ARM assembly 105

Listing 5.8 Post-test loop in ARM assembly 106

Listing 5.9 for loop in C 106

Listing 5.10 for loop rewritten as a pre-test loop in C 107

Listing 5.11 Pre-test loop in ARM assembly 107

Listing 5.12 for loop rewritten as a post-test loop in C 108

Listing 5.13 Post-test loop in ARM assembly 108

Listing 5.14 Calling scanf and printf in C 111

Listing 5.15 Calling scanf and printf in ARM assembly 111

Listing 5.16 Simple function call in C 114

Listing 5.17 Simple function call in ARM assembly 114

Listing 5.18 A larger function call in C 114

Listing 5.19 A larger function call in ARM assembly 115

Listing 5.20 A function call using the stack in C 115

Listing 5.21 A function call using the stack in ARM assembly 116

Listing 5.22 A function call using stm to push arguments onto the stack 116

Listing 5.23 A small function in C 118

Listing 5.24 A small function in ARM assembly 118

Listing 5.25 A small C function with a register variable 119

Listing 5.26 Automatic variables in ARM assembly 119

Listing 5.27 A C program that uses recursion to reverse a string 120

Listing 5.28 ARM assembly implementation of the reverse function 121

Listing 5.29 Better implementation of the reverse function 122

Listing 5.30 Even better implementation of the reverse function 122

Listing 5.31 String reversing in C using pointers 123

Listing 5.32 String reversing in assembly using pointers 123

Listing 5.33 Initializing an array of integers in C 124

Listing 5.34 Initializing an array of integers in assembly 125

Listing 5.35 Initializing a structured data type in C 125

Listing 5.36 Initializing a structured data type in ARM assembly 126

Listing 5.37 Initializing an array of structured data in C 127

Listing 5.38 Initializing an array of structured data in assembly 128

Listing 5.39 Improved initialization in assembly 129

Listing 5.40 Very efficient initialization in assembly 130

Listing 6.1 Definition of an Abstract Data Type in a C header file 138

Listing 6.2 Definition of the image structure may be hidden in a separate header file 139

Listing 6.3 Definition of an ADT in Assembly 140

Listing 6.4 C program to compute word frequencies 140

Listing 6.5 C header for the wordlist ADT 142

Listing 6.6 C implementation of the wordlist ADT 143

Listing 6.7 Makefile for the wordfreq program 146

Listing 6.8 ARM assembly implementation of wl_print_numerical() 148

Listing 6.9 Revised makefile for the wordfreq program 149

Listing 6.10 C implementation of the wordlist ADT using a tree 151

Listing 6.11 ARM assembly implementation of wl_print_numerical() with a tree 158

Listing 7.1 ARM assembly code for adding two 64 bit numbers 176

Listing 7.2 ARM assembly code for multiplication with a 64 bit result 176

Listing 7.3 ARM assembly code for multiplication with a 32 bit result 177

Listing 7.4 ARM assembly implementation of signed and unsigned 32-bit and 64-bit division functions 187

Listing 7.5 ARM assembly code for division by constant 193 192

Listing 7.6 ARM assembly code for division of a variable by a constant without using a multiply instruction 193

Listing 7.7 Header file for a big integer abstract data type 195

Listing 7.8 C source code file for a big integer abstract data type 196

Listing 7.9 Program using the bigint ADT to calculate the factorial function 211

Listing 7.10 ARM assembly implementation if the bigint_adc function 213

Listing 8.1 Examples of fixed-point multiplication in ARM assembly 233

Listing 8.2 Dividing x by 23 239

Listing 8.3 Dividing x by 23 Using Only Shift and Add 240

Listing 8.4 Dividing x by − 50 242

Listing 8.5 Inefficient representation of a binimal 242

Listing 8.6 Efficient representation of a binimal 243

Listing 8.7 ARM assembly implementation of sin x and cos x using fixed-point calculations 252

Listing 8.8 Example showing how the sin x and cos x functions can be used to print a table 257

Listing 9.1 Simple scalar implementation of the sin x function using IEEE single precision 285

Listing 9.2 Simple scalar implementation of the sin x function using IEEE double precision 286

Listing 9.3 Vector implementation of the sin x function using IEEE single precision 288

Listing 9.4 Vector implementation of the sin x function using IEEE double precision 289

Listing 10.1 NEON implementation of the sin x function using single precision 354

Listing 10.2 NEON implementation of the sin x function using double precision 355

Listing 11.1 Function to map devices into the user program memory on a Raspberry Pi 367

Listing 11.2 Function to map devices into the user program memory space on a pcDuino 372

Listing 11.3 ARM assembly code to set GPIO pin 26 to alternate function 1 381

Listing 11.4 ARM assembly code to configure PA10 for output 388

Listing 11.5 ARM assembly code to set PA10 to output a high state 389

Listing 11.6 ARM assembly code to read the state of PI14 and set or clear the Z flag 389

Listing 13.1 Assembly functions for using the Raspberry Pi UART 418

Listing 14.1 Definitions for ARM CPU modes 435

Listing 14.2 Function to set up the ARM exception table 439

Listing 14.3 Stubs for the exception handlers 440

Listing 14.4 Skeleton for an exception handler 441

Listing 14.5 ARM startup code 443

Listing 14.6 A simple main program 446

Listing 14.7 A sample Gnu linker script 448

Listing 14.8 A sample make file 450

Listing 14.9 Running make to build the image 451

Listing 14.10 An improved main program 452

Listing 14.11 ARM startup code with timer interrupt 453

Listing 14.12 Functions to manage the pdDuino interrupt controller 454

Listing 14.13 Functions to manage the Raspberry Pi interrupt controller 457

Listing 14.14 Functions to manage the pdDuino timer0 device 459

Listing 14.15 Functions to manage the Raspberry Pi timer0 device 460

Listing 14.16 IRQ handler to clear the timer interrupt 462

Listing 14.17 A sample make file 463

Listing 14.18 Running make to build the image 464

Preface

This book is intended to be used in a first course in assembly language programming for Computer Science (CS) and Computer Engineering (CE) students. It is assumed that students using this book have already taken courses in programming and data structures, and are competent programmers in at least one high-level language. Many of the code examples in the book are written in C, with an assembly implementation following. The assembly examples can stand on their own, but students who are familiar with C, C++, or Java should find the C examples helpful.

Computer Science and Computer Engineering are very large fields. It is impossible to cover everything that a student may eventually need to know. There are a limited number of course hours available, so educators must strive to deliver degree programs that make a compromise between the number of concepts and skills that the students learn and the depth at which they learn those concepts and skills. Obviously, with these competing goals it is difficult to reach consensus on exactly what courses should be included in a CS or CE curriculum.

Traditionally, assembly language courses have consisted of a mechanistic learning of a set of instructions, registers, and syntax. Partially because of this approach, over the years, assembly language courses have been marginalized in, or removed altogether from, many CS and CE curricula. The author feels that this is unfortunate, because a solid understanding of assembly language leads to better understanding of higher-level languages, compilers, interpreters, architecture, operating systems, and other important CS an CE concepts.

One of the goals of this book is to make a course in assembly language more valuable by introducing methods (and a bit of theory) that are not covered in any other CS or CE courses, while using assembly language to implement the methods. In this way, the course in assembly language goes far beyond the traditional assembly language course, and can once again play an important role in the overall CS and CE curricula.

Choice of Processor Family

Because of their ubiquity, x86 based systems have been the platforms of choice for most assembly language courses over the last two decades. The author believes that this is unfortunate, because in every respect other than ubiquity, the x86 architecture is the worst possible choice for learning and teaching assembly language. The newer chips in the family have hundreds of instructions, and irregular rules govern how those instructions can be used. In an attempt to make it possible for students to succeed, typical courses use antiquated assemblers and interface with the antiquated IBM PC BIOS, using only a small subset of the modern x86 instruction set. The programming environment has little or no relevance to modern computing.

Partially because of this tendency to use x86 platforms, and the resulting unnecessary burden placed on students and instructors, as well as the reliance on antiquated and irrelevant development environments, assembly language is often viewed by students as very difficult and lacking in value. The author hopes that this textbook helps students to realize the value of knowing assembly language. The relatively simple ARM processor family was chosen in hopes that the students also learn that although assembly language programming may be more difficult than high-level languages, it can be mastered.

The recent development of very low-cost ARM based Linux computers has caused a surge of interest in the ARM architecture as an alternative to the x86 architecture, which has become increasingly complex over the years. This book should provide a solution for a growing need.

Many students have difficulty with the concept that a register can hold variable x at one point in the program, and hold variable y at some other point. They also often have difficulty with the concept that, before it can be involved in any computation, data has to be moved from memory into the CPU. Using a load-store architecture helps the students to more readily grasp these concepts.

Another common difficulty that students have is in relating the concepts of an address and a pointer variable. You can almost see the little light bulbs light up over their heads, when they have the “eureka!” moment and realize that pointers are just variables that hold an address. The author hopes that the approach taken in this book will make it easier for students to have that “eureka!” moment. The author believes that load-store architectures make that realization easier.

Many students also struggle with the concept of recursion, regardless of what language is used. In assembly, the mechanisms involved are exposed and directly manipulated by the programmer. Examples of recursion are scattered throughout this textbook. Again, the clean architecture of the ARM makes it much easier for the students to understand what is going on.

Some students have difficulty understanding the flow of a program, and tend to put many unnecessary branches into their code. Many assembly language courses spend so much time and space on learning the instruction set that they never have time to teach good programming practices. This textbook puts strong emphasis on using structured programming concepts. The relative simplicity of the ARM architecture makes this possible.

One of the major reasons to learn and use assembly language is that it allows the programmer to create very efficient mathematical routines. The concepts introduced in this book will enable students to perform efficient non-integral math on any processor. These techniques are rarely taught because of the time that it takes to cover the x86 instruction set. With the ARM processor, less time is spent on the instruction set, and more time can be spent teaching how to optimize the code.

The combination of the ARM processor and the Linux operating system provides the least costly hardware platform and development environment available. A cluster of 10 Raspberry Pis, or similar hosts, with power supplies and networking, can be assembled for 500 US dollars or less. This cluster can support up to 50 students logging in through ssh. If their client platform supports the X window system, then they can run GUI enabled applications. Alternatively, most low-cost ARM systems can directly drive a display and take input from a keyboard and mouse. With the addition of an NFS server (which itself could be a low-cost ARM system and a hard drive), an entire Linux ARM based laboratory of 20 workstations could be built for 250 US dollars per seat or less. Admittedly, it would not be a high-performance laboratory, but could be used to teach C, assembly, and other languages. The author would argue that inexperienced programmers should learn to program on low-performance machines, because it reinforces a life-long tendency towards efficiency.

General Approach

The approach of this book is to present concepts in different ways throughout the book, slowly building from simple examples towards complex programming on bare-metal embedded systems. Students who don’t understand a concept when it is explained in a certain way may easily grasp the concept when it is presented later from a different viewpoint.

The main objective of this book is to provide an improved course in assembly language by replacing the x86 platform with one that is less costly, more ubiquitous, well-designed, powerful, and easier to learn. Since students are able to master the basics of assembly language quickly, it is possible to teach a wider range of topics, such as fixed and floating point mathematics, ethical considerations, performance tuning, and interrupt processing. The author hopes that courses using this book will better prepare students for the junior and senior level courses in operating systems, computer architecture, and compilers.

Companion Website

Please visit the companion web site to access additional resources. Instructors may download the author’s lecture slides and solution manual for the exercises. Students and instructors may also access the laboratory manual and additional code examples. The author welcomes suggestions for additional lecture slides, laboratory assignments, or other materials.

http://booksite.elsevier.com/9780128036983

Acknowledgments

I would like to thank Randy Warner for reading the manuscript, catching errors, and making helpful suggestions. I would also like to thank the following students for suggesting exercises with answers and catching numerous errors in the drafts: Zach Buechler, Preston Cook, Joshua Daybrest, Matthew DeYoung, Josh Dodd, Matt Dyke, Hafiza Farzami, Jeremy Goens, Lawrence Hoffman, Colby Johnson, Benjamin Kaiser, Lauren Keene, Jayson Kjenstad, Murray LaHood-Burns, Derek Lane, Yanlin Li, Luke Meyer, Matthew Mielke, Forrest Miller, Christopher Navarro, Girik Ranchhod, Josh Schweigert, Christian Sieh, Weston Silbaugh, Jacob St. Amand, Njaal Tengesdal, Dylan Thoeny, Michael Vortherms, Dicheng Wu, and Kekoa (Peter) Yamaguchi. Finally, I am also very grateful for my assistants, Scott Logan, Ian Carlson, and Derek Stotz, who gave very valuable feedback during the writing of this book.

Part I

Assembly as a Language

Chapter 1

Introduction

Abstract

This chapter first gives a very high-level description of the major components of function of a computer system. It then motivates the reader by giving reasons why learning assembly language is important for Computer Scientists and Computer Engineers. It then explains why the ARM processor is a good choice for a first assembly language. Next it explains binary data representations, including various integer formats, ASCII, and Unicode. Finally, it describes the memory sections for a typical program during execution. By the end of the chapter, the groundwork has been laid for learning to program in assembly language.

Keywords

Instruction; Instruction stream; Central processing unit; Memory; Input/output device; High-level language; Assembly language; ARM processor; Binary; Hexadecimal; Decimal; Radix or base system; Base conversion; Sign magnitude; Unsigned; Complement; Excess-n; ASCII; Unicode; UTF-8; Stack; Heap; Data section; Text section

An executable computer program is, ultimately, just a series of numbers that have very little or no meaning to a human being. We have developed a variety of human-friendly languages in which to express computer programs, but in order for the program to execute, it must eventually be reduced to a stream of numbers. Assembly language is one step above writing the stream of numbers. The stream of numbers is called the instruction stream. Each number in the instruction stream instructs the computer to perform one (usually small) operation. Although each instruction does very little, the ability of the programmer to specify any sequence of instructions and the ability of the computer to perform billions of these small operations every second makes modern computers very powerful and flexible tools. In assembly language, one line of code usually gets translated into one machine instruction. In high-level languages, a single line of code may generate many machine instructions.

A simplified model of a computer system, as shown in Fig. 1.1, consists of memory, input/output devices, and a central processing unit (CPU), connected together by a system bus. The bus can be thought of as a roadway that allows data to travel between the components of the computer system. The CPU is the part of the system where most of the computation occurs, and the CPU controls the other devices in the system.

f01-01-9780128036983
Figure 1.1 Simplified representation of a computer system.

Memory can be thought of as a series of mailboxes. Each mailbox can hold a single postcard with a number written on it, and each mailbox has a unique numeric identifier. The identifier, x is called the memory address, and the number stored in the mailbox is called the contents of address x. Some of the mailboxes contain data, and others contain instructions which control what actions are performed by the CPU.

The CPU also contains a much smaller set of mailboxes, which we call registers. Data can be copied from cards stored in memory to cards stored in the CPU, or vice-versa. Once data has been copied into one of the CPU registers, it can be used in computation. For example, in order to add two numbers in memory, they must first be copied into registers on the CPU. The CPU can then add the numbers together and store the result in one of the CPU registers. The result of the addition can then be copied back into one of the mailboxes in the memory.

Modern computers execute instructions sequentially. In other words, the next instruction to be executed is at the memory address immediately following the current instruction. One of the registers in the CPU, the program counter (PC), keeps track of the location from which the next instruction is to be fetched. The CPU follows a very simple sequence of actions. It fetches an instruction from memory, increments the PC, executes the instruction, and then repeats the process with the next instruction. However, some instructions may change the PC, so that the next instruction is fetched from a non-sequential address.

1.1 Reasons to Learn Assembly

There are many high-level programming languages, such as Java, Python, C, and C++ that have been designed to allow programmers to work at a high level of abstraction, so that they do not need to understand exactly what instructions are needed by a particular CPU. For compiled languages, such as C and C++, a compiler handles the task of translating the program, written in a high-level language, into assembly language for the particular CPU on the system. An assembler then converts the program from assembly language into the binary codes that the CPU reads as instructions.

High-level languages can greatly enhance programmer productivity. However, there are some situations where writing assembly code directly is desirable or necessary. For example, assembly language may be the best choice when writing

 the first steps in booting the computer,

 code to handle interrupts,

 low-level locking code for multi-threaded programs,

 code for machines where no compiler exists,

 code which needs to be optimized beyond the limits of the compiler,

 on computers with very limited memory, and

 code that requires low-level access to architectural and/or processor features.

Aside from sheer necessity, there are several other reasons why it is still important for computer scientists to learn assembly language.

One example where knowledge of assembly is indispensable is when designing and implementing compilers for high-level languages. As shown in Fig. 1.2, a typical compiler for a high-level language must generate assembly language as its output. Most compilers are designed to have multiple stages. In the input stage, the source language is read and converted into a graph representation. The graph may be optimized before being passed to the output, or code generation, stage where it is converted to assembly language. The assembly is then fed into the system’s assembler to generate an object file. The object file is linked with other object files (which are often combined into libraries) to create an executable program.

f01-02-9780128036983
Figure 1.2 Stages of a typical compilation sequence.

The code generation stage of a compiler must traverse the graph and emit assembly code. The quality of the assembly code that is generated can have a profound influence on the performance of the executable program. Therefore, the programmer responsible for the code generation portion of the compiler must be well versed in assembly programming for the target CPU.

Some people believe that a good optimizing compiler will generate better assembly code than a human programmer. This belief is not justified. Highly optimizing compilers have lots of clever algorithms, but like all programs, they are not perfect. Outside of the cases that they were designed for, they do not optimize well. Many newer CPUs have instructions which operate on multiple items of data at once. However, compilers rarely make use of these powerful single instruction multiple data ( SIMD) instructions. Instead, it is common for programmers to write functions in assembly language to take advantage of SIMD instructions. The assembly functions are assembled into object file(s), then linked with the object file(s) generated from the high-level language compiler.

Many modern processors also have some support for processing vectors (arrays). Compilers are usually not very good at making effective use of the vector instructions. In order to achieve excellent vector performance for audio or video codecs and other time-critical code, it is often necessary to resort to small pieces of assembly code in the performance-critical inner loops. A good example of this type of code is when performing vector and matrix multiplies. Such operations are commonly needed in processing images and in graphical applications. The ARM vector instructions are explained in Chapter 9.

Another reason for assembly is when writing certain parts of an operating system. Although modern operating systems are mostly written in high-level languages, there are some portions of the code that can only be done in assembly. Typical uses of assembly language are when writing device drivers, saving the state of a running program so that another program can use the CPU, restoring the saved state of a running program so that it can resume executing, and managing memory and memory protection hardware. There are many other tasks central to a modern operating system which can only be accomplished in assembly language. Careful design of the operating system can minimize the amount of assembly required, but cannot eliminate it completely.

Another good reason to learn assembly is for debugging. Simply understanding what is going on “behind the scenes” of compiled languages such as C and C++ can be very valuable when trying to debug programs. If there is a problem in a call to a third party library, sometimes the only way a developer can isolate and diagnose the problem is to run the program under a debugger and step through it one machine instruction at a time. This does not require a deep knowledge of assembly language coding but at least a passing familiarity with assembly is helpful in that particular case. Analysis of assembly code is an important skill for C and C++ programmers, who may occasionally have to diagnose a fault by looking at the contents of CPU registers and single-stepping through machine instructions.

Assembly language is an important part of the path to understanding how the machine works. Even though only a small percentage of computer scientists will be lucky enough to work on the code generator of a compiler, they all can benefit from the deeper level of understanding gained by learning assembly language. Many programmers do not really understand pointers until they have written assembly language.

Without first learning assembly language, it is impossible to learn advanced concepts such as microcode, pipelining, instruction scheduling, out-of-order execution, threading, branch prediction, and speculative execution. There are many other concepts, especially when dealing with operating systems and computer architecture, which require some understanding of assembly language. The best programmers understand why some language constructs perform better than others, how to reduce cache misses, and how to prevent buffer overruns that destroy security.

Every program is meant to run on a real machine. Even though there are many languages, compilers, virtual machines, and operating systems to enable the programmer to use the machine more conveniently, the strengths and weaknesses of that machine still determine what is easy and what is hard. Learning assembly is a fundamental part of understanding enough about the machine to make informed choices about how to write efficient programs, even when writing in a high-level language.

As an analogy, most people do not need to know a lot about how an internal combustion engine works in order to operate an automobile. A race car driver needs a much better understanding of exactly what happens when he or she steps on the accelerator pedal in order to be able to judge precisely when (and how hard) to do so. Also, who would trust their car to a mechanic who could not tell the difference between a spark plug and a brake caliper? Worse still, should we trust an engineer to build a car without that knowledge? Even in this day of computerized cars, someone needs to know the gritty details, and they are paid well for that knowledge. Knowledge of assembly language is one of the things that defines the computer scientist and engineer.

When learning assembly language, the specific instruction set is not critically important, because what is really being learned is the fine detail of how a typical stored-program machine uses different storage locations and logic operations to convert a string of bits into a meaningful calculation. However, when it comes to learning assembly languages, some processors make it more difficult than it needs to be. Because some processors have an instruction set that is extremely irregular, non-orthogonal, large, and poorly designed, they are not a good choice for learning assembly. The author feels that teaching students their first assembly language on one of those processors should be considered a crime, or at least a form of mental abuse. Luckily, there are processors that are readily available, low-cost, and relatively easy to learn assembly with. This book uses one of them as the model for assembly language.

1.2 The ARM Processor

In the late 1970s, the microcomputer industry was a fierce battleground, with several companies competing to sell computers to small business and home users. One of those companies, based in the United Kingdom, was Acorn Computers Ltd. Acorn’s flagship product, the BBC Micro, was based on the same processor that Apple Computer had chosen for their Apple IITM line of computers; the 8-bit 6502 made by MOS Technology. As the 1980s approached, microcomputer manufacturers were looking for more powerful 16-bit and 32-bit processors. The engineers at Acorn considered the processor chips that were available at the time, and concluded that there was nothing available that would meet their needs for the next generation of Acorn computers.

The only reasonably-priced processors that were available were the Motorola 68000 (a 32-bit processor used in the Apple Macintosh and most high-end Unix workstations) and the Intel 80286 (a 16-bit processor used in less powerful personal computers such as the IBM PC). During the previous decade, a great deal of research had been conducted on developing high-performance computer architectures. One of the outcomes of that research was the development of a new paradigm for processor design, known as Reduced Instruction Set Computing (RISC). One advantage of RISC processors was that they could deliver higher performance with a much smaller number of transistors than the older Complex Instruction Set Computing (CISC) processors such as the 68000 and 80286. The engineers at Acorn decided to design and produce their own processor. They used the BBC Micro to design and simulate their new processor, and in 1987, they introduced the Acorn ArchimedesTM. The ArchimedesTM was arguably the most powerful home computer in the world at that time, with graphics and audio capabilities that IBM PCTM and Apple MacintoshTM users could only dream about. Thus began the long and successful dynasty of the Acorn RISC Machine (ARM) processor.

Acorn never made a big impact on the global computer market. Although Acorn eventually went out of business, the processor that they created has lived on. It was re-named to the Advanced RISC Machine, and is now known simply as ARM. Stewardship of the ARM processor belongs to ARM Holdings, LLC which manages the design of new ARM architectures and licenses the manufacturing rights to other companies. ARM Holdings does not manufacture any processor chips, yet more ARM processors are produced annually than all other processor designs combined. Most ARM processors are used as components for embedded systems and portable devices. If you have a smart phone or similar device, then there is a very good chance that it has an ARM processor in it. Because of its enormous market presence, clean architecture, and small, orthogonal instruction set, the ARM is a very good choice for learning assembly language.

Although it dominates the portable device market, the ARM processor has almost no presence in the desktop or server market. However, that may change. In 2012, ARM Holdings announced the ARM64 architecture, which is the first major redesign of the ARM architecture in 30 years. The ARM64 is intended to compete for the desktop and server market with other high-end processors such as the Sun SPARC and Intel Xeon. Regardless of whether or not the ARM64 achieves much market penetration, the original ARM 32-bit processor architecture is so ubiquitous that it clearly will be around for a long time.

1.3 Computer Data

The basic unit of data in a digital computer is the binary digit, or bit. A bit can have a value of zero or one. In order to store numbers larger than 1, bits are combined into larger units. For instance, using two bits, it is possible to represent any number between zero and three. This is shown in Table 1.1. When stored in the computer, all data is simply a string of binary digits. There is more than one way that such a fixed-length string of binary digits can be interpreted.

Table 1.1

Values represented by two bits

Bit 1 Bit 0 Value
0 0 0
0 1 1
1 0 2
1 1 3

Computers have been designed using many different bit group sizes, including 4, 8, 10, 12, and 14 bits. Today most computers recognize a basic grouping of 8 bits, which we call a byte. Some computers can work in units of 4 bits, which is commonly referred to as a nibble (sometimes spelled “nybble”). A nibble is a convenient size because it can exactly represent one hexadecimal digit. Additionally, most modern computers can also work with groupings of 16, 32 and 64 bits. The CPU is designed with a default word size. For most modern CPUs, the default word size is 32 bits. Many processors support 64-bit words, which is increasingly becoming the default size.

1.3.1 Representing Natural Numbers

A numeral system is a writing system for expressing numbers. The most common system is the Hindu-Arabic number system, which is now used throughout the world. Almost from the first day of formal education, children begin learning how to add, subtract, and perform other operations using the Hindu-Arabic system. After years of practice, performing basic mathematical operations using strings of digits between 0 and 9 seems natural. However, there are other ways to count and perform arithmetic, such as Roman numerals, unary systems, and Chinese numerals. With a little practice, it is possible to become as proficient at performing mathematics with other number systems as with the Hindu-Arabic system.

The Hindu-Arabic system is a base ten or radix ten system, because it uses the ten digits 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9. For our purposes, the words radix and base are equivalent, and refer to the number of individual digits available in the numbering system. The Hindu-Arabic system is also a positional system, or a place-value notation, because the value of each digit in a number depends on its position in the number. The radix ten Hindu-Arabic system is only one of an infinite family of closely related positional systems. The members of this family differ only in the radix used (and therefore, the number of characters used). For bases greater than base ten, characters are borrowed from the alphabet and used to represent digits. For example, the first column in Table 1.2 shows the character “A” being used as a single digit representation for the number 10.

Table 1.2

The first 21 integers (starting with 0) in various bases

Base
16 10 9 8 7 6 5 4 3 2
0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2 10
3 3 3 3 3 3 3 3 10 11
4 4 4 4 4 4 4 10 11 100
5 5 5 5 5 5 10 11 12 101
6 6 6 6 6 10 11 12 20 110
7 7 7 7 10 11 12 13 21 111
8 8 8 10 11 12 13 20 22 1000
9 9 10 11 12 13 14 21 100 1001
A 10 11 12 13 14 20 22 101 1010
B 11 12 13 14 15 21 23 102 1011
C 12 13 14 15 20 22 30 110 1100
D 13 14 15 16 21 23 31 111 1101
E 14 15 16 20 22 24 32 112 1110
F 15 16 17 21 23 30 33 120 1111
10 16 17 20 22 24 31 100 121 10000
11 17 18 21 23 25 32 101 122 10001
12 18 20 22 24 30 33 102 200 10010
13 19 21 23 25 31 34 103 201 10011
14 20 22 24 26 32 40 110 201 10100

t0015

In base ten, we think of numbers as strings of the 10 digits, “0”–“9”. Each digit counts 10 times the amount of the digit to its right. If we restrict ourselves to integers, then the digit furthest to the right is always the ones digit. It is also referred to as the least significant digit. The digit immediately to the left of the ones digit is the tens digit. To the left of that is the hundreds digit, and so on. The leftmost digit is referred to as the most significant digit. The following equation shows how a number can be decomposed into its constituent digits:

5783910=5×104+7×103+8×102+3×101+9×100.

si1_e

Note that the subscript of “10” on 5783910 indicates that the number is given in base ten.

Imagine that we only had 7 digits: 0, 1, 2, 3, 4, 5, and 6. We need 10 digits for base ten, so with only 7 digits we are limited to base seven. In base seven, each digit in the string represents a power of seven rather than a power of ten. We can represent any integer in base seven, but it may take more digits than in base ten. Other than using a different base for the power of each digit, the math works exactly the same as for base ten. For example, suppose we have the following number in base seven: 3304257. We can convert this number to base ten as follows:

3304257=3×75+3×74+0×73+4×72+2×71+5×70=5042110+720310+010+19610+1410+510=5783910

si2_e

Base two, or binary is the “native” number system for modern digital systems. The reason for this is mainly because it is relatively easy to build circuits with two stable states: on and off (or 1 and 0). Building circuits with more than two stable states is much more difficult and expensive, and any computation that can be performed in a higher base can also be performed in binary. The least significant (rightmost) digit in binary is referred to as the least significant bit, or LSB, while the leftmost binary digit is referred to as the most significant bit, or MSB.

1.3.2 Base Conversion

The most common bases used by programmers are base two (binary), base eight (octal), base ten (decimal) and base sixteen (hexadecimal). Octal and hexadecimal are common because, as we shall see later, they can be translated quickly and easily to and from base two, and are often easier for humans to work with than base two. Note that for base sixteen, we need 16 characters. We use the digits 0 through 9 plus the letters A through F. Table 1.2 shows the equivalents for all numbers between 0 and 20 in base two through base ten, and base sixteen.

Before learning assembly language it is essential to know how to convert from any base to any other base. Since we are already comfortable working in base ten, we will use that as an intermediary when converting between two arbitrary bases. For instance, if we want to convert a number in base three to base five, we will do it by first converting the base three number to base ten, then from base ten to base five. By using this two-stage process, we will only need to learn to convert between base ten and any arbitrary base b.

Base b to decimal

Converting from an arbitrary base b to base ten simply involves multiplying each base b digit d by bn, where n is the significance of digit d, and summing all of the results. For example, converting the base five number 34215 to base ten is performed as follows:

34215=3×53+4×52+2×51+1×50=37510+10010+1010+110=48610

si3_e

This conversion procedure works for converting any integer from any arbitrary base b to its equivalent representation in base ten. Example 1.1 gives another specific example of how to convert from base b to base ten.

Example 1.1

Converting From an Arbitrary Base to Base Ten

Converting 73625 to base ten is accomplished by expanding and summing the terms:

73625=7×53+3×52+6×51+2×50=7×125+3×25+6×5+2×1=875+75+30+2=98210

si4_e

Decimal to base b

Converting from base ten to an arbitrary base b involves repeated division by the base, b. After each division, the remainder is used as the next more significant digit in the base b number, and the quotient is used as the dividend for the next iteration. The process is repeated until the quotient is zero. For example, converting 5610 to base four is accomplished as follows:

eq1-1-9780128036983

Reading the remainders from right to left yields: 3204. This result can be double-checked by converting it back to base ten as follows:

3204=3×42+2×41+0×40=48+8+0=5610.

si5_e

Since we arrived at the same number we started with, we have verified that 5610 = 3204. This conversion procedure works for converting any integer from base ten to any arbitrary base b. Example 1.2 gives another example of converting from base ten to another base b.

Example 1.2

Converting from Base Ten to an Arbitrary Base

Converting 834110 to base seven is accomplished as follows:

eq1-2-9780128036983

Conversion between arbitrary bases

Although it is possible to perform the division and multiplication steps in any base, most people are much better at working in base ten. For that reason, the easiest way to convert from any base a to any other base b is to use a two step process. First step is to convert from base a to decimal. The second step is to convert from decimal to base b. Example 1.3 shows how to convert from any base to any other base.

Example 1.3

Converting from an Arbitrary Base to Another Arbitrary Base

Converting 848343 to base 11 is accomplished with two steps. The number is first converted to base ten as follows:

848353=8×34+4×33+8×32+3×31+4×30=8×81+4×27+8×9+3×3+4×1=648+108+72+9+4=84110

si6_e

Then the result is converted to base 11:

eq01-03-9780128036983

Bases that are powers-of-two

In addition to the methods above, there is a simple method for quickly converting between base two, base eight, and base sixteen. These shortcuts rely on the fact that 2, 8, and 16 are all powers of two. Because of this, it takes exactly four binary digits (bits) to represent exactly one hexadecimal digit. Likewise, it takes exactly three bits to represent an octal digit. Conversely, each hexadecimal digit can be converted to exactly four binary digits, and each octal digit can be converted to exactly three binary digits. This relationship makes it possible to do very fast conversions using the tables shown in Fig. 1.3.

f01-03-9780128036983
Figure 1.3 Tables used for converting between binary, octal, and hex.

When converting from hexadecimal to binary, all that is necessary is to replace each hex digit with the corresponding binary digits from the table. For example, to convert 5AC416 to binary, we just replace “5” with “0101,” replace “A” with “1010,” replace “C” with “1100,” and replace “4” with “0100.” So, just by referring to the table, we can immediately see that 5AC416 = 01011010110001002. This method works exactly the same for converting from octal to binary, except that it uses the table on the right side of Fig. 1.3.

Converting from binary to hexadecimal is also very easy using the table. Given a binary number, n, take the four least significant digits of n and find them in the table on the left side of Fig. 1.3. The hexadecimal digit on the matching line of the table is the least significant hex digit. Repeat the process with the next set of four bits and continue until there are no bits remaining in the binary number. For example, to convert 00111001010101112 to hexadecimal, just divide the number into groups of four bits, starting on the right, to get: 0011|1001|0101|01112. Now replace each group of four bits by looking up the corresponding hex digit in the table on the left side of Fig. 1.3, to convert the binary number to 395716. In the case where the binary number does not have enough bits, simply pad with zeros in the high-order bits. For example, dividing the number 10011000100112 into groups of four yields 1|0011|0001|00112 and padding with zeros in the high-order bits results in 0001|0011|0001|00112. Looking up the four groups in the table reveals that 0001|0011|0001|00112 = 131316.

1.3.3 Representing Integers

The computer stores groups of bits, but the bits by themselves have no meaning. The programmer gives them meaning by deciding what the bits represent, and how they are interpreted. Interpreting a group of bits as unsigned integer data is relatively simple. Each bit is weighted by a power-of-two, and the value of the group of bits is the sum of the non-zero bits multiplied by their respective weights. However, programmers often need to represent negative as well as non-negative numbers, and there are many possibilities for storing and interpreting integers whose value can be both positive and negative. Programmers and hardware designers have developed several standard schemes for encoding such numbers. The three main methods for storing and interpreting signed integer data are two’s complement, sign-magnitude, and excess-N, Fig. 1.4 shows how the same binary pattern of bits can be interpreted as a number in four different ways.

f01-04-9780128036983
Figure 1.4 Four different representations for binary integers.

Sign-magnitude representation

The sign-magnitude representation simply reserves the most significant bit to represent the sign of the number, and the remaining bits are used to store the magnitude of the number. This method has the advantage that it is easy for humans to interpret, with a little practice. However, addition and subtraction are slightly complicated. The addition/subtraction logic must compare the sign bits, complement one of the inputs if they are different, implement an end-around carry, and complement the result if there was no carry from the most significant bit. Complements are explained in Section 1.3.3. Because of the complexity, most integer CPUs do not directly support addition and subtraction of integers in sign-magnitude form. However, this method is commonly used for mantissa in floating-point numbers, as will be explained in Chapter 8. Another drawback to sign-magnitude is that it has two representations for zero, which can cause problems if the programmer is not careful.

Excess-(2n−1 − 1) representation

Another method for representing both positive and negative numbers is by using an excess-N representation. With this representation, the number that is stored is N greater than the actual value. This representation is relatively easy for humans to interpret. Addition and subtraction are easily performed using the complement method, which is explained in Section 1.3.3. This representation is just the same as unsigned math, with the addition of a bias which is usually (2n−1 − 1). So, zero is represented as zero plus the bias. In n = 12 bits, the bias is 212−1 − 1 = 204710, or 0111111111112. This method is commonly used to store the exponent in floating-point numbers, as will be explained in Chapter 8.

Complement representation

A very efficient method for dealing with signed numbers involves representing negative numbers as the radix complements of their positive counterparts. The complement is the amount that must be added to something to make it “whole.” For instance, in geometry, two angles are complementary if they add to 90°. In radix mathematics, the complement of a digit x in base b is simply bx. For example, in base ten, the complement of 4 is 10 − 4 = 6.

In complement representation, the most significant digit of a number is reserved to indicate whether or not the number is negative. If the first digit is less than b2si7_e (where b is the radix), then the number is positive. If the first digit is greater than or equal to b2si7_e, then the number is negative. The first digit is not part of the magnitude of the number, but only indicates the sign of the number. For example, numbers in ten’s complement notation are positive if the first digit is less than 5, and negative if the first digit is greater than 4. This works especially well in binary, since the number is considered positive if the first bit is zero and negative if the first bit is one. The magnitude of a negative number can be obtained by taking the radix complement. Because of the nice properties of the complement representation, it is the most common method for representing signed numbers in digital computers.

Finding the complement: The radix complement of an n digit number y in radix ( base) b is defined as

C(yb)=bnyb.

si9_e  (1.1)

For example, the ten’s complement of the four digit number 873410 is 104 − 8734 = 1266. In this example, we directly applied the definition of the radix complement from Eq. (1.4). That is easy in base ten, but not so easy in an arbitrary base, because it involves performing a subtraction. However, there is a very simple method for calculating the complement which does not require subtraction. This method involves finding the diminished radix complement, which is (bn − 1) − y by substituting each digit with its complement from a complement table. The radix complement is found by adding one to the diminished radix complement. Fig. 1.5 shows the complement tables for bases ten and two. Examples 1.4 and 1.5 show how the complement is obtained in bases ten and two respectively. Examples 1.6 and 1.7 show additional conversions between binary and decimal.

f01-05-9780128036983
Figure 1.5 Complement tables for bases ten and two.

Example 1.4

The Complement in Base Ten

The nine’s complement of the base ten number 593 is found by finding the digit ‘5’ in the complement table, and replacing it with its complement, which is the digit ‘4.’ The digit ‘9’ is replaced with ‘0,’ and ‘3’ is replaced with ‘6.’ Therefore the nine’s complement of 59310 is 406. Likewise, the nine’s complement of 100010 is 899910 and the nine’s complement of 099910 is 900010.

The ten’s complement of 72610 is 27310 + 1 = 27410.

Example 1.5

The One’s and Two’s Complement

The one’s complement of a binary number is found in the same way as the nine’s complement of a decimal number, but using the one’s complement table instead of the nine’s complement table. The one’s complement of 010011012 is 101100102 and the one’s complement of 0000000010110110 is 11111111010010012. Note that the one’s complement of a base two number is equivalent to the bitwise logical “not” (Boolean complement) operator. This operator is very easy to implement in digital hardware.

The two’s complement is the one’s complement plus one. The two’s complement of 10101002 is 01010112 + 1 = 01011002.

Example 1.6

Conversion from Binary to Decimal

Suppose we want to convert a signed binary number to decimal.

1. If the most significant bit is ‘1’, then

a. Find the two’s complement

b. Convert the result to base 10

c. Add a negative sign

2. else

a. Convert the result to base 10

Number One’s Complement Two’s Complement Base 10 Negative
11010010 00101101 00101110 46 − 46
1111111100010110 0000000011101001 0000000011101010 234 − 234
01110100 Not negative 116
1000001101010110 0111110010101001 0111110010101010 31914 −31914
0101001111011011 Not negative 21467

t0055

Example 1.7

Conversion from Decimal to Binary

Suppose we want to convert a negative number from decimal to binary.

1. Remove the negative sign

2. Convert the number to binary

3. Take the two’s complement

Base 10 Positive Binary One’s Complement Two’s Complement
-46 00101110 11010001 11010010
-234 0000000011101010 1111111100010101 1111111100010110
-116 01110100 10001011 10001100
-31914 0111110010101010 1000001101010110 1000001101010111
-21467 0101001111011011 1010110000100100 1010110000100101

t0060

Subtraction using complements One very useful feature of complement notation is that it can be used to perform subtraction by using addition. Given two numbers in base b, xb, and yb, the difference can be computed as:

zb=xbyb

si10_e  (1.2)

=xb+(bnyb)bn

si11_e  (1.3)

=xb+C(yb)bn,

si12_e  (1.4)

where C(yb) is the radix complement of yb. Assume that xb and yb are both positive where ybxb and both numbers have the same number of digits n (yb may have leading zeros). In this case, the result of xb + C(yb) will always be greater than or equal to bn, but less than 2 × bn. This means that the result of xb + C(yb) will always begin with a ‘1’ in the n + 1 digit position. Dropping the initial ‘1’ is equivalent to subtracting bn, making the result xy + bnbn or just xy, which is the desired result. This can be reduced to a simple procedure. When y and x are both positive and yx, the following four steps are to be performed:

1. pad the subtrahend (y) with leading zeros, as necessary, so that both numbers have the same number of digits (n),

2. find the b’s complement of the subtrahend,

3. add the complement to the minuend,

4. discard the leading ‘1’.

The complement notation provides a very easy way to represent both positive and negative integers using a fixed number of digits, and to perform subtraction by using addition. Since modern computers typically use a fixed number of bits, complement notation provides a very convenient and efficient way to store signed integers and perform mathematical operations on them. Hardware is simplified because there is no need to build a specialized subtractor circuit. Instead, a very simple complement circuit is built and the adder is reused to perform subtraction as well as addition.

1.3.4 Representing Characters

In the previous section, we discussed how the computer stores information as groups of bits, and how we can interpret those bits as numbers in base two. Given that the computer can only store information using groups of bits, how can we store textual information? The answer is that we create a table, which assigns a numerical value to each character in our language.

Early in the development of computers, several computer manufacturers developed such tables, or character coding schemes. These schemes were incompatible and computers from different manufacturers could not easily exchange textual data without the use of translation software to convert the character codes from one coding scheme to another.

Eventually, a standard coding scheme, known as the American Standard Code for Information Interchange (ASCII) was developed. Work on the ASCII standard began on October 6, 1960, with the first meeting of the American Standards Association’s (ASA) X3.2 subcommittee. The first edition of the standard was published in 1963. The standard was updated in 1967 and again in 1986, and has been stable since then. Within a few years of its development, ASCII was accepted by all major computer manufacturers, although some continue to support their own coding schemes as well.

ASCII was designed for American English, and does not support some of the characters that are used by non-English languages. For this reason, ASCII has been extended to create more comprehensive coding schemes. Most modern multilingual coding schemes are based on ASCII, though they support a wider range of characters.

At the time that it was developed, transmission of digital data over long distances was very slow, and usually involved converting each bit into an audio signal which was transmitted over a telephone line using an acoustic modem. In order to maximize performance, the standards committee chose to define ASCII as a 7-bit code. Because of this decision, all textual data could be sent using seven bits rather than eight, resulting in approximately 10% better overall performance when transmitting data over a telephone modem. A possibly unforeseen benefit was that this also provided a way for the code to be extended in the future. Since there are 128 possible values for a 7-bit number, the ASCII standard provides 128 characters. However, 33 of the ASCII characters are non-printing control characters. These characters, shown in Table 1.3, are mainly used to send information about how the text is to be displayed and/or printed. The remaining 95 printable characters are shown in Table 1.4.

Table 1.3

The ASCII control characters

Binary Oct Dec Hex Abbr Glyph Name
000 0000 000 0 00 NUL ˆ@ Null character
000 0001 001 1 01 SOH ˆA Start of header
000 0010 002 2 02 STX ˆB Start of text
000 0011 003 3 03 ETX ˆC End of text
000 0100 004 4 04 EOT ˆD End of transmission
000 0101 005 5 05 ENQ ˆE Enquiry
000 0110 006 6 06 ACK ˆF Acknowledgment
000 0111 007 7 07 BEL ˆG Bell
000 1000 010 8 08 BS ˆH Backspace
000 1001 011 9 09 HT ˆI Horizontal tab
000 1010 012 10 0A LF ˆJ Line feed
000 1011 013 11 0B VT ˆK Vertical tab
000 1100 014 12 0C FF ˆL Form feed
000 1101 015 13 0D CR ˆM Carriage return[g]
000 1110 016 14 0E SO ˆN Shift out
000 1111 017 15 0F SI ˆO Shift in
001 0000 020 16 10 DLE ˆP Data link escape
001 0001 021 17 11 DC1 ˆQ Device control 1 (oft. XON)
001 0010 022 18 12 DC2 ˆR Device control 2
001 0011 023 19 13 DC3 ˆS Device control 3 (oft. XOFF)
001 0100 024 20 14 DC4 ˆT Device control 4
001 0101 025 21 15 NAK ˆU Negative acknowledgement
001 0110 026 22 16 SYN ˆV Synchronous idle
001 0111 027 23 17 ETB ˆW End of transmission Block
001 1000 030 24 18 CAN ˆX Cancel
001 1001 031 25 19 EM ˆY End of medium
001 1010 032 26 1A SUB ˆZ Substitute
001 1011 033 27 1B ESC ˆ[ Escape
001 1100 034 28 1C FS ˆ\ File separator
001 1101 035 29 1D GS ˆ] Group separator
001 1110 036 30 1E RS ˆˆ Record separator
001 1111 037 31 1F US ˆ_ Unit separator
111 1111 177 127 7F DEL ˆ? Delete

t0020

Table 1.4

The ASCII printable characters

Binary Oct Dec Hex Glyph
010 0000 040 32 20 _
010 0001 041 33 21 !
010 0010 042 34 22
010 0011 043 35 23 #
010 0100 044 36 24 $
010 0101 045 37 25 %
010 0110 046 38 26 &
010 0111 047 39 27
010 1000 050 40 28 (
010 1001 051 41 29 )
010 1010 052 42 2A *
010 1011 053 43 2B +
010 1100 054 44 2C ,
010 1101 055 45 2D
010 1110 056 46 2E .
010 1111 057 47 2F /
011 0000 060 48 30 0
011 0001 061 49 31 1
011 0010 062 50 32 2
011 0011 063 51 33 3
011 0100 064 52 34 4
011 0101 065 53 35 5
011 0110 066 54 36 6
011 0111 067 55 37 7
011 1000 070 56 38 8
011 1001 071 57 39 9
011 1010 072 58 3A :
011 1011 073 59 3B ;
011 1100 074 60 3C <
011 1101 075 61 3D =
011 1110 076 62 3E >
011 1111 077 63 3F ?
100 0000 100 64 40 @
100 0001 101 65 41 A
100 0010 102 66 42 B
100 0011 103 67 43 C
100 0100 104 68 44 D
100 0101 105 69 45 E
100 0110 106 70 46 F
100 0111 107 71 47 G
100 1000 110 72 48 H
100 1001 111 73 49 I
100 1010 112 74 4A J
100 1011 113 75 4B K
100 1100 114 76 4C L
100 1101 115 77 4D M
100 1110 116 78 4E N
100 1111 117 79 4F O
101 0000 120 80 50 P
101 0001 121 81 51 Q
101 0010 122 82 52 R
101 0011 123 83 53 S
101 0100 124 84 54 T
101 0101 125 85 55 U
101 0110 126 86 56 V
101 0111 127 87 57 W
101 1000 130 88 58 X
101 1001 131 89 59 Y
101 1010 132 90 5A Z
101 1011 133 91 5B [
101 1100 134 92 5C \
101 1101 135 93 5D ]
101 1110 136 94 5E ˆ
101 1111 137 95 5F _
110 0000 140 96 60
110 0001 141 97 61 a
110 0010 142 98 62 b
110 0011 143 99 63 c
110 0100 144 100 64 d
110 0101 145 101 65 e
110 0110 146 102 66 f
110 0111 147 103 67 g
110 1000 150 104 68 h
110 1001 151 105 69 i
110 1010 152 106 6A j
110 1011 153 107 6B k
110 1100 154 108 6C l
110 1101 155 109 6D m
110 1110 156 110 6E n
110 1111 157 111 6F o
111 0000 160 112 70 p
111 0001 161 113 71 q
111 0010 162 114 72 r
111 0011 163 115 73 s
111 0100 164 116 74 t
111 0101 165 117 75 u
111 0110 166 118 76 v
111 0111 167 119 77 w
111 1000 170 120 78 x
111 1001 171 121 79 y
111 1010 172 122 7A z
111 1011 173 123 7B {
111 1100 174 124 7C |
111 1101 175 125 7D }
111 1110 176 126 7E ˜

t0025_at0025_b

Non-printing characters

The non-printing characters are used to provide hints or commands to the device that is receiving, displaying, or printing the data. The FF character, when sent to a printer, will cause the printer to eject the current page and begin a new one. The LF character causes the printer or terminal to end the current line and begin a new one. The CR character causes the terminal or printer to move to the beginning of the current line. Many text editing programs allow the user to enter these non-printing characters by using the control key on the keyboard. For instance, to enter the BEL character, the user would hold the control key down and press the G key. This character, when sent to a character display terminal, will cause it to emit a beep. Many of the other control characters can be used to control specific features of the printer, display, or other device that the data is being sent to.

Converting character strings to ASCII codes

Suppose we wish to covert a string of characters, such as “Hello World” to an ASCII representation. We can use an 8-bit byte to store each character. Also, it is common practice to include an additional byte at the end of the string. This additional byte holds the ASCII NUL character, which indicates the end of the string. Such an arrangement is referred to as a null-terminated string.

To convert the string “Hello World” into a null-terminated string, we can build a table with each character on the left and its equivalent binary, octal, hexadecimal, or decimal value (as defined in the ASCII table) on the right. Table 1.5 shows the characters in “Hello World” and their equivalent binary representations, found by looking in Table 1.4. Since most modern computers use 8-bit bytes (or multiples thereof) as the basic storage unit, an extra zero bit is shown in the most significant bit position.

Table 1.5

Binary equivalents for each character in “Hello World”

Character Binary
H 01001000
e 01100101
l 01101100
l 01101100
o 01101111
00100000
W 01010111
o 01101111
r 01110010
l 01101100
d 01100100
NUL 00000000

Reading the Binary column from top to bottom results in the following sequence of bytes: 01001000 01100101 01101100 01101100 01101111 00100000 01010111 01101111 01110010 01101100 01100100 0000000. To convert the same string to a hexadecimal representation, we can use the shortcut method that was introduced previously to convert each 4-bit nibble into its hexadecimal equivalent, or read the hexadecimal value from the ASCII table. Table 1.6 shows the result of extending Table 1.5 to include hexadecimal and decimal equivalents for each character. The string can now be converted to hexadecimal or decimal simply by reading the correct column in the table. So “Hello World” expressed as a null-terminated string in hexadecimal is “48 65 6C 6C 6F 20 57 6F 62 6C 64 00” and in decimal it is ”72 101 108 108 111 32 87 111 98 108 100 0”.

Table 1.6

Binary, hexadecimal, and decimal equivalents for each character in “Hello World”

Character Binary Hexadecimal Decimal
H 01001000 48 72
e 01100101 65 101
l 01101100 6C 108
l 01101100 6C 108
o 01101111 6F 111
00100000 20 32
W 01010111 57 87
o 01101111 6F 111
r 01110010 62 98
l 01101100 6C 108
d 01100100 64 100
NUL 00000000 00 0

t0035

Interpreting data as ASCII strings

It is sometimes necessary to convert a string of bytes in hexadecimal into ASCII characters. This is accomplished simply by building a table with the hexadecimal value of each byte in the left column, then looking in the ASCII table for each value and entering the equivalent character representation in the right column. Table 1.7 shows how the ASCII table is used to interpret the hexadecimal string “466162756C6F75732100” as an ASCII string.

Table 1.7

Interpreting a hexadecimal string as ASCII

Hexadecimal ASCII
46 F
61 a
62 b
75 u
6C l
6F o
75 u
73 s
21 !
00 NUL

ISO extensions to ASCII

ASCII was developed to encode all of the most commonly used characters in North American English text. The encoding uses only 128 of the 256 codes that are available in a 8-bit byte. ASCII does not include symbols frequently used in other countries, such as the British pound symbol (£) or accented characters (ü). However, the International Standards Organization (ISO) has created several extensions to ASCII to enable the representation of characters from a wider variety of languages.

The ISO has defined a set of related standards known collectively as ISO 8859. ISO 8859 is an 8-bit extension to ASCII which includes the 128 ASCII characters along with an additional 128 characters, such as the British Pound symbol and the American cent symbol. Several variations of the ISO 8859 standard exist for different language families. Table 1.8 provides a brief description of the various ISO standards.

Table 1.8

Variations of the ISO 8859 standard

Name Alias Languages
ISO8859-1 Latin-1 Western European languages
ISO8859-2 Latin-2 Non-Cyrillic Central and Eastern European languages
ISO8859-3 Latin-3 Southern European languages and Esperanto
ISO8859-4 Latin-4 Northern European and Baltic languages
ISO8859-5 Latin/Cyrillic Slavic languages that use a Cyrillic alphabet
ISO8859-6 Latin/Arabic Common Arabic language characters
ISO8859-7 Latin/Greek Modern Greek language
ISO8859-8 Latin/Hebrew Modern Hebrew languages
ISO8859-9 Latin-5 Turkish
ISO8859-10 Latin-6 Nordic languages
ISO8859-11 Latin/Thai Thai language
ISO8859-12 Latin/Devanagari Never completed. Abandoned in 1997
ISO8859-13 Latin-7 Some Baltic languages not covered by Latin-4 or Latin-6
ISO8859-14 Latin-8 Celtic languages
ISO8859-15 Latin-9 Update to Latin-1 that replaces some characters. Most
notably, it includes the euro symbol (€), which did not
exist when Latin-1 was created
ISO8859-16 Latin-10 Covers several languages not covered by Latin-9 and
includes the euro symbol (€)

Unicode and UTF-8

Although the ISO extensions helped to standardize text encodings for several languages that were not covered by ASCII, there were still some issues. The first issue is that the display and input devices must be configured for the correct encoding, and displaying or printing documents with multiple encodings requires some mechanism for changing the encoding on-the-fly. Another issue has to do with the lexicographical ordering of characters. Although two languages may share a character, that character may appear in a different place in the alphabets of the two languages. This leads to issues when programmers need to sort strings into lexicographical order. The ISO extensions help to unify character encodings across multiple languages, but do not solve all of the issues involved in defining a universal character set.

In the late 1980s, there was growing interest in developing a universal character encoding for all languages. People from several computer companies worked together and, by 1990, had developed a draft standard for Unicode. In 1991, the Unicode Consortium was formed and charged with guiding and controlling the development of Unicode. The Unicode Consortium has worked closely with the ISO to define, extend, and maintain the international standard for a Universal Character Set (UCS). This standard is known as the ISO/IEC 10646 standard. The ISO/IEC 10646 standard defines the mapping of code points (numbers) to glyphs (characters). but does not specify character collation or other language-dependent properties. UCS code points are commonly written in the form U+XXXX, where XXXX in the numerical code point in hexadecimal. For example, the code point for the ASCII DEL character would be written as U+007F. Unicode extends the ISO/IEC standard and specifies language-specific features.

Originally, Unicode was designed as a 16-bit encoding. It was not fully backward-compatible with ASCII, and could encode only 65,536 code points. Eventually, the Unicode character set grew to encompass 1,112,064 code points, which requires 21 bits per character for a straightforward binary encoding. By early 1992, it was clear that some clever and efficient method for encoding character data was needed.

UTF-8 (UCS Transformation Format-8-bit) was proposed and accepted as a standard in 1993. UTF-8 is a variable-width encoding that can represent every character in the Unicode character set using between one and four bytes. It was designed to be backward compatible with ASCII and to avoid the major issues of previous encodings. Code points in the Unicode character set with lower numerical values tend to occur more frequently than code points with higher numerical values. UTF-8 encodes frequently occurring code points with fewer bytes than those which occur less frequently. For example, the first 128 characters of the UTF-8 encoding are exactly the same as the ASCII characters, requiring only 7 bits to encode each ASCII character. Thus any valid ASCII text is also valid UTF-8 text. UTF-8 is now the most common character encoding for the World Wide Web, and is the recommended encoding for email messages.

In November 2003, UTF-8 was restricted by RFC 3629 to end at code point 10FFFF16. This allows UTF-8 to encode 1,114,111 code points, which is slightly more than the 1,112,064 code points defined in the ISO/IEC 10646 standard. Table 1.9 shows how ISO/IEC 10646 code points are mapped to a variable-length encoding in UTF-8. Note that the encoding allows each byte in a stream of bytes to be placed in one of the following three distinct categories:

Table 1.9

UTF-8 encoding of the ISO/IEC 10646 code points

First Last
UCS Code Code
Bits Point Point Bytes Byte 1 Byte 2 Byte 3 Byte 4
7 U+0000 U+007F 1 0xxxxxxx
11 U+0080 U+07FF 2 110xxxxx 10xxxxxx
16 U+0800 U+FFFF 3 1110xxxx 10xxxxxx 10xxxxxx
21 U+10000 U+10FFFF 4 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

t0050

1. If the most significant bit of a byte is zero, then it is a single-byte character, and is completely ASCII-compatible.

2. If the two most significant bits in a byte are set to one, then the byte is the beginning of a multi-byte character.

3. If the most significant bit is set to one, and the second most significant bit is set to zero, then the byte is part of a multi-byte character, but is not the first byte in that sequence.

The UTF-8 encoding of the UCS characters has several important features:

Backwards compatible with ASCII: This allows the vast number of existing ASCII documents to be interpreted as UTF-8 documents without any conversion.

Self-synchronization: Because of the way code points are assigned, it is possible to find the beginning of each character by looking only at the top two bits of each byte. This can have important performance implications when performing searches in text.

Encoding of code sequence length: The number of bytes in the sequence is indicated by the pattern of bits in the first byte of the sequence. Thus, the beginning of the next character can be found quickly. This feature can also have important performance implications when performing searches in text.

Efficient code structure: UTF-8 efficiently encodes the UCS code points. The high-order bits of the code point go in the lead byte. Lower-order bits are placed in continuation bytes. The number of bytes in the encoding is the minimum required to hold all the significant bits of the code point.

Easily extended to include new languages: This feature will be greatly appreciated when we contact intelligent species from other star systems.

With UTF-8 encoding, the first 128 characters of the UCS are each encoded in a single byte. The next 1,920 characters require two bytes to encode. The two-byte encoding covers almost all Latin alphabets, and also Arabic, Armenian, Cyrillic, Coptic, Greek, Hebrew, Syriac and Tāna alphabets. It also includes combining diacritical marks, which are used in combination with another character, such as á, ñ, and ö. Most of the Chinese, Japanese, and Korean (CJK) characters are encoded using three bytes. Four bytes are needed for the less common CJK characters, various historic scripts, mathematical symbols, and emoji (pictographic symbols).

Consider the UTF-8 encoding for the British Pound symbol (£), which is UCS code point U+00A3. Since the code point is greater than 7F16, but less than 80016, it will require two bytes to encode. The encoding will be 110xxxxx 10xxxxxx, where the x characters are replaced with the 11 least-significant bits of the code point, which are 00010100011. Thus, the character £ is encoded in UTF-8 as 11000010 10100011 in binary, or C2 A3 in hexadecimal.

The UCS code point for the Euro symbol (€) is U+20AC. Since the code point is between 80016 and FFFF16, it will require three bytes to encode in UTF-8. The three-byte encoding is 1110xxxx 10xxxxxx 10xxxxxx where the x characters are replaced with the 16 least-significant bits of the code point. In this case the code point, in binary is 0010000010101100. Therefore, the UTF-8 encoding for € is 11100010 10000010 10101100 in binary, or E2 82 AC in hexadecimal.

In summary, there are three components to modern language support. The ISO/IEC 10646 defines a mapping from code points (numbers) to glyphs (characters). UTF-8 defines an efficient variable-length encoding for code points (text data) in the ISO/IEC 10646 standard. Unicode adds language specific properties to the ISO/IEC 10646 character set. Together, these three elements currently provide support for textual data in almost every human written language, and they continue to be extended and refined.

1.4 Memory Layout of an Executing Program

Computer memory consists of number of storage locations, or cells, each of which has a unique numeric address. Addresses are usually written in hexadecimal. Each storage location can contain a fixed number of binary digits. The most common size is one byte. Most computers group bytes together into words. A computer CPU that is capable of accessing a single byte of memory is said to have byte addressable memory. Some CPUs are only capable of accessing memory in word-sized groups. They are said to have word addressable memory.

Fig. 1.6 A shows a section of memory containing some data. Each byte has a unique address that is used when data is transferred to or from that memory cell. Most processors can also move data in word-sized chunks. On a 32-bit system, four bytes are grouped together to form a word. There are two ways that this grouping can be done. Systems that store the most significant byte of a word in the smallest address, and the least significant byte in the largest address, are said to be big-endian. The big-endian interpretation of a region of memory is shown in Fig. 1.6B. As shown in Fig. 1.6C, little-endian systems store the least significant byte in the lowest address and the most significant byte in the highest address. Some processors, such as the ARM, can be configured as either little-endian or big-endian. The Linux operating system, by default, configures the ARM processor to run in little-endian mode .

f01-06-9780128036983
Figure 1.6 A section of memory.

The memory layout for a typical program is shown in Fig. 1.7. The program is divided into four major memory regions, or sections. The programmer specifies the contents of the Text and Data sections. The Stack and Heap segments are defined when the program is loaded for execution. The Stack and Heap may grow and shrink as the program executes, while the Text and Data segments are set to fixed sizes by the compiler, linker, and loader. The Text section contains the executable instructions, while the Data section contains constants and statically allocated variables. The sizes of the Text and Data segments depend on how large the program is, and how much static data storage has been declared by the programmer. The heap contains variables that are allocated dynamically, and the stack is used to store parameters for function calls, return addresses, and local (automatic) variables.

f01-07-9780128036983
Figure 1.7 Typical memory layout for a program with a 32-bit address space.

In a high-level language, storage space for a variable can be allocated in one of three ways: statically, dynamically, and automatically. Statically allocated variables are allocated from the .data section. The storage space is reserved, and usually initialized, when the program is loaded and begins execution. The address of a statically allocated variable is fixed at the time the program begins running, and cannot be changed. Automatically allocated variables, often referred to as local variables, are stored on the stack. The stack pointer is adjusted down to make space for the newly allocated variable. The address of an automatic variable is always computed as an offset from the stack pointer. Dynamic variables are allocated from the heap, using malloc, new, or a language-dependent equivalent. The address of a dynamic variable is always stored in another variable, known as a pointer, which may be an automatic or static variable, or even another dynamic variable. The four major sections of program memory correspond to executable code, statically allocated variables, dynamically allocated variables, and automatically allocated variables.

1.5 Chapter Summary

There are several reasons for Computer Scientists and Computer Engineers to learn at least one assembly language. There are programming tasks that can only be performed using assembly language, and some tasks can be written to run much more efficiently and/or quickly if written in assembly language. Programmers with assembly language experience tend to write better code even when using a high-level language, and are usually better at finding and fixing bugs.

Although it is possible to construct a computer capable of performing arithmetic in any base, it is much cheaper to build one that works in base two. It is relatively easy to build an electrical circuit with two states, using two discrete voltage levels, but much more difficult to build a stable circuit with 10 discrete voltage levels. Therefore, modern computers work in base two.

Computer data can be viewed as simple bit strings. The programmer is responsible for supplying interpretations to give meaning to those bit strings. A set of bits can be interpreted as a number, a character, or anything that the programmer chooses. There are standard methods for encoding and interpreting characters and numbers. Fig. 1.4 shows some common methods for encoding integers. The most common encodings for characters are UTF-8 and ASCII.

Computer memory can be viewed as a sequence of bytes. Each byte has a unique address. A running program has four regions of memory. One region holds the executable code. The other three regions hold different types of variables.

Exercises

1.1 What is the two’s complement of 11011101?

1.2 Perform the base conversions to fill in the blank spaces in the following table:

Base 10 Base 2 Base 16 Base 21
23
010011
ABB
2HE

t0065

1.3 What is the 8-bit ASCII binary representation for the following characters?

(a) “A”

(b) “a”

(c) “!”

1.4 What is \ minus ! given that \ and ! are ASCII characters? Give your answer in binary.

1.5 Representing characters:

(a) Convert the string “Super!” to its ASCII representation. Show your result as a sequence of hexadecimal values.

(b) Convert the hexadecimal sequence into a sequence of values in base four.

1.6 Suppose that the string “This is a nice day” is stored beginning at address 4B3269AC16. What are the contents of the byte at address 4B3269B116 in hexadecimal?

1.7 Perform the following:

(a) Convert 1011012 to base ten.

(b) Convert 102310 to base nine.

(c) Convert 102310 to base two.

(d) Convert 30110 to base 16.

(e) Convert 30110 to base 2.

(f) Represent 30110 as a null-terminated ASCII string (write your answer in hexadecimal).

(g) Convert 34205 to base ten.

(h) Convert 23145 to base nine.

(i) Convert 1167 to base three.

(j) Convert 129411 to base 5.

1.8 Given the following binary string:
01001001 01110011 01101110 00100111 01110100 00100000 01000001 01110011 01110011 01100101 01101101 01100010 01101100 01111001 00100000 01000110 01110101 01101110 00111111 00000000

(a) Convert it to a hexadecimal string.

(b) Convert the first four bytes to a string of base ten numbers.

(c) Convert the first (little-endian) halfword to base ten.

(d) Convert the first (big-endian) halfword to base ten.

(e) If this string of bytes were sent to an ASCII printer or terminal, what would be printed?

1.9 The number 1,234,567 is stored as a 32-bit word starting at address F043900016. Show the address and contents of each byte of the 32-bit word on a

(a) little-endian system,

(b) big-endian system.

1.10 The ISO/IEC 10646 standard defines 1,112,064 code points (glyphs). Each code point could be encoded using 24 bits, or three bytes. The UTF-8 encoding uses up to four bytes to encode a code point. Give three reasons why UTF-8 is preferred over a simple 3-byte per code point encoding.

1.11 UTF-8 is often referred to as Unicode. Why is this not correct?

1.12 Skilled assembly programmers can convert small numbers between binary, hexadecimal, and decimal in their heads. Without referring to any tables or using a calculator or pencil, fill in the blanks in the following table:

Binary Decimal Hexadecimal
5
1010
C
23
0101101
4B

t0070

1.13 What are the differences between a CPU register and a memory location?

Chapter 2

GNU Assembly Syntax

Abstract

This chapter begins with a high-level description of assembly language and the assembler. It then explains the five elements of assembly language syntax, and gives some examples. It then goes in to more depth about how the assembler converts assembly language files into object files, which are then linked with other object files to create an executable file. Then it explains the most commonly used directives for the GNU assembler, and gives some examples to help relate the assembly code to equivalent C code.

Keywords

Compiler; Assembler; Linker; Labels; Comments; Directives; Instructions; Sections; Symbols

All modern computers consist of three main components: the central processing unit (CPU), memory, and devices. It can be argued that the major factor that distinguishes one computer from another is the CPU architecture. The architecture determines the set of instructions that can be performed by the CPU. The human-readable language which is closest to the CPU architecture is assembly language.

When a new processor architecture is developed, its creators also define an assembly language for the new architecture. In most cases, a precise assembly language syntax is defined and an assembler is created by the processor developers. Because of this, there is no single syntax for assembly language, although most assembly languages are similar in many ways and have certain elements in common.

The GNU assembler (GAS) is a highly portable re-configurable assembler. GAS uses a simple, general syntax that works for a wide variety of architectures. Although the syntax used by GAS for the ARM processor is slightly different from the syntax defined by the developers of the ARM processor, it provides the same capabilities.

2.1 Structure of an Assembly Program

An assembly program consists of four basic elements: assembler directives, labels, assembly instructions, and comments. Assembler directives allow the programmer to reserve memory for the storage of variables, control which program section is being used, define macros, include other files, and perform other operations that control the conversion of assembly instructions into machine code. The assembly instructions are given as mnemonics, or short character strings that are easier for human brains to remember than sequences of binary, octal, or hexadecimal digits. Each assembly instruction may have an optional label, and most assembly instructions require the programmer to specify one or more operands.

Most assembly language programs are written in lines of 80 characters organized into four columns. The first column is for optional labels. The second column is for assembly instructions or assembler directives. The third column is for specifying operands, and the fourth column is for comments. Traditionally, the first two columns are 8 characters wide, the third column is 16 characters wide, and the last column is 48 characters wide. However, most modern assemblers (including GAS) do not require a fixed column widths. Listing 2.1 shows a basic “Hello World” program written in GNU ARM Assembly to run under Linux. For comparison, Listing 2.2 shows an equivalent program written in C. The assembly language version of the program is significantly longer than the C version, and will only work on an ARM processor. The C version is at a higher level of abstraction, and can be compiled to run on any system that has a C compiler. Thus, C is referred to as a high-level language, and assembly is a low-level language.

f02-02-9780128036983
Listing 2.1 "Hello World" program in ARM assembly
f02-03-9780128036983
Listing 2.2 "Hello World" program in C.

2.1.1 Labels

Most modern assemblers are called two-pass assemblers because they read the input file twice. On the first pass, the assembler keeps track of the location of each piece of data and each instruction, and assigns an address or numerical value to each label and symbol in the input file. The main goal of the first pass is to build a symbol table, which maps each label or symbol to a numerical value.

On the second pass, the assembler converts the assembly instructions and data declarations into binary, using the symbol table to supply numerical values whenever they are needed. In Listing 2.1, there are two labels: main and str. During assembly, those labels are assigned the value of the address counter at the point where they appear. Labels can be used anywhere in the program to refer to the address of data, functions, or blocks of code. In GNU assembly syntax, labels always end with a colon (:) character.

2.1.2 Comments

There are two basic comment styles: multi-line and single-line. Multi-line comments start with /* and everything is ignored until a matching sequence of */ is found. These comments are exactly the same as multi-line comments in C and C++. In ARM assembly, single line comments begin with an @ character, and all remaining characters on the line are ignored. Listing 2.1 shows both types of comment. In addition, if the name of the file ends in .S, then single line comments can begin with //. If the file name does not end with a capital .S, then the // syntax is not allowed.

2.1.3 Directives

Directives are used mainly to define symbols, allocate storage, and control the behavior of the assembler, allowing the programmer to control how the assembler does its job. The GNU assembler has many directives, but assembly programmers typically need to know only a few of them. All assembler directives begin with a period “.” which is followed by a sequence of letters, usually in lower case. Listing 2.1 uses the .data, .asciz, .text, and .globl directives. The most commonly used directives are discussed later in this chapter. There are many other directives available in the GNU Assembler which are not covered here. Complete documentation is available online as part of the GNU Binutils package.

2.1.4 Assembly Instructions

Assembly instructions are the program statements that will be executed on the CPU. Most instructions cause the CPU to perform one low-level operation, In most assembly languages, operations can be divided into a few major types. Some instructions move data from one location to another. Others perform addition, subtraction, and other computational operations. Another class of instructions is used to perform comparisons and control which part of the program is to be executed next. Chapters 3 and 4 explain most of the assembly instructions that are available on the ARM processor.

2.2 What the Assembler Does

Listing 2.3 shows how the GNU assembler will assemble the “Hello World” program from Listing 2.1. The assembler converts the string on input line 2 into the binary representation of the string. The results are shown in hexadecimal in the Code column of the listing. The first byte of the string is stored at address zero in the .data section of the program, as shown by the 0000 in the Addr column on line 2.

f02-04-9780128036983
Listing 2.3 "Hello World" assembly listing.

On line 4, the assembler switches to the .text section of the program and begins converting instructions into binary. The first instruction, on line 9, is converted into its 4-byte machine code, 00402DE916, and stored at location 0000 in the .text section of the program, as shown in the Code and Addr columns on line 6.

Next, the assembler converts the ldr instruction on line 10 into the four-byte machine instruction 0C009FE516 and stores it at address 0004. It repeats this process with each remaining instruction until the end of the program. The assembler writes the resulting data into a specially formatted file, called an object file. Note that the assembler was unable to locate the printf function. The linker will take care of that. The object file created by the assembler, hello.o, contains the data in the Code column of Listing 2.3, along with information to help the linker to link (or “patch”) the instruction on line 11 so that printf is called correctly.

After creating the object file, the next step in creating an executable program would be to invoke the linker and request that it link hello.o with the C standard library. The linker will generate the final executable file, containing the code assembled from hello.S, along with the printf function and other start-up code from the C standard library. The GNU C compiler is capable of automatically invoking the assembler for files that end in .s or .S, and can also be used to invoke the linker. For example, if Listing 2.1 is stored in a file named hello.S in the current directory, then the command

gcc -o hello hello.S

will run the GNU C compiler, telling it to create an executable program file named hello, and to use hello.S as the source file for the program. The C compiler will notice the .S extension, and invoke the assembler to create an object file which is stored in a temporary file, possibly named hello.o. Then the C compiler will invoke the linker to link hello.o with the C standard library, which provides the printf function and some start-up code which calls the main function. The linker will create an executable file named hello. When the linker has finished, the C compiler will remove the temporary object file.

2.3 GNU Assembly Directives

Each processor architecture has its own assembly language, created by the designers of the architecture. Although there are many similarities between assembly languages, the designers may choose different names for various directives. The GNU assembler supports a relatively large set of directives, some of which have more than one name. This is because it is designed to handle assembling code for many different processors without drastically changing the assembly language designed by the processor manufacturers. We will now cover some of the most commonly used directives for the GNU assembler.

2.3.1 Selecting the Current Section

The instructions and data that make up a program are stored in different sections of the program file. There are several standard sections that the programmer can choose to put code and data in. Sections can also be further divided into numbered subsections. Each section has its own address counter, which is used to keep track of the location of bytes within that section. When a label is encountered, it is assigned the value of the current address counter for the currently active section.

Selecting a section and subsection is done by using the appropriate assembly directive. Once a section has been selected, all of the instructions and/or data will go into that section until another section is selected. The most important directives for selecting a section are:

.data subsection

Instructs the assembler to append the following instructions or data to the data subsection numbered subsection. If the subsection number is omitted, it defaults to zero. This section is normally used for global variables and constants which have labels.

.text subsection

Tells the assembler to append the following statements to the end of the text subsection numbered subsection. If the subsection number is omitted, subsection number zero is used. This section is normally used for executable instructions, but may also contain constant data.

.bss subsection

The bss (short for Block Started by Symbol) section is used for defining data storage areas that should be initialized to zero at the beginning of program execution. The .bss directive tells the assembler to append the following statements to the end of the bss subsection numbered subsection. If the subsection number is omitted, subsection number zero is used. This section is normally used for global variables which need to be initialized to zero. Regardless of what is placed into the section at compile-time, all bytes will be set to zero when the program begins executing. This section does not actually consume any space in the object or executable file. It is really just a request for the loader to reserve some space when the program is loaded into memory.

.section name

In addition to the three common sections, the programmer can create other sections using this directive. However in order for custom sections to be linked into a program, the linker must be made aware of them. Controlling the linker is covered in Section 14.4.3.

2.3.2 Allocating Space for Variables and Constants

There are several directives that allow the programmer to allocate and initialize static storage space for variables and constants. The assembler supports bytes, integer types, floating point types, and strings. These directives are used to allocate a fixed amount of space in memory and optionally initialize the memory. Some of these directives allow the memory to be initialized using an expression. An expression can be a simple integer, or a C-style expression. The directives for allocating storage are as follows:

.byte expressions

.byte expects zero or more expressions, separated by commas. Each expression is assembled into the next byte. If no expressions are given, then the address counter is not advanced and no bytes are reserved.

.hword expressions
.short expressions

For the ARM processor, these two directives do exactly the same thing. They expect zero or more expressions, separated by commas, and emit a 16-bit number for each expression. If no expressions are given, then the address counter is not advanced and no bytes are reserved.

.word expressions
.long expressions

For the ARM processor, these two directives do exactly the same thing. They expect zero or more expressions, separated by commas. They will emit four bytes for each expression given. If no expressions are given, then the address counter is not advanced and no bytes are reserved.

.ascii ” string ”

The .ascii directive expects zero or more string literals, each enclosed in quotation marks and separated by commas. It assembles each string (with no trailing ASCII NULL character) into consecutive addresses.

.asciz ” string ”
.string ” string ”

The .asciz directive is similar to the .ascii directive, but each string is followed by an ASCII NULL character (zero). The “z” in .asciz stands for zero. .string is just another name for .asciz.

.float flonums
.single flonums

This directive assembles zero or more floating point numbers, separated by commas. On the ARM, they are 4-byte IEEE standard single precision numbers. .float and .single are synonymous.

.double flonums

The .double directive expects zero or more floating point numbers, separated by commas. On the ARM, they are stored as 8-byte IEEE standard double precision numbers.

Fig. 2.1A shows how these directives are used to declare variables and constants. Fig. 2.1B shows the equivalent statements for creating global variables in C or C++. Note that in both cases, the variables created will be visible anywhere within the file that they are declared, but not visible in other files which are linked into the program.

f02-01-9780128036983
Figure 2.1 Equivalent static variable declarations in assembly and C.

In C, the declaration of an array can be performed by leaving out the number of elements and specifying an initializer, as shown in the last three lines of Fig. 2.1B. In assembly, the equivalent is accomplished by providing a label, a type, and a list of values, as shown in the last three lines of Fig. 2.1A. The syntax is different, but the result is precisely the same.

Listing 2.4 shows how the assembler assigns addresses to these labels. The second column of the listing shows the address (in hexadecimal) that is assigned to each label. The variable i is assigned the first address. Since it is a word variable, the address counter is incremented by four bytes and the next address is assigned to the variable j. The address counter is incremented again, and fmt is assigned the address 0008. The fmt variable consumes seven bytes, so the ch variable gets address 000f. Finally, the array of words named ary begins at address 0012. Note that 1216 = 1810 is not evenly divisible by four, which means that the word variables in ary are not aligned on word boundaries.

f02-05-9780128036983
Listing 2.4 A listing with mis-aligned data.

2.3.3 Filling and Aligning

On the ARM CPU, data can be moved to and from memory one byte at a time, two bytes at a time (half-word), or four bytes at a time (word). Moving a word between the CPU and memory takes significantly more time if the address of the word is not aligned on a four-byte boundary (one where the least significant two bits are zero). Similarly, moving a half-word between the CPU and memory takes significantly more time if the address of the half-word is not aligned on a two-byte boundary (one where the least significant bit is zero). Therefore, when declaring storage, it is important that words and half-words are stored on appropriate boundaries. The following directives allow the programmer to insert as much space as necessary to align the next item on any boundary desired.

.align abs-expr, abs-expr, abs-expr

Pad the location counter (in the current subsection) to a particular storage boundary. For the ARM processor, the first expression specifies the number of low-order zero bits the location counter must have after advancement. The second expression gives the fill value to be stored in the padding bytes. It (and the comma) may be omitted. If it is omitted, then the fill value is assumed to be zero. The third expression is also optional. If it is present, it is the maximum number of bytes that should be skipped by this alignment directive.

.balign [lw] abs-expr, abs-expr, abs-expr

These directives adjust the location counter to a particular storage boundary. The first expression is the byte-multiple for the alignment request. For example, .balign 16 will insert fill bytes until the location counter is an even multiple of 16. If the location counter is already a multiple of 16, then no fill bytes will be created. The second expression gives the fill value to be stored in the fill bytes. It (and the comma) may be omitted. If it is omitted, then the fill value is assumed to be zero. The third expression is also optional. If it is present, it is the maximum number of bytes that should be skipped by this alignment directive.
The .balignw and .balignl directives are variants of the .balign directive. The .balignw directive treats the fill pattern as a 2-byte word value, and .balignl treats the fill pattern as a 4-byte long word value. For example, “.balignw 4,0x368d” will align to a multiple of four bytes. If it skips two bytes, they will be filled in with the value 0x368d (the exact placement of the bytes depends upon the endianness of the processor).

.skip size, fill
.space size, fill

Sometimes it is desirable to allocate a large area of memory and initialize it all to the same value. This can be accomplished by using these directives. These directives emit size bytes, each of value fill. Both size and fill are absolute expressions. If the comma and fill are omitted, fill is assumed to be zero. For the ARM processor, the .space and .skip directives are equivalent. This directive is very useful for declaring large arrays in the .bss section.

Listing 2.5 shows how the code in Listing 2.4 can be improved by adding an alignment directive at line 6. The directive causes the assembler to emit two zero bytes between the end of the ch variable and the beginning of the ary variable. These extra “padding” bytes cause the following word data to be word aligned, thereby improving performance when the word data is accessed. It is a good practice to always put an alignment directive after declaring character or half-word data.

f02-06-9780128036983
Listing 2.5 A listing with properly aligned data.

2.3.4 Setting and Manipulating Symbols

The assembler provides support for setting and manipulating symbols that can then be used in other places within the program. The labels that can be assigned to assembly statements and directives are one type of symbol. The programmer can also declare other symbols and use them throughout the program. Such symbols may not have an actual storage location in memory, but they are included in the assembler’s symbol table, and can be used anywhere that their value is required. The most common use for defined symbols is to allow numerical constants to be declared in one place and easily changed. The .equ directive allows the programmer to use a label instead of a number throughout the program. This contributes to readability, and has the benefit that the constant value can then be easily changed every place that it is used, just by changing the definition of the symbol. The most important directives related to symbols are:

.equ symbol, expression
.set symbol, expression

This directive sets the value of symbol to expression. It is similar to the C language #define directive.

.equiv symbol, expression

The .equiv directive is like .equ and .set, except that the assembler will signal an error if the symbol is already defined.

.global symbol
.globl symbol

This directive makes the symbol visible to the linker. If symbol is defined within a file, and this directive is used to make it global, then it will be available to any file that is linked with the one containing the symbol. Without this directive, symbols are visible only within the file where they are defined.

.comm symbol, length

This directive declares symbol to be a common symbol, meaning that if it is defined in more than one file, then all instances should be merged into a single symbol. If the symbol is not defined anywhere, then the linker will allocate length bytes of uninitialized memory. If there are multiple definitions for symbol, and they have different sizes, the linker will merge them into a single instance using the largest size defined.

Listing 2.6 shows how the .equ directive can be used to create a symbol holding the number of elements in an array. The symbol arysize is defined as the value of the current address counter (denoted by the .) minus the value of the ary symbol, divided by four (each word in the array is four bytes). The listing shows all of the symbols defined in this program segment. Note that the four variables are shown to be in the data segment, and the arysize symbol is marked as an “absolute” symbol, which simply means that it is a number and not an address. The programmer can now use the symbol arysize to control looping when accessing the array data. If the size of the array is changed by adding or removing constant values, the value of arysize will change automatically, and the programmer will not have to search through the code to change the original value, 5, to some other value in every place it is used.

f02-07-9780128036983
Listing 2.6 Defining a symbol for the number of elements in an array.

2.3.5 Conditional Assembly

Sometimes it is desirable to skip assembly of portions of a file. The assembler provides some directives to allow conditional assembly. One use for these directives is to optionally assemble code to aid in debugging.

.if expression

.if marks the beginning of a section of code which is only considered part of the source program being assembled if the argument (which must be an absolute expression) is non-zero. The end of the conditional section of code must be marked by the .endif directive. Optionally, code may be included for the alternative condition by using the .else directive.

.ifdef symbol

Assembles the following section of code if the specified symbol has been defined.

.ifndef symbol

Assembles the following section of code if the specified symbol has not been defined.

.else

Assembles the following section of code only if the condition for the preceding .if or.ifdef was false.

.endif

Marks the end of a block of code that is only assembled conditionally.

2.3.6 Including Other Source Files

.include ” file ”

This directive provides a way to include supporting files at specified points in the source program. The code from the included file is assembled as if it followed the point of the .include directive. When the end of the included file is reached, assembly of the original file continues. The search paths used can be controlled with the ‘-I’ command line parameter. Quotation marks are required around file. This assembler directive is similar to including header files in C and C++ using the #include compiler directive.

2.3.7 Macros

The directives .macro and .endm allow the programmer to define macros that the assembler expands to generate assembly code. The GNU assembler supports simple macros. Some other assemblers have much more powerful macro capabilities.

.macro macname
.macro macname macargs …

Begin the definition of a macro called macname. If the macro definition requires arguments, their names are specified after the macro name, separated by commas or spaces. The programmer can supply a default value for any macro argument by following the name with ‘=deflt’.

The following begins the definition of a macro called reserve_str, with two arguments. The first argument has a default value, but the second does not:

f02-08-9780128036983

When a macro is called, the argument values can be specified either by position, or by keyword. For example, reserve_str 9,17 is equivalent to reserve_str p2=17,p1=9. After the definition is complete, the macro can be called either as

reserve_str x,y

(with \p1 evaluating to x and \p2 evaluating to y), or as

reserve_str ,y

(with \p1 evaluating as the default, in this case 0, and \p2 evaluating to y). Other examples of valid .macro statements are:

f02-09-9780128036983
f02-10-9780128036983

.endm

End the current macro definition.

.exitm

Exit early from the current macro definition. This is usually used only within a .if or .ifdef directive.

\@

This is a pseudo-variable used by the assembler to maintain a count of how many macros it has executed. That number can be accessed with ‘\@’, but only within a macro definition.

Macro example

The following definition specifies a macro SHIFT that will emit the instruction to shift a given register left by a specified number of bits. If the number of bits specified is negative, then it will emit the instruction to perform a right shift instead of a left shift.

f02-11-9780128036983

After that definition, the following code:

f02-12-9780128036983

will generate these instructions:

f02-13-9780128036983

The meaning of these instructions will be covered in Chapters 3 and 4.

Recursive macro example

The following definition specifies a macro enum that puts a sequence of numbers into memory by using a recursive macro call to itself:

f02-14-9780128036983

With that definition, ‘enum 0,5’ is equivalent to this assembly input:

f02-15-9780128036983

2.4 Chapter Summary

There are four elements to assembly syntax: labels, directives, instructions, and comments. Directives are used mainly to define symbols, allocate storage, and control the behavior of the assembler. The most common assembler directives were introduced in this chapter, but there are many other directives available in the GNU assembler. Complete documentation is available online as part of the GNU Binutils package.

Directives are used to declare statically allocated storage, which is equivalent to declaring global static variables in C. In assembly, labels and other symbols are visible only within the file that they are declared, unless they are explicitly made visible to other files with the .global directive. In C, variables that are declared outside of any function are visible to all files in the program, unless the static keyword is used to make them visible only within the file where they are declared. Thus, both C and assembly support file and global scope for static variables, but with the opposite defaults and different syntax.

Directives can also be used to declare macros. Macros are expanded by the assembler and may generate multiple statements. Careful use of macros can automate some simple tasks, allowing several lines of assembly code to be replaced with a single macro invocation.

Exercises

2.1 What is the difference between

(a) the .data section and .bss section?

(b) the .ascii and .asciz directives?

(c) the .word and the .long directives?

2.2 What is the purpose of the .align assembler directive? What does “.align 2” do in GNU ARM assembly?

2.3 Assembly language has four main elements. What are they?

2.4 Using the directives presented in this chapter, show three different ways to create a null-terminated string containing the phrase “segmentation fault”.

2.5 What is the total memory, in bytes, allocated for the following variables?

f02-16-9780128036983

2.6 Identify the directive(s), label(s), comment(s), and instruction(s) in the following code:

f02-17-9780128036983

2.7 Write assembly code to declare variables equivalent to the following C code:

f02-18-9780128036983

2.8 Show how to store the following text as a single string in assembly language, while making it readable and keeping each line shorter than 80 characters:

The three goals of the mission are:

1) Keep each line of code under 80 characters,

2) Write readable comments,

3) Learn a valuable skill for readability.

2.9 Insert the minimum number of .align directives necessary in the following code so that all word variables are aligned on word boundaries and all halfword variables are aligned on halfword boundaries, while minimizing the amount of wasted space.

f02-19-9780128036983

2.10 Re-order the directives in the previous problem so that no .align directives are necessary to ensure proper alignment. How many bytes of storage were wasted by the original ordering of directives, compared to the new one?

2.11 What are the most important directives for selecting a section?

2.12 Why are .ascii and .asciz directives usually followed by an .align directive, but .word directives are not?

2.13 Using the “Hello World” program shown in Listing 2.1 as a template, write a program that will print your name.

2.14 Listing 2.3 shows that the assembler will assign the location 0000000016 to the main symbol and also to the str symbol. Why does this not cause problems?

Chapter 3

Load/Store and Branch Instructions

Abstract

This chapter explains how a particular assembly language is related to the architectural design of a particular CPU family. It then gives an overview of the ARM architecture. Next, it describes the ARM register set and data paths, including the Process Status Register, and the flags which are used to control conditional execution. Then it introduces the concept of instructions and operands, and explains immediate data used as an operand. Next it describes the load and store instructions and all of the addressing modes available on the ARM processor. Then it explains the branch and conditional branch instructions. The chapter ends with some examples showing how the branch and link instruction can be used to call functions from the C standard library.

Keywords

Architecture; Instruction set architecture; Data path; Register; Memory; Load; Store; Branch; Address; Addressing mode; Conditional execution; Function or subroutine call

The part of the computer architecture related to programming is referred to as the instruction set architecture (ISA). The ISA includes the set of registers that the user program can access, and the set of instructions that the processor supports, as well as data paths and processing elements within the processor. The first step in learning a new assembly language is to become familiar with the ISA. For most modern computer systems, data must be loaded in a register before it can be used for any data processing instruction, but there are a limited number of registers. Memory provides a place to store data that is not currently needed. Program instructions are also stored in memory and fetched into the CPU as they are needed. This chapter introduces the ISA for the ARM processor.

3.1 CPU Components and Data Paths

The CPU is composed of data storage and computational components connected together by a set of buses. The most important components of the CPU are the registers, where data is stored, and the arithmetic and logic unit (ALU), where arithmetic and logical operations are performed on the data. Some CPUs also have dedicated hardware units for multiplication and/or division. Fig. 3.1 shows the major components of the ARM CPU and the buses that connect the components together. These buses provide pathways for the data to move between the computational and storage components. The organization of the components and buses in a CPU govern what types of operations can be performed.

f03-01-9780128036983
Figure 3.1 The ARM processor architecture.

The set of instructions and addressing modes available on the ARM processor is closely related to the architecture shown in Fig. 3.1. The architecture provides for certain operations to be performed efficiently, and this has a direct relationship to the types of instructions that are supported.

Note that on the ARM, two source registers can be selected for an instruction, using the A and B buses. The data on the B bus is routed through a shifter, and then to the ALU. This allows the second operand of most instructions to be shifted an arbitrary amount before it reaches the ALU. The data on the A bus goes directly to the ALU. Additionally, the A and B buses can provide operands for the multiplier, and the multiplier can provide data for the A and B buses.

Data coming in from memory or an input/output device is fed directly onto the ALU bus. From there, it can be stored in one of the general-purpose registers. Data being written to memory or an input/output device is taken directly from the B bus, which means that store operations can move data from a register, but cannot modify the data on the way to memory or input/output devices.

The address register is a temporary register that is used by the CPU whenever it needs to read or write to memory or I/O devices. It is used every time an instruction is fetched from memory, and is used for all load and store operations. The address register can be loaded from the program counter, for fetching the next instruction. Also the address register can be loaded from the ALU, which allows the processor to support addressing modes where a register is used as a base pointer and an offset is calculated on-the-fly. After its contents are used to access memory or I/O devices, the base address can be incremented and the incremented value can be stored back into a register. This allows the processor to increment the program counter after each instruction, and to implement certain addressing modes where a pointer is automatically incremented after each memory access.

3.2 ARM User Registers

As shown in Fig. 3.2, the ARM processor provides 13 general-purpose registers, named r0 through r12. These registers can each store 32 bits of data. In addition to the 13 general-purpose registers, the ARM has three other special-purpose registers.

f03-02-9780128036983
Figure 3.2 The ARM user program registers.

The program counter, r15, always contains the address of the next instruction that will be executed. The processor increments this register by four, automatically, after each instruction is fetched from memory. By moving an address into this register, the programmer can cause the processor to fetch the next instruction from the new address. This gives the programmer the ability to jump to any address and begin executing code there.

The link register, r14, is used to hold the return address for subroutines. Certain instructions cause the program counter to be copied to the link register, then the program counter is loaded with a new address. These branch-and-link instructions are briefly covered in Section 3.5 and in more detail in Section 5.4.

The program stack was introduced in Section 1.4. The stack pointer, r13, is used to hold the address where the stack ends. This is commonly referred to as the top of the stack, although on most systems the stack grows downwards and the stack pointer really refers to the bottom of the stack. The address where the stack ends may change when registers are pushed onto the stack, or when temporary local variables (automatic variables) are allocated or deleted. The use of the stack for storing automatic variables is described in Chapter 5. The use of r13 as the stack pointer is a programming convention. Some instructions (eg, branches) implicitly modify the program counter and link registers, but there are no special instructions involving the stack pointer. As far as the hardware is concerned, r13 is exactly the same as registers r0r12, but all ARM programmers use it for the stack pointer.

Although register r13 is normally used as the stack pointer, it can be used as a general-purpose register if the stack is not used. However the high-level language compilers always use it as the stack pointer, so using it as a general-purpose register will result in code that cannot inter-operate with code generated using high-level languages. The link register, r14, can also be used as a general-purpose register, but its contents are modified by hardware when a subroutine is called. Using r13 and r14 as general-purpose registers is dangerous and strongly discouraged.

There are also two other registers which may have special purposes. As with the stack pointer, these are programming conventions. There are no special instructions involving these registers. The frame pointer (r11) is used by high-level language compilers to track the current stack frame. This is sometimes useful when running your program under a debugger, and can sometimes help the compiler to generate more efficient code for returning from a subroutine. The GNU C compiler can be instructed to use r11 as a general-purpose register by using the --omit-frame-pointer command line option. The inter-procedure scratch register r12 is used by the C library when calling functions in dynamically linked libraries. The contents may change, seemingly at random, when certain functions (such as printf) are called.

The final register in the ARM user programming model is the Current Program Status Register (CPSR). This register contains bits that indicate the status of the current program, including information about the results of previous operations. Fig. 3.3 shows the bits in the CPSR. The first four bits, N, Z, C, and V are the condition flags. Most instructions can modify these flags, and later instructions can use the flags to modify their operation. Their meaning is as follows:

f03-03-9780128036983
Figure 3.3 The ARM process status register.

Negative: This bit is set to one if the signed result of an operation is negative, and set to zero if the result is positive or zero.

Zero: This bit is set to one if the result of an operation is zero, and set to zero if the result is non-zero.

Carry: This bit is set to one if an add operation results in a carry out of the most significant bit, or if a subtract operation results in a borrow. For shift operations, this flag is set to the last bit shifted out by the shifter.

oVerflow: For addition and subtraction, this flag is set if a signed overflow occurred.

The remaining bits are used by the operating system or for bare-metal programs, and are described in Section 14.1.

3.3 Instruction Components

The ARM processor supports a relatively small set of instructions grouped into four basic instruction types. Most instructions have optional modifiers which can be used to change their behavior. For example, many instructions can have modifiers which set or check condition codes in the CPSR. The combination of basic instructions with optional modifiers results in an extremely rich assembly language. There are four general instruction types, or categories. The following sections give a brief overview of the features which are common to instructions in each category. The individual instructions are explained later in this chapter, and in the following chapter.

3.3.1 Setting and Using Condition Flags

As mentioned previously, the CPSR contains four flag bits (bits 28–31), which can be used to control whether or not certain instructions are executed. Most of the data processing instructions have an optional modifier to control whether or not the flag bits are affected when the instruction is executed. For example, the basic instruction for addition is add. When the add instruction is executed, the result is stored in a register, but the flag bits in the CPSR are not affected.

However, the programmer can add the s modifier to the add instruction to create the adds instruction. When it is executed, this instruction will affect the CPSR flag bits. The flag bits can be used by subsequent instructions to control execution and branching. The meaning of the flags depends on the type of instruction that last set the flags. Table 3.1 shows the names and meanings of the four bits depending on the type of instruction that set or cleared them. Most instructions support the s modifier to control setting the flags.

Table 3.1

Flag bits in the CPSR register

NameLogical InstructionArithmetic Instruction
N (Negative)No meaningBit 31 of the result is set. Indicates a negative number in signed operations
Z (Zero)Result is all zeroesResult of operation was zero
C (Carry)After Shift operation, ‘1’ was left in carry flagResult was greater than 32 bits
V (oVerflow)No meaningThe signed two’s complement result requires more than 32 bits. Indicates a possible corruption of the result

t0010

Most ARM instructions can have a condition modifier attached. If present, the modifier controls, at run-time, whether or not the instruction is actually executed. These condition modifiers are added to basic instructions to create conditional instructions. Table 3.2 shows the condition modifiers that can be attached to base instructions. For example, to create an instruction that adds only if the CPSR Z flag is set, the programmer would add the eq condition modifier to the basic add instruction to create the addeq instruction.

Table 3.2

ARM condition modifiers

<cond>English Meaning
alalways (this is the default <cond>
eqZ set (=)
neZ clear (≠)
geN set and V set, or N clear and V clear (≥)
ltN set and V clear, or N clear and V set (<)
gtZ clear, and either N set and V set, or N clear and V set (>)
leZ set, or N set and V clear, or N clear and V set (≤)
hiC set and Z clear (unsigned >)
lsC clear or Z (unsigned ≤)
hsC set (unsigned ≥)
csAlternate name for HS
loC clear (unsigned <)
ccAlternate name for LO
miN set (result < 0)
plN clear (result ≥ 0)
vsV set (overflow)
vcV clear (no overflow)

Setting and using condition flags are orthogonal operations. This means that they can be used in combination. Using the previous example, the programmer could add the s modifier to create the addeqs instruction, which executes only if the Z bit is set, and updates the CPSR flags only if it executes.

3.3.2 Immediate Values

An immediate value in assembly language is a constant value that is specified by the programmer. Some assembly languages encode the immediate value as part of the instruction. Other assembly languages create a table of immediate values in a literal pool and insert appropriate instructions to access them. ARM assembly language provides both methods.

Immediate values can be specified in decimal, octal, hexadecimal, or binary. Octal values must begin with a zero, and hexadecimal values must begin with “0x”. Likewise immediate values that start with “0b” are interpreted as binary numbers. Any value that does not begin with zero, 0x, or 0 b will be interpreted as a decimal value.

There are two ways that immediate values can be specified in GNU ARM assembly. The =<immediate|symbol> syntax can be used to specify any immediate 32-bit number, or to specify the 32-bit value of any symbol in the program. Symbols include program labels (such as main) and symbols that are defined using .equ and similar assembler directives. However, this syntax can only be used with load instructions, and not with data processing instructions. This restriction is necessary because of the way the ARM machine instructions are encoded. For data processing instructions, there are a limited number of bits that can be devoted to storing immediate data as part of the instruction.

The #<immediate|symbol> syntax is used to specify immediate data values for data processing instructions. The #<immediate|symbol> syntax has some restrictions. Basically, the assembler must be able to construct the specified value using only eight bits of data, a shift or rotate, and/or a complement. For immediate values that can cannot be constructed by shifting or rotating and complementing an 8-bit value, the programmer must use an ldr instruction with the =<immediate|symbol> to specify the value. That method is covered in Section 3.4. Some examples of immediate values are shown in Table 3.3.

Table 3.3

Legal and illegal values for #<immediate—symbol>

#32Ok because it can be stored as an 8-bit value
#1021Illegal because the number cannot be created from an 8-bit value using shift or rotate and complement
#1024Ok because it is 1 shifted left 10 bits
#0b1011Ok because it fits in 8 bits
#-1Ok because it is the one’s complement of 0
#0xFFFFFFFEOk because it is the one’s complement of 1
#0xEFFFFFFFOk because it is the one’s complement of 1 shifted left 31 bits
#strsizeOk if the value of strsize can be created from an 8-bit value using shift or rotate and complement

t0020

3.4 Load/Store Instructions

The ARM processor has a strict separation between instructions that perform computation and those that move data between the CPU and memory. Because of this separation between load/store operations and computational operations, it is a classic example of a load-store architecture. The programmer can transfer bytes (8 bits), half-words (16 bits), and words (32 bits), from memory into a register, or from a register into memory. The programmer can also perform computational operations (such as adding) using two source operands and one register as the destination for the result. All computational instructions assume that the registers already contain the data. Load instructions are used to move data into the registers, while store instructions are used to move data from the registers to memory.

3.4.1 Addressing Modes

Most of the load/store instructions use an <address> which is one of the six options shown in Table 3.4. The < shift_op > can be any of shift operations from Table 3.5, and shift should be a number between 0 and 31. Although there are really only six addressing modes, there are eleven variations of the assembly language syntax. Four of the variations are simply shorthand notations. One of the variations allows an immediate data value or the address of a label to be loaded into a register, and may result in the assembler generating more than one instruction. The following section describes each addressing mode in detail.

Table 3.4

ARM addressing modes

SyntaxName
[Rn, #±<offset_12>]Immediate offset
[Rn, ±Rm, <shift_op> #<shift>]Scaled register offset
[Rn, #±<offset_12>]!Immediate pre-indexed
[Rn, ±Rm, <shift_op> #<shift>]!Scaled register pre-indexed
[Rn], #±<offset_12>Immediate post-indexed
[Rn], ±Rm, <shift_op> #<shift>Scaled register post-indexed

Table 3.5

ARM shift and rotate operations

<shift>Meaning
lslLogical Shift Left by specified amount
lsrLogical Shift Right by specified amount
asrArithmetic Shift Right by specified amount

Immediate offset: [Rn, #±< offset_12 >]
The immediate offset (which may be positive or negative) is added to the contents of Rn. The result is used as the address of the item to be loaded or stored. For example, the following line of code:

 ldr r0, [r1, #12]

calculates a memory address by adding 12 to the contents of register r1. It then loads four bytes of data, starting at the calculated memory address, into register r0. Similarly, the line:

 str r9, [r6, #-8]

subtracts 8 from the contents of r6 and uses that as the address where it stores the contents of r9 in memory.

Register immediate: [Rn]
When using immediate offset mode with an offset of zero, the comma and offset can be omitted. That is, [Rn] is just shorthand notation for [Rn, #0]. This shorthand is referred to as register immediate mode. For example, the following line of code:

 ldr r3, [r2]

uses the contents of register r2 as a memory address and loads four bytes of data, starting at that address, into register r3. Likewise,

 str r8, [r0]

copies the contents of r8 to the four bytes of memory starting at the address that is in r0.

Scaled register offset: [Rn, ±Rm, < shift_op > #<shift>]
Rm is shifted as specified, then added to or subtracted from Rn. The result is used as the address of the item to be loaded or stored. For example,

 ldr r3, [r2, r1, lsl #2]

shifts the contents of r1 left two bits, adds the result to the contents of r2 and uses the sum as an address in memory from which it loads four bytes into r3. Recall that shifting a binary number left by two bits is equivalent to multiplying that number by four. This addressing mode is typically used to access an array, where r2 contains the address of the beginning of the array, and r1 is an integer index. The integer shift amount depends on the size of the objects in the array. To store an item from register r0 into an array of half-words, the following instruction could be used:

 strh r0, [r4, r5, lsl #1]

where r4 holds the address of the first byte of the array, and r5 holds the integer index for the desired array item.

Register offset: [Rn, ±Rm]
When using scaled register offset mode with a shift amount of zero, the comma and shift specification can be omitted. That is, [Rn, ±Rm] is just shorthand notation for [Rn, ±Rm, lsl #0]. This shorthand is referred to as register offset mode.

Immediate pre-indexed: [Rn, #±Rm< offset_12 >]!
The address is computed in the same way as immediate offset mode, but after the load or store, the address that was used is stored in Rn. This mode can be used to step through elements in an array, updating a pointer to the next array element before each element is accessed.

Scaled register pre-indexed: [Rn, ±Rm, < shift_op > #<shift>]!
The address is computed in the same way as scaled register offset mode, but after the load or store, the address that was used is stored in Rn. This mode can be used to step through elements in an array, updating a pointer to the current array element before each access.

Register pre-indexed: [Rn, ±Rm]!
When using scaled register pre-indexed mode with a shift amount of zero, the comma and shift specification can be omitted. That is, [Rn, ±Rm]! is shorthand notation for [Rn, ±Rm, lsl #0]!. This shorthand is referred to as register pre-indexed mode.

Immediate post-indexed: [Rn], #±< offset_12 >
Register Rn is used as the address of the value to be loaded or stored. After the value is loaded or stored, the value in Rn is updated by adding the immediate offset, which may be negative or positive. This mode can be used to step through elements in an array, updating a pointer to point at the next array element after each one is accessed.

Scaled register post-indexed: [Rn], ±Rm, < shift_op > #<shift>
Register Rn is used as the address of the value to be loaded or stored. After the value is loaded or stored, the value in Rn is updated by adding or subtracting the contents of Rm shifted as specified. This mode can be used to step through elements in an array, updating a pointer to point at the next array element after each one is accessed.

Register post-indexed: [Rn], ±Rm
When using scaled register post-indexed mode with a shift amount of zero, the comma and shift specification can be omitted. That is, [Rn], ±Rm is shorthand notation for [Rn], ±Rm, lsl #0. This shorthand is referred to as register post-indexed mode.

Load Immediate: [Rn], =<immediate|symbol>
This is really a pseudo-instruction. The assembler will generate a mov instruction if possible. Otherwise it will store the value of immediate or the address of symbol in a “literal table” and generate a load instruction, using one of the previous addressing modes, to load the value into a register. This addressing mode can only be used with the ldr instruction.

The load and store instructions allow the programmer to move data from memory to registers or from registers to memory. The load/store instructions can be grouped into the following types:

 single register,

 multiple register, and

 atomic.

The following sections describe the seven load and store instructions that are available, and all of their variations.

3.4.2 Load/Store Single Register

These instructions transfer a single word, half-word, or byte from a register to memory or from memory to a register:

ldr Load Register, and

str Store Register.

Syntax

 <op>{<cond>}{<size>} Rd, <address>

 <op> is either ldr or str.

 The optional <cond> can be any of the codes from Table 3.2 on page 59 specifying conditional execution.

 The optional <size> is one of:

b unsigned byte

h unsigned half-word

sb signed byte

sh signed half-word

 The <address> is any valid address specifier described in Section 3.4.1.

Operations

NameEffectDescription
ldrRdMem[address]Load register from memory at address
strMem[address]RdStore register in memory at address

Examples

u03-01-9780128036983

3.4.3 Load/Store Multiple Registers

ARM has two instructions for loading and storing multiple registers:

ldm Load Multiple Registers, and

stm Store Multiple Registers.

These instructions are used to store registers on the program stack, and for copying blocks of data. The ldm and stm instructions each have four variants, and each variant has two equivalent names. So, although there are only two basic instructions, there are sixteen mnemonics. These are the most complex instructions in the ARM assembly language.

Syntax

 <op><variant> Rd{!}, <register_list>{^}

 <op> is either ldm or stm.

 <variant> is chosen from the following tables:

Block Copy MethodStack Type
VariantDescriptionVariantDescription
iaIncrement AftereaEmpty Ascending
ibIncrement BeforefaFull Ascending
daDecrement AfteredEmpty Descending
dbDecrement BeforefdFull Descending

t0040

 The optional ! specifies that the address register Rd should be modified after the registers are stored.

 An optional trailing ˆ can only be used by operating system code. It causes the transfer to affect user registers instead of operating system registers.

There are two equivalent mnemonics for each load/store multiple instruction. For example, ldmia is exactly the same instruction as ldmfd, and stmdb is exactly the same instruction as stmfd. There are two different names so that the programmer can indicate what the instruction is being used for.

The mnemonics in the Block Copy Method table are used when the programmer is using the instructions to move blocks of data. For instance, the programmer may want to copy eight words from one address in memory to another address. One very efficient way to do that is to:

1. load the address of the first byte of the source into a register,

2. load the address of the first byte of the destination into another register,

3. use ldmia (load multiple increment after) to load eight registers from the source address, then

4. use stmia (store multiple increment after) to store the registers to the destination address.

Assuming source and dest are labeled blocks of data declared elsewhere, the following listing shows the exact instructions needed to move eight words from source to dest:

u03-02-9780128036983

The mnemonics in the Stack Type table are used when the programmer is performing stack operations. The most common variants are stmfd and ldmfd, which are used for pushing registers onto the program stack and later popping them back off, respectively. In Linux, the C compiler always uses the stmfd and ldmfd versions for accessing the stack. The following code shows how the programmer could save the contents of registers r0-r9 on the stack, use them to perform a block copy, then restore their contents:

u03-03-9780128036983

Note that in the previous example, after the stmfd sp!, { r0-r9 } instruction, sp will contain the address of the last word on the stack, because the optional ! was used to indicate that the register should be updated.

Operations

NameEffectDescription
ldmia and ldmfd

addrRdsi3_e

for all iregister_list do

iMem[addr]si4_e

addraddr+4si5_e

end for

if ! is present then

Rdaddrsi6_e

end if

Load multiple registers from memory, starting at the address in Rd and increment the address by four bytes after each load.
stmia and stmea

addrRdsi3_e

for all iregister_list do

Mem[addr]isi8_e

addraddr+4si5_e

end for

if ! is present then

Rdaddrsi6_e

end if

Store multiple registers in memory, starting at the address in Rd and increment the address by four bytes after each store.
ldmib and ldmed

addrRdsi3_e

for all iregister_list do

addraddr+4si5_e

iMem[addr]si4_e

end for

if ! is present then

Rdaddrsi6_e

end if

Load multiple registers from memory, starting at the address in Rd and increment the address by four bytes before each load.
stmib and stmfa

addrRdsi3_e

for all iregister_list do

addraddr+4si5_e

Mem[addr]isi8_e

end for

if ! is present then

Rdaddrsi6_e

end if

Store multiple registers in memory, starting at the address in Rd and increment the address by four bytes before each store.
ldmda and ldmfa

addrRdsi3_e

for all iregister_list do

iMem[addr]si4_e

addraddr4si21_e

end for

if ! is present then

Rdaddrsi6_e

end if

Load multiple registers from memory, starting at the address in Rd and decrement the address by four bytes after each load.
stmda and stmed

addrRdsi3_e

for all iregister_list do

Mem[addr]isi8_e

addraddr4si21_e

end for

if ! is present then

Rdaddrsi6_e

end if

Store multiple registers in memory, starting at the address in Rd and decrement the address by four bytes after each store.
ldmdb and ldmea

addrRdsi3_e

for all iregister_list do

addraddr4si21_e

iMem[addr]si4_e

end for

if ! is present then

Rdaddrsi6_e

end if

Load multiple registers from memory, starting at the address in Rd and decrement the address by four bytes before each load.
stmdb and stmfd

addrRdsi3_e

for all iregister_list do

addraddr4si21_e

Mem[addr]isi8_e

end for

if ! is present then

Rdaddrsi6_e

end if

Store multiple registers in memory, starting at the address in Rd and decrement the address by four bytes before each store.

t0045

Examples

u03-04-9780128036983

3.4.4 Swap

Multiprogramming and threading require the ability to set and test values atomically. This instruction is used by the operating system or threading libraries to guarantee mutual exclusion:

swp Atomic Load and Store

Note: swp and swpb are deprecated in favor of ldrex and strex, which work on multiprocessor systems as well as uni-processor systems.

Syntax

 swp{<cond>}{s} Rd, Rm, [Rn]

 swp{<cond>}{s}b Rd, Rm, [Rn]

 The optional s specifies whether or not the instruction should affect the bits in the CPSR.

 The optional <cond> can be any of the codes from Table 3.2 specifying conditional execution.

Operations

NameEffectDescription
swpRdMem[Rn]si35_eMem[Rn]Rmsi36_eAtomically load Rd and store Rm
swpbRdMem[Rn]si35_eMem[Rn]Rmsi36_eAtomically load Rd and store Rm

Example

u03-05-9780128036983

3.4.5 Exclusive Load/Store

These instructions are used by the operating system or threading libraries to guarantee mutual exclusion, even on multiprocessor systems:

ldrex Load Multiple Registers, and

strex Store Multiple Registers.

Exclusive load (ldrex) reads data from memory, tagging the memory address at the same time. Exclusive store (strex) stores data to memory, but only if the tag is still valid. A strex to the same address as the previous ldrex will invalidate the tag. A str to the same address may invalidate the tag (implementation defined). The strex instruction sets a bit in the specified register which indicates whether or not the store succeeded. This allows the programmer to implement semaphores on uni-processor and multiprocessor systems.

Syntax

 ldrex{<cond>} Rd, Rn

 strex{<cond>} Rd, Rn, Rm

 The optional <cond> can be any of the codes from Table 3.2 specifying conditional execution.

Operations

NameEffectDescription
ldrex

RdMem[Rn]si35_e

tagMem[Rn]truesi40_e

Load register and tag memory address
strex

if tagMem[Rn] then

Mem[Rn]Rdsi41_e

end if

Store register in memory if tag is valid

t0060

Example

u03-06-9780128036983

3.5 Branch Instructions

Branch instructions allow the programmer to change the address of the next instruction to be executed. They are used to implement loops, if-then structures, subroutines, and other flow control structures. There are two basic branch instructions:

 Branch, and

 Branch and Link (subroutine call).

3.5.1 Branch

This instruction is used to perform conditional and unconditional branches in program execution:

b Branch.

It is used for creating loops and if-then-else constructs.

Syntax

 b{<cond>} <target_label>

 The optional <cond> can be any of the codes from Table 3.2 specifying conditional execution.

 The target_label can be any label in the current file, or any label that is defined as .global or .globl in any file that is linked in.

Operations

NameEffectDescription
bpctarget_addresssi42_eload pc with new address (branch)

Examples

u03-07-9780128036983

3.5.2 Branch and Link

The following instruction is used to call subroutines:

bl Branch and Link.

The branch and link instruction is identical to the branch instruction, except that it copies the current program counter to the link register before performing the branch. This allows the programmer to copy the link register back into the program counter at some later point. This is how subroutines are called, and how subroutines return and resume executing at the next instruction after the one that called them.

Syntax

[frame=single]

 bl{<cond>} <target_address>

 The optional <cond> can be any of the codes from Table 3.2 specifying conditional execution.

 The target_address can be any label in the current file, or any label that is defined as .global or .globl in any file that is linked in.

Operations

NameEffectDescription
bl

lrpcsi43_e

pctarget_addresssi42_e

Save pc in lr, then load pc with new address

t0070

Examples

u03-08-9780128036983

Example 3.1 shows how the bl instruction can be used to call a function from the C standard library to read a single character from standard input. By convention, when a function is called, it will leave its return value in r0. Example 3.2 shows how the bl instruction can be used to call another function from the C standard library to print a message to standard output. By convention, when a function is called, it will expect to find its first argument in r0. There are other rules, which all ARM programmers must follow, regarding which registers are used when passing arguments to functions and procedures. Those rules will be explained fully in Section 5.4.

Example 3.1

Using the bl Instruction to Read a Character

Suppose we want to read a single character from standard input. This can be accomplished in C by calling the getchar () function from the C standard library as follows:

u03-09-9780128036983

The above C code assumes that the variable c has been declared to hold the result of the function. In ARM assembly language, functions always return their results in r0. The assembly programmer may then move the result to any register or memory location they choose. In the following example, it is assumed that r9 was chosen to hold the value of the variable c:

u03-10-9780128036983

Example 3.2

Using the bl Instruction to Print a Message

To print a string to standard output in C, we can use the printf () function from the C standard library as follows:

u03-11-9780128036983

The C compiler will automatically create a constant array of characters and initialize it to hold the message. Then it will load the address of the first character in the array into register r0 before calling printf (). The printf () function will expect to see an address in r0, which it will assume is the address of the format string to be printed. The function call can be made as follows in ARM assembly:

u03-12-9780128036983

3.6 Pseudo-Instructions

The assembler provides a small number of pseudo-instructions. From the perspective of the programmer, these instructions are indistinguishable from standard instructions. However, when the assembler encounters a pseudo-instruction, it may substitute a different instruction or generate a short sequence of machine instructions.

3.6.1 Load Immediate

This pseudo-instruction loads a register with any 32-bit value:

ldr Load Immediate

When this pseudo-instruction is encountered, the assembler first determines whether or not it can substitute a mov Rd,#<immediate> or mvn Rd,#<immediate> instruction. If that is not possible, then it reserves four bytes in a “literal pool” and stores the immediate value there. Then, the pseudo-instruction is translated into an ldr instruction using Immediate Offset addressing mode with the pc as the base register.

Syntax

 ldr{<cond>} Rd, =<immediate>

 The optional <cond> can be any of the codes from Table 3.2 specifying conditional execution.

 The <immediate> parameter is any valid 32-bit quantity.

Operations

NameEffectDescription
ldrRdvaluesi45_eLoad register with immediate value

Example

Example 3.3 shows how the assembler generates code from the load immediate pseudo-instruction. Line 2 of the example listing just declares two 32-bit words. They cause the next variable to be given a non-zero address for demonstration purposes, and are not used anywhere in the program, but line 3 declares a string of characters in the data section. The string is located at offset 0x00000008 from the beginning of the data section. The linker is responsible for calculating the actual address, when it assigns a location for the data section. Line 6 shows how a register can be loaded with an immediate value using the mov instruction. The next line shows the equivalent using the ldr pseudo-instruction. Note that the assembler generates the same machine instruction (FD5FE0E3) for both lines.

Example 3.3

Assembly of the Load Immediate Pseudo-Instruction

u03-13a-9780128036983u03-13b-9780128036983

Line 8 shows the ldr pseudo-instruction being used to load a value that cannot be loaded using the mov instruction. The assembler generated a load half-word instruction using the program counter as the base register, and an offset to the location where the value is stored. The value is actually stored in a literal pool at the end of the text segment. The listing has three lines labeled 11. The first line 11 is an instruction. The remaining lines are the literal pool.

On line 9, the programmer used the ldr pseudo-instruction to request that the address of str be loaded into r4. The assembler created a storage location to hold the address of str, and generated a load word instruction using the program counter as the base register and an offset to the location where the address is stored. The address of str is actually stored in the text segment, on the third line 11.

3.6.2 Load Address

These pseudo instructions are used to load the address associated with a label:

adr Load Address

adrl Load Address Long

They are more efficient than the ldr rx,=label instruction, because they are translated into one or two add or subtract operations, and do not require a load from memory. However, the address must be in the same section as the adr or adrl pseudo-instruction, so they cannot be used to load addresses of labels in the .data section.

Syntax

 <op>{<cond>}{s} Rd, label

 <op> is either adr or adrl.

 The adr pseudo-instruction will be translated into one or two pc-relative add or sub instructions.

 The adrl pseudo-instruction will always be translated into two instructions. The second instruction may be a nop instruction.

 The label must be defined in the same file and section where these pseudo-instructions are used.

Operations

NameEffectDescription
adrRdAddress of Labelsi46_eLoad Address
adrlRdAddress of Labelsi46_eLoad Address

Examples

u03-14-9780128036983

3.7 Chapter Summary

The ARM Instruction Set Architecture includes 17 registers and a four basic instruction types. This chapter explained the instructions used for

 moving data between memory and registers, and

 branching and calling subroutines.

The load and store operations are used to move data between memory and registers. The basic load and store operations, ldr and str, have a very powerful set of addressing modes. To facilitate moving multiple registers to or from memory, the ARM ISA provides the ldm and stm instructions, which each have several variants. The assembler provides two pseudo-instructions for loading addresses and immediate values.

The ARM processor provides only two types of branch instruction. The bl instruction is used to call subroutines (functions). The b instruction can be used to create loops and to create if-then-else constructs. The ability to append a condition to almost any instruction results in a very rich instruction set.

Exercises

3.1 Which registers hold the stack pointer, return address, and program counter?

3.2 Which is more efficient for loading a constant value, the ldr pseudo-instruction, or the mov instruction? Explain.

3.3 Which two variants of the Store Multiple instruction are used most often, and why?

3.4 The stm and ldm instructions include an optional ‘!’ after the address register. What does it do?

3.5 The following C statement declares an array of four integers, and initializes their values to 7, 3, 21, and 10, in that order.

int nums[]={7,3,21,10};

(a) Write the equivalent in GNU ARM assembly.

(b) Write the ARM assembly instructions to load all four numbers into registers r3, r5, r6, and r9, respectively, using:

i. a single ldm instruction, and

ii. four ldr instructions.

3.6 What is the difference between a memory location and a CPU register?

3.7 How many registers are provided by the ARM Instruction Set Architecture?

3.8 Use ldm and stm to write a short sequence of ARM assembly language to copy 16 words of data from a source address to a destination address. Assume that the source address is already loaded in r0 and the destination address is already loaded in r1. You may use registers r2 through r5 to hold values as needed. Your code is allowed to modify r0 and/or r1.

3.9 Assume that x is an array of integers. Convert the following C statements into ARM assembly language.

(a) x[8] = 100;

(b) x[10] = x[0];

(c) x[9] = x[3];

3.10 Assume that x is an array of integers, and i and j are integers. Convert the following C statements into ARM assembly language.

(a) x[i] = j;

(b) x[j] = x[i];

(c) x[i] = x[j*2];

3.11 What is the difference between the b instruction and the bl instruction? What is each used for?

3.12 What are the meanings of the following instructions?

(a) ldreq

(b) ldrlt

(c) bgt

(d) bne

(e) bge

Chapter 4

Data Processing and Other Instructions

Abstract

This chapter begins by explaining Operand2, which is used by most ARM data processing instructions to specify one of the source operands for the data processing operation. It explains all of the shift operations and how they can be combined with other data processing operations in a single instruction. It then explains each of the data processing instructions, giving a short example showing how they can be used. Short examples, relating the assembly instructions to C statements, are incorporated throughout the chapter. One of the examples shows how to construct a loop. After the data processing instructions are explained, the chapter covers the special instructions and pseudo-instructions.

Keywords

Operand2; Data processing; Shift; Loop; Comparison; Data movement; Three address instruction; Two address instruction

The ARM processor has approximately 25 data processing instructions. The exact number depends on the processor version. For example, older versions of the architecture did not have the six multiply instructions, and the Cortex M3 and newer processors have two division instructions. There are also a few special instructions that are used infrequently to perform operations that are not classified as load/store, branch, or data processing.

4.1 Data Processing Instructions

The data processing instructions operate only on CPU registers, so data must first be moved from memory into a register before processing can be performed. Most of these instructions use two source operands and one destination register. Each instruction performs one basic arithmetical or logical operation. The operations are grouped in the following categories:

 Arithmetic Operations,

 Logical Operations,

 Comparison Operations,

 Data Movement Operations,

 Status Register Operations,

 Multiplication Operations, and

 Division Operations.

4.1.1 Operand2

Most of the data processing instructions require the programmer to specify two source operands and one destination register for the result. Because three items must be specified for these instructions, they are known as three address instructions. The use of the word address in this case has nothing to do with memory addresses. The term three address instruction comes from earlier processor architectures that allow arithmetic operations to be performed with data that is stored in memory rather than registers. The first source operand specifies a register whose contents will be on the A bus in Fig. 3.1. The second source operand will be on the B bus and is referred to as Operand2. Operand2 can be any one of the following three things:

 a register (r0-r15),

 a register (r0-r15) and a shift operation to modify it, or

 a 32-bit immediate value that can be constructed by shifting, rotating, and/or complementing an 8-bit value.

The options for Operand2 allow a great deal of flexibility. Many operations that would require two instructions on most processors can be performed using a single ARM instruction. Table 4.1 shows the mnemonics used for specifying shift operations, which we refer to as < shift_op >.

Table 4.1

Shift and rotate operations in Operand2

u04-01-9780128036983

t0010

The lsl operation shifts each bit left by a specified amount n. Zero is shifted into the n least significant bits, and the most significant n bits are lost. The lsr operation shifts each bit right by a specified amount n. Zero is shifted into the n most significant bits, and the least significant n bits are lost. The asr operation shifts each bit right by a specified amount n. The n most significant bits become copies of the sign bit (bit 31), and the least significant n bits are lost. The ror operation rotates each bit right by a specified amount n. The n most significant bits become the least significant n bits. The RRX operation rotates one place to the right but the CPSR carry flag, C, is included. The carry flag and the register together create a 33 bit quantity to be rotated. The carry flag is rotated into the most significant bit of the register, and the least significant bit of the register is rotated into the carry flag. Table 4.2 shows all of the possible forms for Operand2.

Table 4.2

Formats for Operand2

#<immediate|symbol>A 32-bit immediate value that can be constructed from an 8 bit value
RmAny of the 16 registers r0-r15
Rm, <shift_op> #<shift_imm>The contents of a register shifted or rotated by an immediate amount between 0 and 31
Rm, <shift_op> RsThe contents of a register shifted or rotated by an amount specified by the contents of another register
Rm, rrxThe contents of a register rotated right by one bit through the carry flag

4.1.2 Comparison Operations

These four comparison operations update the CPSR flags, but have no other effect:

cmp Compare,

cmn Compare Negative,

tst Test Bits, and

teq Test Equivalence.

They each perform an arithmetic operation, but the result of the operation is discarded. Only the CPSR carry flags are affected.

Syntax

 <op>{<cond>} Rn, Operand2

 <op> is either cmp, cmn, tst, or teq.

 The optional <cond> can be any of the codes from Table 3.2 specifying conditional execution.

Operations

NameEffectDescription
cmpRnoperand2Compare and set CPSR flags
cmnRn + operand2Compare negative and set CPSR flags
tstRnoperand2Test bits and set CPSR flags
teqRnoperand2Test equivalence and set CPSR flags

Examples

f04-01-9780128036983

Example 4.1 shows how conditional execution and the test instruction can be used together to create an if-then-else structure. Note that in this case, the assembly code is more concise than the C code. That is not generally true.

Example 4.1

Making an If-Then-Else Construct

The following C code adds three to a if a is odd, and adds seven to a if a is even.

f04-02-9780128036983

Assuming that the value of a is currently being stored in register r4, the following ARM assembly code performs the same function:

f04-03-9780128036983

4.1.3 Arithmetic Operations

There are six basic arithmetic operations:

add Add,

adc Add with Carry,

sub Subtract,

sbc Subtract with Carry,

rsb Reverse Subtract, and

rsc Reverse Subtract with Carry.

All of them involve two 32-bit source operands and a destination register.

Syntax

 <op>{<cond>}{s} Rd, Rn, Operand2

 <op> is one of add, adc, sub, sbc, or rsb, or rsc.

 The optional s specifies whether or not the instruction should affect the bits in the CPSR.

 The optional <cond> can be any of the codes from Table 3.2 on page 59 specifying conditional execution.

Operations

NameEffectDescription
addRdRn+operand2si2_eAdd
adcRdRn+operand2+carrysi3_eAdd with carry
subRdRnoperand2si4_eSubtract
sbcRdRnoperand2+carry1si5_eSubtract with carry
rsbRdoperand2Rnsi6_eReverse subtract
rscRdoperand2Rn+carry1si7_eReverse subtract with carry

Examples

f04-04-9780128036983

Example 4.2 shows a complete program for adding the contents of two statically allocated variables and printing the result. The printf () function expects to find the address of a string in r0. As it prints the string, it finds the \%d formatting command, which indicates that the value of an integer variable should be printed. It expects the variable to be stored in r1. Note that the variable sum does not need to be stored in memory. It is stored in r1, where printf () expects to find it.

Example 4.2

Adding the Contents of Two Variables

The following C program will add together two numbers stored in memory and print the result.

f04-05-9780128036983

The equivalent ARM assembly program is as follows:

f04-06-9780128036983

Example 4.3 shows how the compare, branch, and add instructions can be used to create a loop. There are basically three steps for creating a loop: allocating and initializing the loop variable, testing the loop variable, and modifying the loop variable. In general, any of the registers r0-r12 can be used to hold the loop variable. Section 5.4 introduces some considerations for choosing an appropriate register. For now, it is assumed that r0 is available for use as the loop variable for this example.

Example 4.3

Making a Loop

Suppose we want to implement a loop that is equivalent to the following C code:

f04-07-9780128036983

The loop can be written with the following ARM assembly code:

f04-08-9780128036983

4.1.4 Logical Operations

There are five basic logical operations:

and Bitwise AND,

orr Bitwise OR,

eor Bitwise Exclusive OR,

orn Bitwise OR NOT, and

bic Bit Clear.

All of them involve two source operands and a destination register.

Syntax

 <op>{<cond>}{s} Rd, Rn, Operand2

 <op> is either and, eor, orr, orn, or bic.

 The optional s specifies whether or not the instruction should affect the bits in the CPSR.

 The optional <cond> can be any of the codes from Table 3.2 specifying conditional execution.

Operations

NameEffectDescription
andRdRnoperand2si8_eBitwise AND
orrRdRnoperand2si9_eBitwise OR
eorRdRnoperand2si10_eBitwise Exclusive OR
ornRd¬(Rnoperand2)si11_eComplement of Bitwise OR
bicRdRn¬operand2si12_eBit Clear

Examples

f04-09-9780128036983

4.1.5 Data Movement Operations

The data movement operations copy data from one register to another:

mov Move,

mvn Move Not, and

movt Move Top.

The movt instruction copies 16 bits of data into the upper 16 bits of the destination register, without affecting the lower 16 bits. It is available on ARMv6T2 and newer processors.

Syntax

 <op>{<cond>}{s} Rd, Operand2

 movt{<cond>} Rd, #immed16

 <op> is one of mov or mvn.

 The optional s specifies whether or not the instruction should affect the bits in the CPSR.

 The optional <cond> can be any of the codes from Table 3.2 specifying conditional execution.

Operations

NameEffectDescription
movRdoperand2si13_eCopy operand2 to Rd
mvnRn¬operand2si14_eCopy 1’s complement of operand2
movtRn(immed1616)(Rd0xFFFF)si15_eCopy immed16 into upper 16 bits of Rd

Examples

f04-10-9780128036983

4.1.6 Multiply Operations with 32-bit Results

These two instructions perform multiplication using two 32-bit registers to form a 32-bit result:

mul Multiply, and

mla Multiply and Accumulate.

The mla instruction adds a third register to the result of the multiplication.

Syntax

 mul{<cond>}{s} Rd, Rm, Rs

 mla{<cond>}{s} Rd, Rm, Rs, Rn

 The optional s specifies whether or not the instruction should affect the bits in the CPSR.

 The optional <cond> can be any of the codes from Table 3.2 specifying conditional execution.

Operations

NameEffectDescription
mulRdRm×Rssi16_eMultiply
mlaRdRm×Rs+Rnsi17_eMultiply and accumulate

Examples

f04-11-9780128036983

4.1.7 Multiply Operations with 64-bit Results

These instructions perform multiplication using two 32-bit registers to form a 64-bit result:

smull Signed Multiply Long,

umull Unsigned Multiply Long,

smlal Signed Multiply and Accumulate Long, and

umlal Unsigned Multiply and Accumulate Long.

The smlal and umlal instructions add a 64-bit quantity to the result of the multiplication.

Syntax

 <type><op>l{<cond>}{s} RdLo, RdHi, Rm, Rs

 <type> must be either s for signed or u for unsigned.

 <op> must be either mul, or mla.

 The optional s specifies whether or not the instruction should affect the bits in the CPSR.

 The optional <cond> can be any of the codes from Table 3.2 specifying conditional execution.

Operations

NameEffectDescription
smullRdHi:RdLoRm×Rssi18_eSigned Multiply
umullRdHi:RdLoRm×Rssi18_eUnsigned Multiply
smlalRdHi:RdLoRm×Rs+RdHi:RdLosi20_eSigned Multiply and Accumulate
umlalRdHi:RdLoRm×Rs+RdHi:RdLosi20_eUnsigned Multiply and Accumulate

Examples

f04-12-9780128036983

4.1.8 Division Operations

Some ARM processors have the following instructions to perform division:

sdiv Signed Divide, and

udiv Unsigned Divide.

The divide operations are available on Cortex M3 and newer ARM processors. The processor used on the Raspberry Pi does not have these instructions. The Raspberry Pi 2 does have them.

Syntax

 <type>div{<cond>}{s} Rd, Rm, Rn

 <type> must be either s for signed or u for unsigned.

 The optional <cond> can be any of the codes from Table 3.2 specifying conditional execution.

 The optional s specifies whether or not the instruction should affect the bits in the CPSR.

Operations

NameEffectDescription
sdivRdRm÷Rnsi22_eSigned Divide
udivRdRm÷Rnsi22_eUnsigned Divide

Examples

f04-13-9780128036983

4.2 Special Instructions

There are a few instructions that do not fit into any of the previous categories. They are used to request operating system services and access advanced CPU features.

4.2.1 Count Leading Zeros

This instruction counts the number of leading zeros in the operand register and stores the result in the destination register:

clz Count Leading Zeros.

Syntax

 clz{<cond>} Rd, Rm

 The optional <cond> can be any of the codes from Table 3.2 specifying conditional execution.

Operations

NameEffectDescription
clzRd31log2(Rm)si24_eCount leading zeros in Rm

Example

f04-14-9780128036983

4.2.2 Accessing the CPSR and SPSR

These two instructions allow the programmer to access the status bits of the CPSR and SPSR:

mrs Move Status to Register, and

msr Move Register to Status.

The SPSR is covered in Section 14.1.

Syntax

 mrs{<cond>} Rd, <CPSR|SPSR>{_<fields>}

 msr{<cond>} <CPSR|SPSR>{_<fields>}, Rd

 The optional < fields > is any combination of:

c control field

x extension field

s status field

f flags field

 The optional <cond> can be any of the codes from Table 3.2 specifying conditional execution.

Operations

NameEffectDescription
mrsRdCPSR|SPSRsi25_eMove from Status Register
msrCPSR|SPSRRnsi26_eMove to Status Register

Examples

f04-15-9780128036983

4.2.3 Software Interrupt

The following instruction allows a user program to perform a system call to request operating system services:

swi Software Interrupt.

In Unix and Linux, the system calls are documented in the second section of the online manual. Each system call has a unique id number which is defined in the /usr/include/syscall.h file.

Syntax

 swi <syscall_number>

 The <syscall_number> is encoded in the instruction. The operating system may examine it to determine which operating system service is being requested.

 In Linux, <syscall_number> is ignored. The system call number is passed in r7, and up to seven parameters are passed in r0-r6. No Linux system call requires more than seven parameters.

Operations

NameEffectDescription
swiRequest Operating SystemPerform software interrupt
Service

Example

f04-16-9780128036983

4.2.4 Thumb Mode

The ARM processor has an alternate mode where it executes a 16-bit instruction set known as Thumb. This instruction allows the programmer to change the processor mode and branch to Thumb code:

bx Branch and Exchange.

The thumb instruction set is sometimes more efficient than the full ARM instruction set, and may offer advantages on small systems.

Syntax

 bx{<cond>} Rn

 blx{<cond>} Rn

Operations

NameEffectDescription
bxpctarget_addresssi27_eBranch and change to ARM state. Bit 0 of Rn must be set to 1. Used to return from a Thumb subroutine
blxlrpc1si28_epctarget_addresssi27_eBranch and link with change to Thumb state. Bit 0 of Rn must be set to 1. Bit 0 of lr will be set to 1

Example

f04-17-9780128036983

4.3 Pseudo-Instructions

The assembler provides a small number of pseudo-instructions. From the perspective of the programmer, these instructions are indistinguishable from standard instructions. However, when the assembler encounters a pseudo-instruction, it may substitute a different instruction or generate a short sequence of machine instructions.

4.3.1 No Operation

This pseudo instruction does nothing, but takes one clock cycle to execute.

nop No Operation.

This is equivalent to a mov r0,r0 instruction.

Syntax

 nop

Operations

NameEffectDescription
nopNo effectsNo Operation

Examples

f04-18-9780128036983

4.3.2 Shifts

These pseudo instructions are assembled into mov instructions with an appropriate shift of Operand2:

lsl Logical Shift Left,

lsr Logical Shift Right,

asr Arithmetic Shift Right,

ror Rotate Right, and

rrx Rotate Right with eXtend.

Syntax

 <op>{<cond>}{s} Rd, Rn, Rs

 <op>{<cond>}{s} Rd, Rn, #shift

 rrx{<cond>}{s} Rd, Rn

 <op> must be either lsl, lsr, asr, or ror.

 Rs is a register holding the shift amount. Only the least significant byte is used.

 shift must be between 1 and 32.

 If the optional s is specified, then the N and Z flags are updated according to the result, and the C flag is updated to the last bit shifted out.

 The optional <cond> can be any of the codes from Table 3.2 on page 59 specifying conditional execution.

Operations

NameEffectDescription
lslRdRnshiftsi30_eShift Left
lsrRdRnshiftsi31_eShift Right
asrRdRnshiftsi31_eShift Right with sign extend
rrxRd:CarryCarry:Rdsi33_eRotate Right with eXtend

The rrx operation rotates one place to the right but the CPSR carry flag, C, is included. The carry flag and the register together create a 33-bit quantity to be rotated. The carry flag is rotated into the most significant bit of the register, and the least significant bit of the register is rotated into the carry flag.

Examples

f04-19-9780128036983

4.4 Alphabetized List of ARM Instructions

This chapter and the previous one introduced the core set of ARM instructions. Most of these instructions were introduced with the very first ARM processors. There are approximately 50 additional instructions and pseudo instructions that were introduced with the ARMv6 and later versions of the architecture, or that only appear in specific versions of the ARM. There are also additional instructions available on systems that have the Vector Floating Point (VFP) coprocessor and/or the NEON extensions. The instructions introduced so far are:

NamePageOperation
adc83Add with Carry
add83Add
adr75Load Address
adrl75Load Address Long
and85Bitwise AND
asr94Arithmetic Shift Right
b70Branch
bic86Bit Clear
bl71Branch and Link
bx92Branch and Exchange
clz90Count Leading Zeros
cmn81Compare Negative
cmp81Compare
eor85Bitwise Exclusive OR
ldm65Load Multiple Registers
ldr73Load Immediate
ldr64Load Register
ldrex69Load Multiple Registers
lsl94Logical Shift Left
lsr94Logical Shift Right
mla87Multiply and Accumulate
mov86Move
movt86Move Top
mrs91Move Status to Register
msr91Move Register to Status
mul87Multiply
mvn86Move Not
nop93No Operation
orn86Bitwise OR NOT
orr85Bitwise OR
ror94Rotate Right
rrx94Rotate Right with eXtend
rsb83Reverse Subtract
rsc83Reverse Subtract with Carry
sbc83Subtract with Carry
sdiv89Signed Divide
smlal88Signed Multiply and Accumulate Long
smull88Signed Multiply Long
stm65Store Multiple Registers
str64Store Register
strex69Store Multiple Registers
sub83Subtract
swi91Software Interrupt
swp68Load Multiple Registers
teq81Test Equivalence
tst81Test Bits
udiv89Unsigned Divide
umlal88Unsigned Multiply and Accumulate Long
umull88Unsigned Multiply Long

t0090

4.5 Chapter Summary

The ARM Instruction Set Architecture includes 17 registers and four basic instruction types. This chapter introduced the instructions used for

 moving data from one register to another,

 performing computational operations with two source operands and one destination register,

 multiplication and division,

 performing comparisons, and

 performing special operations.

Most of the data processing instructions are three address instructions, because they involve two source operands and produce one result. For most instructions, the second source operand can be a register, a rotated or shifted register, or an immediate value. This flexibility results in a relatively powerful assembly language. In addition, almost all instructions can be executed conditionally, which, if used properly, results in very efficient and compact code.

Exercises

4.1 If r0 initially contains 1, what will it contain after the third instruction in the sequence below?

f04-20-9780128036983

4.2 What will r0 and r1 contain after each of the following instructions? Give your answers in base 10.

f04-21-9780128036983

4.3 What is the difference between lsr and asr?

4.4 Write the ARM assembly code to load the numbers stored in num1 and num2, add them together, and store the result in numsum. Use only r0 and r1.

4.5 Given the following variable definitions:

f04-22-9780128036983

where you do not know the values of x and y, write a short sequence of ARM assembly instructions to load the two numbers, compare them, and move the largest number into register r0.

4.6 Assuming that a is stored in register r0 and b is stored in register r1, show the ARM assembly code that is equivalent to the following C code.

f04-23-9780128036983

4.7 Without using the mul instruction, give the instructions to multiply r3 by the following constants, leaving the result in r0. You may also use r1 and r2 to hold temporary results, and you do not need to preserve the original contents of r3.

(a) 10

(b) 100

(c) 575

(d) 123

4.8 Assume that r0 holds the least significant 32 bits of a 64-bit integer a, and r1 holds the most significant 32 bits of a. Likewise, r2 holds the least significant 32 bits of a 64-bit integer b, and r3 holds the most significant 32 bits of b. Show the shortest instruction sequences necessary to:

(a) compare a to b, setting the CPSR flags,

(b) shift a left by one bit, storing the result in b,

(c) add b to a, and

(d) subtract b from a.

4.9 Write a loop to count the number of bits in r0 that are set to 1. Use any other registers that are necessary.

4.10 The C standard library provides the open() function, which is documented in the second section of the Linux manual pages. This function is a very small “wrapper” to allow C programmers to access the open() system call. Assembly programmers can access the system call directly. In ARM Linux, the system call number for open() is 5. The values for flag constants used with open() are defined in

/usr/include/bits/fcntl-linux.h.

Write the ARM assembly instructions and directives necessary to make a Linux system call to open a file named input.txt for reading, without using the C standard library. In other words, write the assembly equivalent to: open(”input.txt”,O˙RDONLY); using the swi instruction.

Chapter 5

Structured Programming

Abstract

This chapter first introduces the structured programming concepts and describes the principles of good software design. It then shows how the language elements covered in the previous three chapters are used to create the elements required by structured programming, giving comparative examples of these elements in C and assembly language. It covers programming elements for sequencing, selection, and iteration. Then it covers in greater detail how to access the standard C library functions from assembly language, and how to access assembly language functions from C. It then explains how automatic variables are allocated, and covers writing recursive functions in assembly language. Finally, it explains the implementation of C structs and shows how they can be accessed from assembly language, then covers arrays in the same way.

Keywords

Structured programming; Sequencing; Selection; Iteration; Loop; Subroutine; Function; Recursion; Struct; Aggregate data; Array

Before IBM released FORTRAN in 1957, almost all programming was done in assembly language. Part of the reason for this is that nobody knew how to design a good high-level language, nor did they know how to write a compiler to generate efficient code. Early attempts at high-level languages resulted in languages that were not well structured, difficult to read, and difficult to debug. The first release of FORTRAN was not a particularly elegant language by today’s standards, but it did generate efficient code.

In the 1960s, a new paradigm for designing high-level languages emerged. This new paradigm emphasized grouping program statements into blocks of code that execute from beginning to end. These basic blocks have only one entry point and one exit point. Control of which basic blocks are executed, and in what order, is accomplished with highly structured flow control statements. The structured program theorem provides the theoretical basis of structured programming. It states that there are three ways of combining basic blocks: sequencing, selection, and iteration. These three mechanisms are sufficient to express any computable function. It has been proven that all programs can be written using only basic blocks, the pre-test loop, and if-then-else structure. Although most high-level languages provide additional statements for the convenience of the programmer, they are just “syntactical sugar.” Other structured programming concepts include well-formed functions and procedures, pass-by-reference and pass-by-value, separate compilation, and information hiding.

These structured programming languages enabled programmers to become much more productive. Well-written programs that adhere to structured programming principles are much easier to write, understand, debug, and maintain. Most successful high-level languages are designed to enforce, or at least facilitate, good programming techniques. This is not generally true for assembly language. The burden of writing a well-structured code lies with the programmer, and not with the language.

The best assembly programmers rely heavily on structured programming concepts. Failure to do so results in code that contains unnecessary branch instructions and, in the worst cases, results in something called spaghetti code. Consider a code listing where a line has been drawn from each branch instruction to its destination. If the result looks like someone spilled a plate of spaghetti on the page, then the listing is spaghetti code. If a program is spaghetti code, then the flow of control is difficult to follow. Spaghetti code is much more likely to have bugs and is extremely difficult to debug. If the flow of control is too complex for the programmer to follow, then it cannot be adequately debugged. It is the responsibility of the assembly language programmer to write code that uses a block-structured approach.

Adherence to structured programming principles results in code that has a much higher probability of working correctly. Well-written code also has fewer branch statements, making the percentage of data processing statements versus branch statements is higher. High data processing density results in higher throughput of data. In other words, writing code in a structured manner leads to higher efficiency.

5.1 Sequencing

Sequencing simply means executing statements (or instructions) in a linear sequence. When statement n is completed, statement n + 1 will be executed next. Uninterrupted sequences of statements form basic blocks. Basic blocks have exactly one entry point and one exit point. Flow control is used to select which basic block should be executed next.

5.2 Selection

The first control structure that we will examine is the basic selection construct. It is called selection because it selects one of the two (or possibly more) blocks of code to execute, based on some condition. In its most general form, the condition could be computed in a variety of ways, but most commonly it is the result of some comparison operation or the result of evaluating a Boolean expression.

Most languages support selection in the form of an if-then-else statement. Selection can be implemented very easily in ARM assembly language with a two-stage process:

1. perform an operation that updates the CPSR flags, and

2. use conditional execution to select a block of instructions to execute.

Because the ARM architecture supports conditional execution on almost every instruction, there are two basic ways to implement this control structure: by using conditional execution on all instructions in a block, or by using branch instructions. The conditional execution can be applied directly to instructions following the flag update, or to branch instructions that transfer execution to another location. Listing 5.1 shows a typical if-then-else statement in C.

f05-02-9780128036983
Listing 5.1 Selection in C.

5.2.1 Using Conditional Execution

Listing 5.2 shows the ARM code equivalent to Listing 5.1, using conditional execution. The then and else are written with one instruction each on lines 7 and 8. The then section is written as a conditional instruction with the lt condition attached. The else section is a single instruction with the opposite (ge) condition. Therefore only one of the two instructions will actually execute, depending on the results of the cmp instruction. If there are three or fewer instructions in each block that can be selected, then this is the preferred and most efficient method of writing the bodies of the then and else selections.

f05-03-9780128036983
Listing 5.2 Selection in ARM assembly using conditional execution.

5.2.2 Using Branch Instructions

Listing 5.3 shows the ARM code equivalent to Listing 5.1, using branch instructions. Note that this method requires a conditional branch, an unconditional branch, and two labels. If there are more than three instructions in either basic block, then this is the preferred and most efficient method of writing the bodies of the then and else selections.

f05-04-9780128036983
Listing 5.3 Selection in ARM assembly using branch instructions.

5.2.3 Complex Selection

More complex selection structures should be written with care. Listing 5.4 shows a fragment of C code which compares the variables a, b, and c, and sets the variable x to the least of the three values. In C, Boolean expressions use short-circuit evaluation. For example, consider the Boolean AND operator in the expression ((a<b)&&(a<c)). If the first sub-expression evaluates to false, then the truth value of the complete expression can be immediately determined to be false, so the second sub-expression is not evaluated. This usually results in the compiler generating very efficient assembly code. Good programmers can take advantage of short-circuiting by checking array bounds early in a Boolean expression and accessing array elements later in the expression. For example, the expression ((i<15)&&(array[i]<0)) makes sure that the index i is less than 15 before attempting to access the array. If the index is greater than 14, the array access will not take place. This prevents the program from attempting to access the 16th element on an array that has only 15 elements.

f05-05-9780128036983
Listing 5.4 Complex selection in C.

Listing 5.5 shows an ARM assembly code fragment which is equivalent to Listing 5.4. In this code fragment, r0 is used to store a temporary value for the variable x, and the value is only stored to memory once at the end of the fragment of code. The outer if-then-else statement is implemented using branch instructions. The first comparison is performed on line 8. If the comparison evaluates to false, then it immediately branches to the else block of the outer if-then-else statement. But if the first comparison evaluates to true, then it performs the second comparison. Again, if that comparison evaluates to false, then it branches to the else block of the outer if-then-else statement. If both comparisons evaluate to true, then it executes the then block of the outer if-then-else statement, and then branches to the statement following the else block.

f05-06-9780128036983
Listing 5.5 Complex selection in ARM assembly.

The if-then-else statement on line 5 of Listing 5.4 is implemented using conditional execution. The comparison is performed on line 13 of Listing 5.5. Lines 14 and 15 contain instructions that are conditionally executed. Since they have complementary conditions, it is guaranteed that one of them will move a value into r0. The comparison on line 13 determines which statement executes.

Note that the number of comparisons performed will always be minimized, and the number of branches has also been minimized. The only way that line 13 can be reached is if one of the first two comparisons evaluates to false. If line 2 is executed, then no matter which sequence of events occurs, the program fragment will always reach line 16 and a value will be stored in x. Thus, the ARM assembly code fragment in Listing 5.5 can be considered to be a block of code with exactly one entry point and one exit point.

When writing nested selection structures, it is important to maintain a block structure, even if the bodies of the blocks consist of only a single instruction. It is often very helpful to write the algorithm in pseudo-code or a high-level language, such as C or Java, before converting it to assembly. Prolific commenting of the code is also strongly encouraged.

5.3 Iteration

Iteration involves the transfer of control from a statement in a sequence to a previous statement in the sequence. The simplest type of iteration is the unconditional loop, also known as the infinite loop. This type of loop may be used in programs or tasks that should continue running indefinitely. Listing 5.6 shows an ARM assembly fragment containing an unconditional loop. Few high-level languages provide a true unconditional loop, but the high-level programmer can achieve a similar effect by using a conditional loop and specifying a condition that always evaluates to true.

f05-07-9780128036983
Listing 5.6 Unconditional loop in ARM assembly.

5.3.1 Pre-Test Loop

A pre-test loop is a loop in which a test is performed before the block of instructions forming the loop body is executed. If the test evaluates to true, then the loop body is executed. The last instruction in the loop body is a branch back to the beginning of the test. If the test evaluates to false, then execution branches to the first instruction following the loop body. All structured programming languages have a pre-test loop construct. For example, in C, the pre-test loop is called a while loop. In assembly, a pre-test loop is constructed very similarly to an if-then statement. The only difference is that it includes an additional branch instruction at the end of the sequence of instructions that form the body. Listing 5.7 shows a pre-test loop in ARM assembly.

f05-08-9780128036983
Listing 5.7 Pre-test loop in ARM assembly.

5.3.2 Post-Test Loop

In a post-test loop, the test is performed after the loop body is executed. If the test evaluates to true, then execution branches to the first instruction in the loop body. Otherwise, execution continues sequentially. Most structured programming languages have a post-test loop construct. For example, in C, the post-test loop is called a do-while loop. Listing 5.8 shows a post-test loop in ARM assembly. The body of a post-test loop will always be executed at least once.

f05-09-9780128036983
Listing 5.8 Post-test loop in ARM assembly.

5.3.3 For Loop

Many structured programming languages have a for loop construct, which is a type of counting loop. The for loop is not essential, and is only included as a matter of syntactical convenience. In some cases, a for loop is easier to write and understand than an equivalent pre-test or post-test loop. However, with the addition of an if-then construct, any loop can be implemented as a pre-test loop. The following sections show how loops can be converted from one form to another.

Pre-test conversion

Listing 5.9 shows a simple C program with a for loop. The program prints “Hello World” 10 times, appending an integer to the end of each line.

f05-10-9780128036983
Listing 5.9 for loop in C.

In order to write an equivalent program in assembly, the programmer must first rewrite the for loop as a pre-test loop. Listing 5.10 shows the program rewritten so that it is easier to translate into assembly. Note that the initialization of the loop variable has been moved to its own line before the while statement. Also, the loop variable is modified on the last line of the loop body. This is a straightforward conversion from one type of loop to another type. Listing 5.11 shows a translation of the pre-test loop structure into ARM assembly.

f05-11-9780128036983
Listing 5.10 for loop rewritten as a pre-test loop in C.
f05-12-9780128036983
Listing 5.11 Pre-test loop in ARM assembly.

Post-test conversion

If the programmer can guarantee that the body of a for loop will always execute at least once, then the for loop can be converted to an equivalent post-test loop. This form of loop is more efficient, because the loop control variable is tested one time less than for a pre-test loop. Also, a post-test loop requires only one label and one conditional branch instruction, whereas a pre-test loop requires two labels, a conditional branch, and an unconditional branch.

Since the loop in Listing 5.9 always executes the body exactly 10 times, we know that the body will always execute at least once. Therefore, the loop can be converted to a post-test loop. Listing 5.12 shows the program rewritten as a post-test loop so that it is easier to translate into assembly. Note that, as in the previous example, the initialization of the loop variable has been moved to its own line before the do-while loop, and the loop variable is modified on the last line of the loop body. This post-test version will produce the same output as the pre-test version. This is a straightforward conversion from one type of loop to an equivalent type. Listing 5.13 shows a straightforward translation of the post-test loop structure into ARM assembly.

f05-13-9780128036983
Listing 5.12 for loop rewritten as a post-test loop in C.
f05-14-9780128036983
Listing 5.13 Post-test loop in ARM assembly

5.4 Subroutines

A subroutine is a sequence of instructions to perform a specific task, packaged as a single unit. Depending on the particular programming language, a subroutine may be called a procedure, a function, a routine, a method, a subprogram, or some other name. Some languages, such as Pascal, make a distinction between functions and procedures. A function must return a value and must not alter its input arguments or have any other side effects (such as producing output or changing static or global variables). A procedure returns no value, but may alter the value of its arguments or have other side effects.

Other languages, such as C, make no distinction between procedures and functions. In these languages, functions may be described as pure or impure. A function is pure if:

1. the function always evaluates the same result value when given the same argument value(s), and

2. evaluation of the result does not cause any semantically observable side effect or output.

The first condition implies that the result of the function cannot depend on any hidden information or state that may change as program execution proceeds, or between different executions of the program, nor can it depend on any external input from I/O devices. The result value of a pure function does not depend on anything other than the argument values. If the function returns multiple result values, then these two conditions must apply to all returned values. Otherwise the function is impure. Another way to state this is that impure functions have side effects while pure functions have no side effects.

Assembly language does not impose any distinction between procedures and functions, pure or impure. Although every assembly language will provide a way to call subroutines and return from them, it is up to the programmer to decide how to pass arguments to the subroutines and how to pass return values back to the section of code that called the subroutine. Once again, the expert assembly programmer will use structured programming concepts to write efficient, readable, debugable, and maintainable code.

5.4.1 Advantages of Subroutines

Subroutines help programmers to design reliable programs by decomposing a large problem into a set of smaller problems. It is much easier to write and debug a set of small code pieces than it is to work on one large piece of code. Careful use of subroutines will often substantially reduce the cost of developing and maintaining a large program, while increasing its quality and reliability. The advantages of breaking a program into subroutines include:

 enabling reuse of code across multiple programs,

 reducing duplicate code within a program,

 enabling the programming task to be divided between several programmers or teams,

 decomposing a complex programming task into simpler steps that are easier to write, understand, and maintain,

 enabling the programming task to be divided into stages of development, to match various stages of a project, and

 hiding implementation details from users of the subroutine (a programming principle known as information hiding).

5.4.2 Disadvantages of Subroutines

There are two minor disadvantages in using subroutines. First, invoking a subroutine (versus using in-line code) imposes overhead. The arguments to the subroutine must be put into some known location where the subroutine can find them. if the subroutine is a function, then the return value must be put into a known location where the caller can find it. Also, a subroutine typically requires some standard entry and exit code to manage the stack and save and restore the return address.

In most languages, the cost of using subroutines is hidden from the programmer. In assembly, however, the programmer is often painfully aware of the cost, since they have to explicitly write the entry and exit code for each subroutine, and must explicitly write the instructions to pass the data into the subroutine. However, the advantages usually outweigh the costs. Assembly programs can get very large and failure to modularize the code by using subroutines will result in code that cannot be understood or debugged, much less maintained and extended.

5.4.3 Standard C Library Functions

Subroutines may be defined within a program, or a set of subroutines may be packaged together in a library. Libraries of subroutines may be used by multiple programs, and most languages provide some built-in library functions. The C language has a very large set of functions in the C standard library. All of the functions in the C standard library are available to any program that has been linked with the C standard library. Even assembly programs can make use of this library. Linking is done automatically when gcc is used to assemble the program source. All that the programmer needs to know is the name of the function and how to pass arguments to it.

5.4.4 Passing Arguments

Listing 5.14 shows a very simple C program which reads an integer from standard input using scanf and prints the integer to standard output using printf. An equivalent program written in ARM assembly is shown in Listing 5.15. These examples show how arguments can be passed to subroutines in C and equivalently in assembly language.

f05-15-9780128036983
Listing 5.14 Calling scanf and printf in C.
f05-16-9780128036983
Listing 5.15 Calling scanf and printf in ARM assembly.

All processor families have their own standard methods, or function calling conventions, which specify how arguments are passed to subroutines and how function values are returned. The function call standard allows programmers to write subroutines and libraries of subroutines that can be called by other programmers. In most cases, the function calling standards are not enforced by hardware, but assembly programmers and compiler writers conform to the standards in order to make their code accessible to other programmers. The basic subroutine calling rules for the ARM processor are simple:

 The first four arguments go in registers r0-r3.

 Any remaining arguments are pushed to the stack.

If the subroutine returns a value, then it is stored in r0 before the function returns to its caller. Calling a subroutine in ARM assembly usually requires several lines of code. The number of lines required depends on how many arguments the subroutine requires and where the data for those arguments are stored. Some variables may already be in the correct register. Others may need to be moved from one register to another. Still others may need to be pushed onto the stack. Careful programming is required to minimize the amount of work that must be done just to move the subroutine arguments into their required locations.

The ARM register set was introduced in Chapter 3. Some registers have special purposes that are dictated by the hardware design. Others have special purposes that are dictated by programming conventions. Programmers follow these conventions so that their subroutines are compatible with each other. These conventions are simply a set of rules for how registers should be used. In ARM assembly, all registers have alternate names which can be used to help remember the rules for using them. Fig. 5.1 shows an expanded view of the ARM registers, including their alternate names and conventional use.

f05-01-9780128036983
Figure 5.1 ARM user program registers

Registers r0-r3 are also known as a1-a4, because they are used for passing arguments to subroutines. Registers r4-r11 are also known as v1-v8, because they are used for holding local variables in a subroutine. As mentioned in Section 3.2, register r11 can also be referred to as fp because it is used by the C compiler to track the stack frame, unless the code is compiled using the --omit-frame- pointer command line option.

The intra-procedure scratch register, r12, is used by the C library when calling dynamically linked functions. If a subroutine does not call any C library functions, then it can use r12 as another register to store local variables. If a C library function is called, it may change the contents of r12. Therefore, if r12 is being used to store a local variable, it should be saved to another register or to the stack before a C library function is called.

5.4.5 Calling Subroutines

The stack pointer (sp), link register (lr), and program counter (pc), along with the argument registers, are all involved in performing subroutine calls. The calling subroutine must place arguments in the argument registers, and possibly on the stack as well. Placing the arguments in their proper locations is known as marshaling the arguments. After marshaling the arguments, the calling subroutine executes the bl instruction, which will modify the program counter and link register. The bl instruction copies the contents of the program counter to the link register, then loads the program counter with the address of the first instruction in the subroutine that is being called. The CPU will then fetch and execute its next instruction from the address in the program counter, which is the first instruction of the subroutine that is being called.

Our first examples of calling a function will involve the printf function from the C standard library. The printf function can be a bit confusing at first, but it is an extremely useful and flexible function for printing formatted output. The printf function examines its first argument to determine how many other arguments have been passed to it. The first argument is a format string, which is a null-terminated ASCII string. The format string may include conversion specifiers, which start with the \% character. For each conversion specifier, printf assumes that an argument has been passed in the correct register or location on the stack. The argument is retrieved, converted according to the specified format, and printed. Other specifiers include \%X to print the matching argument as an integer in hexadecimal, \%c to print the matching argument as an ASCII character, \%s to print a zero-terminated string. The integer specifiers can include an optional width and zero-padding specification. For example \%8X will print an integer in hexadecimal, using 8 characters. Any leading zeros will be printed as spaces. The format string \%08X will print an integer in hexadecimal, using 8 characters. In this case, any leading zeros will be printed as zeros. Similarly, \%15d can be used to print an integer in base 10 using spaces to pad the number up to 15 characters, while \%015d will print an integer in base 10 using zeros to pad up to 15 characters.

Listing 5.16 shows a call to printf in C. The printf function requires one argument, and can accept more than one. In this case, there is only one argument, the format string. Listing 5.17 shows an equivalent call made in ARM assembly language. The single argument is loaded into r0 in conformance with the ARM subroutine calling convention.

f05-17-9780128036983
Listing 5.16 Simple function call in C.
f05-18-9780128036983
Listing 5.17 Simple function call in ARM assembly.

Passing arguments in registers

Listing 5.18 shows a call to printf in C having four arguments. The format string is the first argument. The format string contains three conversion specifiers, and is followed by three more arguments. Arguments are matched to conversion specifiers according to their positions. The type of each argument matches the type indicated in the conversion specifier. The first conversion specifier is applied to the second argument, the second conversion specifier is applied to the third argument, and the third conversion specifier is applied to the fourth argument. The \%d conversion specifiers indicate that the arguments are to be interpreted as integers and printed in base 10.

f05-19-9780128036983
Listing 5.18 A larger function call in C.

Listing 5.19 shows an equivalent call made in ARM assembly language. The arguments are loaded into r0-r3 in conformance with the ARM subroutine calling convention. Note that we assume that formatstr has previously been defined using a .asciz or .string assembler directive or equivalent method. As long as there are four or fewer arguments that must be passed, they can all fit in registers r0-r3 (a.k.a a1-a4), but when there are more arguments, things become a little more complicated. Any remaining arguments must be passed on the program stack, using the stack pointer r13. Care must be taken to ensure that the arguments are pushed to the stack in the proper order. Also, after the function call, the arguments must be removed from the stack, so that the stack pointer is restored to its original value.

f05-20-9780128036983
Listing 5.19 A larger function call in ARM assembly.

Passing arguments on the stack

Listing 5.20 shows a call to printf in C having more than four arguments. The format string is the first argument. The format string contains five conversion specifiers, which implies that the format string must be followed by five additional arguments. Arguments are matched to conversion specifiers according to their positions. The type of each argument matches the type indicated in the conversion specifier. The first conversion specifier is applied to the second argument, the second conversion specifier is applied to the third argument, the third conversion specifier is applied to the fourth argument, etc. The \%d conversion specifiers indicate that the arguments are to be interpreted as integers and printed in base 10.

f05-21-9780128036983
Listing 5.20 A function call using the stack in C.

Listing 5.21 shows an equivalent call made in ARM assembly language. Since there are six arguments, the last two must be pushed to the program stack. The arguments are loaded into r0 one at a time and then the register pre-indexed addressing mode is used to subtract four bytes from the stack pointer and then store the argument at the top of the stack. Note that the sixth argument is pushed to the stack first, followed by the fifth argument. The remaining arguments are loaded in r0-r3. Note that we assume that formatstr has previously been defined to be ”The results are: or \ lstinline { .string assembler directive.

f05-22-9780128036983
Listing 5.21 A function call using the stack in ARM assembly.

Listing 5.22 shows how the fifth and sixth arguments can be pushed to the stack using a single stmfd instruction. The sixth argument is loaded into r3 and the fifth argument is loaded into r0, then the stmfd instruction is used to store them on the stack and adjust the stack pointer. A little care must be taken to ensure that the arguments are stored in the correct order on the stack. Remember that the stmfd instruction will always push the lowest-numbered register to the lowest address, and the stack grows downward. Therefore, r3, the sixth argument, will be pushed onto the stack first, making it grow downward by four bytes. Next, r0 is pushed, making the stack grow downward by four more bytes. As in the previous example, the remaining four arguments are loaded into a1-a4.

f05-23-9780128036983
Listing 5.22 A function call using stm to push arguments onto the stack.

After the printf function is called, the fifth and sixth arguments must be popped from the stack. If those values are no longer needed, then there is no need to load them into registers. The quickest way to pop them from the stack is to simply adjust the stack pointer back to its original value. In this case, we pushed two arguments onto the stack, using a total of eight bytes. Therefore, all we need to do is add eight to the stack pointer, thereby restoring its original value.

5.4.6 Writing Subroutines

We have looked at the conventions that are followed for calling functions. Now we will examine these same conventions from the point of view of the function being called. Because of the calling conventions, the programmer writing a function can assume that

 the first four arguments are in r0-r3,

 any additional arguments can be accessed with ldr rd,[sp,# offset ],

 the calling function will remove arguments from the stack, if necessary,

 if the function return type is not void, then they must enusure that the return value is in r0 (and possibly r1, r2, r3), and

 the return address will be in lr.

Also because of the conventions, there are certain registers that can be used freely while others must be preserved or restored so that the calling function can continue operating correctly. Registers which can be used freely are referred to as volatile, and registers which must be preserved or restored before returning are referred to as non-volatile. When writing a subroutine (function),

 registers r0-r3 and r12 are volatile,

 registers r4-r11 and r13 are non-volatile (they can be used, but their contents must be restored to their original value before the function returns),

 register r14 can be used by the function, but its contents must be saved so that the return address can be loaded into r15 when the function returns to its caller,

 if the function calls another function, then it must save register r14 either on the stack or in a non-volatile register before making the call.

Listing 5.23 shows a small C function that simply returns the sum of its six arguments. The ARM assembly version of that function is shown in Listing 5.24. Note that on line 5, the fifth argument is loaded from the stack, and on line 7, the sixth argument is loaded in a similar way, using an offset from the stack pointer. If the calling function has followed the conventions, then the fifth and sixth arguments will be where they are expected to be in relation to the stack pointer.

f05-24-9780128036983
Listing 5.23 A small function in C.
f05-25-9780128036983
Listing 5.24 A small function in ARM assembly.

5.4.7 Automatic Variables

In block-structured high-level languages, an automatic variable is a variable that is local to a block of code and not declared with static duration. It has a lifetime that lasts only as long as its block is executing. Automatic variables can be stored in one of two ways:

1. the stack is temporarily adjusted to hold the variable, or

2. the variable is held in a register during its entire life.

When writing a subroutine in assembly, it is the responsibility of the programmer to decide what automatic variables are required and where they will be stored. In high-level languages this decision is usually made by the compiler. In some languages, including C, it is possible to request that an automatic variable be held in a register. The compiler will attempt to comply with the request, but it is not guaranteed. Listing 5.25 shows a small function which requests that one of its variables be kept in a register instead of on the stack.

f05-26-9780128036983
Listing 5.25 A small C function with a register variable.

Listing 5.26 shows how the function could be implemented in assembly. Note that the array of integers consumes 80 bytes of storage on the stack, and could not possibly fit into the registers available on the ARM processor. However, the loop control variable can easily be stored in one of the registers for the duration of the function. Also notice that on line 1 the storage for the array is allocated simply by adjusting the stack pointer, and on line 9 the storage is released by restoring the stack pointer to its original contents. It is critical that the stack pointer be restored, no matter how the function returns. Otherwise, the calling function will probably mysteriously fail. For this reason, each function should have exactly one block of instructions for returning. If the function needs to return from some location other than the end, then it should branch to the return block rather than returning directly.

f05-27-9780128036983
Listing 5.26 Automatic variables in ARM assembly.

5.4.8 Recursive Functions

A function that calls itself is said to be recursive. Certain problems are easy to implement recursively, but are more difficult to solve iteratively. A problem exhibits recursive behavior when it can be defined by two properties:

1. a simple base case (or cases), and

2. a set of rules that reduce all other cases toward the base case.

For example, we can define person’s ancestors recursively as follows:

1. one’s parents are one’s ancestors (base case),

2. the ancestors of one’s ancestors are also one’s ancestors (recursion step).

Recursion is a very powerful concept in programming. Many functions are naturally recursive, and can be expressed very concisely in a recursive way. Numerous mathematical axioms are based upon recursive rules. For example, the formal definition of the natural numbers by the Peano axioms can be formulated as:

1. 0 is a natural number, and

2. each natural number has a successor, which is also a natural number.

Using one base case and one recursive rule, it is possible to generate the set of all natural numbers. Other recursively defined mathematical objects include functions and sets.

Listing 5.27 shows the C code for a small program which uses recursion to reverse the order of characters in a string. The base case where recursion ends is when there are fewer than two characters remaining to be swapped. The recursive rule is that the reverse of a string can be created by swapping the first and last characters and then reversing the string between them. In short, a string is reversed if:

f05-28-9780128036983
Listing 5.27 A C program that uses recursion to reverse a string.

1. the string has a length of zero or one character, or

2. the first and last characters have been swapped and the remaining characters have been reversed.

In Listing 5.27, line 3 checks for the base case. If the string has not been reversed according to the first rule, then the second rule is applied. Lines 5–7 swap the first and last characters, and line 8 recursively reverses the characters between them.

Listing 5.28 shows how the reverse function can be implemented using recursion in ARM assembly. Line 1 saves the link register to the stack and decrements the stack pointer. Next, storage is allocated for an automatic variable. Lines 3 and 4 test for the base case. If the current case is the base case, then the function simply returns (restoring the stack as it goes). Otherwise, the first and last characters are swapped in lines 5 through 10 and a recursive call is made in lines 11 through 13.

f05-29-9780128036983
Listing 5.28 ARM assembly implementation of the reverse function.

The code in Listing 5.28 can be made a bit more efficient. First, the test for the base case can be performed before anything else is done, as shown in Listing 5.29. Also, the local variable tmp can be stored in a volatile register rather than stored on the stack, because it is only needed for lines 4 through 8. It is not needed after the recursive call, so there is really no need to preserve it on the stack. This means that our function can use half as much stack space and will run much faster. This further refined version is shown in Listing 5.30. This version uses ip (r12) as the tmp variable instead of using the stack.

f05-30-9780128036983
Listing 5.29 Better implementation of the reverse function.
f05-31-9780128036983
Listing 5.30 Even better implementation of the reverse function.

The previous examples used the concept of an array of characters to access the string that is being reversed. Listing 5.31 shows how this problem can be solved in C using pointers to the first and last characters rather than array indices. This version only has two parameters in the reverse function, and uses pointer dereferencing rather than array indexing to access each character. Other than that difference, it works the same as the original version. Listing 5.32 shows how the reverse function can be implemented efficiently in ARM assembly. This implementation has the same number of instructions as the previous version, but lines 4 through 7 use a different addressing mode. On the ARM processor, the pointer method and the array index method are equally efficient. However, many processors do not have the rich set of addressing modes available on the ARM. On those processors, the pointer method may be significantly more efficient.

f05-32-9780128036983
Listing 5.31 String reversing in C using pointers.
f05-33-9780128036983
Listing 5.32 String reversing in assembly using pointers.

5.5 Aggregate Data Types

An aggregate data item can be referenced as a single entity, and yet consists of more than one piece of data. Aggregate data types are used to keep related data together, so that the programmer’s job becomes easier. Some examples of aggregate data are arrays, structures or records, and objects, In most programming languages, aggregate data types can be defined to create higher-level structures. Most high-level languages allow aggregates to be composed of basic types as well as other aggregates. Proper use of structured data helps to make programs less complicated and easier to understand and maintain.

In high-level languages, there are several benefits to using aggregates. Aggregates make the relationships between data clear, and allow the programmer to perform operations on blocks of data. Aggregates also make passing parameters to functions simpler and easier to read.

5.5.1 Arrays

The most common aggregate data type is an array. An array contains zero or more values of the same data type, such as characters, integers, floating point numbers, or fixed point numbers. An array may also contain values of another aggregate data type. Every element in an array must have the same type. Each data item in an array can be accessed by its array index.

Listing 5.33 shows how an array can be allocated and initialized in C. Listing 5.34 shows the equivalent code in ARM assembly. Note that in this case, the scaled register offset addressing mode was used to access each element in the array. This mode is often convenient when the size of each element in the array is an integer power of 2. If that is not the case, then it may be necessary to use a different addressing mode. An example of this will be given in Section 5.5.3.

f05-34-9780128036983
Listing 5.33 Initializing an array of integers in C.
f05-35-9780128036983
Listing 5.34 Initializing an array of integers in assembly.

5.5.2 Structured Data

The second common aggregate data type is implemented as the struct in C or the record in Pascal. It is commonly referred to as a structured data type or a record. This data type can contain multiple fields. The individual fields in the structured data may also be referred to as structured data elements, or simply elements. In most high-level languages, each element of a structured data type may be one of the base types, an array type, or another structured data type. Listing 5.35 shows how a struct can be declared, allocated, and initialized in C. Listing 5.36 shows the equivalent code in ARM assembly.

f05-36-9780128036983
Listing 5.35 Initializing a structured data type in C.
f05-37-9780128036983
Listing 5.36 Initializing a structured data type in ARM assembly.

Care must be taken using assembly to access data structures that were declared in higher level languages such as C and C++. The compiler will typically pad a data structure to ensure that the data fields are aligned for efficiency. On most systems, it is more efficient for the processor to access word-sized data if the data is aligned to a word boundary. Some processors simply cannot load or store a word from an address that is not on a word boundary, and attempting to do so will result in an exception. The assembly programmer must somehow determine the relative address of each field within the higher-level language structure. One way that this can be accomplished in C is by writing a small function which prints out the offsets to each field in the C structure. The offsets can then be used to access the fields of the structure from assembly language. Another method for finding the offsets is to run the program under a debugger and examine the data structure.

5.5.3 Arrays of Structured Data

It is often useful to create arrays of structured data. For example, a color image may be represented as a two-dimensional array of pixels, where each pixel consists of three integers which specify the amount of red, green, and blue that are present in the pixel. Typically, each of the three values is represented using an unsigned eight bit integer. Image processing software often adds a fourth value, α, specifying the transparency of each pixel.

Listing 5.37 shows how an array of pixels can be allocated and initialized in C. The listing uses the malloc() function from the C standard library to allocate storage for the pixels from the heap (see Section 1.4). Note that the code uses the sizeof () function to determine how many bytes of memory are consumed by a single pixel, then multiplies that by the width and height of the image. Listing 5.38 shows the equivalent code in ARM assembly.

f05-38-9780128036983
Listing 5.37 Initializing an array of structured data in C.
f05-39-9780128036983
Listing 5.38 Initializing an array of structured data in assembly.

Note that the code in Listing 5.38 is far from optimal. It can be greatly improved by combining the two loops into one loop. This will remove the need for the multiply on line 28 and the addition on line 29, and will simplify the code structure. An additional improvement would be to increment the single loop counter by three on each loop iteration, making it very easy to calculate the pointer for each pixel. Listing 5.39 shows the ARM assembly implementation with these optimizations.

f05-40-9780128036983
Listing 5.39 Improved initialization in assembly.

Although the implementation shown in Listing 5.39 is more efficient than the previous version, there are several more improvements that can be made. If we consider that the goal of the code is to allocate some number of bytes and initialize them all to zero, then the code can be written more efficiently. Rather than using three separate store instructions to set 3 bytes to zero on each iteration of the loop, why not use a single store instruction to set four bytes to zero on each iteration? The only problem with this approach is that we must consider the possibility that the array may end in the middle of a word. However, this can be dealt with by using two consecutive loops. The first loop sets one word of the array to zero on each iteration, and the second loop finishes off any remaining bytes. Listing 5.40 shows the results of these additional improvements. This third implementation will run much faster than the previous implementations.

f05-41-9780128036983
Listing 5.40 Very efficient initialization in assembly.

5.6 Chapter Summary

Spaghetti code is the bane of assembly programming, but it can easily be avoided. Although assembly language does not enforce structured programming, it does provide the low-level mechanisms required to write structured programs. The assembly programmer must be aware of, and assiduously practice, proper structured programming techniques. The burden of writing properly structured code blocks, with selection structures and iteration structures, lies with the programmer, and failure to apply structured programming techniques will result in code that is difficult to understand, debug, and maintain.

Subroutines provide a way to split programs into smaller parts, each of which can be written and debugged individually. This allows large projects to be divided among team members. In assembly language, defining and using subroutines is not as easy as in higher level languages. However, the benefits usually outweigh the costs. The C library provides a large number of functions. These can be accessed by an assembly program as long as it is linked with the C standard library.

Assembly provides the mechanisms to access aggregate data types. Arrays can be accessed using various addressing modes on the ARM processor. The pre-indexing and post-indexing modes allow array elements to be accessed using pointers, with the pointers being incremented after each element access. Fields in structured data records can be accessed using immediate offset addressing mode. The rich set of addressing modes available on the ARM processor allows the programmer to use aggregate data types more efficiently than on most processors.

Exercises

5.1 What does it mean for a register to be volatile? Which ARM registers are considered volatile according to the ARM function calling convention?

5.2 Fully explain the differences between static variables and automatic variables.

5.3 In ARM assembly language, write a function that is equivalent to the following C function.

f05-42-9780128036983

5.4 What are the two places where an automatic variable can be stored?

5.5 You are writing a function and you decided to use registers r4 and r5 within the function. Your function will not call any other functions; it is self-contained. Modify the following skeleton structure to ensure that r4 and r5 can be used within the function and are restored to comply with the ARM standards, but without unnecessary memory accesses.

f05-43-9780128036983

5.6 Convert the following C program to ARM assembly, using a post-test loop:

f05-44-9780128036983

5.7 Write a complete ARM function to shift a 64-bit value left by any given amount between 0 and 63 bits. The function should expect its arguments to be in registers r0, r1, and r2. The lower 32 bits of the value are passed in r0, the upper 32 bits of the value are passed in r1, and the shift amount is passed in r2.

5.8 Write a complete subroutine in ARM assembly that is equivalent to the following C subroutine.

f05-45-9780128036983

5.9 Write a complete function in ARM assembly that is equivalent to the following C function.

f05-46-9780128036983f05-47-9780128036983

5.10 Write an ARM assembly function to calculate the average of an array of integers, given a pointer to the array and the number of items in the array. Your assembly function must implement the following C function prototype:

int average(int *array, int number_of_items);

Assume that the processor does not support the div instruction, but there is a function available to divide two integers. You do not have to write this function, but you may need to call it. Its C prototype is:

int divide(int numerator, int denominator);

5.11 Write a complete function in ARM assembly that is equivalent to the following C function. Note that a and b must be allocated on the stack, and their addresses must be passed to scanf so that it can place their values into memory.

f05-48-9780128036983

5.12 The factorial function can be defined as:

x!=1ifx1,x×(x1)!otherwise.

si1_e

The following C program repeatedly reads x from the user and calculates x! It quits when it reads end-of-file or when the user enters a negative number or something that is not an integer.
Write this program in ARM assembly.

f05-49-9780128036983

5.13 For large x, the factorial function is slow. However, a lookup table can be added to the function to improve average performance. This technique is commonly known as memoization or tabling, but is sometimes called dynamic programming. The following C implementation of the factorial function uses memoization. Modify your ARM assembly program from the previous problem to include memoization.

f05-50-9780128036983f05-51-9780128036983
Chapter 6

Abstract Data Types

Abstract

This chapter extends the coverage of structured programming to include abstract data types (ADT). It begins by giving the definition of an abstract data type and giving a small example of an ADT that could be used to read, process, and write Netpbm images. The next section introduces an ADT written in C to perform word frequency counts, and shows how performance can be greatly improved by using better algorithms and/or by writing some functions in assembly language. It also shows how a binary tree structure created by C code can be traversed in assembly language. The chapter ends with a ethics module about the Therac-25 cancer treatment device.

Keywords

Abstract data type; Word frequency count; Binary tree; Index; Sort; Ethics

An abstract data type (ADT) is composed of data and the operations that work on that data. The ADT is one of the cornerstones of structured programming. Proper use of ADTs has many benefits. Most importantly, abstract data types help to support information hiding. A software module hides information by encapsulating the information into a module or other construct which presents an interface. The interface typically consists of the names of data types provided by the ADT and a set of subroutine definitions, or prototypes, for operating on the data types. The implementation of the ADT is hidden from the client code that uses the ADT.

A common use of information hiding is to hide the physical storage layout for data so that if it is changed, the change is restricted to a small subset of the total program. For example, if a three-dimensional point (x,y,z) is represented in a program with three floating point scalar variables, and the representation is later changed to a single array variable of size three, a module designed with information hiding in mind would protect the remainder of the program from such a change.

Information hiding reduces software development risk by shifting the code’s dependency on an uncertain implementation onto a well-defined interface. Clients of the interface perform operations purely through the interface, which does not change. If the implementation changes, the client code does not have to change.

Encapsulating software and data structures behind an interface allows the construction of objects that mimic the behavior and interactions of objects in the real world. For example, a simple digital alarm clock is a real-world object that most people can use and understand. They can understand what the alarm clock does, and how to use it through the provided interface (buttons and display) without needing to understand every part inside of the clock. If the internal circuitry of the clock were to be replaced with a different implementation, people could continue to use it in the same way, provided that the interface did not change.

6.1 ADTs in Assembly Language

As with all other structured programming concepts, ADTs can be implemented in assembly language. In fact, most high-level compilers convert structured programming code into assembly during compilation. All that is required is that the programmer define the data structure(s), and the set of operations that can be used on the data. Listing 6.1 gives an example of an ADT interface in C. The type Image is not fully defined in the interface. This prevents client software from accessing the internal structure of the image data type. Therefore, programmers using the ADT can modify images only by using the provided functions. Other structured programming and object-oriented programming languages such as C++, Java, Pascal, and Modula 2 provide similar protection for data structures so that client code can access the data structure only through the provided interface. Note that only the pval definition is exposed, indicating to client programs that the red, green, and blue components of a pixel must be a number between 0 and 255. In C, as with other structured programming languages, the implementation of the subroutines can also be hidden by placing them in separate compilation modules. Those modules will have access to the internal structure of the Image data type.

f06-04-9780128036983
Listing 6.1 Definition of an Abstract Data Type in a C header file.

Assembly language does not have the ability to define a data structure as such, but it does provide the mechanisms needed to specify the location of each field with respect to the beginning of a data structure, as well as the overall size of the data structure. With a little thought and effort, it is possible to implement ADTs in Assembly language. Listing 6.2 shows the private implementation of the Image data type, which is included by the C files which implement the Image data type. Listing 6.3 shows how the data structures from the previous listings can be defined in assembly language. With those definitions, any of the functions declared in Listing 6.1 can be written in assembly language.

f06-05-9780128036983
Listing 6.2 Definition of the image structure may be hidden in a separate header file.
f06-06-9780128036983
Listing 6.3 Definition of an ADT in Assembly.

6.2 Word Frequency Counts

Counting the frequency of words in written text has several uses. In digital forensics, it can be used to provide evidence as to the author of written communications. Different people have different vocabularies, and use words with differing frequency. Word counts can also be used to classify documents by type. Scientific articles from different fields contain words specific to that field, and historical novels will differ from western novels in word frequency.

Listing 6.4 shows the main function for a simple C program which reads a text file and creates a list of all the words contained in a file, along with their frequency of occurrence. The program has been divided into two parts: the main program, and an ADT which is used to keep track the words and their frequencies, and to print a table of word frequencies.

f06-07a-9780128036983f06-07b-9780128036983
Listing 6.4 C program to compute word frequencies.

The interface for the ADT is shown in Listing 6.5. There are several ways that the ADT could be implemented. Note that the interface given in the header file does not show the internal fields of the word list data type. Thus, any file which includes this header is allowed to declare pointers to wordlist data types, but cannot access or modify any internal fields. The list of words could be stored in an array, a linked list, a binary tree, or some other data structure. The subroutines could be implemented in C or in some other language, including assembly. Listing 6.6 shows an implementation in C using a linked list. Note that the function for printing the word frequency list in numerical order has not been implemented. It will be written in assembly language. Since the program is split into multiple files, it is a good idea to use the make utility to build the executable program. A basic makefile is shown in Listing 6.7.

f06-08-9780128036983
Listing 6.5 C header for the wordlist ADT.
f06-09a-9780128036983f06-09b-9780128036983f06-09c-9780128036983
Listing 6.6 C implementation of the wordlist ADT.
f06-10-9780128036983
Listing 6.7 Makefile for the wordfreq program.

Suppose we wish to implement one of the functions from Listing 6.6 in ARM assembly language. We would delete the function from the C file, create a new file with the assembly version of the function, and modify the makefile so that the new file is included in the program. The header file and the main program file would not require any changes. The header file provides function prototypes that the C compiler uses to determine how parameters should be passed to the functions. As long as our new assembly function conforms to its C header definition, the program will work correctly.

6.2.1 Sorting by Word Frequency

The linked list is created in alphabetical order, but the wl_print_numerical() function is required to print it sorted in reverse order of number of occurrences. There are several ways in which this could be accomplished, with varying levels of efficiency. The possible approaches include, but are not limited to:

 Re-ordering the linked list using an insertion sort: This approach creates a complete new list by removing each item, one at a time, from the original list, and inserting it into a new list sorted by the number of occurrences rather than the words themselves. The time complexity for this approach would be O(N2), but would require no additional storage. However, if the list were later needed in alphabetical order, or any more words were to be added, then it would need to be re-sorted in the original order.

 Sorting the linked list using a merge sort algorithm: Merge sort is one of the most efficient sorting algorithms known and can be efficiently applied to data in files and linked lists. The merge sort works as follows:

1. The sub-list size, i, is set to 1.

2. The list is divided into sub-lists, each containing i elements. Each sub-list is assumed to be sorted. (A sub-list of length one is sorted by definition.)

3. The sub-lists are merged together to create a list of sub-lists of size 2i, where each sub-list is sorted.

4. The sub-list size, i, is set to 2i.

5. The process is repeated from step 2 until iN, where N is the number of items to be sorted.

The time complexity for the merge sort algorithm is NlogNsi1_e, which is far more efficient than the insertion sort. This approach would also require no additional storage. However, if the list were later needed in alphabetical order, or any more words were to be added, then it would need to be re-sorted in the original alphabetical order.

 Create an index, and sort the index rather than rebuilding the list. Since the number of elements in the list is known, we can allocate an array of pointers. Each pointer in the array is then initialized to point to one element in the linked list. The array forms an index, and the pointers in the array can be re-sorted in any desired order, using any common sorting method such as bubble sort (O(N2)), in-place insertion sort (O(N2)), quick sort (O(NlogN)si2_e), or others. This approach requires additional storage, but has the advantage that it does not need to modify the original linked list.

There are many other possibilities for re-ordering the list. Regardless of which method is chosen, the main program and the interface (header file) need not be changed. Different implementations of the sorting function can be substituted without affecting any other code.

The wl_print_numerical() function can be implemented in assembly as shown in Listing 6.8. The function operates by re-ordering the linked list using an insertion sort as described above. Listing 6.9 shows the change that must be made to the make file. Now, when make is run, it compiles the two C files and the assembly file into object files, then links them all together. The C implementation of wl_print_numerical() in list.c must be deleted or commented out or the linker will emit an error indicating that it found two versions of wl_print_numerical().

f06-11a-9780128036983f06-11b-9780128036983
Listing 6.8 ARM assembly implementation of wl_print_numerical().
f06-12a-9780128036983f06-12b-9780128036983
Listing 6.9 Revised makefile for the wordfreq program.

6.2.2 Better Performance

The word frequency counter, as previously implemented, takes several minutes to count the frequency of words in the author’s manuscript for this textbook on a Raspberry Pi. Most of the time is spent building the list of words and re-sorting the list in order of word frequency. Most of the time for both of these operations is spent in searching for the word in the list before incrementing its count or inserting it in the list. There are more efficient ways to build ordered lists of data.

Since the code is well modularized using an ADT, the internal mechanism of the list can be modified without affecting the main program. A major improvement can be made by changing the data structure from a linked list to a binary tree. Fig. 6.1 shows an example binary tree storing word frequency counts. The time required to insert into a linked list is O(N), but the time required to insert into a binary tree is O(log2N)si3_e. To give some perspective, the author’s manuscript for this textbook contains about 125,000 words. Since log2(125,000) < 17, we would expect the linked list implementation to require about 125,000177353si4_e times as long as a binary tree implementation to process the author’s manuscript for this textbook. In reality, there is some overhead to the binary tree implementation. Even with the extra overhead, we should see a significant speedup. Listing 6.10 shows the C implementation using a balanced binary tree instead of a linked list.

f06-01-9780128036983
Figure 6.1 Binary tree of word frequencies.
f06-13a-9780128036983f06-13b-9780128036983f06-13c-9780128036983f06-13d-9780128036983f06-13e-9780128036983f06-13f-9780128036983f06-13g-9780128036983
Listing 6.10 C implementation of the wordlist ADT using a tree.

With the tree implementation, wl_print_numerical() could build a new tree, sorted on the word frequency counts. However, it may be more efficient to build a separate index, and sort the index by word frequency counts. The assembly code will allocate an array of pointers, and set each pointer to one of the nodes in the tree, as shown in Fig. 6.2. Then, it will use a quick sort to sort the pointers into descending order by word frequency count, as shown in Fig. 6.3. This implementation is shown in Listing 6.11.

f06-02-9780128036983
Figure 6.2 Binary tree of word frequencies with index added.
f06-03-9780128036983
Figure 6.3 Binary tree of word frequencies with sorted index.
f06-14a-9780128036983f06-14b-9780128036983f06-14c-9780128036983f06-14d-9780128036983
Listing 6.11 ARM assembly implementation of wl_print_numerical() with a tree.

The tree-based implementation gets most of its speed improvement through using two O(NlogN)si2_e algorithms to replace O(N2) algorithms. These examples show how a small part of a program can be implemented in assembly language, and how to access C data structures from assembly language. The functions could just as easily have been written in C rather than assembly, without greatly affecting performance. Later chapters will show examples where the assembly implementation does have significantly better performance than the C implementation.

6.3 Ethics Case Study: Therac-25

The Therac-25 was a device designed for radiation treatment of cancer. It was produced by Atomic Energy of Canada Limited (AECL), which had previously produced the Therac-6 and Therac-20 units in partnership with CGR of France. It was capable of treating tumors close to the skin surface using electron beam therapy, but could also be configured for Megavolt X-ray therapy to treat deeper tumors. The X-ray therapy required the use of a tungsten radiation shield to limit the area of the body that was exposed to the potentially lethal radiation produced by the device.

The Therac-25 used a double pass accelerator, which provided more power, in a smaller space, at less cost, compared to its predecessors. The second major innovation was that computer control was a central part of the design, rather than an add-on component as in its predecessors. Most of the hardware safety interlocks that were integral to the designs of the Therac-6 and Therac-20 were seen as unnecessary, because the software would perform those functions. Computer control was intended to allow operators to set up the machine more quickly, allowing them to spend more time communicating with patients and to treat more patients per day. It was also seen as a way to reduce production costs by relying on software, rather than hardware, safety interlocks.

There were design issues with both the software and the hardware. Although this machine was built with the goal of saving lives, between 1985 and 1986, three deaths and other injuries were attributed to the hardware and software design of this machine. Death due to radiation exposure is usually slow and painful, and the problem was not identified until the damage had been done.

6.3.1 History of the Therac-25

AECL was required to obtain US Food and Drug Administration (FDA) approval before releasing the Therac-25 to the US market. They obtained approval quickly by declaring “pre-market equivalence,” effectively claiming that the new machine was not significantly different from its predecessors. This practice was common in 1984, but was overly optimistic, considering that most of the safety features had been changed from hardware to software implementations. With FDA approval, AECL made the Therac-25 commercially available and performed a Fault Tree Analysis to evaluate the safety of the device.

Fault Tree Analysis, as its name implies, requires building a tree to describe every possible fault and assigning probabilities to those faults. After building the tree, the probabilities of hazards, such as overdose, can be calculated. Unfortunately, the engineers assumed that the software (much of which was re-used from the previous Therac models) would operate correctly. This turned out not to be the case, because the hardware interlocks present in the previous models had hidden some of the software faults. The analysts did consider some possible computer faults, such as an error being caused by cosmic rays, but assigned extremely low probabilities to those faults. As a result, the assessment was very inaccurate.

When the first report of an overdose was reported to AECL in 1985, they sent an engineer to the site to investigate. They also filed a report with the FDA and the Canadian Radiation Protection Board (CRPB). AECL also notified all users of the fact that there had been a report and recommended that operators should visually confirm hardware settings before each treatment. The AECL engineers were unable to reproduce the fault, but suspected that it was due to the design and placement of a microswitch. They redesigned the microswitch and modified all of the machines that had been deployed. They also retracted their recommendation that operators should visually confirm hardware settings before each treatment.

Later that year, a second incident occurred. In this case, there is no evidence that AECL took any action. In January of 1986, AECL received another incident report. An employee at AECL responded by denying that the Therac-25 was at fault, and stated that no other similar incidents had been reported. Another incident occurred in March of that year. AECL sent an engineer to investigate. The engineer was unable to determine the cause, and suggested that it was due to an electrical problem, which may have caused an electrical shock. An independent engineering firm was called to examine the machine and reported that it was very unlikely that the machine could have delivered an electrical shock to the patient. In April of 1986, another incident was reported. In this case, the AECL engineers, working with the medical physicist at the hospital, were able to reproduce the sequence of events that lead to the overdose.

As required by law, AECL filed a report with the FDA. The FDA responded by declaring the Therac-25 defective. AECL was ordered to notify all of the sites where the Therac-25 was in use, investigate the problem, and file a corrective action plan. AECL notified all sites, and recommended removing certain keys from the keyboard on the machines. The FDA responded by requiring them to send another notification with more information about the defect and the consequent hazards. Later in 1986, AECL filed a revised corrective action plan.

Another overdose occurred in January 1987, and was attributed to a different software fault. In February, the FDA and CRPB both ordered that all Therac-25 units be shut down, pending effective and permanent modifications. AECL spent six months developing a new corrective action plan, which included a major overhaul of the software, the addition of mechanical safety interlocks, and other safety-related modifications.

6.3.2 Overview of Design Flaws

The Therac-25 was controlled by a DEC PDP-11 computer, which was the most popular minicomputer ever produced. Around 600,000 were produced between 1970 and 1990 and used for a variety of purposes, including embedded systems, education, and general data processing. It was a 16-bit computer and was far less powerful than a Raspberry Pi. The Therac-25 computer was programmed in assembly language by one programmer and the source code was not documented. Documentation for the hardware components was written in French. After the faults were discovered, a commission concluded that the primary problems with the Therac-25 were attributable to poor software design practices, and not due to any one of several specific coding errors. This is probably the best known case where a poor overall software design, and insufficient testing, led to loss of life.

The worst problems in the design and engineering of the machine were:

 The code was not subjected to independent review.

 The software design was not considered during the assessment of how the machine could fail or malfunction.

 The operator could ignore malfunctions and cause the machine to proceed with treatment.

 The hardware and software were designed separately and not tested as a complete system until the unit was assembled at the hospitals where it was to be used.

 The design of the earlier Therac-6 and Therac-20 machines included hardware interlocks which would ensure that the X-ray mode could not be activated unless the tungsten radiation shield was in place. The hardware interlock was replaced with a software interlock in the Therac-25.

 Errors were displayed as numeric codes, and there was no indication of the severity of the error condition.

The operator interface consisted of a keyboard and text-mode monitor, which was common in the early 1980s. The interface had a data entry area in the middle of the screen and a command line at the bottom. The operator was required to enter parameters in the data entry area, then move the cursor to the command line to initiate treatment. When the operator moved the cursor to the command line, internal variables were updated and a flag variable was set to indicate that data entry was complete. That flag was cleared when a command was entered on the command line. If the operator moved the cursor back to the data entry area without entering a command, then the flag was not cleared, and any subsequent changes to the data entry area did not affect the internal variables.

A global variable was used to indicate that the magnets were currently being adjusted. This variable was modified by two functions, and did not always contain the correct value. Adjusting the magnets required about eight seconds, and the flag was correct for only a small period at the beginning of this time period.

Due to the errors in the design and implementation of the software, the following sequence of events could result in the machine causing injury to, or even the death of, the patient:

1. The operator mistakenly specified high-power mode during data entry.

2. The operator moved the cursor to the command line area.

3. The operator noticed the mistake, and moved the cursor back to the data entry area without entering a command.

4. The operator corrected the mistake and moved the cursor back to the command line.

5. The operator entered the command line area, left it, made the correction, and returned within the eight-second window required for adjusting the magnets.

If the above sequence occurred, then the operator screen could indicate that the machine was in low power mode, although it was actually set in high-power mode. During a final check before initiating the beam, the software would find that the magnets were set for high-power mode but the operator setting was for low power mode. It displayed a numeric error code and prevented the machine from starting. The operator could clear the error code by resetting the computer (which only required one key to be pressed on the keyboard). This caused the tungsten shield to withdraw but left the machine in X-ray mode. When the operator entered the command to start the beam, the machine could be in high-power mode without having the tungsten shield in place. X-rays were applied to the unprotected patient.

It took some time for this critical flaw to appear. The failure only occurred when the operator initially made a one-keystroke mistake in entering the prescription data, moved to the command area, and then corrected the mistake within eight seconds. Initially, operators were slow to enter data, and spent a lot of time making sure that the prescription was correct before initiating treatment. As they became more familiar with the machine, they were able to enter data and correct mistakes more quickly. Eventually, operators became familiar enough with the machine that they could enter data, make a correction, and return to the command area within the critical eight-second window. Also, the operators became familiar with the machine reporting numeric error codes without any indication of the severity of the code. The operators were given a table of codes and their meanings. The code reported was “no dose” and indicated “treatment pause.” There is no reason why the operator should consider that to be a serious problem; they had become accustomed to frequent malfunctions that did not have any consequences to the patient.

Although the code was written in assembly language, that fact was not cited as an important factor. The fundamental problems were poor software design and overconfidence. The reuse of code in an application for which it was not initially designed also may have contributed to the system flaws. A proper design using established software design principles, including structured programming and abstract data types, would almost certainly have avoided these fatalities.

6.4 Chapter Summary

The abstract data type is a structured programming concept which contributes to software reliability, eases maintenance, and allows for major revisions to be performed in a safe way. Many high-level languages enforce, or at least facilitate, the use of ADTs. Assembly language does not. However, the ethical assembly language programmer will make the extra effort to write code that conforms to the standards of structured programming and use abstract data types to help ensure safety, reliability, and maintainability.

ADTs also facilitate the implementation of software modules in more than one language. The interface specifies the components of the ADT, but not the implementation. The implementation can be in any language. As long as assembly programmers and compiler authors generate code that conforms to a well-known standard, their code can be linked with code written in other languages.

Poor coding practices and poor design can lead to dire consequences, including loss of life. It is the responsibility of the programmer, regardless of the language used, to make ethical decisions in the design and implementation of software. Above all, the programmer must be aware of the possible consequences of the decisions they make.

Exercises

6.1 What are the advantages of designing software using abstract data types?

6.2 Why is the internal structure of the Pixel data type hidden from client code in Listing 6.2?

6.3 High-level languages provide mechanisms for information hiding, but assembly does not. Why should the assembly programmer not simply bypass all information hiding and access the internal data structures of any ADT directly?

6.4 The assembly code in wl_print_numerical() accesses the internal structure of the wordlistnode data type. Why is it allowed to do so? Should it be allowed to do so?

6.5 Given the following definitions for a stack ADT:

f06-15-9780128036983
f06-16-9780128036983

Write the InitStack() function in ARM assembly language.

6.6 Referring to the previous question, write the Push() function in ARM assembly language.

6.7 Referring to the previous two questions, write the Pop() function in ARM assembly language.

6.8 Referring to the previous three questions, write the Top() function in ARM assembly language.

6.9 Referring to the previous three questions, write the PrintStack() function in ARM assembly language.

6.10 Re-implement all of the previous stack functions using a linked list rather than a static array.

6.11 The “Software Engineering Code of Ethics And Professional Practice” states that a responsible software engineer should “Approve software only if they have well-founded belief that it is safe, meets specifications, passes appropriate tests…” (sub-principle 1.03) and “Ensure adequate testing, debugging, and review of software…on which they work.” (sub-principle 3.10). Unfortunately, defects did make their way into the system.
The software engineering code of ethics also states that a responsible software engineer should “Treat all forms of software maintenance with the same professionalism as new development.”

(a) Explain how the Software Engineering Code of Ethics And Professional Practice were violated by the Therac 25 developers.

(b) How should the engineers and managers at AECL have responded when problems were reported?

(c) What other ethical and non-ethical considerations may have contributed to the deaths and injuries?

Part II

Performance Mathematics

Chapter 7

Integer Mathematics

Abstract

This chapter introduces the concept of high performance mathematics. The chapter starts by explaining basic math in bases other than 10. It explains subtraction using complement mathematics. Next it gives efficient algorithms for performing signed and unsigned multiplication in binary. It explains how multiplication by a constant can often be converted into a much more efficient sequence of shift and add or subtract operations, and gives a method for multiplying two arbitrarily large numbers. Next, an efficient algorithm is given for binary division, followed by a technique for converting division by a constant into multiplication by a related constant. The next section introduces an ADT, written in C, which can be used to perform basic mathematical operations on integers of any size. The chapter concludes by showing that the ADT can be made much more efficient by replacing some of the functions with assembly language implementations.

Keywords

Addition; Subtraction; Complement; Multiplication; Division; Big integer; High performance; Abstract data type

There are some differences between the way calculations are performed in a computer versus the way most of us were taught as children. The first difference is that calculations are performed in binary instead of base ten. Another difference is that the computer is limited to a fixed number of binary digits, which raises the possibility of having a result that is too large to fit in the number of bits available. This occurrence is referred to as overflow. The third difference is that subtraction is performed using complement addition.

Addition in base b is very similar to base ten addition, except that the result of each column is limited to b − 1. For example, binary addition works exactly the same as decimal addition, except that the result of each column is limited to 0 or 1. The following figure shows an addition in base ten and the equivalent addition in base two.

u07-27-9780128036983

The carry from one column to the next is shown as a small number above the column that it is being carried into. Note that carries from one column to the next are done the same way in both bases. The only difference is that there are more columns in the base two addition because it takes more digits to represent a number in binary than it does in decimal.

7.1 Subtraction by Addition

Finding the complement was explained in Section 1.3.3. Subtraction can be computed by adding the radix complement of the subtrahend to the menuend. Example 7.1 shows a complement subtraction with positive results. When x < y, the result will be negative. In the complement method, this means that there will be a ‘1’ in the most significant bit, and in order to convert the result to base ten, we must take the radix complement. Example 7.2 shows complement subtraction with negative results. Example 7.3 shows several more signed addition and subtraction operations in base ten and binary.

Example 7.1

Ten’s Complement Subtraction

Suppose we wish to calculate 38410 − 5610 using the complements method. After extending both numbers to the same number of digits, we have 38410 − 05610. From Eq. (1.1), the ten’s complement of 05610 is 104 − 05610 = 94410. Adding gives us 38410 + 94410 = 132810. After discarding the leading “1”, we have 328, which is the correct result. Both methods of subtraction are shown below:

u07-25-9780128036983

Example 7.2

Ten’s Complement Subtraction With a Negative Result

Suppose we want to calculate 284 − 481. Both numbers have three digits, so it is not necessary to pad with leading zeros. Adding the ten’s complement of y to x gives 284 + 519 = 803. This is obviously the wrong answer, since the expected answer is − 197. But all is not lost, because 803 happens to be the ten’s complement of 197. The fact that the first digit of the result is greater than four indicates that we must take the ten’s complement of the result and add a negative sign.

Example 7.3

Signed Addition and Subtraction in Decimal and Binary

u07-26-9780128036983

7.2 Binary Multiplication

Many processors have hardware multiply instructions. However hardware multipliers require a large number of transistors, and consume significant power. Processors designed for extremely low power consumption or very small size usually do not implement a multiply instruction, or only provide multiply instructions that are limited to a small number of bits. On these systems, the programmer must implement multiplication using basic data processing instructions.

7.2.1 Multiplication by a Power of Two

If the multiplier is a power of two, then multiplication can be accomplished with a shift to the left. Consider the 4-bit binary number x = x3 × 23 + x2 × 22 + x1 × 21 + x0 × 20, where xn denotes bit n of x. If x is shifted left by one bit, introducing a zero into the least significant bit, then it becomes x3×24+x2×23+x1×22+x0×21+0×20=2x3×23+x2×22+x1×21+x0×20+0×21si1_e Therefore, a shift of one bit to the left is equivalent to multiplication by two. This argument can be extended to prove that a shift left by n bits is equivalent to multiplication by 2n.

7.2.2 Multiplication of Two Variables

Most techniques for binary multiplication involve computing a set of partial products and then summing the partial products together. This process is similar to the method taught to primary schoolchildren for conducting long multiplication on base ten integers, but has been modified here for application to binary. The method typically taught in school for multiplying decimal numbers is based on calculating partial products, shifting them to the left and then adding them together. The most difficult part is to obtain the partial products, as that involves multiplying a long number by one base ten digit. The following example shows how the partial products are formed when multiplying 123 by 456.

u07-30-9780128036983

The first partial product can be written as 123 × 6 × 100 = 738. The second is 123 × 5 × 101 = 6150, and the third is 123 × 4 × 102 = 49200. In practice, we usually leave out the trailing zeros. The procedure is the same in binary, but is simpler because the partial product involves multiplying a long number by a single base 2 digit. Since the multiplier is always either zero or one, the partial product is very easy to compute. The product of multiplying any binary number x by a single binary digit is always either 0 or x. Therefore, the multiplication of two binary numbers comes down to shifting the multiplicand left appropriately for each non-zero bit in the multiplier, and then adding the shifted numbers together.

Suppose we wish to multiply two four-bit numbers, 1011 and 1010:

u07-31-9780128036983

Notice in the previous example that each partial sum is either zero or x shifted by some amount. A slightly quicker way to perform the multiplication is to leave out any partial sum which is zero. Example 7.4 shows the results of multiplying 10110 by 8910 in decimal and binary using this shorter method. For implementation in hardware and software, it is easier to accumulate the partial products, by adding each to a running sum, rather than building a circuit to add multiple binary numbers at once.

Example 7.4

Equivalent Multiplication in Decimal and Binary

u07-28-9780128036983

Binary multiplication can be implemented as a sequence of shift and add instructions. Given two registers, x and y, and an accumulator register a, the product of x and y can be computed using Algorithm 1. When applying the algorithm, it is important to remember that, in the general case, the result of multiplying an n bit number by an m bit number is (at most) an n + m bit number. For instance 112 × 112 = 10012. Therefore, when applying Algorithm 1, it is necessary to know the number of bits in x and y. Since x is shifted left on each iteration of the loop, the registers used to store x and a must both be at least as large as the number of bits in x plus the number of bits in y.

u07-37-9780128036983
Algorithm 1 Algorithm for binary multiplication.

Assume we wish to multiply two numbers, x = 01101001 and y = 01011010. Applying Algorithm 1 results in the following sequence:

axyNext operation
0000000000000000000000000110100101011010shift only
0000000000000000000000001101001000101101add, then shift
0000000011010010000000011010010000010110shift only
0000000011010010000000110100100000001011add, then shift
0000010000011010000001101001000000000101add, then shift
0000101010101010000011010010000000000010shift only
0000101010101010000110100100000000000001add, then shift
0010010011101010001101001000000000000000shift only
105 × 90 = 9450

t0040

To multiply two n bit numbers, you must be able to add two 2n-bit numbers. On the ARM processor, n is usually assumed to be 32-bits, because that is the natural word size for the ARM processor. Adding 64-bit numbers requires two add instructions and the carry from the least-significant 32 bits must be added to the sum of the most-significant 32 bits. The ARM processor provides a convenient way to perform the add with carry. Assume we have two 64 bit numbers, x and y. We have x in r0, r1 and y in r2, r3, where the high order words of each number are in the higher-numbered registers, and we want to calculate x = x + y. Listing 7.1 shows a two instruction sequence for the ARM processor. The first instruction adds the two least-significant words together and sets (or clears) the carry bit and other flags in the CPSR. The second instruction adds the two most significant words along with the carry bit.

f07-06-9780128036983
Listing 7.1 ARM assembly code for adding two 64 bit numbers.

On the ARM processor, the algorithm to multiply two 32-bit unsigned integers is very efficient. Listing 7.2 shows one possible algorithm for multiplying two 32-bit numbers to obtain a 64-bit result. The code is a straightforward implementation of the algorithm, and some modifications can be made to improve efficiency. For example, if we only want a 32-bit result, we do not need to perform 64-bit addition. This significantly simplifies the code, as shown in Listing 7.3.

f07-07-9780128036983
Listing 7.2 ARM assembly code for multiplication with a 64 bit result.
f07-08-9780128036983
Listing 7.3 ARM assembly code for multiplication with a 32 bit result.

7.2.3 Multiplication of a Variable by a Constant

If x or y is a constant, then a loop is not necessary. The multiplication can be directly translated into a sequence of shift and add operations. This will result in much more efficient code than the general algorithm. If we inspect the constant multiplier, we can usually find a pattern to exploit that will save a few instructions. For example, suppose we want to multiply a variable x by 1010. The multiplier 1010 = 10102, so we only need to add x shifted left 1 bit to x shifted left 3 bits as shown below:

u07-32-9780128036983

Now suppose we want to multiply a number x by 1110. The multiplier 1110 = 10112, so we will add x to x shifted left one bit plus x shifted left 3 bits as in the following:

u07-33-9780128036983

If we wish to multiply a number x by 100010, we note that 100010 = 11111010002 It looks like we need one shift plus five add/shift operations, or six add/shift operations. With a little thought, the number of operations can be reduced from six to five as shown below:

u07-34-9780128036983

Applying the basic multiplication algorithm to multiply a number x by 25510 would result in seven add/shift operations, but we can do it with only three operations and use only one register, as shown below:

u07-35-9780128036983

Most modern systems have assembly language instructions for multiplication, but hardware multiply units require a relatively large number of transistors. For that reason, processors intended for small embedded applications often do not have a multiply instruction. Even when a hardware multiplier is available, on some processors it is often more efficient to use shift, add, and subtract operations when multiplying by a constant. The hardware multiplier units that are available on most ARM processors are very powerful. They can typically perform multiplication with a 32-bit result in as little as one clock cycle. The long multiply instructions take between three and five clock cycles, depending on the size of the operands. Using the multiply instruction on an ARM processor to multiply by a constant usually requires loading the constant into a register before performing the multiply. Therefore, if the multiplication can be performed using three or fewer shift, add, and subtract instructions, then it will be equal to or better than using the multiply instruction.

7.2.4 Signed Multiplication

Consider the two multiplication problems shown in Figs. 7.1 and 7.2. Note that the result of a multiply depends on whether the numbers are interpreted as unsigned numbers or signed numbers. For this reason, most computer CPUs have two different multiply operations for signed and unsigned numbers.

f07-01-9780128036983
Figure 7.1 In signed 8-bit math, 110110012 is − 3910.
f07-02-9780128036983
Figure 7.2 In unsigned 8-bit math, 110110012 is 21710.

If the CPU provides only an unsigned multiply, then a signed multiply can be accomplished by using the unsigned multiply operation along with a conditional complement. The following procedure can be used to implement signed multiplication.

1. if the multiplier is negative, take the two’s complement,

2. if the multiplicand is negative, take the two’s complement,

3. perform unsigned multiply, and

4. if the multiplier or multiplicand was negative (but not both), then take two’s complement of result.

Example 7.5 demonstrates this method using one negative number.

Example 7.5

Signed Multiplication Using Unsigned Math

u07-22-9780128036983

7.2.5 Multiplying Large Numbers

Consider the method used for multiplying two digit numbers is base ten, using only the one-digit multiplication tables. Fig. 7.3 shows how a two digit number a = a1 × 101 + a0 × 100 is multiplied by another two digit number b = b1 × 101 + b0 × 100 to produce a four digit result using basic multiplication operations which only take one digit from a and one digit from b at each step.

f07-03-9780128036983
Figure 7.3 Multiplication of large numbers.

This technique can be used for numbers in any base and for any number of digits. Recall that one hexadecimal digit is equivalent to exactly four binary digits. If a and b are both 8-bit numbers, then they are also 2-digit hexadecimal numbers. In other words 8-bit numbers can be divided into groups of four bits, each representing one digit in base sixteen. Given a multiply operation that is capable of producing an 8-bit result from two 4-bit inputs, the technique shown above can then be used to multiply two 8-bit numbers using only 4-bit multiplication operations.

Carrying this one step further, suppose we are given two 16-bit numbers, but our computer only supports multiplying eight bits at a time and producing a 16-bit result. We can consider each 16-bit number to be a two digit number in base 256, and use the above technique to perform four eight bit multiplies with 16-bit results, then shift and add the 16-bit results to obtain the final 32-bit result. This approach can be extended to implement efficient multiplication of arbitrarily large numbers, using a fixed-sized multiplication operation.

7.3 Binary Division

Binary division can be implemented as a sequence of shift and subtract operations. When performing binary division by hand, it is convenient to perform the operation in a manner very similar to the way that decimal division is performed. As shown in Fig. 7.4, the operation is identical, but takes more steps in binary.

f07-04-9780128036983
Figure 7.4 Longhand division in decimal and binary.

7.3.1 Division by a Power of Two

If the divisor is a power of two, then division can be accomplished with a shift to the right. Using the same approach as was used in Section 7.2.1, it can be shown that a shift right by n bits is equivalent to division by 2n. However, care must be taken to ensure that an arithmetic shift is used if the numerator is a signed two’s complement number, and a logical shift is used if the numerator is unsigned.

7.3.2 Division by a Variable

The algorithm for dividing binary numbers is somewhat more complicated than the algorithm for multiplication. The algorithm consists of two main phases:

1. shift the divisor left until it is greater than dividend and count the number of shifts, then

2. repeatedly shift the divisor back to the right and subtract whenever possible.

Fig. 7.5 shows the algorithm in more detail. Because of the complexity of the algorithm, division in hardware requires a significant number of transistors. The ARM architecture did not introduce a divide instruction until ARMv7, and even then it was not implemented on all processors. Many ARM systems (including the Raspberry Pi) do not have hardware division. However, the ARM processor instruction set makes it possible to write very efficient code for division.

f07-05-9780128036983
Figure 7.5 Flowchart for binary division.

Before we introduce the ARM code, we will take some time to step through the algorithm using an example. Let us begin by dividing 94 by 7. The result is shown below:

u07-29-9780128036983

To implement the algorithm, we need three registers, one for the dividend, one for the divisor, and one for a counter. The dividend and divisor are loaded into their registers and the counter is initialized to zero as shown below:

Dividend01011110
Divisor00000111
Counter00000000

t0045

Next, the divisor is shifted left and the counter incremented repeatedly until the divisor is greater than the dividend. This is shown in the following sequence:

Dividend01011110
Divisor00001110
Counter00000001

t0050

Dividend01011110
Divisor00011100
Counter00000010

t0055

Dividend01011110
Divisor00111000
Counter00000011

t0060

Dividend01011110
Divisor01110000
Counter00000100

t0065

Next, we allocate a register for the quotient and initialize it to zero. Then, according to the algorithm, we repeatedly subtract if possible, shift to the right, and decrement the counter. This sequence continues until the counter becomes negative. For our example this results in the following sequence:

u07-10-9780128036983
u07-11-9780128036983
u07-12-9780128036983
u07-13-9780128036983

u07-14-9780128036983
u07-15-9780128036983

When the algorithm terminates, the quotient register contains the result of the division, and the modulus (remainder) is in the dividend register. Thus, one algorithm is used to compute both the quotient and the modulus at the same time. There are variations on this algorithm. For example, one variation is to shift a single bit left in a register, rather than incrementing a count. This variation has the same two phases as the previous algorithm, but counts in powers of two rather than by ones. The following sequence shows what occurs after each iteration of the first loop in the algorithm.

Dividend01011110
Divisor00000111
Power:00000001

t0100

Dividend01011110
Divisor00001110
Power:00000010

t0105

Dividend01011110
Divisor00011100
Power:00000100

t0110

Dividend01011110
Divisor00111000
Power:00001000

t0115

Dividend01011110
Divisor01110000
Power:00010000

t0120

The divisor is greater than the dividend, so the algorithm proceeds to the second phase. In this phase, if the divisor is less than or equal to the dividend, then the power register is added to the quotient and the divisor is subtracted from the dividend. Then, the power and Divisor registers are shifted to the right. The process is repeated until the power register is zero. The following sequence shows what the registers will contain at the end of each iteration of the second loop.

u07-16-9780128036983
u07-17-9780128036983
u07-18-9780128036983
u07-19-9780128036983
u07-20-9780128036983
u07-21-9780128036983

As with the previous version, when the algorithm terminates, the quotient register contains the result of the division, and the modulus (remainder) is in the dividend register. Listing 7.4 shows the ARM assembly code to implement this version of the division algorithm for 32-bit numbers, and the counting method for 64-bit numbers.

f07-13a-9780128036983f07-13b-9780128036983f07-13c-9780128036983f07-13d-9780128036983
Listing 7.4 ARM assembly implementation of signed and unsigned 32-bit and 64-bit division functions

7.3.3 Division by a Constant

In general, division is slow. Newer ARM processors provide a hardware divide instruction which requires between two and twelve clock cycles to produce a result, depending on the size of the operands. Older processors must perform division using software, as previously described. In either case, division is by far the slowest of the basic mathematical operations. However, division by a constant c can be converted to a multiply by the reciprocal of c. It is obviously much more efficient to use a multiply instead of a divide wherever possible. Efficient division of a variable by a constant is achieved by applying the following equality:

x÷c=x×1c.

si15_e  (7.1)

The only difficulty is that we have to do it in binary, using only integers. If we modify the right-hand side by multiplying and dividing by some power of two (2n), we can rewrite Eq. (7.1) as follows:

x÷c=x×2nc×2n.

si16_e  (7.2)

Recall that, in binary, multiplying by 2n is the same as shifting left by n bits, while multiplying by 2n is done by shifting right by n bits. Therefore, Eq. (7.2) is just Eq. (7.1) with two shift operations added. The two shift operations cancel each other out. Now, let

m=2nc.

si17_e  (7.3)

We can rewrite Eq. (7.2) as:

x÷c=x×m×2n.

si18_e  (7.4)

We now have a method for dividing by a constant c which involves multiplying by a different constant, m, and shifting the result. In order to achieve the best precision, we want to choose n such that m is as large as possible with the number of bits we have available.

Suppose we want efficient code to calculate x ÷ 23 using 8-bit signed integer multiplication. Our first task is to find m=2ncsi19_e such that 011111112m ≥ 010000002. In other words, we want to find the value of n where the most significant bit of m is zero, and the next most significant bit of m is one. If we choose n = 11, then

m=2112389.0434782609.

si20_e

Rounding to the nearest integer gives m = 89. In 8 bits, m is 010110012 or 5916. We now have values for m and n, and therefore we can apply Eq. (7.4) to divide any number x by 23. The procedure is simple: calculate y = x × m, then shift y right by 11 bits.

However, there are two more considerations. First, when the divisor is positive, the result for some values of x may be incorrect due to rounding error. It is usually sufficient to increment the reciprocal value by one in order to avoid these errors. In the previous example, the number would be changed from 5916 to 5A16. When implementing this technique for finding the reciprocal, the programmer should always verify that the results are correct for all input values. The second consideration is when the dividend is negative. In that case it is necessary to subtract one from the final result.

For example, to calculate 10110 ÷ 2310 in binary, with eight bits of precision, we first perform the multiplication as follows:

u07-23-9780128036983

Then shift the result right by 11 bits. 100011000111012 shifted right 1110 bits is: 1002 = 410. If the modulus is required, it can be calculated as 101 mod 23 = 101 − (4 × 23) = 9, which once again requires multiplication by a constant.

In the previous example the shift amount of 11 bits provided the best precision possible. But how was that number chosen? The shift amount, n, can be directly computed as

n=p+log2c1,

si21_e  (7.5)

where p is the desired number of bits of precision. The value of m can then be computed as

m=2nc+1c>0,2ncotherwise.

si22_e  (7.6)

For example, to divide by the constant 33, with 16 bits of precision, we compute n as

n=16+log2331=16+5.0443941=16+51=20,

si23_e

and then we compute m as

m=22033+1=31776.03030331776=7C2016.

si24_e

Therefore, multiplying a 16 bit number by 7C2016 and then shifting right 20 bits is equivalent to dividing by 33.

Example 7.6 shows how to calculate m and n for division by 193. On the ARM processor, division by a constant can be performed very efficiently. Listing 7.5 shows how division by 193 can be implemented using only a few lines of code. In the listing, the numbers are 32 bits in length, so the constant m is much larger than in the example that was multiplied by hand, but otherwise the method is the same.

Example 7.6

Division by Constant 193

To divide by a constant 193, with 32 bits of precision, the multiplier is computed using Eqs. (7.5) and 7.6 with p = 32 as follows:

m=232+71193+1=238193+1=1424237860.811424237860=54E4252416.

si25_e

The shift amount, n, is 38 bits.

f07-14-9780128036983
Listing 7.5 ARM assembly code for division by constant 193.

On processors without the multiply instruction, we can use the technique of shifting and adding shown previously. If we wish to divide by 23 using 32 bits of precision, we compute the multiplier as

m=232+4123+1=23523+1=1493901669.171493901669=590B216516.

si26_e

That is 010110010000101100100001011001012. Note that there are only 12 non-zero bits, and the pattern 1011001 appears three times in the 32-bit multiplier. The multiply can be implemented as 224(26x + 24x + 23x + 20x) + 213(26x + 24x + 23x + 20x) +22(26x + 24x + 23x + 20x) + 20x. So the following code sequence can be used on processors that do not have the multiply instruction:

f07-15a-9780128036983f07-15b-9780128036983
Listing 7.6 ARM assembly code for division of a variable by a constant without using a multiply instruction.

7.3.4 Dividing Large Numbers

Section 7.2.5 showed how large numbers can be multiplied by breaking them into smaller numbers and using a series of multiplication operations. There is no similar method for synthesizing a large division operation with an arbitrary number of digits in the dividend and divisor. However, there is a method for dividing a large dividend by a divisor given that the division operation can operate on numbers with at least the same number of digits as in the divisor.

Suppose we wish to perform division of an arbitrarily large dividend by a one digit divisor using a basic division operation that can divide a two digit dividend by a one digit divisor. The operation can be performed in multiple steps as follows:

1. Divide the most significant digit of the dividend by the divisor. The result is the most significant digit of the quotient.

2. Prepend the remainder from the previous division step to the next digit of the dividend, forming a two-digit number, and divide that by the divisor. This produces the next digit of the result.

3. Repeat from step 2 until all digits of the dividend have been processed.

4. Take the final remainder as the modulus.

The following example shows how to divide 6189 by 7 using only 2-digits at a time:

eq07-04-9780128036983

This method can be applied in any base and with any number of digits. The only restriction is that the basic division operation must be capable of dividing a 2n digit number by an n digit number and producing a 2n digit quotient and an n digit remainder. for example, the div instruction available on Cortex M3 and newer processors is capable of dividing a 32-bit dividend by a 32-bit divisor, producing a 32-bit quotient. The remainder can be calculated by multiplying the quotient by the divisor and subtracting the product from the dividend. Using this division operation it is possible to divide an arbitrarily large number by a 16-bit divisor.

We have seen that, given a divide operation capable of dividing an n digit number by an n digit number, it is possible to divide a dividend with any number of digits by a divisor with n2si27_e digits. Unfortunately, there is no similar method to deal with an arbitrarily large divisor, or to divide an arbitrarily large dividend by a divisor with more than n2si27_e digits. In those cases the division must be performed using a general division algorithm as shown previously.

7.4 Big Integer ADT

For some programming tasks, it may be helpful to deal with arbitrarily large integers. For example, the factorial function and Ackerman’s function grow very quickly and will overflow a 32-bit integer for small input values. In this section, we will outline an abstract data type which provides basic operations for arbitrarily large integer values. Listing 7.7 shows the C header for this ADT, and Listing 7.8 shows the C implementation. Listing 7.9 shows a small program that uses the bigint ADT to create a table of x! for all x between 0 and 100.

f07-16a-9780128036983f07-16b-9780128036983
Listing 7.7 Header file for a big integer abstract data type.
f07-17a-9780128036983f07-17b-9780128036983f07-17c-9780128036983f07-17d-9780128036983f07-17e-9780128036983f07-17f-9780128036983f07-17g-9780128036983f07-17h-9780128036983f07-17i-9780128036983f07-17j-9780128036983f07-17k-9780128036983f07-17l-9780128036983f07-17m-9780128036983f07-17n-9780128036983f07-17o-9780128036983f07-17p-9780128036983
Listing 7.8 C source code file for a big integer abstract data type.
f07-18a-9780128036983f07-18b-9780128036983
Listing 7.9 Program using the bigint ADT to calculate the factorial function.

The implementation could be made more efficient by writing some of the functions in assembly language. One opportunity for improvement is in the add function, which must calculate the carry from one chunk of bits to the next. In assembly, the programmer has direct access to the carry bit, so carry propagation should be much faster.

When attempting to speed up a C program by converting selected parts of it to assembly language, it is important to first determine where the most significant gains can be made. A profiler, such as gprof, can be used to help identify the sections of code that will matter most. It is also important to make sure that the result is not just highly optimized C code. If the code cannot benefit from some features offered by assembly, then it may not be worth the effort of re-writing in assembly. The code should be re-written from a pure assembly language viewpoint.

It is also important to avoid premature assembly programming. Make sure that the C algorithms and data structures are efficient before moving to assembly. if a better algorithm can give better performance, then assembly may not be required at all. Once the assembly is written, it is more difficult to make major changes to the data structures and algorithms. Assembly language optimization is the final step in optimization, not the first one.

Well-written C code is modularized, with many small functions. This helps readability, promotes code reuse, and may allow the compiler to achieve better optimization. However, each function call has some associated overhead. If optimal performance is the goal, then calling many small functions should be avoided. For instance, if the piece of code to be optimized is in a loop body, then it may be best to write the entire loop in assembly, rather than writing a function and calling it each time through the loop. Writing in assembly is not a guarantee of performance. Spaghetti code is slow. Load/store instructions are slow. Multiplication and division are slow. The secret to good performance is avoiding things that are slow. Good optimization requires rethinking the code to take advantage of assembly language.

The bigint_adc function was re-written in assembly, as shown in Listing 7.10. This function is used internally by several other functions in the bigint ADT to perform addition and subtraction. The profiler indicated that it is used more than any other function. If assembly language can make this function run faster, then it should have a profound effect on the program.

f07-19a-9780128036983f07-19b-9780128036983f07-19c-9780128036983f07-19d-9780128036983
Listing 7.10 ARM assembly implementation if the bigint_adc function.

The bigfact main function was executed 50 times on a Raspberry Pi, using the C version of bigint_adc and then with the assembly version. The total time required using the C version was 27.65 seconds, and the program spent 54.0% of its time (14.931 seconds) in the bigint_adc function. The assembly version ran in 15.07 seconds, and the program spent 15.3% of its time (2.306 seconds) in the bigint_adc function. Therefore the assembly version of the function achieved a speedup of 6.47 over the C implementation. Overall, the program achieved a speedup of 1.83 by writing one function in assembly.

Running gprof on the improved program reveals that most of the time is now spent in the bigint_mul function (63.2%) and two functions that it calls: bigint_mul_uint (39.1%) and bigint_shift_left_chunk (21.6%). It seems clear that optimizing those two functions would further improve performance.

7.5 Chapter Summary

Complement mathematics provides a method for performing all basic operations using only the complement, add, and shift operations. Addition and subtraction are fast, but multiplication and division are relatively slow. In particular, division should be avoided whenever possible. The exception to this rule is division by a power of the radix, which can be implemented as a shift. Good assembly programmers replace division by a constant c with multiplication by the reciprocal of c. They also replace the multiply instruction with a series of shifts and add or subtract operations when it makes sense to do so. These optimizations can make a big difference in performance.

Writing sections of a program in assembly can result in better performance, but it is not guaranteed. The chance of achieving significant performance improvement is increased if the following rules are used:

1. Only optimize the parts that really matter.

2. Design data structures with assembly in mind.

3. Use efficient algorithms and data structures.

4. Write the assembly code last.

5. Ignore the C version and write good, clean, assembly.

6. Reduce function calls wherever it makes sense.

7. Avoid unnecessary memory accesses.

8. Write good code. The compiler will beat poor assembly every time, but good assembly will beat the compiler every time.

Understanding the basic mathematical operations can enable the assembly programmer to work with integers of any arbitrary size with efficiency that cannot be matched by a C compiler. However, it is best to focus the assembly programming on areas where the greatest gains can be made.

Exercises

7.1 Multiply − 90 by 105 using signed 8-bit binary multiplication to form a signed 16-bit result. Show all of your work.

7.2 Multiply 166 by 105 using unsigned 8-bit binary multiplication to form an unsigned 16-bit result. Show all of your work.

7.3 Write a section of ARM assembly code to multiply the value in r1 by 1310 using only shift and add operations.

7.4 The following code will multiply the value in r0 by a constant C. What is C?

u07-36-9780128036983

7.5 Show the optimally efficient instruction(s) necessary to multiply a number in register r0 by the constant 6710.

7.6 Show how to divide 7810 by 610 using binary long division.

7.7 Demonstrate the division algorithm using a sequence of tables as shown in Section 7.3.2 to divide 15510 by 1110.

7.8 When dividing by a constant value, why is it desirable to have m as large as possible?

7.9 Modify your program from Exercise 5.13 in Chapter 5 to produce a 64-bit result, rather than a 32-bit result.

7.10 Modify your program from Exercise 5.13 in Chapter 5 to produce a 128-bit result, rather than a 32-bit result. How would you do this in C?

7.11 Write the bigint_shift_left_chunk function from Listing 7.8 in ARM assembly, and measure the performance improvement.

7.12 Write the bigint_mul_uint function in ARM assembly, and measure the performance improvement.

7.13 Write the bigint_mul function in ARM assembly, and measure the performance improvement.

Chapter 8

Non-Integral Mathematics

Abstract

This chapter starts by demonstrating how to convert fractional numbers to radix notation in any base. It then presents a theorem that can be used to determine in which bases a given fraction will terminate rather than repeating. That theorem is then used to explain why some base ten fractional numbers cannot be represented in binary with a finite number of bits. Next fixed-point numbers are introduced. The rules for addition, subtraction, multiplication, and division are given. Division by a constant is explained in terms of fixed-point mathematics. Next, the IEEE floating point formats are explained. The chapter ends with an example showing how fixed-point mathematics can be used to write functions for sine and cosine which give better precision and higher performance than the functions provided by GCC.

Keywords

Fixed point; Radix point; Non-terminating repeating fraction; S/U notation; Q notation; Floating point; Performance

Chapter 7 introduced methods for performing computation using integers. Although many problems can be solved using only integers, it is often necessary (or at least more convenient) to perform computation using real numbers or even complex numbers. For our purposes, a non-integral number is any number that is not an integer. Many systems are only capable of performing computation using binary integers, and have no hardware support for non-integral calculations. In this chapter, we will examine methods for performing non-integral calculations using only integer operations.

8.1 Base Conversion of Fractional Numbers

Section 1.3.2 explained how to convert integers in a given base into any other base. We will now extend the methods to convert fractional values. A fractional number can be viewed as consisting of an integer part, a radix point, and a fractional part. In base 10, the radix point is also known as the decimal point. In base 2, it is called the binimal point. For base 16, it is the heximal point, and in base 8 it is an octimal point. The term radix point is used as a general term for a location that divides a number into integer and fractional parts, without specifying the base.

8.1.1 Arbitrary Base to Decimal

The procedure for converting fractions from a given base b into base ten is very similar to the procedure used for integers. The only difference is that the digit to the left of the radix point is weighted by b0 and the exponents become increasingly negative for each digit right of the radix point. The basic procedure is the same for any base b. For example, the value 101.01012 can be converted to base ten by expanding it as follows:

1×22+0×21+1×20+0×21+1×22+0×23+1×24=4+0+1+0+14+0+116=5.312510

si1_e

Likewise, the hexadecimal fraction 4F2.9A0 can be converted to base ten by expanding it as follows:

4×162+15×161+2×160+9×161+10×162+0×163=1024+240+2+916+10256+04096=1266.601562510

si2_e

8.1.2 Decimal to Arbitrary Base

When converting from base ten into another base, the integer and fractional parts are treated separately. The base conversion for the integer part is performed in exactly the same way as in Section 1.3.2, using repeated division by the base b. The fractional part is converted using repeated multiplication. For example, to convert the decimal value 5.687510 to a binary representation:

1. Convert the integer portion, 510 into its binary equivalent, 1012.

2. Multiply the decimal fraction by two. The integer part of the result is the first binary digit to the right of the radix point.
Because x = 0.6875 × 2 = 1.375, the first binary digit to the right of the point is a 1. So far, we have 5.62510 = 101.12

3. Multiply the fractional part of x by 2 once again.
Because x = 0.375 × 2 = 0.75, the second binary digit to the right of the point is a 0. So far, we have 5.62510 = 101.102

4. Multiply the fractional part of x by 2 once again.
Because x = 0.75 × 2 = 1.50, the third binary digit to the right of the point is a 1. So now we have 5.625 = 101.101

5. Multiply the fractional part of x by 2 once again.
Because x = 0.5 × 2 = 1.00, the fourth binary digit to the right of the point is a 1. So now we have 5.625 = 101.1011

6. Since the fractional part is now zero, we know that all remaining digits will be zero.

The procedure for obtaining the fractional part can be accomplished easily using a tabular method, as shown below:

OperationResult
IntegerFraction
0.6875 × 2 = 1.37510.375
0.375 × 2 = 0.7500.75
0.75 × 2 = 1.510.5
0.5 × 2 = 1.010.0

t0020

Putting it all together, 5.687510 = 101.10112. After converting a fraction from base 10 into another base, the result should be verified by converting back into base 10. The results from the previous example can be expanded as follows:

1×22+0×21+1×20+1×21+0×22+1×23+1×24=4+0+1+12+0+18+116=5.687510

si3_e

Converting decimal fractions to base sixteen is accomplished in a very similar manner. To convert 842.23437510 into base 16, we first convert the integer portion by repeatedly dividing by 16 to yield 34A. We then repeatedly multiply the fractional part, extracting the integer portion of the result each time as shown in the table below:

OperationResult
IntegerFraction
0.234375 × 16 = 3.7530.75
0.75 × 16 = 12.0120.0

t0025

In the second line, the integer part is 12, which must be replaced with a hexadecimal digit. The hexadecimal digit for 1210 is C, so the fractional part is 3C. Therefore, 842.23437510 =34A.3C16 The result is verified by converting it back into base 10 as follows:

3×162+4×161+10×160+3×161+12×162=768+64+10+316+12256=842.23437510

si4_e

Bases that are powers-of-two

Converting fractional values between binary, hexadecimal, and octal can be accomplished in the same manner as with integer values. However, care must be taken to align the radix point properly. As with integers, converting from hexadecimal or octal to binary is accomplished by replacing each hex or octal digit with the corresponding binary digits from the appropriate table shown in Fig. 1.3.

For example, to convert 5AC.43B16 to binary, we just replace “5” with “0101,” replace “A” with “1010,” replace “C” with “1100,” replace “4” with “0100,” replace “3” with “0011,” replace “B” with “1011,” So, using the table, we can immediately see that 5AC.43B16 = 010110101100.0100001110112. This method works exactly the same way for converting from octal to binary, except that it uses the table on the right side of Fig. 1.3.

Converting fractional numbers from binary to hexadecimal or octal is also very easy when using the tables. The procedure is to split the binary string into groups of bits, working outwards from the radix point, then replace each group with its hexadecimal or octal equivalent. For example, to convert 01110010.10101112 to hexadecimal, just divide the number into groups of four bits, starting at the radix point and working outwards in both directions. It may be necessary to pad with zeroes to make a complete group on the left or right, or both. Our example is grouped as follows: |0000|0111|0010.1010|1110|2. Now each group of four bits is converted to hexadecimal by looking up the corresponding hex digit in the table on the left side of Fig. 1.3. This yields 072.AE16. For octal, the binary number would be grouped as follows: |001|110|010.101|011|100|2. Now each group of three bits is converted to octal by looking up the corresponding digit in the table on the right side of Fig. 1.3. This yields 162.5348.

8.2 Fractions and Bases

One interesting phenomenon that is often encountered is that fractions which terminate in one base may become non-terminating, repeating fractions in another base. For example, the binary representation of the decimal fraction 110si5_e is a repeating fraction, as shown in Example 8.1. The resulting fractional part from the last step performed is exactly the same as in the second step. Therefore, the sequence will repeat. If we continue, we will repeat the sequence of steps 2–5 forever. Hence, the final binary representation will be:

0.110=0.000110011001100112=0.00011¯2

si6_e

Because of this phenomenon, it is impossible to exactly represent 1.1010 (and many other fractional quantities) as a binary fraction in a finite number of bits.

Example 8.1

A Non-Terminating, Repeating Binimal

.1×2=0.2.2×2=0.4.4×2=0.8.8×2=1.6.6×2=1.2.2×2=0.4

si7_e

The fact that some base 10 fractions cannot be exactly represented in binary has lead to many subtle software bugs and round-off errors, when programmers attempt to work with currency (and other quantities) as real-valued numbers. In this section, we explore the idea that the representation problem can be avoided by working in some base other than base 2. If that is the case, then we can simply build hardware (or software) to work in that base, and will be able to represent any fractional value precisely using a finite number of digits. For brevity, we will refer to a binary fractional quantity as a binimal and a decimal fractional quantity as a decimal. We would like to know whether there are more non-terminating decimals than binimals, more non-terminating binimals than decimals, or neither. Since there are an infinite number of non-terminating decimals and an infinite number of non-terminating binimals, we could be tempted to conclude that they are equal. However, that is an oversimplification. If we ask the question differently, we can discover some important information. A better way to ask the question is as follows:

Question: Is the set of terminating decimals a subset of the set of terminating binimals, or vice versa, or neither?

We start by introducing a lemma which can be used to predict whether or not a terminating fraction in one base will terminate in another base. We introduce the notation x|y (read as “x divides y”) to indicate that y can be evenly divided by x.

Lemma 8.2.1

If x, 0 < x < 1, terminates in some base B (a product of primes), then x=NxDxsi8_e, and Dx=p1k1p2k2pnknsi9_e, where the pi are the prime factors of B.

Proof

Let x=NxDxsi8_e, and Dx=p1k1p2k2pnknsi9_e, where the pi are the prime factors of B. Then DxNx×Bkmaxsi12_e, where kmax = max(k1,k2,…kn), so x=NxDxxsi13_e terminates after kmax or fewer divisions.

Let x=NxDxsi8_e terminate after k divisions. Then Dx|Nx × Bk. Since Dx does not evenly divide Nx, Dx must be composed of some combination of the prime factors of B. Thus, Dx can be expressed as p1k1p2k2pnknsi15_e.

Theorem 8.2.1

The set of terminating binimals is a subset of the set of terminating Decimals.

Proof

Let b be a terminating binimal. Then, by Lemma 8.2.1, b=NbDbsi16_e, such that Db = 2k, for some k ≥ 0. Therefore, Db = 2k5m, for some k, m > 0, and again by the Lemma, b is also a terminating decimal.

Theorem 8.2.2

The set of terminating decimals is not a subset of the set of terminating binimals.

Proof

Let d be a terminating decimal such that d=NdDdsi17_e, where Dd = 2k5m. If m > 0, then by the Lemma, d is a non-terminating binimal.

Answer: The set of terminating binimals is a subset of the set of terminating decimals, but the set of terminating decimals is not a subset of the set of terminating binimals.

Implications

Theorem 8.2.1 implies that any binary fraction can be expressed exactly as a decimal fraction, but Theorem 8.2.2 implies that there are decimal fractions which cannot be expressed exactly in binary. Every fraction (when expressed in lowest terms) which has a non-zero power of five in its denominator cannot be represented in binary with a finite number of bits. Another implication is that some fractions cannot be expressed exactly in either binary or decimal. For example, let B = 30 = 2 * 3 * 5. Then any number with denominator 2k13k25k3si18_e terminates in base 30. However if k2≠0, then the fraction will terminate in neither base two nor base ten, because three is not a prime factor of ten or two.

Another implication of the theorem is that the more prime factors we have in our base, the more fractions we can express exactly. For instance, the smallest base that has two, three, and five as prime factors is base 30. Using that base, we can exactly express fractions in radix notation that cannot be expressed in base ten or in base two with a finite number of digits. For example, in base 30, the fraction 1115si19_e will terminate after one division since 15 = 3151. To see what the number will look like, let us extend the hexadecimal system of using letters to represent digits beyond 9. So we get this chart for base 30:

0100301101302102303103304104305105306106307107308108309109301010A301110B301210C301310D301410E301510F301610G301710H301810I301910J302010K302110L302210M302310N302410O302510P302610Q302710R302810S302910T30

si20_e

Since 1115=2230si21_e, the fraction can be expressed precisely as 0.M30. Likewise, the fraction 1345si22_e is 0.28¯10si23_e but terminates in base 30. Since 45 = 3351, this number will have three or fewer digits following the radix point. To compute the value, we will have to raise it to higher terms. Using 302 as the denominator gives us:

1345=260900

si24_e

Now we can convert it to base 30 by repeated division. 26030=8si25_e with remainder 20. Since 20 < 30, we cannot divide again. Therefore, 1345si22_e in base 30 is 0.8K.

Although base 30 can represent all fractions that can be expressed in bases two and ten, there are still fractions that cannot be represented in base 30. For example, 17si27_e has the prime factor seven in its denominator, and therefore will only terminate in bases were seven is a prime factor of the base. The fraction 17si27_e will terminate in base 7, base 14, base 21, base 42 and many others, but not in base 30. Since there are an infinite number of primes, no number system is immune from this problem. No matter what base the computer works in, there are fractions that cannot be expressed exactly with a finite number of digits. Therefore, it is incumbent upon programmers and hardware designers to be aware of round-off errors and take appropriate steps to minimize their effects.

For example, there is no reason why the hardware clocks in a computer should work in base ten. They can be manufactured to measure time in base two. Instead of counting seconds in tenths, hundredths or thousandths, they could be calibrated to measure in fourths, eighths, sixteenths, 1024ths, etc. This would eliminate the round-off error problem in keeping track of time.

8.3 Fixed-Point Numbers

As shown in the previous section, given a finite number of bits, a computer can only approximately represent non-integral numbers. It is often necessary to accept that limitation and perform computations involving approximate values. With due care and diligence, the results will be accurate within some acceptable error tolerance. One way to deal with real-valued numbers is to simply treat the data as fixed- point numbers. Fixed-point numbers are treated as integers, but the programmer must keep track of the radix point during each operation. We will present a systematic approach to designing fixed-point calculations.

When using fixed-point arithmetic, the programmer needs a convenient way to describe the numbers that are being used. Most languages have standard data types for integers and floating point numbers, but very few have support for fixed-point numbers. Notable exceptions include PL/1 and Ada, which provide support for fixed-point binary and fixed-point decimal numbers. We will focus on fixed-point binary, but the techniques presented can also be applied to fixed-point numbers in any base.

8.3.1 Interpreting Fixed-Point Numbers

Each fixed-point binary number has three important parameters that describe it:

1. whether the number is signed or unsigned,

2. the position of the radix point in relation to the right side of the sign bit (for signed numbers) or the position of the radix point in relation to the most significant bit (for unsigned numbers), and

3. the number of fractional bits stored.

Unsigned fixed-point numbers will be specified as U(i,f), where i is the position of the radix point in relation to the left side of the most significant bit, and f is the number of bits stored in the fractional part.

For example, U(10,6) indicates that there are six bits of precision in the fractional part of the number, and the radix point is ten bits to the right of the most significant bit stored. The layout for this number is shown graphically as:

u08-01-9780128036983

where i is an integer bit and f is a fractional bit. Very small numbers with no integer part may have a negative i. For example, U(−8,16) specifies an unsigned number with no integer part, eight leading zero bits which are not actually stored, and 16 bits of fractional precision. The layout for this number is shown graphically as:

u08-02-9780128036983

Likewise, signed fixed-point numbers will be specified using the following notation: S(i,f), where i is the position of the radix point in relation to the right side of the sign bit, and f is the number of fractional bits stored. As with integer two’s-complement notation, the sign bit is always the leftmost bit stored. For example, S(9,6) indicates that there are six bits in the fractional part of the number, and the radix point is nine bits to the right of the sign bit. The layout for this number is shown graphically as:

u08-03-9780128036983

where i is an integer bit and f is a fractional bit. Very small numbers with no integer part may have a negative i. For example, S(−7,16) specifies a signed number with no integer part, six leading sign bits which are not actually stored, a sign bit that is stored and 15 bits of fraction. The layout for this number is shown graphically as:

u08-04-9780128036983

Note that the “hidden” bits in a signed number are assumed to be copies of the sign bit, while the “hidden” bits in an unsigned number are assumed to be zero.

The following figure shows an unsigned fixed-point number with seven bits in the integer part and nine bits in the fractional part. It is a U(7,9) number. Note that the total number of bits is 7 + 9 = 16

u08-05-9780128036983

The value of this number in base 10 can be computed by summing the values of each non-zero bit as follows:

2139+2119+2109+299+259+239+219+209=24+22+21+20+24+26+28+29=16+4+2+1+116+164+1256+1512=23.08398437510

si29_e

Likewise, the following figure shows a signed fixed-point number with nine bits in the integer part and six bits in the fractional part. It is as S(9,6) number. Note that the total number of bits is 9 + 6 + 1 = 16.

u08-06-9780128036983

The value of this number in base 10 can be computed by summing the values of each non-zero bit as follows:

2136+2116+2106+296+256+236+216+206=27+25+24+23+21+23+25+26=128+32+16+8+12+18+132+164=184.67187510

si30_e

Note that in the above two examples, the pattern of bits are identical. The value of a number depends upon how it is interpreted. The notation that we have introduced allows us to easily specify exactly how a number is to be interpreted. For signed values, if the first bit is non-zero, then the two’s complement should be taken before the number is evaluated. For example, the following figure shows an S(8,7) number that has a negative value.

u08-07-9780128036983

The value of this number in base 10 can be computed by taking the two’s complement, summing the values of the non-zero bits, and adding a negative sign to the result. The two’s complement of 1011010101111010 is 0100101010000101 + 1 = 0100101010000110. The value of this number is:

2147+2117+297+277+227+217=27+24+22+20+25+26=128+16+4+1+132+164=149.04687510

si31_e

For a final example we will interpret this bit pattern as an S(−5,16). In that format, the layout is:

u08-08-9780128036983

The value of this number in base ten can be computed by taking the two’s complement, summing the values of the non-zero bits, and adding a negative sign to the result. The two’s complement is:

u08-09-9780128036983

The value of this number interpreted as an S(−5,16) is:

26+29+211+213+218+219=0.0181941986083984375

si32_e

8.3.2 Q Notation

Fixed-point number formats can also be represented using Q notation, which was developed by Texas Instruments. Q notation is equivalent to the S/U format used in this book, except that the integer portion is not always fully specified. In general, Q formats are specified as Qm, n where m is the number of integer bits, and n is the number of fractional bits. If a fixed word size w is being used then m may be omitted, and is assumed to be wn. For example, a Q10 number has 10 fractional bits, and the number of integer bits is not specified, but is assumed to be the number of bits required to complete a word of data. A Q2,4 number has two integer bits and four fractional bits in a 6-bit word. There are two conflicting conventions for dealing with the sign bit. In one convention, the sign bit is included as part of m, and in the other convention, it is not. When using Q notation, it is important to state which convention is being used. Additionally, a U may be prefixed to indicate an unsigned value. For example UQ8.8 is equivalent to U(8,8), and Q7,9 is equivalent to S(7,9).

8.3.3 Properties of Fixed-Point Numbers

Once the decision has been made to use fixed-point calculations, the programmer must make some decisions about the specific representation of each fixed-point variable. The combination of size and radix will affect several properties of the numbers, including:

Precision: the maximum number of non-zero bits representable,

Resolution: the smallest non-zero magnitude representable,

Accuracy: the magnitude of the maximum difference between a true real value and its approximate representation,

Range: the difference between the largest and smallest number that can be represented, and

Dynamic range: the ratio of the maximum absolute value to the minimum positive absolute value representable.

Given a number specified using the notation introduced previously, we can determine its properties. For example, an S(9,6) number has the following properties:

Precision: P = 16 bits

Resolution: R = 2−6 = 0.015625

Accuracy: A=R2=0.0078125si33_e

Range: Minimum value is 1000000000.000000 = −512 Maximum value is 0111111111.111111 = 1023.9921875 Range is G = 1023.9921875 + 512 = 1535.9921875

Dynamic range: For a signed fixed-point rational representation, S(i,f), the dynamic range is

D=2×2i2f=2i+f+1=2P.

si34_e

Therefore, the dynamic range of an S(9,6) is 216 = 65536.

Being aware of these properties, the programmer can select fixed-point representations that fit the task that they are trying to solve. This allows the programmer to strive for very efficient code by using the smallest fixed-point representation possible, while still guaranteeing that the results of computations will be within some limits for error tolerance.

8.4 Fixed-Point Operations

Fixed-point numbers are actually stored as integers, and all of the integer mathematical operations can be used. However, some care must be taken to track the radix point at each stage of the computation. The advantages of fixed-point calculations are that the operations are very fast and can be performed on any computer, even if it does not have special hardware support for non-integral numbers.

8.4.1 Fixed-Point Addition and Subtraction

Fixed-point addition and subtraction work exactly like their integer counterparts. Fig. 8.1 gives some examples of fixed-point addition with signed numbers. Note that in each case, the numbers are aligned so that they have the same number of bits in their fractional part. This requirement is the only difference between integer and fixed-point addition. In fact, integer arithmetic is just fixed-point arithmetic with no bits in the fractional part. The arithmetic that was covered in Chapter 7 was fixed-point arithmetic using only S(i,0) and U(i,0) numbers. Now we are simply extending our knowledge to deal with numbers where f≠0. There are some rules which must be followed to ensure that the results are correct. The rules for subtraction are the same as the rules for addition. Since we are using two’s complement math, subtraction is performed using addition.

f08-01-9780128036983
Figure 8.1 Examples of fixed-point signed arithmetic.

Suppose we want to add an S(7,8) number to an S(7,4) number. The radix points are at different locations, so we cannot simply add them. Instead, we must shift one of the numbers, changing its format, until the radix points are aligned. The choice of which one to shift depends on what format we desire for the result. If we desire eight bits of fraction in our result, then we would shift the S(7,4) left by four bits, converting it into an S(7,8). With the radix points aligned, we simply use an integer addition operation to add the two numbers. The result will have it’s radix point in the same location as the two numbers being added.

8.4.2 Fixed Point Multiplication

Recall that the result of multiplying an n bit number by an m bit number is an n + m bit number. In the case of fixed-point numbers, the size of the fractional part of the result is the sum of the number of fractional bits of each number, and the total size of the result is the sum of the total number of bits in each number. Consider the following example where two U(5,3) numbers are multiplied together:

f08-39-9780128036983

The result is a U(10,6) number. The number of bits in the result is the sum of all of the bits of the multiplicand and the multiplier. The number of fractional bits in the result is the sum of the number of fractional bits in the multiplicand and the multiplier. There are three simple rules to predict the resulting format when multiplying any two fixed-point numbers.

Unsigned Multiplication The result of multiplying two unsigned numbers U(i1,f1) and U(i2,f2) is a U(i1 + i2,f1 + f2) number.

Mixed Multiplication The result of multiplying a signed number S(i1,f1) and an unsigned number U(i2,f2) is an S(i1 + i2,f1 + f2) number.

Signed Multiplication The result of multiplying two signed numbers S(i1,f1) and S(i2,f2) is an S(i1 + i2 + 1,f1 + f2) number.

Note that this rule works for integers as well as fixed-point numbers, since integers are really fixed-point numbers with f = 0. If the programmer desires a particular format for the result, then the multiply is followed by an appropriate shift.

Listing 8.1 gives some examples of fixed-point multiplication using the ARM multiply instructions. In each case, the result is shifted to produce the desired format. It is the responsibility of the programmer to know what type of fixed-point number is produced after each multiplication and to adjust the result by shifting if necessary.

f08-02-9780128036983
Listing 8.1 Examples of fixed-point multiplication in ARM assembly.

8.4.3 Fixed Point Division

Derivation of the rule for determining the format of the result of division is more complicated than the one for multiplication. We will first consider only unsigned division of a dividend with format U(i1,f1) by a divisor with format U(i2,f2).

Results of fixed point division

Consider the results of dividing two fixed-point numbers, using integer operations with limited precision. The value of the least significant bit of the dividend N is 2fisi35_e and the value of the least significant bit of the divisor D is 2f2si36_e. In order to perform the division using integer operations, it is necessary to multiply N by 2fisi37_e and multiply D by 2f2si38_e so that both numbers are integers. Therefore, the division operation can be written as:

Q=N×2f1D×2f2=ND×2f1f2.

si39_e

Note that no multiplication is actually performed. Instead, the programmer mentally shifts the radix point of the divisor and dividend, then computes the radix point of the result. For example, given two U(5,3) numbers, the division operation is accomplished by converting them both to integers, performing the division, then computing the location radix point:

Q=N×23D×23=ND×20.

si40_e

Note that the result is an integer. If the programmer wants to have some fractional bits in the result, then the dividend must be shifted to the left before the division is performed.

If the programmer wants to have fq fractional bits in the quotient, then the amount that the dividend must be shifted can easily be computed as

s=fq+f1f2.

si41_e

For example, suppose the programmer wants to divide 01001.011 stored as a U(28,3) by 00011.110 which is also stored as a U(28,3), and wishes to have six fractional bits in the result. The programmer would first shift 01001.011 to the left by six bits, then perform the division and compute the position of the radix in the result as shown:

f08-41-9780128036983

Since the divisor may be between zero and one, the quotient may actually require more integer bits than there are in the dividend. Consider that the largest possible value of the dividend is Nmax=2i12f1si43_e, and the smallest positive value for the divisor is Dmin=2f2si44_e. Therefore, the maximum quotient is given by:

Qmax=2i12f12f2=2i1+f22f1f2.

si45_e

Taking the limit of the previous equation,

limf1f2Qmax=2i1+f2,

si46_e

provides the following bound on how many bits are required in the integer part of the quotient:

Qmax<2i1+f2.

si47_e

Therefore, in the worst case, the quotient will require i1 + f2 integer bits. For example, if we divide a U(3,5), a = 111.11111 = 7.9687510, by a U(5,3), b = 00000.001 = 0.12510, we end up with a U(6,2)q = 111111.11 = 63.7510.

The same thought process can be used to determine the results for signed division as well as mixed division between signed and unsigned numbers. The results can be reduced to the following three rules:

Unsigned Division The result of dividing an unsigned fixed-point number U(i1,f1) by an unsigned number U(i2,f2) is a U(i1 + f2,f1f2) number.

Mixed Division The result of dividing two fixed-point numbers where one of them is signed and the other is unsigned is an S(i1 + f2,f1f2) number.

Signed Division The result of dividing two signed fixed-point numbers is an S(i1 + f2 + 1,f1f2) number.

Consider the results when a U(2,3), a = 00000.001 = 0.12510 is divided by a U(4,1), b = 1000.0 = 8.010. The quotient is q = 0.000001, which requires six bits in the fractional part. However, if we simply perform the division, then according to the rules shown above, the result will be a U(8,−2). There is no such thing as a U(8,−2), so the result is meaningless.

When f2 > f1, blindly applying the rules will result in a negative fractional part. To avoid this, the dividend can be shifted left so that it has at least as many fractional bits as the divisor. This leads to the following rule: If f2 > f1 then convert the divisor to an S(i1,x), where xf2, then apply the appropriate rule. For example, dividing an S(5,2) by a U(3,12) would result in an S(17,−10). But shifting the S(5,2) 16 bits to the left will result in an S(5,18), and dividing that by a U(3,12) will result in an S(17,6).

Maintaining precision

Recall that integer division produces a result and a remainder. In order to maintain precision, it is necessary to perform the integer division operation in such a way that all of the significant bits are in the result and only insignificant bits are left in the remainder. The easiest way to accomplish this is by shifting the dividend to the left before the division is performed.

To find a rule for determining the shift necessary to maintain full precision in the quotient, consider the worst case. The minimum positive value of the dividend is Nmin=2f1si48_e and the largest positive value for the divisor is Dmin=2i22f2si49_e. Therefore, the minimum positive quotient is given by:

Qmin=2f12i22f2=12f12i2+f22f2=2f22f1+i2+f2=12f1+i2=2(i2+f1)

si50_e

Therefore, in the worst case, the quotient will require i2 + f1 fractional bits to maintain precision. However, fewer bits can be reserved if full precision is not required.

Recall that the least significant bit of the quotient will be 2(i2+f1)si51_e. Shifting the dividend left by i2 + f2 bits will convert it into a U(i1,i2 + f1 + f2). Using the rule above, when it is divided by a U(i2,f2), the result is a U(i1 + f2,i2 + f1). This is the minimum size which is guaranteed to preserve all bits of precision. The general method for performing fixed-point division while maintaining maximum precision is as follows:

1. shift the dividend left by i2 + f2, then

2. perform integer division.

The result will be a U(i1 + f2,i2 + f1) for unsigned division, or an S(i1 + f2 + 1,i2 + f1) for signed division. The result for mixed division is left as an exercise for the student.

8.4.4 Division by a Constant

Section 7.3.3 introduced the idea of converting division by a constant into multiplication by the reciprocal of that constant. In that section it was shown that by pre-multiplying the reciprocal by a power of two (a shift operation), then dividing the final result by the same power of two (a shift operation), division by a constant could be performed using only integer operations with a more efficient multiply replacing the (usually) very slow divide.

This section presents an alternate way to achieve the same results, by treating division by an integer constant as an application of fixed-point multiplication. Again, the integer constant divisor is converted into its reciprocal, but this time the process is considered from the viewpoint of fixed-point mathematics. Both methods will achieve exactly the same results, but some people tend to grasp the fixed-point approach better than the purely integer approach.

When writing code to divide by a constant, the programmer must strive to achieve the largest number of significant bits possible, while using the shortest (and most efficient) representation possible. On modern computers, this usually means using 32-bit integers and integer multiply operations which produce 64-bit results. That would be extremely tedious to show in a textbook, so the principals will be demonstrated here using 8-bit integers and an integer multiply which produces a 16-bit result.

Division by constant 23

Suppose we want efficient code to calculate x ÷ 23 using only 8-bit signed integer multiplication. The reciprocal of 23, in binary, is

R=123=0.00001011001000010112.

si52_e

If we store R as an S(1,11), it would look like this:

u08-10-9780128036983

Note that in this format, the reciprocal of 23 has five leading zeros. We can store R in eight bits by shifting it left to remove some of the leading zeros. Each shift to the left changes the format of R. After removing the first leading zero bit, we have:

u08-11-9780128036983

After removing the second leading zero bit, we have:

u08-12-9780128036983

After removing the third leading zero bit, we have:

u08-13-9780128036983

Note that the number in the previous format has a “hidden” bit between the radix point and the sign bit. That bit is not actually stored, but is assumed to be identical to the sign bit. Removing the fourth leading zero produces:

u08-14-9780128036983

The number in the previous format has two “hidden” bits between the radix point and the sign bit. Those bits are not actually stored, but are assumed to be identical to the sign bit. Removing the fifth leading zero produces:

u08-15-9780128036983

We can only remove five leading zero bits, because removing one more would change the sign bit from 0 to 1, resulting in a completely different number. Note that the final format has three “hidden” bits between the radix point and the sign bit. These bits are all copies of the sign bit. It is an S(−4,8) number because the sign is four bits to the right of the radix point (resulting in the three “hidden” bits). According to the rules of fixed-point multiplication given earlier, an S(7,0) number x multiplied by an S(−4,8) number R will yield an S(4,8) number y. The value y will be 23×x23si53_e because we have three “hidden” bits to the right of the radix point. Therefore,

x23=R×x×23,

si54_e

indicating that after the multiplication, we must shift the result right by three bits to restore the radix. Since 123si55_e is positive, the number R must be increased by one to avoid round-off error. Therefore, we will use R + 1 = 01011010 = 9010 in our multiply operation. To calculate y = 10110 ÷ 2310, we can multiply and perform a shift as follows:

f08-38-9780128036983

Because our task is to implement integer division, everything to the right of the radix point can be immediately discarded, keeping only the upper eight bits as the integer portion of the result. The integer portion, 1000112, shifted right three bits, is 1002 = 410. If the modulus is required, it can be calculated as: 101 − (4 × 23) = 9. Some processors, such as the Motorola HC11, have a special multiply instruction which keeps only the upper half of the result. This method would be especially efficient on that processor. Listing 8.2 shows how the 8-bit division code would be implemented in ARM assembly. Listing 8.3 shows an alternate implementation which uses shift and add operations rather than a multiply.

f08-03-9780128036983
Listing 8.2 Dividing x by 23
f08-04-9780128036983
Listing 8.3 Dividing x by 23 Using Only Shift and Add

Division by constant −50

The procedure is exactly the same for dividing by a negative constant. Suppose we want efficient code to calculate x50si56_e using 16-bit signed integers. We first convert 150si57_e into binary:

150=0.0000010100011110¯

si58_e

The two’s complement of 150si57_e is

150=1.1111101011100001¯

si60_e

We can represent 150si61_e as the following S(1,21) fixed-point number:

u08-16-9780128036983

Note that the upper seven bits are all one. We can remove six of those bits and adjust the format as follows. After removing the first leading one, the reciprocal is:

u08-17-9780128036983

Removing another leading one changes the format to:

u08-18-9780128036983

On the next step, the format is:

u08-19-9780128036983

Note that we now have a “hidden” bit between the radix point and the sign bit. The hidden bit is not actually part of the number that we store and use in the computation, but it is assumed to be the same as the sign bit.

After three more leading ones are removed, the format is:

u08-20-9780128036983

Note that there are four “hidden” bits between the radix point and the sign. Since the reciprocal 150si61_e is negative, we do not need to round by adding one to the number R. Therefore, we will use R = 10101110000101012 = AE1516 in our multiply operation.

Since we are using 16-bit integer operations, the dividend, x, will be an S(15,0). The product of an S(15,0) and an S(−5,16) will be an S(11,16). We will remove the 16 fractional bits by shifting right. The four “hidden” bits indicate that the result must be shifted an additional four bits to the right, resulting in a total shift of 20 bits. Listing 8.4 shows how the 16-bit division code would be implemented in ARM assembly.

f08-05-9780128036983
Listing 8.4 Dividing x by −50

8.5 Floating Point Numbers

Sometimes we need more range than we can easily get from fixed precision. One approach to solving this problem is to create an aggregate data type that can represent a fractional number by having fields for an exponent, a sign bit, and an integer mantissa. For example, in C, we could represent a fractional number using the data structure shown in Listing 8.5. That data structure, along with some subroutines for addition, subtraction, multiplication and division, would provide the capability to perform arithmetic without explicitly tracking the radix point. The subroutines for the basic arithmetical operations could do that, thereby freeing the programmer to work at a higher level.

f08-06-9780128036983
Listing 8.5 Inefficient representation of a binimal.

The structure shown in Listing 8.5 is a rather inefficient way to represent a fractional number, and may create different data structures on different machines. The sign only requires one bit, and the size of the exponent and mantissa are dependent upon the machine on which the code is compiled. The sign will use one bit, the exponent eight bits, and the mantissa 23 bits.

The C language includes the notion of bit fields. This allows the programmer to specify exactly how many bits are to be used for each field within a struct, Listing 8.6 shows a C data structure that consumes 32 bits on all machines and architectures. It provides the same fields as the structure in Listing 8.5, but specifies exactly how many bits each field consumes.

f08-07-9780128036983
Listing 8.6 Efficient representation of a binimal.

The compiler will compress this data structure into 32 bits, regardless of the natural word size of the machine.

The method of representing fractional numbers as a sign, exponent, and mantissa is very powerful, and IEEE has set standards for various floating point formats. These formats can be described using bit fields in C, as described above. Many processors have hardware that is specifically designed to perform arithmetic using the standard IEEE formatted data. The following sections highlight most of the IEEE defined numerical definitions.

The IEEE standard specifies the bitwise representation for numbers, and specifies parameters for how arithmetic is to be performed. The IEEE standard for numbers includes the possibility of having numbers that cannot be easily represented. For example, any quantity that is greater than the most positive representable value is positive infinity, and any quantity that is less than the most negative representable value is negative infinity. There are special bit patterns to encode these quantities. The programmer or hardware designer is responsible for ensuring that their implementation conforms to the IEEE standards. The following sections describe some of the IEEE standard data formats.

8.5.1 IEEE 754 Half-Precision

The half-precision format gives a 16-bit encoding for fractional numbers with a small range and low precision. There are situations where this format is adequate. If the computation is being performed on a very small machine, then using this format may result in significantly better performance than could be attained using one of the larger IEEE formats. However, in most situations, the programmer can achieve better performance and/or precision by using a fixed-point representation. The format is as follows:

u08-21-9780128036983

 The Significand (a.k.a. “Mantissa”) is stored using a sign-magnitude coding, with bit 15 being the sign bit.

 The exponent is an excess-15 number. That is, the number stored is 15 greater than the actual exponent.

 There are 10 bits of significand, but there are 11 bits of significand precision. There is a “hidden” bit, m10, between m9 and e0. When a number is stored in this format, it is shifted until its leftmost non-zero bit is in the hidden bit position, and the hidden bit is not actually stored. The exception to this rule is when the number is zero or very close to zero. The radix point is assumed to be between the hidden bit and the first bit stored. The radix point is then shifted by the exponent.

Table 8.1 shows how to interpret IEEE 754 Half-Precision numbers. The exponents 00000 and 11111 have special meaning. The value 00000 is used to represent zero and numbers very close to zero, and the exponent value 11111 is used to represent infinity and NaN. NaN, which is the abbreviation for not a number, is a value representing an undefined or unrepresentable value. One way to get NaN as a result is to divide infinity by infinity. Another is to divide zero by zero. The NaN value can indicate that there is a bug in the program, or that a calculation must be performed using a different method.

Table 8.1

Format for IEEE 754 half-precision

ExponentSignificand = 0Significand≠0Equation
00000± 0subnormal− 1sign × 2−14 × 0.significand
00001 …11110normalized value− 1sign × 2exp−15 × 1.significand
11111±si63_eNaN

t0010

Subnormal means that the value is too close to zero to be completely normalized. The minimum strictly positive (subnormal) value is 2−24 ≈ 5.96 × 10−8. The minimum positive normal value is 2−14 ≈ 6.10 × 10−5. The maximum exactly representable value is (2 − 2−10) × 215 = 65504.

Examples

The following bit value:

u08-22-9780128036983

represents

+1.1000101011×20101101111=1.1000101011×24=0.000110001010110.09637.

si64_e

The following bit value:

u08-23-9780128036983

represents

1.0000100101×21100101111=1.0000100101×210=10000100101.0=106110.

si65_e

8.5.2 IEEE 754 Single-Precision

The single precision format provides a 23-bit mantissa and an 8-bit exponent, which is enough to represent a reasonably large range with reasonable precision. This type can be stored in 32 bits, so it is relatively compact. At the time that the IEEE standards were defined, most machines used a 32-bit word, and were optimized for moving and processing data in 32-bit quantities. For many applications this format represents a good trade-off between performance and precision.

u08-24-9780128036983

8.5.3 IEEE 754 Double-Precision

The double-precision format was designed to provide enough range and precision for most scientific computing requirements. It provides a 10-bit exponent and a 53-bit mantissa. When the IEEE 754 standard was introduced, this format was not supported by most hardware. That has changed. Most modern floating point hardware is optimized for the IEEE 754 double-precision standard, and most modern processors are designed to move 64-bit or larger quantities. On modern floating-point hardware, this is the most efficient representation.

However, processing large arrays of double-precision data requires twice as much memory, and twice as much memory bandwidth, as single-precision.

u08-25-9780128036983

8.5.4 IEEE 754 Quad-Precision

The IEEE 754 Quad-Precision format was designed to provide enough range and precision for very demanding applications. It provides a 14-bit exponent and a 116-bit mantissa. This format is still not supported by most hardware. The first hardware floating point unit to support this format was the SPARC V8 architecture. As of this writing, the popular Intel x86 family, including the 64-bit versions of the processor, do not have hardware support for the IEEE 754 quad-precision format. On modern high-end processors such as the SPARC, this may be an efficient representation. However, for mid-range processors such as the Intel x86 family and the ARM, this format is definitely out of their league.

u08-26-9780128036983

8.6 Floating Point Operations

Many processors do not have hardware support for floating point. On those processors, all floating point must be accomplished through software. Processors that do support floating point in hardware must have quite sophisticated circuitry to manage the basic operations on data in the IEEE 754 standard formats. Regardless of whether the operations are carried out in software or hardware, the basic arithmetic operations require multiple steps.

8.6.1 Floating Point Addition and Subtraction

The steps required for addition and subtraction of floating point numbers is the same, regardless of the specific format. The steps for adding or subtracting to floating point numbers a and b are as follows:

1. Extract the exponents Ea and Eb.

2. Extract the significands Ma and Mb. and convert them into 2’s complement numbers, using the signs Sa and Sb.

3. Shift the significand with the smaller exponent right by |EaEb|.

4. Perform addition (or subtraction) on the significands to get the significand of the result, Mr. Remember that the result may require one more significant bit to avoid overflow.

5. If Mr is negative, then take the 2’s complement and set Sr to 1. Otherwise set Sr to 0.

6. Shift Mr until the leftmost 1 is in the “hidden” bit position, and add the shift amount to the smaller of the two exponents to form the new exponent Er.

7. Combine the sign Sr, the exponent Er, and significand Mr to form the result.

The complete algorithm must also provide for correct handling of infinity and NaN.

8.6.2 Floating Point Multiplication and Division

Multiplication and division of floating point numbers also requires several steps. The steps for multiplication and division of two floating point numbers a and b are as follows:

1. Calculate the sign of the result Sr.

2. Extract the exponents Ea and Eb.

3. Extract the significands Ma and Mb.

4. Multiply (or divide) the significands to form Mr.

5. Add (or subtract) the exponents (in excess-N) to get Er.

6. Shift Mr until the leftmost 1 is in the “hidden” bit position, and add the shift amount to Er.

7. Combine the sign S, the exponent Er, and significand Mr to form the result.

The complete algorithm must also provide for correct handling of infinity and NaN.

8.7 Computing Sine and Cosine

It has been said, and is commonly accepted, that “you can’t beat the compiler.” The meaning of this statement is that using hand-coded assembly language is futile and/or worthless because the compiler is “smarter” than a human. This statement is a myth, as will now be demonstrated.

There are many mathematical functions that are useful in programming. Two of the most useful functions are sinxsi66_e and cosxsi67_e. However, these functions are not always implemented in hardware, particularly for fixed-point representations. If these functions are required for fixed-point computation, then they must be written in software. These two functions have some nice properties that can be exploited. In particular:

 If we have the sinxsi66_e function, then we can calculate cosxsi67_e using the relationship

cosx=sinπ2x.

si70_e  (8.1)

Therefore, we only need to get the sine function working, and then we can implement cosine with only a little extra effort.

 sinxsi66_e is cyclical, so sin2π=sin0=sin2πsi72_e. This means that we can limit the domain of our function to the range [−π,π].

 sinxsi66_e is symmetric, so that sinx=sinxsi74_e. This means that we can further restrict the domain to [0,π].

 After we restrict the domain to [0,π], we notice another symmetry, sinx=sin(πx),π2xπsi75_e and we can further restrict the domain to [0,π2]si76_e.

 The range of both functions, sinxsi66_e and cosxsi67_e, is in the range [−1,1].

If we exploit all of these properties, then we can write a single shared function to be used by both sine and cosine. We will name this function sinq, and choose the following fixed-point formats:

 sinq will accept x as an S(1,30), and

 sinq will return an S(1,30)

These formats were chosen because S(1,30) is a good format for storing a signed number between zero and π2si79_e, and also the optimal format for storing a signed number between one and negative one.

The sine function will map x into the domain accepted by sinq and then call sinq to do the actual work. If the result should be negative, then the sine function will negate it before returning. The cosine function will use the relationship previously mentioned, and call the sine function.

We have now reduced the problem to one of approximating sinxsi66_e within the range [0,π2]si76_e. An approximation to the function sinxsi66_e can be calculated using the Taylor Series:

sinx=n=0(1)nx2n+1(2n+1)!.

si83_e  (8.2)

The first few terms of the series should be sufficient to achieve a good approximation. The maximum value possible for the seventh term is (0.5×π)1313!0.0000000510si84_e, which indicates that our function should be accurate to at least 25 bits using seven terms. If more accuracy is desired, then additional terms can be added.

8.7.1 Formats for the Powers of x

The numerators in the first nine terms of the Taylor series approximation are: x, x3, x5, x7, x9, x11, x13, x15, and x17. Given an S(1,30) format for x, we can predict the format for the numerator of each successive term in the Taylor series. If we simply perform successive multiplies, then we would get the following formats for the powers of x:

TermFormat32-bit
xS(1,30)S(1,30)
x3S(3,90)S(3,28)
x5S(5,150)S(5,26)
x7S(7,210)S(7,24)
x9S(9,270)S(9,22)
x11S(11,330)S(11,20)
x13S(13,390)S(13,18)

The middle column in the table shows that the format for x17 would require 528 bits if all of the fractional bits are retained. Dealing with a number at that level of precision would be slow and impractical. We will, of necessity, need to limit the number of bits used. Since the ARM processor provides a multiply instruction involving two 32-bit numbers, we choose to truncate the numerators to 32 bits. The third column in the table indicates the resulting format for each term if precision is limited to 32 bits.

On further consideration of the Taylor series, we notice that each of the above terms will be divided by a constant. Instead of dividing, we can multiply by the reciprocal of the constant. We will create a similar table holding the formats and constants for the factorial terms. With a bit of luck, the division (implemented as multiplication) in each term will result in a reasonable format for each resulting term.

8.7.2 Formats and Constants for the Factorial Terms

The first term of the Taylor series is x1!si85_e, so we can simply skip the division. The second term is x33!=x3×13!si86_e and the third term is x55!=x5×15!si87_e We can convert 13!si88_e to binary as follows:

MultiplicationResult
IntegerFraction
16×2=26si89_e026si90_e
26×2=46si91_e046si92_e
46×2=86si93_e126si90_e
26×2=46si91_e046si92_e
86×2=86si97_e126si90_e

t0045

Since the pattern repeats, we can conclude that 13!=0.001¯2si99_e. Since we need a negative number, we take the two’s complement, resulting in 13!=111.110¯2si100_e. Represented as an S(1,30), this would be

u08-27-9780128036983

Since the first four bits are one, we can remove three bits and store it as:

u08-28-9780128036983

In hexadecimal, this is AAAAAAAA16.

Performing the same operations, we find that 15!si101_e can be converted to binary as follows:

MultiplicationResult
IntegerFraction
1120×2=2120si102_e02120si103_e
2120×2=4120si104_e04120si105_e
4120×2=8120si106_e08120si107_e
8120×2=16120si108_e016120si109_e
16120×2=32120si110_e032120si111_e
32120×2=64120si112_e064120si113_e
64120×2=128120si114_e18120si107_e

t0050

Since the fraction in the seventh row is the same as the fraction in the third row, we know that the table will repeat forever. Therefore, 15!=0.0000001¯2si116_e. Since the first six bits to the right of the radix are all zero, we can remove the first five bits. Also adding one to the least significant bit to account for rounding error yields the following S(−6,32):

u08-29-9780128036983

In hexadecimal, the number to be multiplied is 4444444516. Note that since 15!si101_e is a positive number, the reciprocal was incremented by one to avoid round-off errors. We can apply the same procedure to the remaining terms, resulting in the following table:

TermReciprocal FormatReciprocal Value (Hex)
13!si88_eS(−2,32)AAAAAAAA
15!si101_eS(−6,32)44444445
17!si120_eS(−12,32)97F97F97
19!si121_eS(−18,32)5C778E96
111!si122_eS(−25,32)9466EA60
113!si123_eS(−32,32)5849184F

8.7.3 Putting it All Together

We want to keep as much precision as is reasonably possible for our intermediate calculations. Using 64 bits of precision for all intermediate calculations will give a good trade-off between performance and precision. The integer portion should never require more than two bits, so we choose an S(2,61) as our intermediate representation. If we combine the previous two tables, we can determine what the format of each complete term will be. This is shown in Table 8.2.

Table 8.2

Result formats for each term

NumeratorReciprocalResult
TermValueFormatValueFormatHexFormat
1xS(1,30)Extend to 64 bits and shift rightS(2,61)
2x3S(3,28)13!si88_eS(−2,32)AAAAAAAAS(2, 61)
3x5S(5,26)15!si101_eS(−6,32)44444444S(0, 63)
4x7S(7,24)17!si120_eS(−12,32)97F97F97S(−4, 64)
5x9S(9,22)19!si121_eS(−18,32)5C778E96S(−8, 64)
6x11S(11,20)111!si122_eS(−25,32)9466EA60S(−13, 64)
7x13S(13,18)113!si123_eS(−32,32)5849184FS(−18, 64)

t0060

Note that the formats were truncated to fit in a 64-bit result. We can now see that the formats for the first nine terms of the Taylor series are reasonably similar. They all require exactly 64 bits, and the radix points can be shifted so that they are aligned for addition. In order to make the shifting and adding process easier, we will pre-compute the shift amounts and store them in a look-up table.

Table 8.3 shows the shifts that are necessary to convert each term to an S(2,61) so that it can be added to the running total.

Table 8.3

Shifts required for each term

Term NumberOriginal FormatShift AmountResulting Format
1S(1,30)1S(2,61)
2S(2,61)0S(2,61)
3S(0,63)2S(2,61)
4S(−4,64)6S(2,61)
5S(−8,64)10S(2,61)
6S(−13,64)15S(2,61)
7S(−18,64)20S(2,61)

t0065

Note that the seventh term contributes very little to the final 32-bit sum which is stored in the upper 32 bits of the running total. We now have all of the information that we need in order to implement the function. Listing 8.7 shows how the sine and cosine function can be implemented in ARM assembly using fixed point computation, and Listing 8.8 shows a main program which prints a table of values and their sine and cosines.

f08-08a-9780128036983f08-08b-9780128036983f08-08c-9780128036983f08-08d-9780128036983f08-08e-9780128036983f08-08f-9780128036983
Listing 8.7 ARM assembly implementation of sinxsi66_e and cosxsi67_e using fixed-point calculations.
f08-09a-9780128036983f08-09b-9780128036983
Listing 8.8 Example showing how the sinxsi66_e and cosxsi67_e functions can be used to print a table.

8.7.4 Performance Comparison

In some situations it can be very advantageous to use fixed-point math. For example, when using an ARMv6 or older processor, there may not be a hardware floating point unit available. Table 8.4 shows the CPU time required for running a program to compute the sine function on 10,000,000 random values, using various implementations of the sine function. In each case, the program main() function was written in C. The only difference in the six implementations was the data type (which could be fixed-point, IEEE single precision, or IEEE double precision), and the sine function that was used. The times shown in the table include only the amount of CPU time actually used in the sine function, and do not include the time required for program startup, storage allocation, random number generation, printing results, or program exit. The six implementations are as follows:

Table 8.4

Performance of sine function with various implementations

OptimizationImplementationCPU seconds
None32-bit Fixed Point Assembly3.85
32-bit Fixed Point C18.99
Single Precision Software Float C56.69
Double Precision Software Float C55.95
Single Precision VFP C11.60
Double Precision VFP C11.48
Full32-bit Fixed Point Assembly3.22
32-bit Fixed Point C5.02
Single Precision Software Float C20.53
Double Precision Software Float C54.51
Single Precision VFP C3.70
Double Precision VFP C11.08

32-bit Fixed Point Assembly The sine function is computed using the code shown in Listing 8.7.

32-bit Fixed Point C The sine function is computed using exactly the same algorithm as in Listing 8.7, but it is implemented in C rather than Assembly.

Single Precision Software Float C Sine is computed using the floating point sine function which is provided by the GCC C compiler. The code is compiled for an ARMv6 or earlier processor without hardware floating point support. The C code is written to use IEEE single precision floating point numbers.

Double Precision Software Float C Exactly the same as the previous method, but using IEEE double precision instead of single precision.

Single Precision VFP C Sine is computed using the floating point sine function which is provided by the GCC C compiler. The code is compiled for the ARMv6 or later processor using hardware floating point support. The C code is written to use IEEE single precision floating point numbers.

Double Precision VFP C Same as the previous method, but using IEEE double precision instead of single precision.

Each of the six implementations was compiled both with and without compiler optimizations, resulting in a total of 12 test cases. All cases were run on a standard Raspberry Pi model B with the default CPU clock rate.

From Table 8.4, it is clear that the fixed-point implementation written in assembly beats the code generated by the compiler in every case. The closest that the compiler can get is when it can use the VFP hardware floating point unit and the compiler is run with full optimization. Even in that case the fixed-point assembly implementation is almost 15% faster than the single precision floating point implementation, and has 33% more precision (32 bits versus 24 bits). In the worst case, when a VFP hardware unit is not available, the assembly code beats the compiler by a whopping 638% in speed and 33% in precision for single precision floats, and is 1692% faster than double precision floating point at a cost of 41% in precision. Note that even with floating point hardware support, fixed point in assembly is still 3.44 times as fast as the C compiler code.

Similar results could be obtained on any processor architecture, and any reasonably complex mathematical problem. When developing software for small systems, the developer must weigh the costs and benefits of alternative implementations. For battery powered systems, it is important to realize that choices of hardware and software can affect power consumption even more strongly than computing performance. First, the power used by a system which includes a hardware floating point processor will be consistently higher than that of a system without one. Second, the reduction in processing time required for the job is closely related to the reduction in power required. Therefore, for battery operated systems, A fixed-point implementation could greatly extend battery life. The following statements summarize the results from the experiment in this section:

1. A competent assembly programmer can beat the assembler, in some cases by a very large margin.

2. If computational performance is critical, then a well-designed fixed-point implementation will usually outperform even a hardware-accelerated floating point implementation.

3. If there is no hardware support for floating point, then floating point performance is extremely poor, and fixed point will always provide the best performance.

4. If battery life is a consideration, then a fixed-point implementation can have an enormous advantage.

Note also from the table that the assembly language version of the fixed-point sine function beats the identical C version by a wide margin. Section 9.8.2 will demonstrate that a good assembly language programmer who is familiar with the floating point hardware can beat the compiler by an even wider performance margin.

8.8 Ethics Case Study: Patriot Missile Failure

Fixed-point arithmetic is very efficient on modern computers. However it is incumbent upon the programmer to track the radix point at all stages of the computation, and to ensure that a sufficient number of bits are provided on both sides of the radix point. The programmer must ensure that all computations are carried out with the desired level of precision, resolution, accuracy, range, and dynamic range. Failure to do so can have serious consequences.

On February 25, 1991, during the Gulf War, an American Patriot Missile battery in Dharan, Saudi Arabia, failed to intercept an incoming Iraqi SCUD missile. The SCUD struck an American army barracks, killing 28 soldiers and injuring around 98 other people. The cause was an inaccurate calculation of the time elapsed since the system was last booted.

The hardware clock on the system counted the time in tenths of a second since the last reboot. Current time, in seconds, was calculated by multiplying that number by 110si5_e. For this calculation, 110si5_e was represented as a U(1,23) fixed-point number. Since 110si5_e cannot be represented precisely in a fixed number of bits, there was round-off error in the calculations. The small imprecision, when multiplied by a large number, resulted in significant error. The longer the system ran after boot, the larger the error became.

The system determined whether or not it should fire by predicting where the incoming missile would be at a specific time in the future. The time and predicted location were then fed to a second system which was responsible for locking onto the target and firing the Patriot missile. The system would only fire when the missile was at the proper location at the specified time. If the radar did not detect the incoming missile at the correct time and location, then the system would not fire.

At the time of the failure, the Patriot battery had been up for around 100 h. We can estimate the error in the timing calculations by considering how the binary number was stored. The binary representation of 110si5_e is 0.00011¯si134_e. Note that it is a non-terminating, repeating binimal. The 24-bit register in the Patriot could only hold the following set of bits:

u08-30-9780128036983

This resulted in an error of 0.000000000000000000000001100¯2si135_e. The error can be computed in base 10 as:

e=224+225+228+229+232+233+

si136_e  (8.3)

=i=02(4i+24)+2(4i+25)

si137_e  (8.4)

9.5×108.

si138_e  (8.5)

To find out how much error was in the total time calculation, we multiply e by the number of tenths of a second in 100 h. This gives 9.5 × 10−8 × 100 × 60 × 60 × 10 = 0.34 s. A SCUD missile travels at about 1,676 m/s. Therefore it travels about 570 m in 0.34 s. Because of this, the targeting and firing system was expecting to find the SCUD at a location that was over half a kilometer from where it really was. This was far enough that the incoming SCUD was outside the “range gate” that the Patriot tracked. It did not detect the SCUD at its predicted location, so it could not lock on and fire the Patriot.

This is an example of how a seemingly insignificant error can lead to a major failure. In this case, it led to loss of life and serious injury. Ironically, one factor that contributed to the problem was that part of the code had been modified to provide more accurate timing calculations, while another part had not. This meant that the inaccuracies did not cancel each other. Had both sections of code been re-written, or neither section changed, then the issue probably would not have surfaced.

The Patriot system was originally designed in 1974 to be mobile and to defend against aircraft that move much more slowly than ballistic missiles. It was expected that the system would be moved often, and therefore the computer would be rebooted frequently. Also, the slow-moving aircraft would be much easier to track, and the error in predicting where it is expected to be would not be significant. The system was modified in 1986 to be capable of shooting down Soviet ballistic missiles. A SCUD missile travels at about twice the speed of the Soviet missiles that the system was re-designed for.

The system was deployed to Iraq in 1990, and successfully shot down a SCUD missile in January of 1991. In mid-February of 1991, Israeli troops discovered that the system became inaccurate if it was allowed to run for long periods of time. They claimed that the system would become unreliable after 20 hours of operation. U.S. military did not think the discovery was significant, but on February 16th, a software update was released. Unfortunately, the update could not immediately reach all units because of wartime difficulties in transportation. The Army released a memo on February 21st, stating that the system was not to be run for “very long times,” but did not specify how long a “very long time” would be. The software update reached Dhahran one day after the Patriot Missile system failed to intercept a SCUD missile, resulting in the death of 28 Americans and many more injuries.

Part of the reason this error was not found sooner was that the program was written in assembly language, and had been patched several times in its 15-year life. The code was difficult to understand and maintain, and did not conform to good programming practices. The people who worked to modify the code to handle the SCUD missiles were not as familiar with the code as they would have been if it were written more recently, and time was a critical factor. Prolonged testing could have caused a disaster by keeping the system out of the hands of soldiers in a time of war. The people at Raytheon Labs had some tough decisions to make. It cannot be said that Raytheon was guilty of negligence or malpractice. The problem with the system was not necessarily the developers, but that the system was modified often and in inconsistent ways, without complete understanding.

8.9 Chapter Summary

Sometimes it is desirable to perform calculations involving non-integral numbers. The two common ways to represent non-integral numbers in a computer are fixed point and floating point. A fixed point representation allows the programmer to perform calculations with non-integral numbers using only integer operations. With fixed point, the programmer must track the radix point throughout the computation. Floating point representations allow the radix point to be tracked automatically, but require much more complex software and/or hardware. Fixed point will usually provide better performance than floating point, but requires more programming skill.

Fractional numbers in radix notation may not terminate in all bases. Numbers which terminate in base two will also terminate in base ten, but the converse is not true. Programmers should avoid counting using fractions which do not terminate in base two, because it leads to the accumulation of round-off errors.

Exercises

8.1 Perform the following base conversions:

(a) Convert 10110.0012 to base ten.

(b) Convert 11000.01012 to base ten.

(c) Convert 10.12510 to binary.

8.2 Complete the following table (assume all values represent positive fixed-point numbers):

Base 10Base 2Base 16Base 13
49.125
101011.011
AF.3
12

t0070

8.3 You are working on a problem involving real numbers between −2 and 2 on a computer that has 16-bit integer registers and no hardware floating point support. You decide to use 16-bit fixed-point arithmetic.

(a) What fixed-point format should you use?

(b) Draw a diagram showing the sign, if any, radix point, integer part, and fractional part.

(c) What is the precision, resolution, accuracy, and range of your format?

8.4 What is the resulting type of each of the following fixed-point operations?

(a) S(24,7)×S(27,15)

(b) S(3,4)÷U(4,20)

8.5 Convert 26.64062510 to a binary U(18,14) representation. Show the ARM assembly code necessary to load that value into register r4.

8.6 For each of the following fractions, indicate whether or not it will terminate in bases 2, 5, 7, and 10.

(a) 1364si139_e

(b) 3760si140_e

(c) 2574si141_e

(d) 391250si142_e

(e) 17343si143_e

8.7 What is the exact value of the binary number 0011011100011010 when interpreted as an IEEE half-precision number? Give your answer in base ten.

8.8 The “Software Engineering Code of Ethics And Professional Practice” states that a responsible software engineer should “Approve software only if they have well-founded belief that it is safe, meets specifications, passes appropriate tests…” (sub-principle 1.03) and “Ensure adequate testing, debugging, and review of software…on which they work” (sub-principle 3.10).
The software engineering code of ethics also states that a responsible software engineer should “Treat all forms of software maintenance with the same professionalism as new development.”

(a) Explain how the Software Engineering Code of Ethics And Professional Practice were violated by the Patriot Missile system developers.

(b) How should the engineers and managers at Raytheon have responded when they were asked to modify the Patriot Missile System to work outside of its original design parameters?

(c) What other ethical and non-ethical considerations may have contributed to the disaster?

Chapter 9

The ARM Vector Floating Point Coprocessor

Abstract

This chapter begins by giving an overview of the ARM Vector Floating Point (VFP) coprocessor and the ARM VFP register set. Next, it gives an overview of the Floating Point Status and Control Register (FPSCR). It then explains RunFast mode, which gives higher performance but is not fully compliant with the IEEE floating point standards. That is followed by a explanation of vector mode, which can give an additional performance boost in some situations. Then, after a short discussion of the register usage rules, it describes each of the VFP instructions, providing a short description of each one. Next, it presents four implementations of a function to calculate sine using the ARM VFP coprocessor, and shows that they are all significantly faster than the implementation provided by GCC.

Keywords

Floating point; Vector; IEEE Compliance; Performance

Some ARM processors have dedicated hardware to support floating point operations. For ARMv7 and previous architectures, floating point is provided by an optional Vector Floating Point (VFP) coprocessor. Many newer processors also support the NEON extensions, which are covered in Chapter 10. The remainder of this chapter will explain the VFP coprocessor.

9.1 Vector Floating Point Overview

There are four major revisions of the VFP coprocessor:

VFPv1: Obsolete

VFPv2: An optional extension to the ARMv5 and ARMv6 processors. VFPv2 has 16 64-bit FPU registers.

VFPv3: An optional extension to the ARMv7 processors. It is backwards compatible with VFPv2, except that it cannot trap floating-point exceptions. VFPv3-D32 has 32 64-bit FPU registers. Some processors have VFPv3-D16, which supports only 16 64-bit FPU registers. VFPv3 adds several new instructions to the VFP instruction set.

VFPv4: Implemented on some Cortex ARMv7 processors. VFPv4 has 32 64-bit FPU registers. It adds both half-precision extensions and multiply-accumulate instructions to the features of VFPv3. Some processors have VFPv4-D16, which supports only 16 64-bit FPU registers.

Fig. 9.1 shows the 16 ARM integer registers, and the additional registers provided by the VFP coprocessor. Banks four through seven are only present on the VFPv3-D32 and VFPv4-D32 versions of the coprocessor. Note that each register in Banks zero through three can be used to store either one 64-bit number or two 32-bit numbers. For example, double precision register d0 may also be referred to as single precision registers s0 and s1. Each 32-bit VFP register can hold an integer or a single precision floating point number. Registers in Banks four through seven cannot be used as single precision registers.

f09-01-9780128036983
Figure 9.1 ARM integer and vector floating point user program registers.

The VFP adds about 23 new instructions to the ARM instruction set. The exact number of VFP instructions depends on the specific version of the VFP coprocessor. Instructions are provided to:

 transfer floating point values between VFP registers,

 transfer floating-point values between the VFP coprocessor registers and main memory,

 transfer 32-bit values between the VFP coprocessor registers and the ARM integer registers,

 perform addition, subtraction, multiplication, and division, involving two source registers and a destination register,

 compute the square root of a value,

 perform combined multiply-accumulate operations,

 perform conversions between various integer, fixed point, and floating point representations, and

 compare floating-point values.

In addition to performing basic operations involving two source registers and one destination register, VFP instructions can also perform operations involving registers arranged as short vectors (arrays) of up to eight single-precision values or four double-precision values. A single instruction can be used to perform operations on all of the elements of such vectors. This feature can substantially accelerate computation on arrays and matrices of floating point data. This type of data is common in graphics and signal processing applications. Vector mode can reduce code size and increase speed of execution by supporting parallel operations and multiple transfers.

9.2 Floating Point Status and Control Register

The Floating Point Status and Control Register (FPSCR) is similar to the CPSR register. The FPSCR stores status bits from floating point operations in much the same way as the CPSR stores status bits from integer operations. The programmer can also write to certain bits in the FPSCR to control the behavior of the VFP coprocessor. The layout of the FPSCR is shown in Fig. 9.2. The meaning of each field is as follows:

f09-02-9780128036983
Figure 9.2 Bits in the FPSCR.

N The Negative flag is set to one by vcmp if Fd < Fm.

Z The Zero flag is Set to one by vcmp if Fd = Fm.

C The Carry flag is set to one by vcmp if Fd = Fm, or Fd > Fm, or Fd and Fm are unordered.

V The oVerflow flag is set to one by vcmp if Fd and Fm are unordered.

QC NEON only. The saturation cumulative flag is set to one by saturating instructions if saturation has occurred.

DN Default NaN enable:

0: Disable Default NaN mode. NaN operands propagate through to the output of a floating-point operation.

1: Enable Default NaN mode. Any operation involving one or more NaNs returns the default NaN.

The default single precision NaN is 7FC0000016 and the default double-precision NaN is 7FF800000000000016. Default NaN mode does not comply with IEEE 754 standard, but may increase performance. NEON instructions ignore this bit and always use Default NaN mode.

FZ Flush-to-Zero enable:

0: Disable Flush-to-Zero mode.

1: Enable Flush-to-Zero mode.

Flush-to-Zero mode replaces subnormal numbers with 0. This does not comply with IEEE 754 standard, but may increase performance. NEON instructions ignore this bit and always use flush-to-Zero mode.

RMODE Rounding mode:

00 Round to Nearest (RN).

01 Round towards Plus infinity (RP).

10 Round towards Minus infinity (RM).

11 Round towards Zero (RZ).

NEON instructions ignore these bits and always use Round to Nearest mode.

STRIDE Sets the stride (distance between items) for vector operations:

00 Stride is 1.

01 Reserved.

10 Reserved.

11 Stride is 2.

LEN Sets the vector length for vector operations:

000 Vector length is 1 (scalar mode).

001 Vector length is 2.

010 Vector length is 3.

011 Vector length is 4.

100 Vector length is 5.

101 Vector length is 6.

110 Vector length is 7.

111 Vector length is 8.

IDE Input Denormal (subnormal) exception Enable:

0: Exception disabled.

1: An exception is generated when one or more operand is subnormal.

IXE IneXact exception Enable:

0: Exception disabled.

1: An exception is generated when the result contains more significand bits than the destination format can contain, and must be rounded.

UFE UnderFlow exception Enable:

0: Exception disabled.

1: An exception is generated when the result is closer to zero than can be represented by the destination format.

OFE OverFlow exception Enable:

0: Exception disabled.

1: An exception is generated when the result is farther from zero than can be represented by the destination format.

DZE Division by Zero exception Enable:

0: Exception disabled.

1: An exception is generated by divide instructions when the divisor is zero or subnormal.

IOE Invalid Operation exception Enable:

0: Exception disabled.

1: An exception is generated when the result is not defined, or cannot be represented. For example, adding positive and negative infinity gives an invalid result.

IDC The Input Subnormal Cumulative flag is set to one when an IDE condition has occurred.

IXC The IneXact Cumulative flag is set to one when an IXE condition has occurred.

UFC The UnderFlow Cumulative flag is set to one when a UFE condition has occurred.

OFC The OverFlow Cumulative flag is set to one when an OFE condition has occurred.

DZC The Division by Zero Cumulative flag is set to one when a DZE condition has occurred.

IOC The Invalid Operation Cumulative flag is set to one when an OFE condition has occurred.

The only VFP instruction that can be used to update the status flags in the FPSCR is fcmp, which is similar to the integer cmp instruction. To use the FPSCR flags to control conditional instructions, including conditional VFP instructions, they must first be moved into the CPSR register. Table 9.1 shows the meanings of the FPSCR flags when they are transferred to the CPSR and used for conditional execution on following instructions. The following rules govern how the bits in the FPSCR may be changed by subroutines:

Table 9.1

Condition code meanings for ARM and VFP

<cond>ARM Data Processing InstructionVFP fcmp Instruction
ALAlwaysAlways
EQEqualEqual
NENot EqualNot equal, or unordered
GESigned greater than or equalGreater than or equal
LTSigned less thanLess than, or unordered
GTSigned greater thanGreater than
LESigned less than or equalLess than or equal, or unordered
HIUnsigned higherGreater than, or unordered
LSUnsigned lower or sameLess than or equal
HSCarry set/unsigned higher or sameGreater than or equal, or unordered
CSSame as HSSame as HS
LOCarry clear/ unsigned lowerless than
CCSame as LOSame as LO
MINegativeLess than
PLPositive or zeroGreater than or equal, or unordered
VSOverflowUnordered (at least one NaN operand)
VCNo overflowNot unordered

1. Bits 27-31, 0-4, and 7 do not need to be preserved.

2. Subroutines may modify bits 8-12, 15, and 22-25 but the practice is discouraged. These bits should only be changed by specific support subroutines which change the global state of the program. If they are modified within a subroutine, then their original value must be restored before the function returns or calls another function.

3. Bits 16–18 and bits 20–21 may be changed by a subroutine, but must be set to zero before the function returns or calls another function.

4. All other bits are reserved for future use and must not be modified.

9.2.1 Performance Versus Compliance

Floating point operations are complex, and there are many special cases, such as dealing with NaNs, infinities, and subnormals. These special cases are a normal part of performing floating point math, but they are relatively infrequent. In order to simplify the hardware, many special situations which occur infrequently are handled by software. When one of these exceptional situations occurs, the VFP hardware sets the appropriate flags in the FPSCR and generates an interrupt. The ARM CPU then executes an interrupt handler to deal with the exceptional situation. When the routine finishes, it returns to the point where the exception occurred and execution resumes just as if the situation had been dealt with by the hardware. This approach is taken by many processor architectures to reduce the complexity, cost, and/or power consumption of the floating point hardware, This approach also allows the programmer to make a trade-off between performance and strict IEEE 754 compliance.

Full-compliance mode

The support code for dealing with VFP exceptions is included in most ARM-based operating systems. Even bare-metal embedded systems can include the VFP support service routines. With the support code enabled, the VFP coprocessor is fully compliant with the IEEE 754 standard. However, using the fully compliant mode does increase the average run-time for floating point code, and increases the size of the operating system kernel or embedded system code.

RunFast mode

When all of the VFP exceptions are disabled, Default NaN mode is enabled, and Flush-to-Zero is enabled, the VFP is not fully compliant with the IEEE 754 standard. However, floating point code runs significantly faster. For that reason, the state when bits 8–12 and bit 15 are set to zero while bits 24 and 25 are set to one is referred to as RunFast mode. There is some loss of accuracy for very small values, but the hardware no longer has to check for many of the conditions that may stall the floating point pipeline. This results in fewer stalls and much higher throughput in the hardware, as well as eliminating the necessity to handle exceptions in software. Many other floating point architectures have similar modes, so the GCC developers have found it worthwhile to provide programmers with the option of using them. User applications can be compiled to use this mode with GCC by using the - ffast -math and/or -Ofast options during compilation and linking. The startup code in the C standard library will then set the VFP to RunFast mode before calling the main function.

9.2.2 Vector Mode

A VFP vector consists of up to eight single-precision registers, or up to four double-precision registers. All of the registers in a vector must be in the same bank. Also, vectors cannot be stored in Bank 0 or Bank 4. For example, registers s8 through s10 could be treated as a vector of three single-precision values. Registers s14 through s17 cannot be treated as a vector because some of those registers are in Bank 1 and others are in Bank 2. Registers d0 through d3 cannot be treated as a vector because they are in Bank 0.

The LEN field in the FPSCR controls the length of vectors that are used for vector operations. In vector operations, the first register in the vector is given as the operand, and the remaining registers are inferred from the settings of LEN and STRIDE. The STRIDE field allows data to be interleaved. For example, if the stride is set to two, and length is set to four, then the vector starting at s8 would consist of registers s8, s10, s12, and s14, while the vector starting at s9 would consist of registers s9, s11, s13, and s15. If a vector runs off the end of a bank, then the address wraps around to the first register in the bank. For example, if length is set to six and stride is set to one, then the vector starting at s13 would consist of s13, s14, s15, s8, s9, and s10, in that order.

The vector-capable data-processing instructions have one of the following two forms:

f09-03-9780128036983

where Op is the VFP instruction, Fd is the destination register (or the first register in a vector), Fn is an operand register (or the first register in a vector), and Fm is an operand register (or the first register in a vector). Most data-processing instructions can operate in scalar mode, mixed mode, or vector mode. The mode depends on the LEN bits in the FPSCR, as well as on which register banks contain the destination and operand(s).

 The operation is scalar if the LEN field is set to zero (scalar mode) or the destination operand, Fd, is in Bank 0 or Bank 4. The operation acts on Fm (and Fn if the operation uses two operands) and places the result in Fd.

 The operation is mixed if the LEN field is not set to zero and Fm is in Bank 0 or Bank 4 but Fd is not. If the operation has only one operand, then the operation is applied to Fm and copies of the result are stored into each register in the destination vector. If the operation has two operands, then it is applied with the scalar Fm and each element in the vector starting at Fn, and the result is stored in the vector beginning at Fd.

 The operation is vector if the LEN field is not set to zero and neither Fd nor Fm is in Bank 0 or Bank 4. If the operation has only one operand, then the operation is applied to the vector starting at Fm and the results are placed in the vector starting at Fd. If the operation has two operands, then it is applied with corresponding elements from the vectors starting at Fm and Fn, and the result is stored in the vector beginning at Fd.

9.3 Register Usage Rules

As with the integer registers, there are rules for using the VFP registers. These rules are a convention, and following the convention ensures interoperability between code written by different programmers and compilers. Registers s16 through s31 are non-volatile. This implies that d8 through d15 are also non-volatile, since they are really the same registers. The contents of these registers must be preserved across subroutine calls. The remaining registers (s0 through s15, also known as d0 through d7) are volatile. They are used for passing arguments, returning results, and for holding local variables. They do not need to be preserved by subroutines. If registers d16 through d31 are present, then they are also considered volatile.

In addition to the FPSCR, all VFP implementations contain at least two additional system registers. The Floating-point System ID register (FPSID) is a read-only register whose value indicates which VFP implementation is being provided. The contents of the FPSID can be transferred to an ARM integer register, then examined to determine which VFP version is available. There is also a Floating-point Exception register (FPEXC). Two bits of the FPEXC register provide system-level status and control. The remaining bits of this register are defined by the sub-architecture. These additional system registers should not be accessed by user applications.

9.4 Load/Store Instructions

The VFP provides several instructions for moving data between memory and the VFP registers. There are instructions for loading and storing single and double precision registers, and for moving multiple registers to or from memory.. All of the load and store instructions require a memory address to be in one of the ARM integer registers.

9.4.1 Load/Store Single Register

The following instructions are used to load or store a single VFP register:

vldr Load VFP Register, and

vstr Store VFP Register.

Syntax

v<op>r{<cond>}{.<prec>} Fd, [Rn{,#offset}]

v<op>r{<cond>}{.<prec>} Fd, =label

 <op> may be either ld or st.

 Fd may be any single or double precision register.

 Rn may be any ARM integer register.

 <cond> is an optional condition code.

 <prec> may be either f32 or f64.

Operations

NameEffectDescription
vldrFdMem[Rn+offset]si1_eLoad Fd using Rn as a pointer
vstrMem[Rn+offset]Fdsi2_eStore Fd using Rn as a pointer

Examples

f09-04-9780128036983

9.4.2 Load/Store Multiple Registers

These instructions load or store multiple floating-point registers:

vldm Load Multiple VFP Registers, and

vstm Store Multiple VFP Registers.

As with the integer ldm and stm instructions, there are multiple versions for use in moving data and accessing stacks.

Syntax

v<op>m<mode>{<cond>}{.<prec>} Rn{!},<list>

vpush{<cond>}{.<prec>} <list>

vpop{<cond>}{.<prec>} <list>

 <op> may be either ld or st.

 <mode> is one of

ia Increment address after each transfer.

db Decrement address before each transfer.

 Rn may be any ARM integer register.

 <cond> is an optional condition code.

 <prec> may be either f32 or f64.

 <list> may be any set of contiguous single precision registers, or any set of contiguous double precision registers.

 If mode is db then the ! is required.

 vpop <list> is equivalent to vldmia sp!,< list >.

 vpush <list> is equivalent to vstmdb sp!,< list >.

Operations

NameEffectDescription
vldmia

addrRdsi3_e

for iregister_list do

iMem[addr]si4_e

if single then

addraddr+4si5_e

else

addraddr+8si6_e

end if

end for

if ! is present then

Rdaddrsi7_e

end if

Load multiple registers from memory starting at the address in Rd. Increment address after each load.
vstmia

addrRdsi3_e

for iregister_list do

Mem[addr]isi9_e

if single then

addraddr+4si5_e

else

addraddr+8si6_e

end if

end for

if ! is present then

Rdaddrsi7_e

end if

Store multiple registers in memory starting at the address in Rd. Increment address after each store.
vldmdb

addrRdsi3_e

for iregister_list do

if single then

addraddr4si14_e

else

addraddr8si15_e

end if

iMem[addr]si4_e

end for

Rdaddrsi7_e

Load multiple registers from memory starting at the address in Rd. Decrement address before each load.
vstmdb

addrRdsi3_e

for iregister_list do

if single then

addraddr4si14_e

else

addraddr8si15_e

end if

Mem[addr]isi9_e

end for

Rdaddrsi7_e

Store multiple registers in memory starting at the address in Rd. Decrement address before each store.

t0025

Examples

f09-05-9780128036983

9.5 Data Processing Instructions

These operations are vector-capable. For details on how to use vector mode, refer to Section 9.2.2. Instructions are provided to perform the four basic arithmetic functions, plus absolute value, negation, and square root. There are also special forms of the multiply instructions that perform multiply-accumulate.

9.5.1 Copy, Absolute Value, Negate, and Square Root

The unary operations require on source operand and a destination register. The source and destination can be the same register. There are four unary operations:

vcpy Copy VFP Register (equivalent to move),

vabs Absolute Value,

vneg Negate, and

vsqrt Square Root.

Syntax

v<op>{<cond>}.<prec> Fd, Fm

 <op> is one of cpy, abs, neg, or sqrt.

 <cond> is an optional condition code.

 <prec> may be either f32 or f64.

Operations

NameEffectDescription
vcpyFdFnsi23_eCopy
vabsFd|Fn|si24_eAbsolute Value
vnegFdFnsi25_eNegate
vsqrtFdFnsi26_eSquare Root

Examples

f09-06-9780128036983

9.5.2 Add, Subtract, Multiply, and Divide

The basic mathematical operations require two source operands and one destination. There are five basic mathematical operations:

vadd Add,

vsub Subtract,

vmul Multiply,

vnmul Negate and Multiply, and

vdiv Divide.

Syntax

v<op>{<cond>}.<prec> Fd, Fn, Fm

 <op> is one of add, sub, mul, nmul, or div.

 <cond> is an optional condition code.

 <prec> may be either f32 or f64.

Operations

NameEffectDescription
vaddFdFn+Fmsi27_eAdd
vsubFdFnFmsi28_eSubtract
vmulFdFn×Fmsi29_eMultiply
vnmulFdFn×Fmsi30_eNegate and multiply
vdivFdFn÷Fmsi31_eDivide

Examples

f09-07-9780128036983

9.5.3 Compare

The compare instruction subtracts the value in Fm from the value in Fd and sets the flags in the FPSCR based on the result. The comparison operation will raise an exception if one of the operations is a signalling NaN. There is also a version of the instruction that will raise an exception if either operand is any type of NaN. The two comparison instructions are:

vcmp Compare, and

vcmpe Compare with Exception.

Syntax

vcmp{e}{<cond>}.<prec> Fd, Fm

 If e is present, an exception is raised if either operand is any kind of NaN. Otherwise, an exception is raised only if either operand is a signaling NaN.

 <cond> is an optional condition code.

 <prec> may be either f32 or f64.

Operations

NameEffectDescription
fcmpFPSCRflags(FdFm)si32_eCompare two registers

Examples

f09-08-9780128036983

9.6 Data Movement Instructions

With the addition of all of the VFP registers, there many more possibilities for how data can be moved. There are many more registers, and VFP registers may be 32 or 64 bit. This results in several possible combinations for moving data among all of the registers. The VFP instruction set includes instructions for moving data between two VFP registers, between VFP and integer registers, and between the various system registers.

9.6.1 Moving Between Two VFP Registers

The most basic move instruction involving VFP registers simply moves data between two floating point registers. The instruction is:

vmov Move Between VFP Registers.

Syntax

vmov{<cond>}{.<prec>} Fd, Fm

 F can be s or d.

 Fd and Fm must be the same size.

 <cond> is an optional condition code.

 <prec> is either f32 or f64.

Operations

NameEffectDescription
vmovFdFmsi33_eMove Fm to Fd

Examples

f09-09-9780128036983

9.6.2 Moving Between VFP Register and One Integer Register

This version of the move instruction allows 32 bits of data to be moved between an ARM integer register and a floating point register. The instruction is:

vmov Move Between VFP and One ARM Integer Register.

Syntax

vmov{<cond>} Rd, Sn

vmov{<cond>} Sn, Rd

 Rd is an ARM integer register.

 Sd is a VFP single precision register.

 <cond> is an optional condition code.

Operations

NameEffectDescription
vmov Rd,SmRdSmsi34_eMove Sm to Rd
vmov Sm,RdSmRdsi35_eMove Rd to Sm

Examples

f09-10-9780128036983

9.6.3 Moving Between VFP Register and Two Integer Registers

This version of the move instruction is used to transfer 64 bits of data between ARM integer registers and floating point registers:

vmov Move Between VFP and Two ARM Integer Registers.

Syntax

vmov{<cond>} destination(s), source(s)

 Source and destination must be VFP or integer registers. One of them must be a set of ARM integer registers, and the other must be VFP coprocessor registers. The following table shows the possible choices for sources and destinations.

ARM IntegerFloating Point
Rl,RhDd
Sd,Sd’

 Sd and Sd’ must be adjacent, and Sd’ must be the higher-numbered register.

 <cond> is an optional condition code.

Operations

NameEffectDescription
vmov Dd,Rl,RhDdRh:Rlsi36_eMove Rh and Rl to Dd
vmov Rl,Rh,DmRh:RlDmsi37_eMove Dm to Rh and Rl
vmov Sd,Sd’,Rl,RhSdRh,SdRlsi38_eMove Rh and Rl to Sd and Sd’.
vmov Rl,Rh,Sd,Sd’RhSd,RlSdsi39_eMove Sd and Sd’ to Rh and Rl.

Examples

f09-11-9780128036983

9.6.4 Move Between ARM Register and VFP System Register

There are two instructions which allow the programmer to examine and change bits in the VFP system register(s):

vmrs Move From VFP System Register to ARM Register, and

vmsr Move From ARM Register to VFP System Register.

User programs should only access the FPSCR to check the flags and control vector mode.

Syntax

vmrs{<cond>} Rd, VFPsysreg

vmsr{<cond>} VFPsysreg, Rd

 VFPsysreg can be any of the VFP system registers.

 Rd can be APSR_nzcv or any ARM integer register.,

 <cond> is an optional condition code.

Operations

NameEffectDescription
mrsRdVFPsysregsi40_eMove data from VFP system register to integer register
msrVFPsysregRdsi41_eMove data from integer register to VFP system register

Examples

f09-12-9780128036983

9.7 Data Conversion Instructions

The ARM VFP provides several instructions for converting between various floating point and integer formats. Some VFP versions also have instructions for converting between fixed point and floating point formats.

9.7.1 Convert Between Floating Point and Integer

These instructions are used to convert integers to single or double precision floating point, or for converting single or double precision to integer:

vcvt Convert Between Floating Point and Integer

vcvtr Convert Floating Point to Integer with Rounding

These instructions always use a single precision register for the integer, but the floating point argument can be single precision or double precision. Some versions of the VFP do not support the double precision versions.

Syntax

vcvt{r}{<cond>}.<type>.f64 Sd, Dm

vcvt{r}{<cond>}.<type>.f32 Sd, Sm

vcvt{<cond>}.f64.<type> Dd, Sm

vcvt{<cond>}.f32.<type> Sd, Sm

 The optional r makes the operation use the rounding mode specified in the FPSCR. The default is to round toward zero.

 <cond> is an optional condition code.

 The <type> can be either u32 or s32 to specify unsigned or signed integer.

 These instructions can also convert from fixed point to floating point if followed by an appropriate vmul.

Operation

OpcodeEffectDescription
vcvt.f64.s32Dddouble(Sm)si42_eConvert signed integer to double
vcvt.f32,s32Sdsingle(Sm)si43_eConvert signed integer to single
vcvt.f64.u32Dddouble(Sm)si42_eConvert unsigned integer to double
vcvt.f32.u32Sdsingle(Sm)si43_eConvert unsigned integer to single
vcvt.s32.f32Sdint(Sm)si46_eConvert single to signed integer
vcvt.u32.f32Sdunsigned(Sm)si47_eConvert single to unsigned integer
vcvt.s32.f64Sdint(Dm)si48_eConvert double to signed integer
vcvt.u32.f64Sdunsigned(Dm)si49_eConvert double to unsigned integer

Examples

f09-13-9780128036983

9.7.2 Convert Between Fixed Point and Single Precision

VFPv3 and higher coprocessors have additional instructions used for converting between fixed point and single precision floating point:

vcvt Convert To or From Fixed Point.

Syntax

vcvt{<cond>}.<td>.f32 Sd, Sm, #fbits

vcvt{<cond>}.f32.<td> Sd, Sm, #fbits

 <cond> is an optional condition code.

 <td> specifies the type and size of the fixed point number, and must be one of the following:

s32 signed 32 bit value,

u32 unsigned 32 bit value,

s16 signed 16 bit value, or

u16 unsigned 16 bit value.

 The #fbits operand specifies the number of fraction bits in the fixed point number, and must be less than or equal to the size of the fixed point number indicated by <td>.

Operations

NameEffectDescription
vcvt.s32.f32Ddfixed32(Sm)si50_eConvert single precision to 32-bit signed fixed point.
vcvt.u32.f32Sdufixed32(Sm)si51_eConvert single precision to 32-bit unsigned fixed point.
vcvt.s16.f32Ddfixed16(Sm)si52_eConvert single precision to 16-bit signed fixed point.
vcvt.u16.f32Sdufixed16(Sm)si53_eConvert single precision to 16-bit unsigned fixed point.
vcvt.f32.s32Ddsingle(Sm)si54_eConvert signed 32-bit fixed point to single precision
vcvt.f32.u32Sdsingle(Sm)si43_eConvert unsigned 32-bit fixed point to single precision
vcvt.f32.s16Ddsingle(Sm)si54_eConvert signed 16-bit fixed point to single precision
vcvt.f32.16Sdsingle(Sm)si43_eConvert unsigned 16-bit fixed point to single precision

Examples

f09-14-9780128036983

9.8 Floating Point Sine Function

A fixed point implementation of the sine function was discussed in Section 8.7, and shown to be superior to the floating point sine function provided by GCC. Now that we have covered the VFP instructions, we can write an assembly version using floating point which also performs better than the routines provided by GCC.

9.8.1 Sine Function Using Scalar Mode

Listing 9.1 shows a single precision floating point implementation of the sine function, using the ARM VFPv3 instruction set. It works in a similar way to the previous fixed point code. There is a table of constants, each of which is the reciprocal of one of the factorial divisors in the Taylor series for sine. The subroutine calculates the powers of x one-by-one, and multiplies each power by the next constant in the table, summing the results as it goes. Note that the table of constants is shorter than the fixed point version of the code, because there are fewer bits of precision in a single precision floating point number than there are in the fixed point representation that was used previously.

f09-15-9780128036983
Listing 9.1 Simple scalar implementation of the sin x function using IEEE single precision.

Listing 9.2 shows a double precision floating point implementation of the sine function, using the ARM VFPv3 instruction set. Again, there is a table of constants, each of which is the reciprocal of one of the factorial divisors in the Taylor series for sine. The subroutine calculates the powers of x one-by-one, and multiplies each power by the next constant in the table, summing the results as it goes. Note that the table of constants is longer than the fixed point version of the code, because there are more bits of precision in a double precision floating point number than there are in the fixed point representation that was used previously.

f09-16-9780128036983
Listing 9.2 Simple scalar implementation of the sin x function using IEEE double precision.

9.8.2 Sine Function Using Vector Mode

The previous implementations are already faster than the implementations provided by GCC, However, it may be possible to gain a little more performance by using VFP vector mode. In the single precision code, there are five terms to be added. Since single precision vectors can have up to eight elements, the code should not require any loop at all.

Listing 9.3 shows a single precision floating point implementation of the sine function, using the ARM VFPv3 instruction set in vector mode. It performs the same operations as the previous implementation, but instead of using a loop, all of the data is pre-loaded into vector banks and then a vector multiply operation is performed. The processor is then returned to scalar mode, and the summation is performed. This implementation is slightly faster than the previous version.

f09-17-9780128036983
Listing 9.3 Vector implementation of the sin x function using IEEE single precision.

Listing 9.4 shows a double precision floating point implementation of the sine function, using the ARM VFPv3 instruction set in vector mode. It performs the same operations as the previous implementation, but performs the nine multiplications in three groups of three, using vector operations. Also, computing the powers of x is done within the loop, using a vector multiply. In this case, the vector code is significantly faster than the scalar version.

f09-18a-9780128036983f09-18b-9780128036983f09-18c-9780128036983
Listing 9.4 Vector implementation of the sin x function using IEEE double precision.

9.8.3 Performance Comparison

Table 9.2 shows the performance of various implementations of the sine function, with and without compiler optimization. The Single Precision C and Double Precision C implementations are the standard implementations provided by GCC.

Table 9.2

Performance of sine function with various implementations

OptimizationImplementationCPU seconds
NoneSingle Precision Scalar Assembly2.96
Single Precision Vector Assembly2.63
Single Precision C8.75
Double Precision Scalar Assembly4.59
Double Precision Vector Assembly3.75
Double Precision C9.21
FullSingle Precision Scalar Assembly2.16
Single Precision Vector Assembly2.06
Single Precision C2.59
Double Precision Scalar Assembly3.88
Double Precision Vector Assembly3.16
Double Precision C8.49

When compiler optimization is not used, the single precision scalar VFP implementation achieves a speedup of about 2.96, and the vector implementation achieves a speedup of about 3.33 compared to the GCC implementation. The double precision scalar VFP implementation achieves a speedup of about 2.01, and the vector implementation achieves a speedup of about 2.46 compared to the GCC implementation.

When the best possible compiler optimization is used (-Ofast), the single precision scalar VFP implementation achieves a speedup of about 1.20, and the vector implementation achieves a speedup of about 1.26 compared to the GCC implementation. The double precision scalar VFP implementation achieves a speedup of about 2.19, and the vector implementation achieves a speedup of about 2.69 compared to the GCC implementation.

In most cases, the assembly versions were significantly faster than the functions provided by GCC. GCC with full optimization using single-precision numbers was competitive, but the assembly language vector implementation still beat it by over 25%. It is clear that writing some functions in assembly can result in large performance gains.

9.9 Alphabetized List of VFP Instructions

NamePageOperation
vabs277Absolute Value
vadd278Add
vcmp279Compare
vcmpe279Compare with Exception
vcpy277Copy VFP Register
vcvt283Convert Between Floating Point and Integer
vcvt284Convert To or From Fixed Point
vcvtr283Convert Floating Point to Integer with Rounding
vdiv278Divide
vldm275Load Multiple VFP Registers
vldr274Load VFP Register
vmov280Move Between VFP and One ARM Integer Register
vmov281Move Between VFP and Two ARM Integer Registers
vmov279Move Between VFP Registers
vmrs282Move From VFP System Register to ARM Register
vmsr282Move From ARM Register to VFP System Register
vmul278Multiply
vneg277Negate
vnmul278Negate and Multiply
vsqrt277Square Root
vstm275Store Multiple VFP Registers
vstr274Store VFP Register
vsub278Subtract

9.10 Chapter Summary

The ARM VFP coprocessor adds a great deal of power to the ARM architecture. The register set is expanded to hold up to four times the amount of data that can be held in the ARM integer registers. The additional instructions allow the programmer to deal directly with the most common IEEE 754 formats for floating point numbers. The ability to treat groups of registers as vectors adds a significant performance improvement. Access to the vector features is only possible through assembly language. The GCC compiler is not capable of using these advanced features, which gives the assembly programmer a big advantage when high-performance code is needed.

Exercises

9.1 How many registers does the VFP coprocessor add to the ARM architecture?

9.2 What is the purpose of the FZ, DN, and IDE, IXE, UFE, OFE, DZE, and IOE bits in the FPSCR? What is it called when FZ and DN are set to one and all of the others are set to zero?

9.3 If a VFP coprocessor is present, how are floating point parameters passed to subroutines? How is a pointer to a floating point value (or array of values) passed to a subroutine?

9.4 Write the following C code in ARM assembly:

f09-19-9780128036983

9.5 In the previous exercise, the C code contains a subtle bug.

a. What is the bug?

b. Show two ways to fix the code in ARM assembly. Hint: One way is to change the amount of the increment, which will change the number of times that the loop executes.

9.6 The fixed point sine function from the previous chapter was not compared directly to the hand-coded VFP implementation. Based on the information in Tables 9.2 and 8.4, would you expect the fixed point sine function from the previous chapter to beat the hand-coded assembly VFP sine function in this chapter? Why or why not?

9.7 3-D objects are often stored as an array of points, where each point is a vector (array) consisting of four values, x, y, z, and the constant 1.0. Rotation, translation, scaling and other operations are accomplished by multiplying each point by a 4 × 4 transformation matrix. The following C code shows the data types and the transform operation:

f09-20-9780128036983

Write the equivalent ARM assembly code.

9.8 Optimize the ARM assembly code you wrote in the previous exercise. Use vector mode if possible.

9.9 Since the fourth element of the point is always 1.0, there is no need to actually store it. This will reduce memory requirements by about 25%, and require one fewer multiply. The C code would look something like this:

f09-21-9780128036983

Write optimal ARM VFP code to implement this function.

9.10 The function in the previous problem would typically be called multiple times to process an array of points, as in the following function:

f09-22-9780128036983

This could be somewhat inefficient. Re-write this function in assembly so that the transformation of each point is done without resorting to a function call. Make your code as efficient as possible.

Chapter 10

The ARM NEON Extensions

Abstract

This chapter begins with an overview of the NEON extensions and explains the relationship between VFP and NEON. The NEON registers are explained, and the syntax for NEON instructions is explained. Next, each of the NEON instructions are explained, with short examples. In some cases, extended examples and figures are provided to help explain the operation of complex instructions. After all of the instructions are explained, another implementation of sine is presented and compared to previous implementations and with the GCC sine function. It is shown that NEON gives a significant performance advantage over VFP and hand coded assembly is much faster than the sin function provided by the compiler.

Keywords

Single instruction multiple data (SIMD); Vector; Vector element; Instruction level parallelism; Lane

The ARM VFP coprocessor has been replaced or augmented by the NEON architecture on ARMv7 and higher systems. NEON extends the VFP instruction set with about 125 instructions and pseudo-instructions to support not only floating point, but also integer and fixed point. NEON also supports Single Instruction, Multiple Data (SIMD) operations. All NEON processors have the full set of 32 double precision VFP registers, but NEON adds the ability to view the register set as 16 128-bit (quadruple-word) registers, named q0 through q15.

A single NEON instruction can operate on up to 128 bits, which may represent multiple integer, fixed point, or floating point numbers. For example, if two of the 128-bit registers each contain eight 16-bit integers, then a single NEON instruction can add all eight integers from one register to the corresponding integers in the other register, resulting in eight simultaneous additions. For certain applications, this SIMD architecture can result in extremely fast and efficient implementations. NEON is particularly useful at handling streaming video and audio, but also can give very good performance on floating point intensive tasks. NEON instructions perform parallel operations on vectors. NEON deprecates the use of VFP vector mode covered in Section 9.2.2. On most NEON systems, using the VFP vector mode will result in an exception, which transfers control to the support code which emulates vector mode in software. This causes a severe performance penalty, so VFP vector mode should not be used on NEON systems.

Fig. 10.1 shows the ARM integer, VFP, and NEON register set. NEON views each register as containing a vector of 1, 2, 4, 8, or 16 elements, all of the same size and type. Individual elements of each vector can also be accessed as scalars. A scalar can be 8 bits, 16 bits, 32 bits, or 64 bits. The instruction syntax is extended to refer to scalars using an index, x, in a doubleword register. Dm[x] is element x in register Dm. The size of the elements is given as part of the instruction. Instructions that access scalars can access any element in the register bank.

f10-01-9780128036983
Figure 10.1 ARM integer and NEON user program registers.

10.1 NEON Intrinsics

The GCC compiler gives C (and C++) programs direct access to the NEON instructions through the NEON intrinsics. The intrinsics are a large set of functions that are built into the compiler. Most of the intrinsics functions map to one NEON instruction. There are additional functions provided for typecasting (reinterpreting) NEON vectors, so that the C compiler does not complain about mismatched types. It is usually shorter and more efficient to write the NEON code directly as assembly language functions and link them to the C code. However only those who know assembly language are capable of doing that.

10.2 Instruction Syntax

Some instructions require specific register types. Other instructions allow the programmer to choose single word, double word, or quad word registers. If the instruction requires single precision registers, then the registers are specified as Sd for the destination register, Sn for the first operand register, and Sm for the second operand register. If the instruction requires only two registers, then Sn is not used. The lower-case letter is replaced with a valid register number. The register name is not case sensitive, so S10 and s10 are both valid names for single precision register 10.

The syntax of the NEON instructions can be described using a relatively simple notation. The notation consists of the following elements:

{item} Braces around an item indicate that the item is optional. For example, many operations have an optional condition, which is written as {<cond>}.

Ry An ARM integer register. y can be any number in the range 0{15.

Sy A 32-bit or single precision register. y can be any number in the range 0{31.

Dy A 64-bit or double precision register. y can be any number in the range 0{31.

Qy A quad word register. y can be any number in the range 0{15.

Fy A VFP register. F must be either s for a single word register, or d for a double word register. y can be any valid register number.

Ny A NEON or VFP register. N must be either s for a single word register, d for a double word register, or q for a quad word register. y can be any valid register number.

Vy A NEON vector register. V must be replaced with d for a double word register, or q for a quad word register. y can be any valid register number.

Vy[x] A NEON scalar (vector element). The size of the scalar is defined as part of the instruction. V must be replaced with d for a double word register, or q for a quad word register. y can be any valid register number. x specifies which scalar element of Vy is to be used. Valid values for x can be deduced by the size of Vy and the size of the scalars that the instruction uses.

<op> Operation specific part of a general instruction format

<n> An integer usually indicating a specific instruction version

<size> An integer indicating the number of bits used

<cond> ARM condition code from Table 3.2

<type> Many instructions operate on one or more of the following specific data types:

i8 Untyped 8 bits

i16 Untyped 16 bits

i32 Untyped 32 bits

i64 Untyped 64 bits

s8 Signed 8-bit integer

s16 Signed 16-bit integer

s32 Signed 32-bit integer

s64 Signed 64-bit integer

u8 Unsigned 8-bit integer

u16 Unsigned 16-bit integer

u32 Unsigned 32-bit integer

u64 Unsigned 64-bit integer

f16 IEEE 754 half precision floating point

f32 IEEE 754 single precision floating point

f64 IEEE 754 double precision floating point

<list> A brace-delimited list of up to four NEON registers, vectors, or scalars. The general form is {Dn,D(n+a),D(n+2a),D(n+3a)} where a is either 1 or 2.

<align> Specifies the memory alignment of structured data for certain load and store operations.

<imm> An immediate value. The required format for immediate values depends on the instruction.

<fbits> Specifies the number of fraction bits in fixed point numbers.

The following function definitions are used in describing the effects of many of the instructions:

xsi1_e The floor function maps a real number, x, to the next smallest integer.

u10-01-9780128036983 The saturate function limits the value of x to the highest or lowest value that can be stored in the destination register.

xsi2_e The round function maps a real number, x, to the nearest integer.

xsi3_e The narrow function reduces a 2n bit number to an n bit number, by taking the n least significant bits.

xsi4_e The extend function converts an n bit number to a 2n bit number, performing zero extension if the number is unsigned, or sign extension if the number is signed.

10.3 Load and Store Instructions

These instructions can be used to perform interleaving of data when structured data is loaded or stored. The data should be properly aligned for best performance. These instructions are very useful for common multimedia data types.

For example, image data is typically stored in arrays of pixels, where each pixel is a small data structure such as the pixel struct shown in Listing 5.37. Since each pixel is three bytes, and a d register is 8 bytes, loading a single pixel into one register would be inefficient. It would be much better to load multiple pixels at once, but an even number of pixels will not fit in a register. It will take three doubleword or quadword registers to hold an even number of pixels without wasting space, as shown in Fig. 10.2. This is the way data would be loaded using a VFP vldr or vldm instruction. Many image processing operations work best if each color “channel” is processed separately. The NEON load and store vector instructions can be used to split the image data into color channels, where each channel is stored in a different register, as shown in Fig. 10.3.

f10-02-9780128036983
Figure 10.2 Pixel data interleaved in three doubleword registers.
f10-03-9780128036983
Figure 10.3 Pixel data de-interleaved in three doubleword registers.

Other examples of interleaved data include stereo audio, which is two interleaved channels, and surround sound, which may have up to nine interleaved channels. In all of these cases, most processing operations are simplified when the data is separated into non-interleaved channels.

10.3.1 Load or Store Single Structure Using One Lane

These instructions are used to load and store structured data across multiple registers:

vld<n> Load Structured Data, and

vst<n> Store Structured Data.

They can be used for interleaving or deinterleaving the data as it is loaded or stored, as shown in Fig. 10.3.

Syntax

 v<op><n>.<size> <list>,[Rn{:<align>}]{!}

 v<op><n>.<size> <list>,[Rn{:<align>}],Rm

 <op> must be either ld or st.

 <n> must be one of 1, 2, 3, or 4.

 <size> must be one of 8, 16, or 32.

 <list> specifies the list of registers. There are four list formats:

1. {Dd[x]}

2. {Dd[x], D(d+a)[x]}

3. {Dd[x], D(d+a)[x], D(d+2a)[x]}

4. {Dd[x], D(d+a)[x], D(d+2a)[x], D(d+3a)[x]}

where a can be either 1 or 2. Every register in the list must be in the range d0-d31.

 Rn is the ARM register containing the base address. Rn cannot be pc.

 <align> specifies an optional alignment. If <align> is not specified, then standard alignment rules apply.

 The optional ! indicates that Rn is updated after the data is transferred. This is similar to the ldm and stm instructions.

 Rm is an ARM register containing an offset from the base address. If Rm is present, Rn is updated to Rn + Rm after the address is used to access memory. Rm cannot be sp or pc.

Table 10.1 shows all valid combinations of parameters for these instructions. Note that the same vector element (scalar) x must be used in each register. Up to four registers can be specified. If the structure has more than four fields, then these instructions can be used repeatedly to load or store all of the fields.

Table 10.1

Parameter combinations for loading and storing a single structure

<n><size><list><align>Alignment
18Dd[x]Standard only
2-516Dd[x]162 byte
2-532Dd[x]324 byte
28Dd[x], D(d+1)[x]162 byte
2-516Dd[x], D(d+1)[x]324 byte
Dd[x], D(d+2)[x]324 byte
2-532Dd[x], D(d+1)[x]648 byte
Dd[x], D(d+2)[x]648 byte
38Dd[x], D(d+1)[x], D(d+2)[x]Standard only
2-516 or 32Dd[x], D(d+1)[x], D(d+2)[x]Standard only
Dd[x], D(d+2)[x], D(d+4)[x]Standard only
48Dd[x], D(d+1)[x], D(d+2)[x], D(d+3)[x]324 byte
2-516Dd[x], D(d+1)[x], D(d+2)[x], D(d+3)[x]648 byte
Dd[x], D(d+2)[x], D(d+4)[x], D(d+6)[x]648 byte
2-532Dd[x], D(d+1)[x], D(d+2)[x], D(d+3)[x]64 or 128(<align> ÷ 8) bytes
Dd[x], D(d+2)[x], D(d+4)[x], D(d+6)[x]64 or 128(<align> ÷ 8) bytes

t0010

Operations

NameEffectDescription
vld<n>

tmpRnsi5_e

incr(si6_e<size> ÷ 8)

for Dregs(<list>) do

 D[x]Mem[tmp]si7_e

 tmptmp+incrsi8_e

end for

if ! is present then

 Rntmpsi9_e

else

 if Rm is specified then

 RnRmsi10_e

end if

end if

Load one or more data items into a single lane of one or more registers
vst<n>

tmpRnsi5_e

incr(si6_e<size> ÷ 8)

for Dregs(<list>) do

 Mem[tmp]D[x]si13_e

 tmptmp+incrsi8_e

end for

if ! is present then

 Rntmpsi9_e

else

 if Rm is specified then

 RnRmsi10_e

 end if

end if

Store one or more data items from a single lane of one or more registers

t0030

Examples

f10-11-9780128036983

10.3.2 Load Copies of a Structure to All Lanes

This instruction is used to load multiple copies of structured data across multiple registers:

vld<n> Load Copies of Structured Data.

The data is copied to all lanes. This instruction is useful for initializing vectors for use in later instructions.

Syntax

 vld<n>.<size> <list>,[Rn{:<align>}]{!}

 vld<n>.<size> <list>,[Rn{:<align>}],Rm

 <n> must be one of 1, 2, 3, or 4.

 <size> must be one of 8, 16, or 32.

 <list> specifies the list of registers. There are four list formats:

1. {Dd[]}

2. {Dd[], D(d+a)[]}

3. {Dd[], D(d+a)[], D(d+2a)[]}

4. {Dd[], D(d+a)[], D(d+2a)[], D(d+3a)[]}

where a can be either 1 or 2. Every register in the list must be in the range d0-d31.

 Rn is the ARM register containing the base address. Rn cannot be pc.

 <align> specifies an optional alignment. If <align> is not specified, then standard alignment rules apply.

 The optional ! indicates that Rn is updated after the data is transferred. This is similar to the ldm and stm instructions.

 Rm is an ARM register containing an offset from the base address. If Rm is present, Rn is updated to Rn + Rm after the address is used to access memory. Rm cannot be sp or pc.

Table 10.2 shows all valid combinations of parameters for this instruction. Note that the vector element number is not specified, but the brackets [] must be present. Up to four registers can be specified. If the structure has more than four fields, then this instruction can be repeated to load or store all of the fields.

Table 10.2

Parameter combinations for loading multiple structures

<n><size><list><align>Alignment
18Dd[]Standard only
Dd[], D(d+1)[]Standard only
2-516Dd[]162 byte
Dd[], D(d+1)[]162 byte
2-532Dd[]324 byte
Dd[], D(d+1)[]324 byte
28Dd[], D(d+1)[]81 byte
8Dd[], D(d+2)[]81 byte
2-516Dd[], D(d+1)[]162 byte
Dd[], D(d+2)[]162 byte
2-532Dd[], D(d+1)[]324 byte
Dd[], D(d+2)[]324 byte
38, 16, or 32Dd[], D(d+1)[], D(d+2)[]Standard only
Dd[], D(d+2)[], D(d+4)[]Standard only
48Dd[], D(d+1)[], D(d+2)[], D(d+3)[]324 byte
Dd[], D(d+2)[], D(d+4)[], D(d+6)[]324 byte
2-516Dd[], D(d+1)[], D(d+2)[], D(d+3)[]648 byte
Dd[], D(d+2)[], D(d+4)[], D(d+6)[]648 byte
2-532Dd[], D(d+1)[], D(d+2)[], D(d+3)[]64 or 128(<align> ÷ 8) bytes
Dd[], D(d+2)[], D(d+4)[], D(d+6)[]64 or 128(<align> ÷ 8) bytes

t0015

Operations

NameEffectDescription
vld<n>

tmpRnsi5_e

incr(si6_e<size> ÷ 8)

nlanes(64÷si19_e<size>)

for Dregs(<list>) do

 for 0 ≤ x < nlanes do

 D[x]Mem[tmp]si7_e

 tmptmp+incrsi8_e

 end for

end for

if ! is present then

 Rntmpsi9_e

else

 if Rm is specified then

 RnRmsi10_e

 end if

end if

Load one or more data items into all lanes of one or more registers.

t0035

Examples

f10-12-9780128036983

10.3.3 Load or Store Multiple Structures

These instructions are used to load and store multiple data structures across multiple registers with interleaving or deinterleaving:

vld<n> Load Multiple Structured Data, and

vst<n> Store Multiple Structured Data.

Syntax

 v<op><n>.<size> <list>,[Rn{:<align>}]{!}

 v<op><n>.<size> <list>,[Rn{:<align>}],Rm

 <op> must be either ld or st.

 <n> must be one of 1, 2, 3, or 4.

 <size> must be one of 8, 16, or 32.

 <list> specifies the list of registers. There are four list formats:

1. {Dd}

2. {Dd, D(d+a)}

3. {Dd, D(d+a), D(d+2a)}

4. {Dd, D(d+a), D(d+2a), D(d+3a)}

where a can be either 1 or 2. Every register in the list must be in the range d0-d31.

 Rn is the ARM register containing the base address. Rn cannot be pc.

 <align> specifies an optional alignment. If <align> is not specified, then standard alignment rules apply.

 The options ! indicates that Rn is updated after the data is transferred, similar to the ldm and stm instructions.

 Rm is an ARM register containing an offset from the base address. If Rm is present, Rn is updated to Rn + Rm after the address is used to access memory. Rm cannot be sp or pc.

Table 10.3 shows all valid combinations of parameters for this instruction. Note that the scalar is not specified and the instructions work on all multiple vector elements. Up to four registers can be specified. If the structure has more than four fields, then this instruction can be repeated to load or store all of the fields.

Table 10.3

Parameter combinations for loading copies of a structure

<n><size><list><align>Alignment
18, 16, 32, or 64Dd648 bytes
Dd, D(d+1)64 or 128(<align> ÷ 8) bytes
Dd, D(d+1), D(d+2)648 bytes
Dd, D(d+1), D(d+2), D(d+3)64, 128, or 256(<align> ÷ 8) bytes
28, 16, or 32Dd, D(d+1)64 or 128(<align> ÷ 8) bytes
Dd, D(d+2)64 or 128(<align> ÷ 8) bytes
Dd, D(d+1), D(d+2), D(d+3)64, 128, or 256(<align> ÷ 8) bytes
38, 16, or 32Dd, D(d+1), D(d+2)648 bytes
Dd, D(d+2), D(d+3)648 bytes
48, 16, or 32Dd, D(d+1), D(d+2), D(d+3)64, 128, or 256(<align> ÷ 8) bytes
Dd, D(d+2), D(d+4), D(d+6)64, 128, or 256(<align> ÷ 8) bytes

t0020

Operations

NameEffectDescription
vld<n>

tmpRnsi5_e

incr(si6_e<size> ÷ 8)

nlanes(64÷si19_e<size>)

for 0 ≤ x < nlanes do

 for D<list> do

 D[x]Mem[tmp]si7_e

 tmptmp+incrsi8_e

 end for

end for

if ! is present then

 Rntmpsi9_e

else

 if Rm is specified then

 RnRmsi10_e

 end if

end if

Load one or more data items into all lanes of one or more registers.
vst<n>

tmpRnsi5_e

incr(si6_e<size> ÷ 8)

nlanes(64÷si19_e<size>)

for 0 ≤ x < nlanes do

 for D<list> do

 Mem[tmp]D[x]si13_e

 tmptmp+incrsi8_e

 end for

end for

if ! is present then

 Rntmpsi9_e

else

 if Rm is specified then

 RnRmsi10_e

 end if

end if

Load one or more data items into all lanes of one or more registers.

t0040

Examples

f10-13-9780128036983

10.4 Data Movement Instructions

Because they use the same set of registers, VFP and NEON share some instructions for loading, storing, and moving registers. The shared instructions are vldr, vstr, vldm, vstm, vpop, vpush, vmov, vmrs, and vmsr. These were explained in Chapter 9. NEON extends the vmov instructions to allow specification of NEON scalars and quadwords, and adds the ability to perform one’s complement during a move.

10.4.1 Moving Between NEON Scalar and Integer Register

This version of the move instruction allows data to be moved between the NEON registers and the ARM integer registers as 8-bit, 16-bit, or 32-bit NEON scalars:

vmov Move Between NEON and ARM.

Syntax

 vmov{<cond>}.<size> Dn[x],Rd

 vmov{<cond>}.<type> Rd,Dn[x]

 <cond> is an optional condition code.

 <size> must be 8, 16, or 32, and specifies the number of bits that are to be moved.

 The <type> must be u8, u16, u32, s8, s16, s32, or f32, and specifies the number of bits that are to be moved and whether or not the result should be sign-extended in the ARM integer destination register.

Operations

NameEffectDescription
vmov Dd[x],RmDn[x]Rdsi38_eMove least significant size bits of Rd to NEON scalar Dn[x].
vmov Rd,Dn[x]RdDn[x]si39_eMove NEON scalar Dn[x] to Rd, storing as specified type

Examples

f10-14-9780128036983

10.4.2 Move Immediate Data

NEON extends the VFP vmov instruction to include the ability to move an immediate value, or the one’s complement of an immediate value, to every element of a register. The instructions are:

vmov Move Immediate, and

vmvn Move Immediate NOT.

Syntax

 v<op>.<type> Vd, #<imm>

 <op> must be either <mov> or <mvn>.

 <type> must be i8, i16, i32, f32, or i64, and specifies the size of items in the vector.

 V can be s, d, or q.

 <imm> is an immediate value that matches <type>, and is copied to every element in the vector. The following table shows valid formats for imm:

<type>vmovvmvn
i80xXY0xXY
i160x00XY0xFFXY
0xXY000xXYFF
i320x000000XY0xFFFFFFXY
0x0000XY000xFFFFXYFF
0x00XY00000xFFXYFFFF
0xXY0000000xXYFFFFFF
i640xABCDEFGH0xABCDEFGH
2-3Each letter represents a byte, and must be either FF or 00
f32Any number that can be written as ± n × (2 − r), where n and r are integers, such that 16 ≤ n ≤ 31 and 0 ≤ r ≤ 7

t0050

Operations

NameEffectDescription
vmovVd[]immedsi40_eCopy immediate value to all elements of Vd.
vmvnVd[]¬immedsi41_eCopy one’s complement of immediate value to all elements of Vd.

Examples

f10-15-9780128036983

10.4.3 Change Size of Elements in a Vector

It is sometimes useful to increase or decrease the number of bits per element in a vector. NEON provides these instructions to convert a doubleword vector with elements of size y to a quadword vector with size 2y, or to perform the inverse operation:

vmovl Move and Lengthen,

vmovn Move and Narrow,

vqmovn Saturating Move and Narrow, and

vqmovun Saturating Move and Narrow Unsigned.

Syntax

 vmovl.<type> Qd, Dm

 v{q}movn.<type> Dd, Qm

 vqmovun.<type> Dd, Qm

 The valid choices for <type> are given in the following table:

OpcodeValid Types
vmovls8, s16, s32, u8, u16, or u32
vmovni8, i16, or i32
vqmovns8, s16, s32, u8, u16, or u32
vqmovuns8, s16, or s32

 q indicates that the results are saturated.

Operations

NameEffectDescription
vmovl

for 0 ≤ i < (64 ÷ size) do

 Qd[i]Dm[i]si42_e

end for

Sign or zero extends (depending on <type>) each element of a doubleword vector to twice their length
v{q}movn

for 0 ≤ i < (64 ÷ size) do

 if q is present then

 eq10-02-9780128036983

 else

 Dd[i]Qm[i])si43_e

 end if

end for

Copy the least significant half of each element of a quadword vector to the corresponding elements of a doubleword vector. If q is present, then the value is saturated
vqmovun

for 0 ≤ i < (64 ÷ size) do

eq10-03-9780128036983

end for

Copy each element of the operand vector to the corresponding element of the destination vector. The destination element is unsigned, and the value is saturated

t0065

Examples

f10-16-9780128036983

10.4.4 Duplicate Scalar

The duplicate instruction copies a scalar into every element of the destination vector. The scalar can be in a NEON register or an ARM integer register. The instruction is:

vdup Duplicate Scalar.

Syntax

 vdup.<size> Vd, Rm

 vdup.<size> Vd, Dm[x]

 <size> must be one of 8, 16 or 32.

 V can be d or q.

 Rm cannot be r15.

Operations

NameEffectDescription
vdup.<size>V d[] < −RmCopy <size> least significant bits of Rm to all elements of Vd
vdup.<size>V d[] < −Dm[x]Copy element x of Dm to all elements of Vd

Examples

f10-17-9780128036983

10.4.5 Extract Elements

This instruction extracts 8-bit elements from two vectors and concatenates them. Fig. 10.4 gives an example of what this instruction does. The instruction is:

f10-04-9780128036983
Figure 10.4 Example of vext.8 d12,d4,d9,#5.

vext Extract Elements.

Syntax

 vext.<size> Vd, Vn, Vm, #<imm>

 <size> must be one of 8, 16, 32, or 64.

 V can be d or q.

 <imm> is the number of elements to extract from the bottom of Vm. The remaining elements required to fill Vd are taken from the top of Vn.

Operation

NameEffectDescription
vext

if V is double then

 size8si44_e

else

 size16si45_e

end if

for imm > i ≥ 0 do

 Vd[i+sizeimm]Vm[i]si46_e

end for

for size > iimm do

 Vd[iimm]Vm[i]si47_e

end for

Concatenate the top of first operand to the bottom of the second operand.

t0075

Examples

f10-18-9780128036983

10.4.6 Reverse Elements

This instruction reverses the order of data in a register:

vrev Reverse Elements.

One use of this instruction is for converting data from big-endian to little-endian order, or from little-endian to big-endian order. It could also be useful for swapping data and transforming matrices. Fig. 10.5 shows three examples.

f10-05-9780128036983
Figure 10.5 Examples of the vrev instruction. (A) vrev16.8 d3,d4; (B) vrev32.16 d8,d9; (C) vrev32.8 d5,d7.

Syntax

 vrev<n>.<size> Vd, Vm

 <n> can be 16, 32, or 64.

 <size> is either 8, 16, or 32 and indicates the size of the elements to be reversed. <size> must be less than <n>.

 V can be q or d.

Operation

NameEffectDescription
vrev

n# of groupssi48_e

gsize of groupsi49_e

for 0 ≤ i < n do

 for 0 ≤ j < g do

 Vd[i×g+j]Vm[i×g+(gj1)]si50_e

 end for

end for

Reverse the order of elements of <size> bits within every element of <n> bits.

t0080

Examples

f10-19-9780128036983

10.4.7 Swap Vectors

This instruction simply swaps two NEON registers:

vswp Swap Vectors.

Syntax

 vswp{.<type>} Vd, Vm

 <type> can be any NEON data type. The assembler ignores the type, but it can be useful to the programmer as extra documentation.

 V can be q or d.

Operation

NameEffectDescription
vswpVdVm;VmVdsi51_eSwap registers

Examples

f10-20-9780128036983

10.4.8 Transpose Matrix

This instruction transposes 2 × 2 matrices:

vtrn Transpose Matrix.

Fig. 10.6 shows two examples of this instruction. Larger matrices can be transposed using a divide-and-conquer approach.

f10-06-9780128036983
Figure 10.6 Examples of the vtrn instruction. (A) vtrn.8 d14,d15; (B) vtrn.32 d31,d15.

Syntax

 vtrn.<size> Vd, Vm

 <size> is either 8, 16, or 32 and indicates the size of the elements in the matrix (or matrices).

 V can be q or d.

Operation

NameEffectDescription
vtrn

n# of elementssi52_e

for 0 ≤ i < n by 2 do

 tmpVm[i]si53_e

 Vm[i]Vd[i+1]si54_e

 Vd[i+1]tmpsi55_e

end for

Treat two vectors as an array of 2 × 2 matrices and transpose them.

t0090

Examples

f10-21-9780128036983

Fig. 10.7 shows how the vtrn instruction can be used to transpose a 3 × 3 matrix. Transposing a 4 × 4 matrix requires the transposition of 13 2 × 2 matrices. However, this instruction can operate on multiple 2 × 2 sub-matrices in parallel, and can group elements into different sized sub-matrices. There is also a very useful swap instruction that can exchange the rows of a matrix. Using the swap and transpose instructions, transposing a 4 × 4 matrix of 16-bit elements can be done with only four instructions, as shown in Fig. 10.8.

f10-07-9780128036983
Figure 10.7 Transpose of a 3 × 3 matrix.
f10-08-9780128036983
Figure 10.8 Transpose of a 4 × 4 matrix of 32-bit numbers.

10.4.9 Table Lookup

The table lookup instructions use indices held in one vector to lookup values from a table held in one or more other vectors. The resulting values are stored in the destination vector. The table lookup instructions are:

vtbl Table Lookup, and

vtbx Table Lookup with Extend.

Syntax

 v<op>.8 Dd, <list>, Dm

 <op> is one of tbl or tbx

 <list> specifies the list of registers. There are five list formats:

1. {Dn},

2. {Dn, D(n+1)},

3. {Dn, D(n+1), D(n+2)},

4. {Dn, D(n+1), D(n+2), D(n+3)}, or

5. {Qn, Q(n+1)}.

 Dm is the register holding the indices.

 The table can contain up to 32 bytes.

Operations

NameEffectDescription
vtbl

Minrsi56_e first register

Maxrsi57_e last register

for 0 ≤ i < 8 do

 rMinr+(Dm[i]÷8)si58_e

 if r > Maxr then

 Dd[i]0si59_e

 else

 eDm[i]mod8si60_e

 Dd[i]Dr[e]si61_e

 end if

end for

Use indices Dm to look up values in a table and store them in Dd. If the index is out of range, zero is stored in the corresponding destination.
vtbx

Minrsi56_e first register

Maxrsi57_e last register

for 0 ≤ i < 8 do

 rMinr+(Dm[i]÷8)si58_e

 if rMaxr then

 eDm[i]mod8si60_e

 Dd[i]Dr[e]si61_e

 end if

end for

Use indices Dm to look up values in a table and store them in Dd. If the index is out of range, the corresponding destination is unchanged.

t0095

Examples

f10-22-9780128036983

10.4.10 Zip or Unzip Vectors

These instructions are used to interleave or deinterleave the data from two vectors:

vzip Zip Vectors, and

vuzp Unzip Vectors.

Fig. 10.9 gives an example of the vzip instruction. The vuzp instruction performs the inverse operation.

f10-09-9780128036983
Figure 10.9 Example of vzip.8 d9,d4.

Syntax

 v<op>.<size> Vd, Vm

 <op> is either zip or uzp.

 <size> is either 8, 16, or 32 and indicates the size of the elements in the matrix (or matrices).

 V can be q or d.

Operations

NameEffectDescription
vzip

n# of elementssi52_e

for 0 ≤ i < (n ÷ 2) by 2 do

 tmp1[2×i]Vm[i]si68_e

 tmp1[2×i+1]Vd[i]si69_e

end for

for (n ÷ 2) ≤ i < n by 2 do

 tmp2[2×i]Vm[i]si70_e

 tmp2[2×i+1]Vd[i]si71_e

end for

Vmtmp1si72_e

Vdtmp2si73_e

Interleave data from two vectors. tmp is a vector of suitable size.
vuzp

n# of elementssi52_e

for 0 ≤ i < (n ÷ 2) by 2 do

 tmp1[i]Vm[2×i]si75_e

 tmp2[i]Vm[2×i+1]si76_e

end for

for (n ÷ 2) ≤ i < n by 2 do

 tmp1[i]Vd[2×i]si77_e

 tmp2[i]Vd[2×i+1]si78_e

end for

Vmtmp1si72_e

Vdtmp2si73_e

Interleave data from two vectors. tmp is a vector of suitable size.

t0100

Examples

f10-23-9780128036983

10.5 Data Conversion

When high precision is not required, The IEEE half-precision format can be used to store floating point numbers in memory. This can reduce memory requirements by up to 50%. This can also result in a significant performance improvement, since only half as much data needs to be moved between the CPU and main memory. However, on most processors half-precision data must be converted to single precision before it is used in calculations. NEON provides enhanced versions of the vcvt instruction which support conversion to and from IEEE half precision. There are also versions of vcvt which operate on vectors, and perform integer or fixed-point to floating-point conversions.

10.5.1 Convert Between Fixed Point and Single-Precision

This instruction can be used to perform a data conversion between single precision and fixed point on each element in a vector:

vcvt Convert Data Format.

The elements in the vector must be a 32-bit single precision floating point or a 32-bit integer. Fixed point (or integer) arithmetic operations are up to twice as fast as floating point operations. In some cases it is much more efficient to make this conversion, perform the calculations, then convert the results back to floating point.

Syntax

 vcvt{<cond>}.<type>.f32 Sd, Sm{, #<fbits>}

 vcvt{<cond>}.f32.<type> Sd, Sm{, #<fbits>}

 <cond> is an optional condition code.

 <type> must be either s32 or u32.

 The optional <fbits> operand specifies the number of fraction bits for a fixed point number, and must be between 0 and 32. If it is omitted, then it is assumed to be zero.

Operations

NameEffectDescription
vcvt.s32.f32Fd[]fixed(Fm[])si81_eConvert single precision to 32-bit signed fixed point or integer.
vcvt.u32.f32Fd[]ufixed(Fm[])si82_eConvert single precision to 32-bit unsigned fixed point or integer.
vcvt.f32.s32Fd[]single(Fm[])si83_eConvert signed 32-bit fixed point or integer to single precision
vcvt.f32.u32Fd[]single(Fm[])si83_eConvert unsigned 32-bit fixed point or integer to single precision

Examples

f10-24-9780128036983

10.5.2 Convert Between Half-Precision and Single-Precision

NEON systems with the half-precision extension provide the following instruction to perform conversion between single precision and half precision floating point formats:

vcvt Convert Between Half and Single.

Syntax

 vcvt<op>{<cond>}.f16.f32 Sd, Sm

 vcvt<op>{<cond>}.f32.f16 Sd, Sm

 The <op> must be either b or t and specifies whether the top or bottom half of the register should be used for the half-precision number.

 <cond> is an optional condition code.

Operations

NameEffectDescription
vcvtb.f16.f32Sdhalf(Sm)si85_eConvert single precision to half precision and store in bottom half of destination
vcvtt.f16.f32Sdhalf(Sm)si85_eConvert single precision to half precision and store in top half of destination
vcvtb.f32.f16Sdsingle(Sm)si87_eConvert half precision number from bottom half of source to single precision
vcvtt.f32.f16Sdsingle(Sm)si87_eConvert half precision number from top half of source to single precision

Examples

f10-25-9780128036983

10.6 Comparison Operations

NEON adds the ability to perform integer comparisons between vectors. Since there are multiple pairs of items to be compared, the comparison instructions set one element in a result vector for each pair of items. After the comparison operation, each element of the result vector will have every bit set to zero (for false) or one (for true). Note that if the elements of the result vector are interpreted as signed two’s-complement numbers, then the value 0 represents false and the value − 1 represents true.

10.6.1 Vector Compare

The following instructions perform comparisons of all of the corresponding elements of two vectors in parallel:

vceq Compare Equal,

vcge Compare Greater Than or Equal,

vcgt Compare Greater Than,

vcle Compare Less Than or Equal, and

vclt Compare Less Than.

The vector compare instructions compare each element of a vector with the corresponding element in a second vector, and sets an element in the destination vector for each comparison. If the comparison is true, then all bits in the result element are set to one. Otherwise, all bits in the result element are set to zero. Note that summing the elements of the result vector (as signed integers) will give the two’s complement of the number of comparisons which were true.

Note: vcle and vclt are actually pseudo-instructions. They are equivalent to vcgt and vcge with the operands reversed.

Syntax

 vc<op>.<type> Vd, Vn, Vm

 vc<op>.<type> Vd, Vn, #0

 <op> must be one of eq, ge, gt, le, or lt.

 If <op> is eq, then <type> must be i8, i16, i32, or f32.

 If <op> is not eq and Rop is #0, then < type > must be s8, s16, s32, or f32.

 If <op> is not eq and the third operand is a register, then <type> must be s8, s16, s32, u8, u16, u32, or f32.

 The result data type is determined from the following table:

Operand TypeResult Type
i32, s32, u32, or f32i32
i16, s16, or u16i16
i8, s8, or u8i8

 If the third operand is #0, then it is taken to be a vector of the correct size in which every element is zero.

 V can be d or q.

Operations

NameEffectDescription
vc<op>

for ivector_length do

 if Fm[i]<op> Rop[i]

 then

 Fd[i]111si89_e

 else

 Fd[i]000si90_e

 end if

end for

Compare each scalar in Fn to the corresponding scalar in Fm. Set the corresponding scalar in Fd to all ones if <op> is true, and all zeros if <op> is not true.

t0120

Examples

f10-26-9780128036983

10.6.2 Vector Absolute Compare

The following instructions perform comparisons between the absolute values of all of the corresponding elements of two vectors in parallel:

vacgt Absolute Compare Greater Than, and

vacge Absolute Compare Greater Than or Equal.

The vector absolute compare instruction compares the absolute value of each element of a vector with the absolute value of the corresponding element in a second vector, and sets an element in the destination vector for each comparison. If the comparison is true, then all bits in the result element are set to one. Otherwise, all bits in the result element are set to zero. Note that summing the elements of the result vector (as signed integers) will give the two’s complement of the number of comparisons which were true.

Syntax

 vac<op>.f32 Vd, Vn, Vm

 <op> must be either ge or gt.

 V can be d or q.

 The operand element type must be f32.

 The result element type is i32.

Operations

NameEffectDescription
vac<op>

for ivector_length do

 if |Fm[i]|<op> |Fn[i]|

then

 Fd[i]111si89_e

 else

 Fd[i]000si90_e

 end if

end for

Compare each scalar in Fn to the corresponding scalar in Fm. If the comparison is true, then set all bits in the corresponding scalar in Fd to one. Otherwise set all bits in the corresponding scalar in Fd to zero.

t0125

Examples

f10-27-9780128036983

10.6.3 Vector Test Bits

NEON provides the following vector version of the ARM tst instruction:

vtst Test Bits.

The vector test bits instruction performs a logical AND operation between each element of a vector and the corresponding element in a second vector. If the result is not zero, then every bit in the corresponding element of the result vector is set to one. Otherwise, every bit in the corresponding element of the result vector is set to zero.

Syntax

 vtst.<size> Vd, Vn, Vm

 V can be d or q.

 <size> must be one of 8, 16 or 32

 The result element type is defined by the following table:

<size>Result Type
32i32
16i16
8i8

Operations

NameEffectDescription
vtst

for ivector_length do

 if (Fm[i] ∧ Fn[i])≠0 then

 Fd[i]111si89_e

 else

 Fd[i]000si90_e

 end if

end for

Perform logical AND between each scalar in Fn and the corresponding scalar in Fm. Set the corresponding scalar in Fd to all ones if the result is not zero, and all zeros otherwise

t0135

Examples

f10-28-9780128036983

10.7 Bitwise Logical Operations

NEON adds the ability to perform integer and bitwise logical operations on the VFP register set. Recall that integer operations can also be used on fixed-point data. These operations add a great deal of power to the ARM processor.

10.7.1 Bitwise Logical Operations

NEON includes vector versions of the following five basic logical operations:

vand Bitwise AND,

veor Bitwise Exclusive-OR,

vorr Bitwise OR,

vorn Bitwise Complement and OR, and

vbic Bit Clear.

All of them involve two source operands and a destination register.

Syntax

 v<op>{.<type>} Vd, Vn, Vm

 <op> must be one of and, eor, orr, orn, or bic.

 V must be either q or d.

 type must be i8, i16, i32, or i64. For these bitwise logical operations, type does not matter.

Operations

NameEffectDescription
vandVdVnVmsi95_eLogical AND
veorVdVnVmsi96_eExclusive OR
vorrVdVnVmsi97_eLogical OR
vornVd¬(VnVm)si98_eComplement of Logical OR
vbicVdVn¬Vmsi99_eBit Clear

Examples

f10-29-9780128036983

10.7.2 Bitwise Logical Operations with Immediate Data

It is often useful to clear and/or set specific bits in a register. The NEON instruction set provides the following vector versions of the logical OR and bit clear instructions:

vorr Bitwise OR Immediate, and

vbic Bit Clear Immediate.

Syntax

 v<op>.<type> Vd, #<imm>

 <op> must be either orr, or bic.

 V must be either q or d to specify whether the operation involves quadwords or doublewords.

 <type> must be i16 or i32.

 <imm> is a 16-bit or 32-bit immediate value, which is interpreted as a pattern for filling the immediate operand. The following table shows acceptable patterns for <imm>, based on what was chosen for <type>:

<type>
i16i32
0x00XY0x000000XY
0xXY000x0000XY00
0x00XY0000
0xXY000000

t0145

Operations

NameEffectDescription
vorrVdVdimm:immsi100_eLogical OR
vbicVdVdimm:immsi101_eBit Clear

Examples

f10-30-9780128036983

10.7.3 Bitwise Insertion and Selection

NEON provides three instructions which can be used to combine the bits in two registers or to extract specific bits from a register, according to a pattern:

vbit Bitwise Insert,

vbif Bitwise Insert if False, and

vbsl Bitwise Select.

Syntax

[frame=single]

 v<op>{.<type>} Vd, Vn, Vm

 <op> can be bif, bit, or bsl.

 V can be d or q.

 The <type> must be i8, i16, i32, or i64, and specifies the size of items in the vectors. Note that for these bitwise logical operations, the type does not matter. so the assembler ignores it. However, it can be useful to the programmer as extra documentation.

Operations

NameEffectDescription
vbitFd(Fd¬Fm)(FnFm)si102_eInsert each bit from the first operand into the destination if the corresponding bit of the second operand is 1
vbifFd(FdFm)(Fn¬Fm)si103_eInsert each bit from the first operand into the destination if the corresponding bit of the second operand is 0
vbslFd(FdFn)(¬FdFm)si104_eSelect each bit for the destination from the first operand if the corresponding bit of the destination is 1, or from the second operand if the corresponding bit of the destination is 0

Examples

f10-31-9780128036983

10.8 Shift Instructions

The NEON shift instructions operate on vectors. Shifts are often used for multiplication and division by powers of two. The results of a left shift may be larger than the destination register, resulting in overflow. A shift right is equivalent to division. In some cases, it may be useful to round the result of a division, rather than truncating. NEON provides versions of the shift instruction which perform saturation and/or rounding of the result.

10.8.1 Shift Left by Immediate

These instructions shift each element in a vector left by an immediate value:

vshl Shift Left Immediate,

vqshl Saturating Shift Left Immediate,

vqshlu Saturating Shift Left Immediate Unsigned, and

vshll Shift Left Immediate Long.

Overflow conditions can be avoided by using the saturating version, or by using the long version, in which case the destination is twice the size of the source.

Syntax

 vshl.<type> Vd, Vm, #<imm>

 vqshl{u}.<type> Vd, Vm, #<imm>

 vshll.<type> Qd, Dm, #<imm>

 If u is present, then the results are unsigned.

 The valid choices for <type> are given in the following table:

OpcodeValid Types
vshli8, i16, i32, i64, s8, s16, or s32
vqshls8, s16, s32, s64, u8, u16, u32, or u64
vqshlus8, s16, s32, or s64
vshllu8, u16, u32, u64, s8, s16, or s32

Operations

NameEffectDescription
vshl

Vd[]Vm[]immsi105_e

Each element of Vm is shifted left by the immediate value and stored in the corresponding element of Vd. Bits shifted past the end of an element are lost.
vshll

Qd[]Dm[]immsi106_e

Each element of Vm is shifted left by the immediate value and stored in the corresponding element of Vd. The values are sign or zero extended, depending on <type>
vqshl{u}

eq10-04-9780128036983

Each element of Vm is shifted left by the immediate value and stored in the corresponding element of Vd. If the result of the shift is outside the range of the destination element, then the value is saturated. If u was specified, then the destination is unsigned. Otherwise, it is signed

t0160

Examples

f10-34-9780128036983

10.8.2 Shift Left or Right by Variable

These instructions shift each element in a vector, using the least significant byte of the corresponding element of a second vector as the shift amount:

vshl Shift Left or Right by Variable,

vrshl Shift Left or Right by Variable and Round,

vqshl Saturating Shift Left or Right by Variable, and

vqrshl Saturating Shift Left or Right by Variable and Round.

If the shift value is positive, the operation is a left shift. If the shift value is negative, then it is a right shift. A shift value of zero is equivalent to a move. If the operation is a right shift, and r is specified, then the result is rounded rather than truncated. Results are saturated if q is specified.

Syntax

 v{q}{r}shl.<type> Vd, Vn, Vm

 If q is present, then the results are saturated.

 If r is present, then right shifted values are rounded rather than truncated.

 V can be d or q.

 <type> must be one of s8, s16, s32, s64, s8, s16, s32, or s64.

Operations

NameEffectDescription
vshl

if q is present then

 if r is present then

 eq10-05-9780128036983

 else

 eq10-06-9780128036983

 end if

else

 if r is present then

 Vd[]Vn[]Vm[]si107_e

 else

 Vd[]Vn[]Vm[]si108_e

 end if

end if

Each element of Vm is shifted left by the immediate value and stored in the corresponding element of Vd. Bits shifted past the end of an element are lost.

t0165

Examples

f10-35-9780128036983

10.8.3 Shift Right by Immediate

These instructions shift each element in a vector right by an immediate value:

vshr Shift Right Immediate,

vrshr Shift Right Immediate and Round,

vshrn Shift Right Immediate and Narrow,

vrshrn Shift Right Immediate Round and Narrow,

vsra Shift Right and Accumulate Immediate, and

vrsra Shift Right Round and Accumulate Immediate.

Syntax

 v{r}shr{<cond>}.<type> Vd, Vm, #<imm>

 v{r}shrn{<cond>}.<type> Vd, Vm, #<imm>

 v{r}sra{<cond>}.<type> Vd, Vm, #<imm>

 V can be d or q.

 If r is present, then right shifted values are rounded rather than truncated.

 <cond> is an optional condition code.

 The valid choices for <type> are given in the following table:

OpcodeValid Types
v{r}shru8, u16, u32, u64, s8, s16, s32, or s64,
v{r}shrni16, i32, or i64
v{r}srau8, u16, u32, u64, s8, s16, s32, or s64,

Operations

NameEffectDescription
v{r}shr

if r is present then

 Vd[]Vm[]immsi109_e

else

 Vd[]Vm[]immsi110_e

end if

Each element of Vm is shifted right with zero extension by the immediate value and stored in the corresponding element of Vd. Results can be rounded both.
v{r}shrn

if r is present then

 Vd[]Vm[]immsi111_e

else

 Vd[]Vm[]immsi112_e

end if

Each element of Vm is shifted right with zero extension by the immediate value, optionally rounded, then narrowed and stored in the corresponding element of Vd.
v{r}sra

if r is present then

 Vd[]Vd[]+Vm[]immsi113_e

else

 Vd[]Vd[]+Vm[]immsi114_e

end if

Each element of Vm is shifted right with sign or zero extension by the immediate value and accumulated in the corresponding element of Vd. Results can be rounded.

t0175

Examples

f10-36-9780128036983

10.8.4 Saturating Shift Right by Immediate

These instructions shift each element in a quad word vector right by an immediate value:

vqshrn Saturating Shift Right Immediate,

vqrshrn Saturating Shift Right Immediate Round,

vqshrun Saturating Shift Right Immediate Unsigned, and

vqrshrun Saturating Shift Right Immediate Round Unsigned.

The result is optionally rounded, then saturated, narrowed, and stored in a double word vector.

Syntax

 vq{r)shr{u}n.<type> Dd, Qm, #<imm>

 If r is present, then right shifted values are rounded rather than truncated.

 If u is present, then the results are unsigned, regardless of the type of elements in Qm.

 The valid choices for <type> are given in the following table:

OpcodeValid Types
vq{r}shrnu16, u32, u64, s16, s32, or s64,
vq{r}shruns16, s32, or s64,

 <imm> Is the amount that elements are to be shifted, and must be between zero and one less than the number of bits in <type>.

Operations

NameEffectDescription
vq{r}shrn

if r is present then

 eq10-07-9780128036983

else

 eq10-08-9780128036983

end if

Each element of Vm is shifted right with sign extension by the immediate value, optionally rounded, then saturated and narrowed, and stored in the corresponding element of Vd.
vq{r}shrun

if r is present then

 eq10-09-9780128036983

else

 eq10-10-9780128036983

end if

Each element of Vm is shifted right with zero extension by the immediate value, optionally rounded, then saturated and narrowed, and stored in the corresponding element of Vd.

t0185

Examples

f10-37-9780128036983

10.8.5 Shift and Insert

These instructions perform bitwise shifting of each element in a vector, then combine the results with the contents of the destination register:

vsli Shift Left and Insert,

vsri Shift Right and Insert.

Fig. 10.10 provides an example.

f10-10-9780128036983
Figure 10.10 Effects of vsli.32 d4,d9,#6.

Syntax

 vs<dir>i.<size> Vd, Vm, #<imm>

 <dir> must be l for a left shift, or r for a right shift.

 <size> must be 8, 16, 32, or 64.

 <imm> is the amount that elements are to be shifted, and must be between zero and <size>− 1 for vsli, or between one and <size> for vsri.

Operations

NameEffectDescription
vsli

mask(1imm+1)1si115_e

Vd[](maskVd[])(Vm[]imm)si116_e

Each element of Vm is shifted left and combined with lower <imm> bits of the corresponding element of Vd.
vsri

mask¬(1sizeimm+1)1si117_e

Vd[](maskVd[])(Vm[]imm)si118_e

Each element of Vm is shifted right and combined with upper <imm> bits of the corresponding element of Vd.

t0190

Examples

f10-38-9780128036983

10.9 Arithmetic Instructions

NEON provides several instructions for addition, subtraction, and multiplication, but does not provide a divide instruction. Whenever possible, division should be performed by multiplying the reciprocal. When dividing by constants, the reciprocal can be calculated in advance, as shown in Chapter 8. For dividing by variables, NEON provides instructions for quickly calculating the reciprocals for all elements in a vector. In most cases, this is faster than using a divide instruction. When division is absolutely unavoidable, the VFP divide instructions can be used.

10.9.1 Vector Add and Subtract

The following eight instructions perform vector addition and subtraction:

vadd Add

vqadd Saturating Add

vaddl Add Long

vaddw Add Wide

vsub Subtract

vqsub Saturating Subtract

vsubl Subtract Long

vsubw Subtract Wide

The Vector Add (vadd) instruction adds corresponding elements in two vectors and stores the results in the corresponding elements of the destination register. The Vector Subtract (vsub) instruction subtracts elements in one vector from corresponding elements in another vector and stores the results in the corresponding elements of the destination register. Other versions allow mismatched operand and destination sizes, and the saturating versions prevent overflow by limiting the range of the results.

Syntax

 v{q}<op>.<type> Vd, Vn, Vm

 v<op>l.<type> Qd, Dn, Dm

 v<op>w.<type> Qd, Qn, Dm

 <op> is either add or sub.

 The valid choices for <type> are given in the following table:

OpcodeValid Types
v<op>i8, i16, i32, i64, or f32
vq<op>s8, s16, s32, s64, u8, u16, u32, or u64
v<op>ls8, s16, s32, u8, u16, or u32
v<op>ws8, s16, s32, u8, u16, or u32

Operations

NameEffectDescription
v<op>

Vd[]Vn[]<op>Vm[]si119_e

The operation is applied to corresponding elements of Vn and Vm. The results are stored in the corresponding elements of Vd.
vq<op>

eq10-11-9780128036983

The operation is applied to corresponding elements of Vn and Vm. The results are saturated then stored in the corresponding elements of Vd.
v<op>l

Qd[]Dn[]<op>Dm[]si120_e

The operation is applied to corresponding elements of Dn and Dm. The results are zero or sign extended then stored in the corresponding elements of Qd.
v<op>w

Qd[]Qn[]<op>Dm[]si121_e

The elements of Vm are sign or zero extended, then the operation is applied with corresponding elements of Vn. The results are stored in the corresponding elements of Vd.

t0200

Examples

f10-39-9780128036983

10.9.2 Vector Add and Subtract with Narrowing

These instructions add or subtract the corresponding elements of two vectors, and narrow by taking the most significant half of the result:

vaddhn Add and Narrow

vraddhn Add, Round, and Narrow

vsubhn Subtract and Narrow

vrsubhn Subtract, Round, and Narrow

The results are stored in the corresponding elements of the destination register. Results can be optionally rounded instead of truncated.

Syntax

 v{r}<op>hn.<type> Dd, Qn, Qm

 <op> is either add or sub.

 If <r> is specified, then the result is rounded instead of truncated.

 <type> must be either i16, i32, or i64.

Operations

NameEffectDescription
v<op>hn

shiftsize÷2si122_e

if r is present then

 xVn[]<op>Vm[]si123_e

 Vd[]xshiftsi124_e

else

 xVn[]<op>Vm[]si125_e

 Vd[]xshiftsi124_e

end if

The operation is applied to corresponding elements of Vn and Vm. The results are optionally rounded, then narrowed by taking the most significant half, and stored in the corresponding elements of Vd.

t0205

Examples

f10-40-9780128036983

10.9.3 Add or Subtract and Divide by Two

These instructions add or subtract corresponding elements from two vectors then shift the result right by one bit:

vhadd Halving Add

vrhadd Halving Add and Round

vhsub Halving Subtract

The results are stored in corresponding elements of the destination vector. If the operation is addition, then the results can be optionally rounded.

Syntax

 v{r}hadd.<type> Vd, Vn, Vm

 vhsub.<type> Vd, Vn, Vm

 If <r> is specified, then the result is rounded instead of truncated.

 <type> must be either s8, s16, s32, u8, u16, ar u32.

Operations

NameEffectDescription
v{r}hadd

if r is present then

 Vd[]Vn[]+Vm[]1si127_e

else

 Vd[]Vn[]+Vm[]1si128_e

end if

The corresponding elements of Vn and Vm are added together, optionally rounded, then shifted right one bit. Results are stored in the corresponding elements of Vd.
vhsub

Vd[]Vn[]Vm[]1si129_e

The elements of Vn are subtracted from the corresponding elements of Vm. Results are shifted right one bit and stored in the corresponding elements of Vd.

t0210

Examples

f10-41-9780128036983

10.9.4 Add Elements Pairwise

These instructions add vector elements pairwise:

vpadd Add Pairwise

vpaddl Add Pairwise Long

vpadal Add Pairwises and Accumulate Long

The long versions can be used to prevent overflow.

Syntax

 vpadd.<type> Dd, Dn, Dm

 vp<op>l.<type> Vd, Vm

 <op> must be either add or ada.

 The valid choices for <type> are given in the following table:

OpcodeValid Types
vpaddi8, i16, i32, or f32
vp<op>ls8, s16, s32, u8, u16, or u32

Operations

NameEffectDescription
vpadd

n# of elementssi52_e

for 0 ≤ i < (n ÷ 2) do

 Dd[i]Dm[i]+Dm[i+1]si131_e

end for

for (n ÷ 2) ≤ i < n do

 ji(n÷2)si132_e

 Dd[i]Dn[j]+Dn[j+1]si133_e

end for

Add elements of two vectors pairwise and store the results in another vector.
vpaddl

n# of elementssi52_e

for 0 ≤ i < (n ÷ 2) by 2 do

 Vd[i]Vm[i]+Vm[i+1]si135_e

end for

Add elements of a vector pairwise and store the results in another vector.
vpadal

n# of elementssi52_e

for 0 ≤ i < (n ÷ 2) by 2 do

 Vd[i]Vd[i]+Vm[i]+Vm[i+1]si137_e

end for

Add elements of a vector pairwise and accumulate the results in another vector.

t0220

Examples

f10-43-9780128036983

10.9.5 Absolute Difference

These instructions subtract the elements of one vector from another and store or accumulate the absolute value of the results:

vaba Absolute Difference and Accumulate

vabal Absolute Difference and Accumulate Long

vabd Absolute Difference

vabdl Absolute Difference Long

The long versions can be used to prevent overflow.

Syntax

v<op>.<type> Vd, Vn, Vm

v<op>l.<type> Qd, Dn, Dm

 <op> is either aba or abd.

 The valid choices for <type> are given in the following table:

OpcodeValid Types
vabds8, s16, s32, u8, u16, u32, or f32
vabas8, s16, s32, u8, u16, or u32
vabdls8, s16, s32, u8, u16, or u32
vabals8, s16, s32, u8, u16, or u32

Operations

NameEffectDescription
vabd

Vd[]Vn[]Vm[]si138_e

Subtract corresponding elements and take the absolute value
vaba

Vd[]Vd[]+Vn[]Vm[]si139_e

Subtract corresponding elements and take the absolute value. Accumulate the results
vabdl

Qd[]Dn[]Dm[]si140_e

Extend and subtract corresponding elements, then take the absolute value
v<op>w

Qd[]Qd[]+Dn[]Dm[]si141_e

Extend and subtract corresponding elements, then take the absolute value. Accumulate the results

t0230

Examples

f10-45-9780128036983

10.9.6 Absolute Value and Negate

These operations compute the absolute value or negate each element in a vector:

vabs Absolute Value

vneg Negate

vqabs Saturating Absolute Value

vqneg Saturating Negate

The saturating versions can be used to prevent overflow.

Syntax

 v{q}<op>.<type> Vd, Vm

 If q is present then results are saturated.

 <op> is either abs or neg.

 The valid choices for <type> are given in the following table:

OpcodeValid Types
vabss8, s16, s32, or f32
vnegs8, s16, s32, or f32
vqabss8, s16, or s32
vqnegs8, s16, or s32

Operations

NameEffectDescription
v{q}abs

if q is present then

 eq10-12-9780128036983

else

 Vd[]Vm[]si142_e

end if

Copy absolute value of each element of Vm to the corresponding element of Vd, optionally saturating the result
v{q}neg

if q is present then

 eq10-13-9780128036983

else

 Vd[]Vm[]si143_e

end if

Copy absolute value of each element of Vm to the corresponding element of Vd, optionally saturating the result

t0240

Examples

f10-46-9780128036983

10.9.7 Get Maximum or Minimum Elements

The following four instructions select the maximum or minimum elements and store the results in the destination vector:

vmax Maximum

vmin Minimum

vpmax Pairwise Maximum

vpmin Pairwise Minimum

Syntax

 v<op>.<type> Vd, Vn, Vm

 vp<op>.<type> Dd, Dn, Dm

 <op> is either max or min.

 <type> must be one of s8, s16, s32, u8, u16, u32, or f32.

Operations

NameEffectDescription
vmax

n# of elementssi52_e

for 0 ≤ i < n do

 if V n[i] > V m[i] then

 Vd[i]Vn[i]si145_e

 else

 Vd[i]Vm[i]si146_e

 end if

end for

Compare corresponding elements and copy the greater of each pair into the corresponding element in the destination vector
vpmax

n# of elementssi52_e

for 0 ≤ i < (n ÷ 2) do

 if Dm[i] > Dm[i + 1] then

 Dd[i]Dm[i]si148_e

 else

 Dd[i]Dm[i+1]si149_e

 end if

end for

for (n ÷ 2) ≤ i < n do

 if Dn[i] > Dn[i + 1] then

 Dd[i+(n÷2)]Dn[i]si150_e

 else

 Dd[i+(n÷2)]Dn[i+1]si151_e

 end if

end for

Compare elements pairwise and copy the greater of each pair into an element in the destination vector, another vector
vmin

n# of elementssi52_e

for 0 ≤ i < n do

 if V n[i] < V m[i] then

 Vd[i]Vn[i]si145_e

 else

 Vd[i]Vm[i]si146_e

 end if

end for

Compare corresponding elements and copy the lesser of each pair into the corresponding element in the destination vector
vpmin

n# of elementssi52_e

for 0 ≤ i < (n ÷ 2) do

 if Dm[i] < Dm[i + 1] then

 Dd[i]Dm[i]si148_e

 else

 Dd[i]Dm[i+1]si149_e

 end if

end for

for (n ÷ 2) ≤ i < n do

 if Dn[i] < Dn[i + 1] then

 Dd[i+(n÷2)]Dn[i]si150_e

 else

 Dd[i+(n÷2)]Dn[i+1]si151_e

 end if

end for

Compare elements pairwise and copy the lesser of each pair into an element in the destination vector, another vector

t0245

Examples

f10-47-9780128036983

10.9.8 Count Bits

These instructions can be used to count leading sign bits or zeros, or to count the number of bits that are set for each element in a vector:

vcls Count Leading Sign Bits

vclz Count Leading Zero Bits

vcnt Count Set Bits

Syntax

 v<op>.<type> Vd, Vm

 <op> is either cls, clz or cnt.

 The valid choices for <type> are given in the following table:

OpcodeValid Types
vclss8, s16, or s32
vclzu8, u16, or u32
vcnti8

Operations

NameEffectDescription
vcls

n# of elementssi52_e

for 0 ≤ i < n) do

 Vd[i]leading_sign_bits(Vm[i])si161_e

end for

Count the number of consecutive bits that are the same as the sign bit for each element in Fm, and store the counts in the corresponding elements of Fd
vcls

n# of elementssi52_e

for 0 ≤ i < n) do

 Vd[i]leading_zero_bits(Vm[i])si163_e

end for

Count the number of leading zero bits for each element in Fm, and store the counts in the corresponding elements of Fd.
vcnt

n# of elementssi52_e

for 0 ≤ i < n) do

 Vd[i]count_one_bits(Vm[i])si165_e

end for

Count the number of bits in Fm that are set to one, and store the counts in the corresponding elements of Fd

t0260

Examples

f10-48-9780128036983

10.10 Multiplication and Division

There is no vector divide instruction in NEON. Division is accomplished with multiplication by the reciprocals of the divisors. The reciprocals are found by making an initial estimate, then using the Newton-Raphson method to improve the approximation. This can actually be faster than using a hardware divider. NEON supports single precision floating point and unsigned fixed point reciprocal calculation. Fixed point reciprocals provide higher precision. Division using the NEON reciprocal method may not provide the best precision possible. If the best possible precision is required, then the VFP divide instruction should be used.

10.10.1 Multiply

These instructions are used to multiply the corresponding elements from two vectors:

vmul Multiply

vmla Multiply Accumulate

vmls Multiply Subtract

vmull Multiply Long

vmlal Multiply Accumulate Long

vmlsl Multiply Subtract Long

The long versions can be used to avoid overflow.

Syntax

 v<op>.<type> Vd, Vn, Vm

 v<op>l.<type> Qd, Dn, Dm

 <op> is either mul, mla. or mls.

 The valid choices for <type> are given in the following table:

OpcodeValid Types
vmulp8, i8, i16, or i32
vmlai8, i16, or i32
vmlsi8, i16, or i32
vmullp8, s8, s16, s32, u8, u16, or u32
vmlals8, s16, s32, u8, u16, or u32
vmlsls8, s16, s32, u8, u16, or u32

Operations

NameEffectDescription
vmul

Vd[]Vn[]×Vm[]si166_e

Multiply corresponding elements from two vectors and store the results in a third vector
vmla

Vd[]Vd[]+(Vn[]×Vm[])si167_e

Multiply corresponding elements from two vectors and add the results in a third vector
vmul

Vd[]Vd[](Vn[]×Vm[])si168_e

Multiply corresponding elements from two vectors and subtract the results from a third vector
vmull

Qd[]Dn[]×Dm[]si169_e

Multiply corresponding elements from two vectors and store the results in a third vector
vmlal

Qd[]Qd[]+(Dn[]×Dm[])si170_e

Multiply corresponding elements from two vectors and add the results in a third vector
vmul

Qd[]Qd[](Dn[]×Dm[])si171_e

Multiply corresponding elements from two vectors and subtract the results from a third vector

t0270

Examples

f10-49-9780128036983

10.10.2 Multiply by Scalar

These instructions are used to multiply each element in a vector by a scalar:

vmul Multiply by Scalar

vmla Multiply Accumulate by Scalar

vmls Multiply Subtract by Scalar

vmull Multiply Long by Scalar

vmlal Multiply Accumulate Long by Scalar

vmlsl Multiply Subtract Long by Scalar

The long versions can be used to avoid overflow.

Syntax

v<op>.<type> Vd, Vn, Dm[x]

v<op>l.<type> Qd, Dn, Dm[x]

 <op> is either mul, mla. or mls.

 The valid choices for <type> are given in the following table:

OpcodeValid Types
vmuli16, i32, or f32
vmlai16, i32, or f32
vmlsi16, i32, or f32
vmulls16, s32, u16, or u32
vmlals16, s32, u16, or u32
vmlsls16, s32, u16, or u32

 x must be valid for the chosen <type>.

Operations

NameEffectDescription
vmul

Vd[]Vn[]×Dm[x]si172_e

Multiply corresponding elements from two vectors and store the results in a third vector
vmla

Vd[]Vd[]+(Vn[]×Dm[x])si173_e

Multiply corresponding elements from two vectors and add the results in a third vector
vmul

Vd[]Vd[](Vn[]×Dm[x])si174_e

Multiply corresponding elements from two vectors and subtract the results from a third vector
vmull

Qd[]Dn[]×Dm[x]si175_e

Multiply corresponding elements from two vectors and store the results in a third vector
vmlal

Qd[]Qd[]+(Dn[]×Dm[x])si176_e

Multiply corresponding elements from two vectors and add the results in a third vector
vmul

Qd[]Qd[](Dn[]×Dm[x])si177_e

Multiply corresponding elements from two vectors and subtract the results from a third vector

t0280

Examples

f10-50-9780128036983

10.10.3 Fused Multiply Accumulate

A fused multiply accumulate operation does not perform rounding between the multiply and add operations. The two operations are fused into one. NEON provides the following fused multiply accumulate instructions:

vfma Fused Multiply Accumulate

vfnma Fused Negate Multiply Accumulate

vfms Fused Multiply Subtract

vfnms Fused Negate Multiply Subtract

Using the fused multiply accumulate can result in improved speed and accuracy for many computations that involve the accumulation of products.

Syntax

 <op>{<cond>}.<prec> Fd, Fn, Fm

<op> is one of vfma, vfnma, vfms, or vfnms.

<cond> is an optional condition code.

<prec> may be either f32 or f64.

Operations

NameEffectDescription
vfmaFdFd+Fn×Fmsi178_eMultiply and accumulate
vnfnmaFdFd+Fn×Fmsi179_eNegate, multiply, and accumulate
vfmsFdFdFn×Fmsi180_eMultiply and subtract
vnfmsFdFdFn×Fmsi181_eNegate multiply, and subtract

Examples

f10-51-9780128036983

10.10.4 Saturating Multiply and Double (Low)

These instructions perform multiplication, double the results, and perform saturation:

vqdmull Saturating Multiply Double (Low)

vqdmlal Saturating Multiply Double Accumulate (Low)

vqdmlsl Saturating Multiply Double Subtract (Low)

Syntax

 vqd<op>l.<type> Qd, Dn, Dm

 vqd<op>l.<type> Qd, Dn, Dm[x]

 <op> is either mul, mla. or mls.

 <type> must be either s16 or s32.

Operations

NameEffectDescription
vqdmull

if second operand is scalar then

 eq10-14-9780128036983

else

 eq10-15-9780128036983

end if

Multiply elements, double the results, and store in the destination vector with saturation
vqdmull

if second operand is scalar then

 eq10-16-9780128036983

else

 eq10-17-9780128036983

end if

Multiply elements, double the results, and add to the destination vector with saturation
vqdmull

if second operand is scalar then

 eq10-18-9780128036983

else

 eq10-19-9780128036983

end if

Multiply elements, double the results, and subtract from the destination vector with saturation

t0290

Examples

f10-52-9780128036983

10.10.5 Saturating Multiply and Double (High)

These instructions perform multiplication, double the results, perform saturation, and store the high half of the results:

vqdmulh Saturating Multiply Double (High)

vqrdmulh Saturating Multiply Double (High) and Round

Syntax

 vq{r}dmulh.<type> Vd, Vn, Vm

 vq{r}dmulh.<type> Vd, Vn, Dm[x]

 <type> must be either s16 or s32.

Operations

NameEffectDescription
vqdmulh

nsize of<type>si182_e

if second operand is scalar then

 Vd[]Vn[]×Dm[x]×2nsi183_e

else

 Vd[]Vn[]×Vm[]×2nsi184_e

end if

Multiply elements, double the results and store the high half in the destination vector with saturation
vqrdmulh

nsize of<type>si182_e

if second operand is scalar then

 Vd[]Vn[]×Dm[x]×2nsi186_e

else

 Vd[]Vn[]×Vm[]×2nsi187_e

end if

Multiply elements, double the results, round, and store the high half in the destination vector with saturation

t0295

Examples

f10-53-9780128036983

10.10.6 Estimate Reciprocals

These instructions perform the initial estimates of the reciprocal values:

vrecpe Reciprocal Estimate

vrsqrte Reciprocal Square Root Estimate

These work on floating point and unsigned fixed point vectors. The estimates from this instruction are accurate to within about eight bits. If higher accuracy is desired, then the Newton-Raphson method can be used to improve the initial estimates. For more information, see the Reciprocal Step instruction.

Syntax

 v<op>.<type> Vd, Vm

 <op> is either recpe or rsqrte.

 <type> must be either u32, or f32.

 If <type> is u32, then the elements are assumed to be U(1,31) fixed point numbers, and the most significant fraction bit (bit 30) must be 1, and the integer part must be zero. The vclz and shift by variable instructions can be used to put the data in the correct format.

 The result elements are always f32.

Operations

NameEffectDescription
vrecpe

n# of elementssi52_e

for 0 ≤ i < n) do

 Vd[i](1÷Vm[i])si189_e

end for

Find an approximate reciprocal of each element in a vector
vrsqrte

n# of elementssi52_e

for 0 ≤ i < n) do

 Vd[i](1÷Vm[i])si191_e

end for

Find an approximate reciprocal square root of each element in a vector

t0300

Examples

f10-54-9780128036983

10.10.7 Reciprocal Step

These instructions are used to perform one Newton-Raphson step for improving the reciprocal estimates:

vrecps Reciprocal Step

vrsqrts Reciprocal Square Root Step

For each element in the vector, the following equation can be used to improve the estimates of the reciprocals:

xn+1=xn(2dxn),

si192_e

where xn is the estimated reciprocal from the previous step, and d is the number for which the reciprocal is desired. This equation converges to 1dsi193_e if x0 is obtained using vrecpe on d. The vrecps instruction computes

xn+1=2dxn,

si194_e

so one additional multiplication is required to complete the update step. The initial estimate x0 must be obtained using the vrecpe instruction.

For each element in the vector, the following equation can be used to improve the estimates of the reciprocals of the square roots:

xn+1=xn3dxn22,

si195_e

where xn is the estimated reciprocal from the previous step, and d is the number for which the reciprocal is desired. This equation converges to 1dsi196_e if x0 is obtained using vrsqrte on d. The vrsqrts instruction computes

xn+1=3dxn2,

si197_e

so two additional multiplications are required to complete the update step. The initial estimate x0 must be obtained using the vrsqrte instruction.

Syntax

 v<op>.<type> Vd, Vn, Vm

 <op> is either recps or rsqrts.

 <type> must be either u32, or f32.

Operations

NameEffectDescription
vrecpe

n# of elementssi52_e

for 0 ≤ i < n) do

 Vd[i]2Vn[i]×Vm[i]si199_e

end for

Perform most of the Newton-Raphson reciprocal improvement step.
vrsqrte

n# of elementssi52_e

for 0 ≤ i < n) do

 Vd[i](3Vn[i]×Vm[i])÷2si201_e

end for

Perform most of the Newton-Raphson reciprocal square root improvement step

t0305

Examples

f10-55-9780128036983

10.11 Pseudo-Instructions

The GNU assembler supports five pseudo-instructions for NEON. Two of them are vcle and vclt, which were covered in Section 10.6.1. The other three are explained in the following sections.

10.11.1 Load Constant

This pseudo-instruction loads a constant value into every element of a NEON vector, or into a VFP single-precision or double-precision register:

vldr Load Constant.

This pseudo-instruction will use vmov if possible. Otherwise, it will create an entry in the literal pool and use vldr.

Syntax

 vldr{<cond>}.<type> Vd, =<imm>

 <cond> is an optional condition code.

 <type> must be one of i8, i16, i32, i64, s8, s16, s32, s64, u8, u16, u32, u64, f32, or f64.

 <imm> is a value appropriate for the specified <type>.

Operations

NameEffectDescription
vldr

Vd<imm>si202_e

Load a constant

t0310

Examples

f10-56-9780128036983

10.11.2 Bitwise Logical Operations with Immediate Data

It is often useful to clear and/or set specific bits in a register. The following pseudo-instructions can provide bitwise logical operations:

vand Bitwise AND Immediate

vorn Bitwise Complement and OR Immediate

Syntax

 v<op>.<type> Vd, #<imm>

 <op> must be either and, or orn.

 V must be either q or d to specify whether the operation involves quadwords or doublewords.

 <type> must be i8, i16, i32, or i64.

 <imm> is a 16-bit or 32-bit immediate value, which is interpreted as a pattern for filling the immediate operand. The following table shows acceptable patterns for <imm>, based on what was chosen for <type>:

<type>
i8,i16i32,i64
0xFFXY0xFFFFFFXY
0xXYFF0xFFFFXYFF
0xFFXYFFFF
0xXYFFFFFF

t0315

Operations

NameEffectDescription
vandVdVdimm:immsi101_eLogical OR
vornVd¬(Vdimm:imm)si204_eBit Clear

Examples

f10-57-9780128036983

10.11.3 Vector Absolute Compare

The following pseudo-instructions perform comparisons between the absolute values of all of the corresponding elements of two vectors in parallel:

vacle Absolute Compare Less Than or Equal

vaclt Absolute Compare Less Than

The vector absolute compare instruction compares the absolute value of each element of a vector with the absolute value of the corresponding element in a second vector, and sets an element in the destination vector for each comparison. If the comparison is true, then all bits in the result element are set to one. Otherwise, all bits in the result element are set to zero. Note that summing the elements of the result vector (as signed integers) will give the two’s complement of the number of comparisons which were true.

Syntax

 vac<op>.f32 Vd, Vn, Vm

 <op> must be either lt or lt.

 V can be d or q.

 The operand element type must be f32.

 The result element type is i32.

Operations

NameEffectDescription
vac<op>

for ivector_length do

 if |Fm[i]|<op> |Fn[i]|

then

 Fd[i]111si89_e

else

 Fd[i]000si90_e

 end if

end for

Compare each scalar in Fn to the corresponding scalar in Fm. If the comparison is true, then set all bits in the corresponding scalar in Fd to one. Otherwise set all bits in the corresponding scalar in Fd to zero.

t0325

Examples

f10-58-9780128036983

10.12 Performance Mathematics: A Final Look at Sine

In Chapter 9, four versions of the sine function were given. Those implementations used scalar and VFP vector modes for single-precision and double-precision. Those previous implementations are already faster than the implementations provided by GCC, However, it may be possible to gain a little more performance by taking advantage of the NEON architecture. All versions of NEON are guaranteed to have a very large register set, and that fact can be used to attain better performance.

10.12.1 Single Precision

Listing 10.1 shows a single precision floating point implementation of the sine function, using the ARM NEON instruction set. It performs the same operations as the previous implementations of the sine function, but performs many of the calculations in parallel. This implementation is slightly faster than the previous version.

f10-59-9780128036983
Listing 10.1 NEON implementation of the sin x function using single precision.

10.12.2 Double Precision

Listing 10.2 shows a double precision floating point implementation of the sine function. This code is intended to run on ARMv7 and earlier NEON/VFP systems with the full set of 32 double-precision registers. NEON systems prior to ARMv8 do not have NEON SIMD instructions for double precision operations. This implementation is faster than Listing 9.4 because it uses a large number of registers, does not contain a loop, and is written carefully so that multiple instructions can be at different stages in the pipeline at the same time. This technique of gaining performance is known as loop unrolling.

f10-60a-9780128036983f10-60b-9780128036983f10-60c-9780128036983
Listing 10.2 NEON implementation of the sin x function using double precision.

10.12.3 Performance Comparison

Table 10.4 compares the implementations from Listings 10.1 and 10.2 with the VFP vector implementations from Chapter 9 and the sine function provided by GCC. Notice that in every case, using vector mode VFP instructions is slower than the scalar VFP version. As mentioned previously, vector mode is deprecated on NEON processors. On NEON systems, vector mode is emulated in software. Although vector mode is supported, using it will result in reduced performance, because each vector instruction causes the operating system to take over and substitute a series of scalar floating point operations on-the-fly. A great deal of time was spent by the operating system software in emulating the VFP hardware vector mode.

Table 10.4

Performance of sine function with various implementations

OptimizationImplementationCPU seconds
NoneSingle Precision VFP scalar Assembly1.74
Single Precision VFP vector Assembly27.09
Single Precision NEON Assembly1.32
Single Precision C4.36
Double Precision VFP scalar Assembly2.83
Double Precision VFP vector Assembly106.46
Double Precision NEON Assembly2.24
Double Precision C4.59
FullSingle Precision VFP scalar Assembly1.11
Single Precision VFP vector Assembly27.15
Single Precision NEON Assembly0.96
Single Precision C1.69
Double Precision VFP scalar Assembly2.56
Double Precision VFP vector Assembly107.5.53
Double Precision NEON Assembly2.05
Double Precision C4.27

When compiler optimization is not used, the single precision scalar VFP implementation achieves a speedup of about 2.51, and the NEON implementation achieves a speedup of about 3.30 compared to the GCC implementation. The double precision scalar VFP implementation achieves a speedup of about 1.62, and the loop-unrolled NEON implementation achieves a speedup of about 2.05 compared to the GCC implementation.

When the best possible compiler optimization is used (-Ofast), the single precision scalar VFP implementation achieves a speedup of about 1.52, and the NEON implementation achieves a speedup of about 1.76 compared to the GCC implementation. The double precision scalar VFP implementation achieves a speedup of about 1.67, and the loop-unrolled NEON implementation achieves a speedup of about 2.08 compared to the GCC implementation. The single precision NEON version was 1.16 times as fast as the VFP scalar version and the double precision NEON implementation was 1.25 times as fast as the VFP scalar implementation.

Although the VFP versions of the sine function ran without modification on the NEON processor, re-writing them for NEON resulted in significant performance improvement. Performance of the vectorized VFP code running on a NEON processor was abysmal. The take-away lesson is that a programmer can improve performance by writing some functions in assembly that are specifically targeted to run on an specific platform. However, assembly code which improves performance on one platform may actually result in very poor performance on a different platform. To achieve optimal or near-optimal performance, it is important for the programmer to be aware of exactly which hardware platform is being used.

10.13 Alphabetized List of NEON Instructions

NamePageOperation
vaba339Absolute Difference and Accumulate
vabal339Absolute Difference and Accumulate Long
vabd339Absolute Difference
vabdl339Absolute Difference Long
vabs340Absolute Value
vacge324Absolute Compare Greater Than or Equal
vacgt324Absolute Compare Greater Than
vacle353Absolute Compare Less Than or Equal
vaclt353Absolute Compare Less Than
vadd335Add
vaddhn336Add and Narrow
vaddl335Add Long
vaddw335Add Wide
vand326Bitwise AND
vand352Bitwise AND Immediate
vbic326Bit Clear
vbic327Bit Clear Immediate
vbif328Bitwise Insert if False
vbit328Bitwise Insert
vbsl328Bitwise Select
vceq323Compare Equal
vcge323Compare Greater Than or Equal
vcgt323Compare Greater Than
vcle323Compare Less Than or Equal
vcls342Count Leading Sign Bits
vclt323Compare Less Than
vclz342Count Leading Zero Bits
vcnt342Count Set Bits
vcvt322Convert Between Half and Single
vcvt321Convert Data Format
vdup312Duplicate Scalar
veor326Bitwise Exclusive-OR
vext313Extract Elements
vfma346Fused Multiply Accumulate
vfms346Fused Multiply Subtract
vfnma346Fused Negate Multiply Accumulate
vfnms346Fused Negate Multiply Subtract
vhadd337Halving Add
vhsub337Halving Subtract
vld¡n¿305Load Copies of Structured Data
vld¡n¿307Load Multiple Structured Data
vld¡n¿303Load Structured Data
vldr351Load Constant
vmax341Maximum
vmin341Minimum
vmla343Multiply Accumulate
vmla345Multiply Accumulate by Scalar
vmlal344Multiply Accumulate Long
vmlal345Multiply Accumulate Long by Scalar
vmls343Multiply Subtract
vmls345Multiply Subtract by Scalar
vmlsl344Multiply Subtract Long
vmlsl345Multiply Subtract Long by Scalar
vmov310Move Immediate
vmov309Move Between NEON and ARM
vmovl311Move and Lengthen
vmovn311Move and Narrow
vmul343Multiply
vmul345Multiply by Scalar
vmull343Multiply Long
vmull345Multiply Long by Scalar
vmvn310Move Immediate Negative
vneg340Negate
vorn326Bitwise Complement and OR
vorn352Bitwise Complement and OR Immediate
vorr326Bitwise OR
vorr327Bitwise OR Immediate
vpadal338Add Pairwises and Accumulate Long
vpadd338Add Pairwise
vpaddl338Add Pairwise Long
vpmax341Pairwise Maximum
vpmin341Pairwise Minimum
vqabs340Saturating Absolute Value
vqadd335Saturating Add
vqdmlal347Saturating Multiply Double Accumulate (Low)
vqdmlsl347Saturating Multiply Double Subtract (Low)
vqdmulh348Saturating Multiply Double (High)
vqdmull347Saturating Multiply Double (Low)
vqmovn311Saturating Move and Narrow
vqmovun311Saturating Move and Narrow Unsigned
vqneg340Saturating Negate
vqrdmulh348Saturating Multiply Double (High) and Round
vqrshl330Saturating Shift Left or Right by Variable and Round
vqrshrn332Saturating Shift Right Immediate Round
vqrshrun333Saturating Shift Right Immediate Round Unsigned
vqshl329Saturating Shift Left Immediate
vqshl330Saturating Shift Left or Right by Variable
vqshlu329Saturating Shift Left Immediate Unsigned
vqshrn332Saturating Shift Right Immediate
vqshrun333Saturating Shift Right Immediate Unsigned
vqsub335Saturating Subtract
vraddhn336Add, Round, and Narrow
vrecpe348Reciprocal Estimate
vrecps349Reciprocal Step
vrev314Reverse Elements
vrhadd337Halving Add and Round
vrshl330Shift Left or Right by Variable and Round
vrshr331Shift Right Immediate and Round
vrshrn331Shift Right Immediate Round and Narrow
vrsqrte348Reciprocal Square Root Estimate
vrsqrts349Reciprocal Square Root Step
vrsra331Shift Right Round and Accumulate Immediate
vrsubhn336Subtract, Round, and Narrow
vshl329Shift Left Immediate
vshl330Shift Left or Right by Variable
vshll329Shift Left Immediate Long
vshr331Shift Right Immediate
vshrn331Shift Right Immediate and Narrow
vsli334Shift Left and Insert
vsra331Shift Right and Accumulate Immediate
vsri334Shift Right and Insert
vst<n>307Store Multiple Structured Data
vst<n>303Store Structured Data
vsub335Subtract
vsubhn336Subtract and Narrow
vsubl335Subtract Long
vsubw335Subtract Wide
vswp315Swap Vectors
vtbl318Table Lookup
vtbx318Table Lookup with Extend
vtrn316Transpose Matrix
vtst325Test Bits
vuzp319Unzip Vectors
vzip319Zip Vectors

t0330_at0330_b

10.14 Chapter Summary

NEON can dramatically improve performance of algorithms that can take advantage of data parallelism. However, compiler support for automatically vectorizing and using NEON instructions is still immature. NEON intrinsics allow C and C++ programmers to access NEON instructions, by making them look like C functions. It is usually just as easy and more concise to write NEON assembly code as it is to use the intrinsics functions. A careful assembly language programmer can usually beat the compiler, sometimes by a wide margin. The greatest gains usually come from converting an algorithm to avoid floating point, and taking advantage of data parallelism.

Exercises

10.1 What is the advantage of using IEEE half-precision? What is the disadvantage?

10.2 NEON achieved relatively modest performance gains on the sine function, when compared to VFP.

(a) Why?

(b) List some tasks for which NEON could significantly outperform VFP.

10.3 There are some limitations on the size of the structure that can be loaded or stored using the vld<n> and vst<n> instructions. What are the limitations?

10.4 The sine function in Listing 10.2 uses a technique known as “loop unrolling” to achieve higher performance. Name at least three reasons why this code is more efficient than using a loop?

10.5 Reimplement the fixed-point sine function from Listing 8.7 using NEON instructions. Hint: you should not need to use a loop. Compare the performance of your NEON implementation with the performance of the original implementation.

10.6 Reimplement Exercise 9.10 using NEON instructions.

10.7 Fixed point operations may be faster than floating point operations. Modify your code from the previous example so that it uses the following definitions for points and transformation matrices:

f10-61-9780128036983

Use saturating instructions and/or any other techniques necessary to prevent overflow. Compare the performance of the two implementations.

Part III

Accessing Devices

Chapter 11

Devices

Abstract

This chapter starts with a high-level explanation of how devices may be accessed in a modern computer system, and then explains that most devices on modern architectures are memory-mapped. Next, it explains how memory mapped devices can be accessed by user processes under Linux, by making use of the mmap system call. Code examples are given, showing how several devices can be mapped into the memory of a user-level program on the Raspberry Pi and pcDuino. Next the General Purpose I/O devices on both systems are explained, providing the reader with the opportunity to do a comparison between two different devices which perform almost precisely the same functions.

Keywords

Device; Memory map; General purpose I/O (GPIO); I/O Pin; Header; Pull-up and pull-down resistor; LED; Switch

As mentioned in Chapter 1, a computer system consists of three main parts: the CPU, memory, and devices. The typical computing system has many devices of various types for performing specific functions. Some devices, such as data caches, are closely coupled to the CPU, and are typically controlled by executing special CPU instructions that can only be accessed in assembly language. However, most of the devices on a typical system are accessed and controlled through the system data bus. These devices appear to the programmer to be ordinary memory locations. The hardware in the system bus decodes the addresses coming from the CPU, and some addresses correspond to devices rather than memory. Fig. 11.1 shows the memory layout for a typical system. The exact locations of the devices and memory are chosen by the system hardware designers. From the programmer’s standpoint, writing data to certain memory addresses results in the data being transferred to a device rather than stored in memory. The programmer must read documentation on the hardware design to determine exactly where the devices are in memory.

f11-01-9780128036983
Figure 11.1 Typical hardware address mapping for memory and devices.

11.1 Accessing Devices Directly Under Linux

There are devices that allow data to be read or written from external sources, devices that can measure time, devices for moving data from one location in memory to another, devices for modifying the addresses of memory regions, and devices for even more esoteric purposes. Some devices are capable of sending signals to the CPU to indicate that they need attention, while others simply wait for the CPU to check on their status.

A modern computer system, such as the Raspberry Pi, has dozens or even hundreds of devices. Programmers write device driver software for each device. A device driver provides a few standard function calls for each device, so that it can be used easily. The specific set of functions depends on the type of device and the design of the operating system. Operating system designers strive to define a small set of device types, and to define a standard software interface for each type in order to make devices interchangeable.

Devices are typically controlled by writing specific values to the device’s internal device registers. For the ARM processor, access to most device registers is accomplished using the load and store instructions. Each device is assigned a base address in memory. This address corresponds with the first register inside the device. The device may also have other registers that are accessible at some pre-defined offset address from the base address. Some registers are read-only, some are write-only, and some are read-write. To use the device, the programmer must read from, and write appropriate data to, the correct device registers. For every device, there is a programmer’s model and documentation explaining what each register in the device does. Some devices are well designed, easy to use, and well documented. Some devices are not, and the programmer must work harder to write software to use them.

Linux is a powerful, multiuser, multitasking operating system. The Linux kernel manages all of the devices and protects them from direct access by user programs. User programs are intended to access devices by making system calls. The kernel accesses the devices on behalf of the user programs, ensuring that an errant user program cannot misuse the devices and other resources on the system. Attempting to directly access the registers in any device will result in an exception. The kernel will take over and kill the offending process.

However, our programs will need direct access to the device registers. Linux allows user programs to gain direct access through the mmap() system call. Listing 11.1 shows how four devices can be mapped into the memory space of a user program on a Raspberry Pi. In most cases, the user program will need administrator privileges in order to perform the mapping. The operating system does not usually give permission for ordinary users to access devices directly. However Linux does provide the ability to change permissions on /dev/mem, or for user programs to run with elevated privileges.

f11-08a-9780128036983f11-08b-9780128036983f11-08c-9780128036983f11-08d-9780128036983f11-08e-9780128036983f11-08f-9780128036983
Listing 11.1 Function to map devices into the user program memory on a Raspberry Pi

Listing 11.2 shows how four devices can be mapped into the memory space of a user program on a pcDuino. The devices are equivalent to the devices mapped in Listing 11.1. Some of the devices are described in the following sections of this chapter. The pcDuino devices and Raspberry Pi devices operate differently, but provide similar functionality. Note that most of the code is the same for both listings. The only real differences between Listings 11.1 and 11.2 are the names of the devices and their hardware addresses.

f11-09a-9780128036983f11-09b-9780128036983f11-09c-9780128036983f11-09d-9780128036983f11-09e-9780128036983
Listing 11.2 Function to map devices into the user program memory space on a pcDuino.

11.2 General Purpose Digital Input/Output

One type of device, commonly found on embedded systems, is the General Purpose I/O (GPIO) device. Although there are many variations on this device provided by different manufacturers, they all provide similar capabilities. The device provides a set of input and/or output bits, which allow signals to be transferred to or from the outside world. Each bit of input or output in a GPIO device is generally referred to as a pin, and a group of pins is referred to as a GPIO port. Ports commonly support 8 bits of input or output, but some devices have 16 or 32 bit ports. Some GPIO devices support multiple ports, and some systems have multiple GPIO devices in them.

A system with a GPIO device usually has some type of connector or wires that allow external inputs or outputs to be connected to the system. For example, the IBM PC has a type of GPIO device that was originally intended for communications with a parallel printer. On that platform, the GPIO device is commonly referred to as the parallel printer port.

Some GPIO devices, such as the one on the IBM PC, are arranged as sets of pins that can be switched as a group to either input or output. In many modern GPIO devices, each pin can be individually configured to accept or source different input and output voltages. On some devices, the amount of drive current available can be configured. Some include the ability to configure built-in pull-up and/or pull-down resistors. On most older GPIO devices, the input and output voltages are typically limited to the supply voltage of the GPIO device, and the device may be damaged by greater voltages. Newer GPIO devices generally can tolerate 5 V on inputs, regardless of the supply voltage of the device.

GPIO devices are very common in systems that are intended to be used for embedded applications. For most GPIO devices:

 individual pins or groups of pins can be configured,

 pins can be configured to be input or output,

 pins can be disabled so that they are neither input nor output,

 input values can be read by the CPU (typically high=1, low=0),

 output values can be read or written by the CPU, and

 input pins can be configured to generate interrupt requests.

Some GPIO devices may also have more advanced features, such as the ability to use Direct Memory Access (DMA) to send data without requiring the CPU to move each byte or word. Fig. 11.2 shows two common ways to use GPIO pins. Fig. 11.2A shows a GPIO pin that has been configured for input, and connected to a push-button switch. When the switch is open, the pull-up resistor pulls the voltage on the pin to a high state. When the switch is closed, the pin is pulled to a low state and some current flows through the pull-up resistor to ground. Typically, the pull-up resistor would be around 10 kΩ. The specific value is not critical, but it must be high enough to limit the current to a small amount when the switch is closed. Fig. 11.2B shows a GPIO pin that is configured as an output and is being used to drive an LED. When a 1 is output on the pin, it is at the same voltage as Vcc (the power supply voltage), and no current flows. The LED is off. When a 0 is output on the pin, current is drawn through the resistor and the LED, and through the pin to ground. This causes the LED to be illuminated. Selection of the resistor is not critical, but it must be small enough to light the LED without allowing enough current to destroy either the LED or the GPIO circuitry. This is typically around 1 kΩ. Note that, in general, GPIO pins can sink more current than they can source, so it is most common to connect LEDs and other devices in the way shown.

f11-02-9780128036983
Figure 11.2 GPIO pins being used for input and output. (A) GPIO pin being used as input to read the state of a push-button switch. (B) GPIO pin being used as output to drive an LED.

11.2.1 Raspberry Pi GPIO

The Broadcom BCM2835 system-on-chip contains 54 GPIO pins that are split into two banks. The GPIO pins are named using the following format: GPIOx, where x is a number between 0 and 53. The GPIO pins are highly configurable. Each pin can be used for general purpose I/O, or can be configured to serve up to six pre-defined alternate functions. Configuring a GPIO pin for an alternate function usually allows some other device within the BCM2835 to use the pin. For example, GPIO4 can be used

 for general purpose I/O,

 to send the signal generated by General Purpose Clock 0 to external devices,

 to send bit one of the Secondary Address Bus to external devices, or

 to receive JTAG data for programming the firmware of the device.

The last eight GPIO pins, GPIO46–GPIO53 have no alternate functions, and are used only for GPIO.

In addition to the alternate function, all GPIO pins can be configured individually as input or output. When configured as input, a pin can also be configured to detect when the signal changes, and to send an interrupt to the ARM CPU. Each input pin also has internal pull-up and pull-down resistors, which can be enabled or disabled by the programmer.

The GPIO pins on the BCM2835 SOC are very flexible and are quite complex, but are well designed and not difficult to program, once the programmer understands how the pins operate and what the various registers do. There are 41 registers that control the GPIO pins. The base address for the GPIO device is 20200000. The 41 registers and their offsets from the base address are shown in Table 11.1.

Table 11.1

Raspberry Pi GPIO register map

OffsetNameDescriptionSizeR/W
0016GPFSEL0GPIO Function Select 032R/W
0416GPFSEL1GPIO Function Select 132R/W
0816GPFSEL2GPIO Function Select 232R/W
0C16GPFSEL3GPIO Function Select 332R/W
1016GPFSEL4GPIO Function Select 432R/W
1416GPFSEL5GPIO Function Select 532R/W
1C16GPSET0GPIO Pin Output Set 032W
2016GPSET1GPIO Pin Output Set 132W
2816GPCLR0GPIO Pin Output Clear 032W
2C16GPCLR1GPIO Pin Output Clear 132W
3416GPLEV0GPIO Pin Level 032R
3816GPLEV1GPIO Pin Level 132R
4016GPEDS0GPIO Pin Event Detect Status 032R/W
4416GPEDS1GPIO Pin Event Detect Status 132R/W
4C16GPREN0GPIO Pin Rising Edge Detect Enable 032R/W
5016GPREN1GPIO Pin Rising Edge Detect Enable 132R/W
5816GPFEN0GPIO Pin Falling Edge Detect Enable 032R/W
5C16GPFEN1GPIO Pin Falling Edge Detect Enable 132R/W
6416GPHEN0GPIO Pin High Detect Enable 032R/W
6816GPHEN1GPIO Pin High Detect Enable 132R/W
7016GPLEN0GPIO Pin Low Detect Enable 032R/W
7416GPLEN1GPIO Pin Low Detect Enable 132R/W
7C16GPAREN0GPIO Pin Async. Rising Edge Detect 032R/W
8016GPAREN1GPIO Pin Async. Rising Edge Detect 132R/W
8816GPAFEN0GPIO Pin Async. Falling Edge Detect 032R/W
8C16GPAFEN1GPIO Pin Async. Falling Edge Detect 132R/W
9416GPPUDGPIO Pin Pull-up/down Enable32R/W
9816GPPUDCLK0GPIO Pin Pull-up/down Enable Clock 032R/W
9C16GPPUDCLK1GPIO Pin Pull-up/down Enable Clock 132R/W

t0010

Setting the GPIO pin function

The first six 32-bit registers in the device are used to select the function for each of the 54 GPIO pins. The function of each pin is controlled by a group of three bits in one of these registers. The mapping is very regular. Bits 0–2 of GPIOFSEL0 control the function of GPIO pin 0. Bits 3–5 of GPIOFSEL0 control the function of GPIO pin 1, and so on, up to bits 27–29 of GPIOFSEL0, which control the function of GPIO pin 9. The next pin, pin 10, is controlled by bits 0–2 of GPIOFSEL1. The pins are assigned in sequence through the remaining bits, until bits 27–29, which control GPIO pin 19. The remaining four GPIOFSEL registers control the remaining GPIO pins. Note that bits 30 and 31 of all of the GPIOFSEL registers are not used, and most of the bits in GPIOFSEL5 are not assigned to any pin. The meaning of each combination of the three bits is shown in Table 11.2. Note that the encoding is not as simple as one might expect.

Table 11.2

GPIO pin function select bits

MSB-LSBFunction
000Pin is an input
001Pin is an output
100Pin performs alternate function 0
101Pin performs alternate function 1
110Pin performs alternate function 2
111Pin performs alternate function 3
011Pin performs alternate function 4
010Pin performs alternate function 5

The procedure for setting the function of a GPIO pin is as follows:

 Determine which GPIOFSEL register controls the desired pin.

 Determine which bits of the GPIOFSEL register are used.

 Determine what the bit pattern should be.

 Read the GPIOFSEL register.

 Clear the correct bits using the bic instruction.

 Set them to the correct pattern using the orr instruction.

For example, Listing 11.3 shows the sequence of code which would be used to set GPIO pin 26 to alternate function 1.

f11-10-9780128036983
Listing 11.3 ARM assembly code to set GPIO pin 26 to alternate function 1.

Setting GPIO output pins

To use a GPIO pin for output, the function select bits for that pin must be set to 001. Once that is done, the output can be driven high or low by using the GPSET and GPCLR registers. GPIO pin 0 is set to a high output by writing a 1 to bit 0 of GPSET0, and it is set to low output by writing a 1 to bit 0 of GPCLR0. GPIO pin 1 is similarly controlled by bit 1 in GPSET0 and GPCLR0. Each of the GPIO pins numbered 0 through 31 is assigned one bit in GPSET0 and one bit in GPCLR0. GPIO pin 32 is assigned to bit 0 of GPSET1 and GPCLR1, GPIO pin 33 is assigned to bit 1 of GPSET1 and GPCLR1, and so on. Since there are only 54 GPIO pins, bits 22–31 of GPSET1 and GPCLR1 are not used. The programmer can set or clear several outputs simultaneously by writing the appropriate bits in the GPSET and GPCLR registers.

Reading GPIO input pins

To use a GPIO pin for input, the function select bits for that pin must be set to 000. Once that is done, the input can be read at any time by reading the appropriate GPLEV register and examining the bit that corresponds with the input pin. GPIO pin 0 is read as bit 0 of GPLEV0, GPIO pin 1 is similarly read as bit 1 of GPLEV1. Each of the GPIO pins numbered 0 through 31 is assigned one bit in GPLEV0. GPIO pin 32 is assigned to bit 0 of GPLEV1, GPIO pin 33 is assigned to bit 1 of GPLEV1, and so on. Since there are only 54 GPIO pins, bits 22–31 of GPLEV1 are not used. The programmer can read the status of several inputs simultaneously by reading one of the GPLEV registers and examining the bits corresponding to the appropriate pins.

Enabling internal pull-up or pull-down

Input pins can be configured with internal pull-up or pull-down resistors. This can simplify the design of the system. For instance, Fig. 11.2A, shows a push-button switch connected to an input, with an external pull-up resistor. That resistor is unnecessary if the internal pull-up for that pin is enabled.

Enabling the pull-up or pull-down is a two step process. The first step is to configure the type of change to be made, and the second step is to perform that change on the selected pin(s). The first step is accomplished by writing to the GPPUD register. The valid binary control codes are shown in Table 11.3.

Table 11.3

GPPUD control codes

CodeFunction
00Disable pull-up and pull-down
01Enable pull-down
10Enable pull-up

Once the GPPUD register is configured, the selected operation can be performed on multiple pins by writing to one or both of the GPPUDCLK registers. GPIO pins are assigned to bits in these two registers in the same way as the pins are assigned in the GPLEV, GPSET, and GPCLR registers. Writing 1 to bit 0 of GPPUDCLK0 will configure the pull-up or pull-down for GPIO pin 0, according to the control code that is currently in the GPPUD register.

Detecting GPIO events

The GPEDS registers are used for detecting events that have occurred on the GPIO pins. For instance a pin may have transitioned from low to high, and back to low. If the CPU does not read the GPLEV register often enough, then such an event could be missed. The GPEDS registers can be configured to capture such events so that the CPU can detect that they occurred.

GPIO pins are assigned to bits in these two registers in the same way as the pins are assigned in the GPLEV, GPSET, and GPCLR registers. If bit 1 of GPEDS0 is set, then that indicates that an event has occurred on GPIO pin 0. Writing a 0 to that bit will clear the bit and allow the event detector to detect another event. Each pin can be configured to detect specific types of events by writing to the GPREN, GPHEN, GPLEN, GPAREN, and GPAFEN registers. For more information, refer to the BCM2835 ARM Peripherals manual.

GPIO pins available on the Raspberry Pi

The Raspberry Pi provides access to several of the 54 GPIO pins through the expansion header. The expansion header is a group of physical pins located in the corner of the Raspberry Pi board. Fig. 11.3 shows where the header is located on the Raspberry Pi. Wires can be connected to these pins and then the GPIO device can be programmed to send and/or receive digital information. Fig. 11.4 shows which signals are attached to the various pins. Some of the pins are used to provide power and ground to the external devices.

f11-03-9780128036983
Figure 11.3 The Raspberry Pi expansion header location.
f11-04-9780128036983
Figure 11.4 The Raspberry Pi expansion header pin assignments.

Table 11.4 shows some useful alternate functions available on each pin of the Raspberry Pi expansion header. Many of the alternate functions available on these pins are not really useful. Those functions have been left out of the table. The most useful alternate functions are probably GPIO 14 and 15, which can be used for serial communication, and GPIO 18, which can be used for pulse width modulation. Pulse width modulation is covered in Section 12.2, and serial communication is covered in Section 13.2. The Serial Peripheral Interface (SPI) functions could also be useful for connecting the Raspberry Pi to other devices which support SPI. Also, the SDA and SCL functions could be used to communicate with I2C devices.

Table 11.4

Raspberry Pi expansion header useful alternate functions

Alternate Function
Pin05
GPIO 2SDA1
GPIO 3SCL1
GPIO 4GPCLK0
GPIO 7SPI0_CE1_N
GPIO 8SPI0_CE0_N
GPIO 9SPI0_MISO
GPIO 10SPI0_MOSI
GPIO 11SPI0_SCLK
GPIO 14TXD0TXD1
GPIO 15RXD0RXD1
GPIO 18PCM_CLKPWM0

t0025

11.2.2 pcDuino GPIO

The AllWinner A10/A20 system-on-chip contains 175 GPIO pins, which are arranged in seven ports. Each of the seven ports is identified by a letter between “A” and “I.” The ports are part of the PIO device, which is mapped at address 01C2080016. The GPIO pins are named using the following format: PNx, where N is a letter between “A” and “I” indicating the port, and x is a number indicating a pin on the given port. The assignment of pins to ports is somewhat irregular, as shown in Table 11.5. Some ports have as many as 28 physical pins, while others have as few as six. However, the layout of the registers in the device is very regular. Given any port and pin combination, finding the correct registers and sets of bits within the registers, is very straightforward.

Table 11.5

Number of pins available on each of the AllWinner A10/A20 PIO ports

PortPins
A18
B24
C25
D28
E12
F6
G12
H28
I22

Each of the 9 ports is controlled by a set of 9 registers, for a total of 81 registers. There are seven additional registers that can be used to configure pins as interrupt sources. Interrupt processing is explained in Section 14.2. All of the port and interrupt registers together make a total of 88 registers for the GPIO device. The complete register map with the offset of each register from the device base address is shown in Table 11.6.

Table 11.6

Registers in the AllWinner GPIO device

OffsetNameDescription
00016PA_CFG0Function select for Port A, Pins 0–7
00416PA_CFG1Function select for Port A, Pins 8–15
00816PA_CFG2Function select for Port A, Pins 16–17
00C16PA_CFG3Not used
01016PA_DATPort A Data Register
01416PA_DRV0Port A Multi-driving, Pins 0–15
01816PA_DRV1Port A Multi-driving, Pins 16–17
01C16PA_PULL0Port A Pull-Up/-Down, Pins 0–15
02016PA_PULL1Port A Pull-Up/-Down, Pins 16–17
02416PB_CFG0Function select for Port B, Pins 0–7
02816PB_CFG1Function select for Port B, Pins 8–15
02C16PB_CFG2Function select for Port B, Pins 16–23
03016PB_CFG3Not used
03416PB_DATPort B Data Register
03816PB_DRV0Port B Multi-driving, Pins 0–15
03C16PB_DRV1Port B Multi-driving, Pins 16–23
04016PB_PULL0Port B Pull-Up/-Down, Pins 0–15
04416PB_PULL1Port B Pull-Up/-Down, Pins 16–23
04816PC_CFG0Function select for Port C, Pins 0–7
04C16PC_CFG1Function select for Port C, Pins 8–15
05016PC_CFG2Function select for Port C, Pins 16–23
05416PC_CFG3Function select for Port C, Pin 24
05816PC_DATPort C Data Register
05C16PC_DRV0Port C Multi-driving, Pins 0–15
06016PC_DRV1Port C Multi-driving, Pins 16–23
06416PC_PULL0Port C Pull-Up/-Down, Pins 0–15
06816PC_PULL1Port C Pull-Up/-Down, Pins 16–23
06C16PD_CFG0Function select for Port D, Pins 0–7
07016PD_CFG1Function select for Port D, Pins 8–15
07416PD_CFG2Function select for Port D, Pins 16–23
07816PD_CFG3Function select for Port D, Pins 24–27
07C16PD_DATPort D Data Register
08016PD_DRV0Port D Multi-driving, Pins 0–15
08416PD_DRV1Port D Multi-driving, Pins 16–27
08816PD_PULL0Port D Pull-Up/-Down, Pins 0–15
08C16PD_PULL1Port D Pull-Up/-Down, Pins 16–27
09016PE_CFG0Function select for Port E, Pins 0–7
09416PE_CFG1Function select for Port E, Pins 8–11
09816PE_CFG2Not used
09C16PE_CFG3Not used
0A016PE_DATPort E Data Register
0A416PE_DRV0Port E Multi-driving, Pins 0–11
0A816PE_DRV1Not used
0AC16PE_PULL0Port E Pull-Up/-Down, Pins 0–11
0B016PE_PULL1Not used
0B416PF_CFG0Function select for Port F, Pins 0–5
0B816PF_CFG1Not used
0BC16PF_CFG2Not used
0C016PF_CFG3Not used
0C416PF_DATPort F Data Register
0C816PF_DRV0Port F Multi-driving, Pins 0–5
0CC16PF_DRV1Not used
0D016PF_PULL0Port F Pull-Up/-Down, Pins 0–5
0D416PF_PULL1Not used
0D816PG_CFG0Function select for Port G, Pins 0–7
0DC16PG_CFG1Function select for Port G, Pins 8–11
0E016PG_CFG2Not used
0E416PG_CFG3Not used
0E816PG_DATPort G Data Register
0EC16PG_DRV0Port G Multi-driving, Pins 0–11
0F016PG_DRV1Not used
0F416PG_PULL0Port G Pull-Up/-Down, Pins 0–11
0F816PG_PULL1Not used
0FC16PH_CFG0Function select for Port H, Pins 0–7
10016PH_CFG1Function select for Port H, Pins 8–15
10416PH_CFG2Function select for Port H, Pins 16–23
10816PH_CFG3Function select for Port H, Pins 24–27
10C 16PH_DATPort H Data Register
11016PH_DRV0Port H Multi-driving, Pins 0–15
11416PH_DRV1Port H Multi-driving, Pins 16–27
11816PH_PULL0Port H Pull-Up/-Down, Pins 0–15
11C16PH_PULL1Port H Pull-Up/-Down, Pins 16–27
12016PI_CFG0Function select for Port I, Pins 0–7
12416PI_CFG1Function select for Port I, Pins 8–15
12816PI_CFG2Function select for Port I, Pins 16–21
12C16PI_CFG3Not used
13016PI_DATPort I Data Register
13416PI_DRV0Port I Multi-driving, Pins 0–15
13816PI_DRV1Port I Multi-driving, Pins 16–21
13C16PI_PULL0Port I Pull-Up/-Down, Pins 0–15
14016PI_PULL1Port I Pull-Up/-Down, Pins 16–21
20016PIO_INT_CFG0PIO Interrupt Configure Register 0
20416PIO_INT_CFG1PIO Interrupt Configure Register 1
20816PIO_INT_CFG2PIO Interrupt Configure Register 2
20C16PIO_INT_CFG3PIO Interrupt Configure Register 3
21016PIO_INT_CTLPIO Interrupt Control Register
21416PIO_INT_STATUSPIO Interrupt Status Register
21816PIO_INT_DEBPIO Interrupt Debounce Register

t0035_at0035_b

The GPIO pins are highly configurable. Each pin can be used either for general purpose I/O, or can be configured to serve one of up to six pre-defined alternate functions. Configuring a GPIO pin for an alternate function usually allows some other device within the A10/A20 SOC to use the pin. For example PB2 (pin 2 of port B) can be used for general purpose I/O, or can be used to output the signal from a Pulse Width Modulator (PWM) device (explained in Section 12.2). Each input pin also has internal pull-up and pull-down resistors, which can be enabled or disabled by the programmer.

Setting the GPIO pin function

The first four registers for each port are used to configure the functions for each of the pins. The function of each pin is controlled by three bits in one of the four configuration registers. Pins 0–7 are controlled using configuration register 0. Pins 8–15 are controlled by configuration register 1, and so on. The assignment of pins to control bits is shown in Fig. 11.5. Note that eight pins are controlled by each register, and there is an unused bit between each group of three bits.

f11-05-9780128036983
Figure 11.5 Bit-to-pin assignments for PIO control registers.

Each GPIO pin can be configured by writing a 3-bit code to the appropriate location in the correct port configuration register. The meanings of each possible code is shown in Table 11.7. For example, to configure port A, pin 10 (PA10) for output, the 3-bit code 001 must be written to bits 8–10 the PA_CFG1 register, without changing any other bit in the register. Listing 11.4 shows how this operation can be accomplished.

Table 11.7

Allwinner A10/A20 GPIO pin function select bits

MSB-LSBFunction
000Pin is an input
001Pin is an output
010Pin performs alternate function 0
011Pin performs alternate function 1
100Pin performs alternate function 2
101Pin performs alternate function 3
110Pin performs alternate function 4
111Pin performs alternate function 5
f11-11-9780128036983
Listing 11.4 ARM assembly code to configure PA10 for output.

Reading and setting GPIO pins

An output pin can be set to a high state by setting the corresponding bit in the correct port data register. Likewise the pin can be set to a low state by clearing its corresponding bit. Care must be taken to avoid changing any other bits in the port data register. Listing 11.5 shows how this operation can be accomplished for setting a port to output a high state. To set the port output to a low state, the orr instruction would be replaced with a bic instruction.

f11-12-9780128036983
Listing 11.5 ARM assembly code to set PA10 to output a high state.

To determine the current state of an output pin or read an input pin, the programmer can read the contents of the correct port data register and use bitwise logical operations to isolate the appropriate bit. For example, to read the state of pin 14 of port I (PI14), the programmer would read the PI_DAT register and mask all bits except bit 14. Listing 11.6 shows how this operation can be accomplished. Another method would be to use the tst instruction, rather than the ands instruction, to set the CPSR flags.

f11-13-9780128036983
Listing 11.6 ARM assembly code to set PA10 to output a high state.

Enabling internal pull-up or pull-down

Input pins can be configured with internal pull-up or pull-down resistors. This can simplify the design of the system. For instance, Fig. 11.2a, shows a push-button switch connected to an input with an external pull-up resistor. That resistor is unnecessary if the internal pull-up for that pin is enabled. Each pin is assigned two bits in one of the port pull-up/-down registers. The pull-up and pull-down resistors for pin 0 on port B are controlled using bits 0 and 1 of the PB_PULL0 register. Likewise the pull-up and pull-down resistors for pin 19 of port C are controlled using bits 6 and 7 of the PC_PULL1 register. Table 11.8 shows the bit patterns used to configure the pull-up and pull-down resisters for a pin.

Table 11.8

Pull-up and pull-down resistor control codes

CodeFunction
00Disable pull-up and pull-down
01Enable pull-up
10Enable pull-down
11Reserved

Detecting GPIO events

When configured as an input, most of the pins on the pdDuino can be configured to generate an interrupt, which notifies the CPU than an event has occurred. Configuration of interrupts is beyond the scope of this chapter. It is accomplished using the PIO_INT registers.

GPIO pins available on the pcDuino

The pcDuino provides access to several of the 175 GPIO pins through the expansion headers. Fig. 11.6 shows where the headers are located on the pcDuino. Wires can be plugged into the holes in these headers and then the GPIO device can be programmed to send and/or receive digital and/or analog signals. The physical layout of the pcDuino header makes it compatible with a wide range of expansion modules designed for the Arduino family of microcontroller boards.

f11-06-9780128036983
Figure 11.6 The pcDuino header locations.

Some of the header holes can provide power and ground to the external devices. Analog signals can be read into the pcDuino using the ADC header connections. Fig. 11.7 shows the pcDuino names for the signals that are available on the headers. Table 11.9 shows how the pcDuino header signal names are mapped to the actual port pins on the AllWinner A10/A20 chip. It also shows the most useful alternate functions available on each of the pins. Many alternate functions are left out of the table because they are not really useful. Note that the pcDunio and the Raspberry Pi both provide pins to perform PWM, UART communications, and SPI.

f11-07-9780128036983
Figure 11.7 The pcDuino header pin assignments.

Table 11.9

pcDuino GPIO pins and function select code assignments.

Function Select Code Assignment
pcDuino Pin NamePortPin010011100110
UART-Rx(GPIO0)I19UART2_RXEINT31
UART-Tx(GPIO1)I18UART2_TXEINT30
GPIO3(GPIO2)H7UART5_RXEINT7
PWM0(GPIO3)H6UART5_TXEINT6
GPIO4H8EINT8
PWM1(GPIO5)B2PWM0
PWM2(GPIO6)I3PWM1
GPIO7H9EINT9
GPIO8H10EINT10
PWM3(GPIO9)H5EINT5
SPI_CS(GPIO10)I10SPI0_CS0UART5_TXEINT22
SPI_MOSI(GPIO11)I12SPI0_MOSIUART6_TXCLK_OUT_AEINT24
SPI_MISO(GPIO12)I13SPI0_MISOUART6_RXCLK_OUT_BEINT25
SPI_CLK(GPIO13)I11SPI0_CLKUART5_RXEINT23

t0050

11.3 Chapter Summary

All input and output are accomplished by using devices. There are many types of devices, and each device has its own set of registers which are used to control the device. The programmer must understand the operation of the device and the use of each register in order to use the device at a low level. Computer system manufacturers usually can provide documentation providing the necessary information for low-level programming. The quality of the documentation can vary greatly, and a general understanding of various types of devices can help in deciphering poor or incomplete documentation.

There are two major tasks where programming devices at the register level is required: operating system drivers and very small embedded systems. Operating systems provide an abstract view of each device and this allows programmers to use them more easily. However, someone must write that driver, and that person must have intimate knowledge of the device. On very small systems, there may not be a driver available. In that case, the device must be accessed directly. Even when an operating system provides a driver, it is sometimes necessary or desirable for the programmer to access the device directly. For example, some devices may provide modes of operation or capabilities that are not supported by the operating system driver. Linux provides a mechanism which allows the programmer to map a physical device into the program’s memory space, thereby gaining access to the raw device registers.

Exercises

11.1 Explain the relationships and differences between device registers, memory locations, and CPU registers.

11.2 Why is it necessary to map the device into user program memory before accessing it under Linux? Would this step be necessary under all operating systems or in the case where there is no operating system and our code is running on the “bare metal?”

11.3 What is the purpose of a GPIO device?

11.4 The Raspberry Pi and the PcDuino have very different GPIO devices.

(a) Are they functionally equivalent?

(b) Are they equally programmer-friendly?

(c) If you have answered no to either of the previous questions, then what are the differences?

11.5 Draw a circuit diagram showing how to connect:

(a) a pushbutton switch to GPIO 23 and an LED to GPIO 27 on the Raspberry Pi, and

(b) a pushbutton switch to GPIO12 and an LED to GPIO13 on the PcDuino.

11.6 Assuming the systems are wired according to the previous exercise, write two functions. One function must initialize the GPIO pins, and the other function must read the state of the switch and turn the LED on if the button is pressed, and off if the button is not pressed. Write the two functions for

(a) a Raspberry Pi, and

(b) a PcDuino.

11.7 Write the code necessary to route the output from PWM0 to GPIO 18 on a Raspberry Pi.

11.8 Write the code necessary to route the output from PWM0 to GPIO 5 on a PcDuino.

Chapter 12

Pulse Modulation

Abstract

This chapter begins by explaining pulse density and pulse width modulation in general terms. It then introduces and describes the PWM device on the Raspberry Pi. Following that, it covers the pcDuino PWM device. This gives the reader another opportunity to see two different devices which both perform essentially the same functions.

Keywords

Pulse width modulation; Pulse density modulation; Digital to analog; Low pass filter

The GPIO device provides a method for sending digital signals to external devices. This can be useful to control devices that have basically two states: on and off. In some situations, it is useful to have the ability to turn a device on at varying levels. For instance, it could be useful to control a motor at any required speed, or control the brightness of a light source. One way that this can be accomplished is through pulse modulation.

The basic idea is that the computer sends a stream of pulses to the device. The device acts as a low-pass filter, which averages the digital pulses into an analog voltage. By varying the percentage of time that the pulses are high, versus low, the computer can control how much average energy is sent to the device. The percentage of time that the pulses are high versus low is known as the duty cycle. Varying the duty cycle is referred to as modulation. There are two major types of pulse modulation: pulse density modulation (PDM) and pulse width modulation (PWM). Most pulse modulation devices are configured in three steps as follows:

1. The base frequency of the clock that drives the PWM device is configured. This step is usually optional.

2. The mode of operation for the pulse modulation device is configured by writing to one or more configuration registers in the pulse modulation device.

3. The cycle time is set by writing a “range” value into a register in the pulse modulation device. This value is usually set as a multiple of the base clock cycle time.

Once the device is configured, the duty cycle can be changed easily by writing to one or more registers in the pulse modulation device.

12.1 Pulse Density Modulation

With PDM, also known as pulse frequency modulation (PFM), the duration of the positive pulses does not change, but the time between them (the pulse density) is modulated. When using PDM devices, the programmer typically sets the device cycle time tc in a register, then uses another register to specify the number of pulses d that are to be sent during a device cycle. The number of pulses is typically referred to as the duty cycle and must be chosen such that 0 ≤ dtc. For instance, if tc = 1024, then the device cycle time is 1024 times the cycle time of the clock that drives the device. If d = 512, then the device will send 512 pulses, evenly spaced, during the device cycle. Each pulse will have the same duration as the base clock. The device will continue to output this pulse pattern until d is changed.

Fig. 12.1 shows a signal that is being sent using PDM, and the resulting set of pulses. Each pulse transfers a fixed amount of energy to the device. When the pulses arrive at the device, they are effectively filtered using a low pass filter. The resulting received signal is also shown. Notice that the received signal has a delay, or phase shift, caused by the low-pass filtering. This approach is suitable for controlling certain types of devices, such as lights and speakers.

f12-01-9780128036983
Figure 12.1 Pulse density modulation.

However, when driving such devices directly with the digital pulses, care must be taken that the minimum frequency of pulses remains above the threshold that can be detected by human senses. For instance, when driving a speaker, the minimum pulse frequency must be high enough that the individual pulses cannot be distinguished by the human ear. This minimum frequency is around 40 kHz. Likewise, when driving an LED directly, the minimum frequency must be high enough that the eye cannot detect the individual pulses, because they will be seen as a flickering effect. That minimum frequency is around 70 Hz. To reduce or alleviate this problem, designers may add a low-pass filter between the PWM device and the device that is being driven.

12.2 Pulse Width Modulation

In PWM, the frequency of the pulses remains fixed, but the duration of the positive pulse (the pulse width) is modulated. When using PWM devices, the programmer typically sets the device cycle time tc in a register, then uses another register to specify the number of base clock cycles, d, for which the output should be high. The percentage dtc×100si1_e is typically referred to as the duty cycle and d must be chosen such that 0 ≤ dtc. For instance, if tc = 1024, then the device cycle time is 1024 times the cycle time of the clock that drives the device. If d = 512, then the device will output a high signal for 512 clock cycles, then output a low signal for 512 clock cycles. It will continue to repeat this pattern of pulses until d is changed.

Fig. 12.2 shows a signal that is being sent using PWM. The pulses are also shown. Each pulse transfers some energy to the device. The width of each pulse determines how much energy is transferred. When the pulses arrive at the device, they are effectively filtered using a low-pass filter. The resulting received signal is shown by the dashed line. As with PDM, the received signal has a delay, or phase shift, caused by the low-pass filtering.

f12-02-9780128036983
Figure 12.2 Pulse width modulation.

One advantage of PWM over PDM is that the digital circuit is not as complex. Another advantage of PWM over PDM is that the frequency of the pulses does not vary, so it is easier for the programmer to set the base frequency high enough that the individual pulses cannot be detected by human senses. Also, when driving motors it is usually necessary to match the pulse frequency to the size and type of motor. Mismatching the frequency can cause loss of efficiency as well as overheating of the motor and drive electronics. In severe cases, this can cause premature failure of the motor and/or drive electronics. With PWM, it is easier for the programmer to control the base frequency, and thereby avoid those problems.

12.3 Raspberry Pi PWM Device

The Broadcom BCM2835 system-on-chip includes a device that can create two PWM signals. One of the signals (PWM0) can be routed through GPIO pin 18 (alternate function 5), where it is available on the Raspberry Pi expansion header at pin 12. PWM0 can also be routed through GPIO pin 40. On the Raspberry Pi, pin 40 it is sent through a low-pass filter, and then to the Raspberry Pi audio output port as the right stereo channel. The other signal (PWM1) can be routed through GPIO pin 45. From there, it is sent through a low-pass filter, and then to the Raspberry Pi audio output port as the left stereo channel. So, both PWM channels are accessible, but PWM1 is only accessible through the audio output port after it has been low-pass filtered. The raw PWM0 signal is available through the Raspberry Pi expansion header at pin 12.

There are three modes of operation for the BCM2835 PWM device:

1. PDM mode,

2. PWM mode, and

3. serial transmission mode.

The following paragraphs explain how the device can be used in basic PWM mode, which is the most simple and straightforward mode for this device. Information on how to use the PDM and serial transmission modes, the FIFO, and DMA is available in the BCM2835 ARM Peripherals manual.

The base address of the PWM device is 2020C00016 and it contains eight registers. Table 12.1 shows the offset, name, and a short description for each of the registers. The mode of operation is selected for each channel independently by writing appropriate bits in the PWMCTL register. The base clock frequency is controlled by the clock manager device, which is explained in Section 13.1. By default, the system startup code sets the base clock for the PWM device to 100 MHz.

Table 12.1

Raspberry Pi PWM register map

OffsetNameDescriptionSizeR/W
0016PWMCTLPWM Control32R/W
0416PWMSTAPWM FIFO Status32R/W
0816PWMDMACPWM DMA Configuration32R/W
1016PWMRNG1PWM Channel 1 Range32R/W
1416PWMDAT1PWM Channel 1 Data32R/W
1816PWMFIF1PWM FIFO Input32R/W
2016PWMRNG2PWM Channel 2 Range32R/W
2416PWMDAT2PWM Channel 2 Data32R/W

t0010

Table 12.2 shows the names and short descriptions of the bits in the PWMCTL register. There are 8 bits used for controlling channel 1 and 8 bits for controlling channel 2. PWENn is the master enable bit for channel n. Setting that bit to 0 disables the PWM channel, while setting it to 1 enables the channel. MODEn is used to select whether the channel is in serial transmission mode or in the PDM/PWM mode. If MODEn is set to 0, then MSENn is used to choose whether channel n is in PDM mode or PWM mode. If MODEn is set to 1, then RPTLn, SBITn, USEFn, and CLRFn are used to manage the operation of the FIFO for channel n. POLAn is used to enable or disable inversion of the output signal for channel n.

Table 12.2

Raspberry Pi PWM control register bits

BitNameDescriptionValues
0PWEN1Channel 1 Enable

0: Channel is disabled

1: Channel is enabled

1MODE1Channel 1 Mode

0: PDM or PWM mode

1: Serial mode

2RPTL1Channel 1 Repeat Last

0: Transmission stops when FIFO empty

1: Last data are sent repeatedly

3SBIT1Channel 1 Silence Bit

0: Output goes low when not transmitting

1: Output goes high when not transmitting

4POLA1Channel 1 Polarity

0: 0 is low voltage and 1 is high voltage

1: 1 is low voltage and 0 is high voltage

5USEF1Channel 1 Use FIFO

0: Data register is used

1: FIFO is used

6CLRF1Channel 1 Clear FIFO

Write 0: No effect

Write 1: Causes FIFO to be emptied

7MSEN1Channel 1 PWM Enable

0: PDM mode

1: PWM mode

8PWEN2Channel 2 Enable

0: Channel is disabled

1: Channel is enabled

9MODE2Channel 2 Mode

0: PDM or PWM mode

1: Serial mode

10RPTL2Channel 2 Repeat Last

0: Transmission stops when FIFO empty

1: Last data are sent repeatedly

11SBIT2Channel 2 Silence Bit

0: Output goes low when not transmitting

1: Output goes high when not transmitting

12POLA2Channel 2 Polarity

0: 0 is low voltage and 1 is high voltage

1: 1 is low voltage and 0 is high voltage

13USEF2Channel 2 Use FIFO

0: Data register is used

1: FIFO is used

14UnusedReserved
16MSEN2Channel 2 PWM Enable

0: PDM mode

1: PWM mode

16–31UnusedReserved

t0015

The PWMRNGn registers are used to define the base period for the corresponding channel. In PDM mode, evenly distributed pulses are sent within a period of length defined by this register, and the number of pulses sent during the base period is controlled by writing to the corresponding PWMDATn register. In PWM mode, the PWMRNGn register defines the base frequency for the pulses, and the duty cycle is controlled by writing to the corresponding PWMDATn register. Example 12.1 gives an overview of the steps needed to configure PWM0 for use in PWM mode.

Example 12.1

Example of Determining Clock Values on the Raspberry Pi

Suppose we wish to use PWM0 to perform PWM with a base frequency of 100 kHz and the ability to control the duty cycle with a resolution of 0.1%. The steps would be as follows:

1. Verify that the clock manager device is configured to send a 100 MHz clock to the pulse modulator device through PWM_CLK.

2. To obtain a frequency of 100 kHz from a 100-MHz clock, it is necessary to divide by 1000. Therefore the second step is to store 1000 in the PWMRNG1 register.

3. Before enabling the PWM channel, it is prudent to initialize the duty cycle. The safest initial value is 0%, or completely off. This is accomplished by writing zero to the PWMDAT1 register.

4. Enable PWM channel 1 to operate in PWM mode by setting bit zero of PWMCTL to 1, bit one of PWMCTL to 0, bit five of PWMCTL to 0, and bit seven of PWMCTL to 1.

Once this initialization is performed, we can set or change the duty cycle at any time by writing a value between 0 and 1000 to the PWMDAT1 register.

12.4 pcDuino PWM Device

The AllWinner A10/A20 SOCs have a hardware PWM device which is capable of generating two PWM signals. The PWM device is driven by the OSC24M signal, which is generated by the Clock Control Unit (CCU) in the AllWinner SOC. This base clock runs at 24 MHz by default, and changing the base frequency could affect many other devices in the system. The base clock can be divided by one of 11 predefined values using a prescaler built into the PWM device. Each of the two channels has its own prescaler. Table 12.3 shows the possible settings for the prescalers.

Table 12.3

Prescaler bits in the pcDuino PWM device

ValueEffect
0000Base clock is divided by 120
0001Base clock is divided by 180
0010Base clock is divided by 240
0011Base clock is divided by 360
0100Base clock is divided by 480
0101,0110,0111Not used
1000Base clock is divided by 1200
1001Base clock is divided by 2400
1010Base clock is divided by 3600
1011Base clock is divided by 4800
1100Base clock is divided by 7200
1101,1110Not used
1111Base clock is divided by 1

There are two modes of operation for the PWM device. In the first mode, the device operates like a standard PWM device as described in Section 12.2. In the second mode, it sends a single pulse and then waits until it is triggered again by the CPU. In this mode, it is a monostable multivibrator, also known as a one-shot multivibrator, or just one-shot. The duration of the pulse is controlled using the pre-scaler and the period register.

The PWM device is mapped at address 01C20C0016. Table 12.4 shows the registers and their offsets from the base address. All of the device configuration is done through a single control register, which can also be read in order to determine the status of the device. The bits in the control register are shown in Table 12.5.

Table 12.4

pcDuino PWM register map

OffsetNameDescription
20016PWMCTLPWM Control
20416PWM_CH0_PERIODPWM Channel 0 Period
20816PWM_CH1_PERIODPWM Channel 1 Period

Table 12.5

pcDuino PWM control register bits

BitNameDescriptionValues
3-0CH0_PRESCALChannel 0 PrescaleThese bits must be set before PWM Channel 0 clock is enabled. See Table 12.3.
4CH0_ENChannel 0 Enable0: Channel disabled
1: Channel enabled
5CH0_ACT_STAChannel 0 Polarity0: Channel is active low
1: Channel is active high
6SCLK_CH0_GATINGChannel 0 Clock0: Clock disabled
1: Clock enabled
7CH0_PUL_STARTStart pulseIf configured for pulse mode, writing a 1 causes the PWM device to emit a single pulse.
8PWM0_BYPASSBypass PWM0: Output PWM device signal
1: Output base clock
9SCLK_CH0_MODESelect Mode0: PWM mode
1: Pulse mode
10-14Not Used
18-15CH1_PRESCALChannel 1 PrescaleThese bits must be set before PWM Channel 1 clock is enabled. See Table 12.3.
19CH1_ENChannel 1 Enable0: Channel disabled
1: Channel enabled
20CH1_ACT_STAChannel 1 Polarity0: Channel is active low
1: Channel is active high
21SCLK_CH1_GATINGChannel 1 Clock0: Clock disabled
1: Clock enabled
22CH1_PUL_STARTStart pulseIf configured for pulse mode, writing a 1 causes the PWM device to emit a single pulse.
23PWM1_BYPASSBypass PWM0: Output PWM device signal
1: Output base clock
24SCLK_CH1_MODESelect Mode0: PWM mode
1: Pulse mode
27-25Not Used
28PWM0_RDYCH0 Period Ready0: PWM0 Period register is ready
1: PWM0 Period register is busy
29PWM1_RDYCH1 Period Ready0: PWM1 Period register is ready
1: PWM1 Period register is busy
31–30Not Used

t0030

Before enabling a PWM channel, the period register for that channel should be initialized. The two period registers are each organized as two 16-bit numbers. The upper 16 bits control the total number of clock cycles in one period. In other words, they control the base frequency of the PWM signal. The PWM frequency is calculated as

f=OSC24MPSCN+1,

si2_e

where OSC24M is the frequency of the base clock (the default is 24 MHz), PSC is the prescale value set in the channel prescale bits in the PWM control register, and N is the value stored in the upper 16 bits of the channel period register.

The lower 16 bits of the channel period register control the duty cycle. The duty cycle (expressed as % of full on) can be calculated as

d=DN×100,

si3_e

where N is the value stored in the upper 16 bits of the channel period register, and D is the value stored in the lower 16 bits of the channel period register. Note that the condition DN must always remain true. If the programmer allows D to become greater than N, the results are unpredictable.

The procedure for configuring the AllWinner A10/A20 PWM device is as follows:

1. Disable the desired channel:

(a) Read the PWM control register into x.

(b) Clear all of the bits in x for the desired PWM channel.

(c) Write x back to the PWM control register

2. Initialize the period register for the desired channel.

(a) Calculate the desired value for N.

(b) Let D = 0.

(c) Let y = N × 216 + D.

(d) Write y to the desired channel period register.

3. Set the prescaler.

(a) Select the four-bit code for the desired divisor from Table 12.3.

(b) Set the prescaler code bits in x.

(c) Write x back to the PWM control register.

4. Enable the PWM device.

(a) Set the appropriate bits in x to enable the desired channel, select the polarity, and enable the clock.

(b) Write x to the PWM control register.

Once the control register is configured, the duty cycle can be controlled by calculating a new value for D and then writing y = N × 216 + D to the desired channel period register.

12.5 Chapter Summary

Pulse modulation is a group of methods for generating analog signals using digital equipment, and is commonly used in control systems to regulate the power sent to motors and other devices. Pulse modulation techniques can have very low power loss compared to other methods of controlling analog devices, and the circuitry required is relatively simple.

The cycle frequency must be programmed to match the application. Typically, 10 Hz is adequate for controlling an electric heating element, while 120 Hz would be more appropriate for controlling an incandescent light bulb. Large electric motors may be controlled with a cycle frequency as low as 100 Hz, while smaller motors may need frequencies around 10,000 Hz. It can take some experimentation to find the best frequency for any given application.

Exercises

12.1 Write ARM assembly programs to configure PWM0 and the GPIO device to send a signal out on Raspberry Pi header pin 12 with:

(a) period of 1 ms and duty cycle of 25%, and

(b) frequency of 150 Hz and duty cycle of 63%.

12.2 Write ARM assembly programs to configure PWM0 and the GPIO device to send a signal out on the pcDuino PWM1/GPIO5 pin with:

(a) period of 1 ms and duty cycle of 25%, and

(b) frequency of 150 Hz and duty cycle of 63%.

Chapter 13

Common System Devices

Abstract

This chapter briefly describes some of the devices which are present in most modern computer systems. It then describes in detail the clock management devices on the Raspberry Pi and the pcDuino. Next, it gives an explanation of asynchronous serial communications, and explains how there is some tolerance for mismatch between the clock rate of the transmitter and receiver. It then explains the Universal Asynchronous Receiver/Transmitter (UART) device. Next it covers in detail the UART devices present on the Raspberry Pi and the PcDuino. Once again, the reader is given the opportunity to do a comparison between two different devices which perform almost precisely the same functions.

Keywords

Universal asynchronous receiver/transmitter (UART); Clock manager; Serial communications; RS232

There are some classes of devices that are found in almost every system, including the smallest embedded systems. Such common devices include hardware for managing the clock signals sent to other devices, and serial communications (typically RS232). Most mid-sized or large systems also include devices for managing virtual memory, managing the cache, driving a display, interfacing with keyboard and mouse, accessing disk and other storage devices, and networking. Small embedded systems may have devices for converting analog signals to digital and vice versa, pulse width modulation, and other purposes. Some systems, such as the Raspberry Pi and pcDuino, have all or most of the devices of large systems, as well as most of the devices found on embedded systems. In this chapter, we look at two devices found on almost every system.

13.1 Clock Management Device

Very simple computer systems can be driven by a single clock. Most devices, including the CPU, are designed as state machines. The clock device sends a square-wave signal at a fixed frequency to all devices that need it. The clock signal tells the devices when to transition to the next state. Without the clock signal, none of the devices would do anything.

More complex computers may contain devices which need to run at different rates. This requires the system to have separate clock signals for each device (or group of devices). System designers often solve this problem by adding a clock manager device to the system. This device allows the programmer to configure the clock signals that are sent to the other devices in the system. Fig. 13.1 shows a typical system. The clock manager, just like any other device, is configured by the CPU writing data to its registers using the system bus.

f13-01-9780128036983
Figure 13.1 Typical system with a clock management device.

13.1.1 Raspberry Pi Clock Manager

The BCM2835 system-on-chip contains an ARM CPU and several devices. Some of the devices need their own clock to drive their operation at the correct frequency. Some devices, such as serial communications receivers and transmitters, need configurable clocks so that the programmer has control over the speed of the device. To provide this flexibility and allow the programmer to have control over the clocks for each device, the BCM2835 includes a clock manager device, which can be used to configure the clock signals driving the other devices in the system.

The Raspberry Pi has a 19.2 MHz oscillator which can be used as a base frequency for any of the clocks. The BCM2835 also has three phase-locked-loop circuits that boost the oscillator to higher frequencies. Table 13.1 shows the frequencies that are available from various sources. Each device clock can be driven by one of the PLLs, the external 19.2 MHz oscillator, a signal from the HDMI port, or either of two test/debug inputs.

Table 13.1

Clock sources available for the clocks provided by the clock manager

NumberNameFrequencyNote
0GND0 HzClock is stopped
1oscillator19.2 MHz
2testdebug0UnknownUsed for system testing
3testdebug1UnknownUsed for system testing
4PLLA650 MHzMay not be available
5PLLC200 MHzMay not be available
6PLLD500 MHz
7HDMI auxiliaryUnknown
8–15GND0 HzClock is stopped

t0010

Among the clocks controlled by the clock manager device are the core clock (CM_VPU), the system timer clock (PM_TIME) which controls the speed of the system timer, the GPIO clocks which are documented in the Raspberry Pi peripheral documentation, the pulse modulator device clocks, and the serial communications clocks. It is generally not a good idea to modify the settings of any of the clocks without good reason.

The base address of the clock manager device is 2010100016. Some of the clock manager registers are shown in Table 13.2. Each clock is managed by two registers: a control register and a divisor. The control register is used to enable or disable a clock, to select which source oscillator drives the clock, and to select an optional multistage noise shaping (MASH) filter level. MASH filtering is useful for reducing the perceived noise when a clock is being used to generate an audio signal. In most cases, MASH filtering should not be used.

Table 13.2

Some registers in the clock manager device

OffsetNameDescription
07016CM_GP0_CTLGPIO Clock 0 (GPCLK0) Control
07416CM_GP0_DIVGPIO Clock 0 (GPCLK0) Divisor
07816CM_GP1_CTLGPIO Clock 1 (GPCLK1) Control
07c16CM_GP1_DIVGPIO Clock 1 (GPCLK1) Divisor
08016CM_GP2_CTLGPIO Clock 2 (GPCLK2) Control
08416CM_GP2_DIVGPIO Clock 2 (GPCLK2) Divisor
09816CM_PCM_CTLPulse Code Modulator Clock (PCM_CLK) Control
09c16CM_PCM_DIVPulse Code Modulator Clock (PCM_CLK) Divisor
0a016CM_PWM_CTLPulse Modulator Device Clock (PWM_CLK) Control
0a416CM_PWM_DIVPulse Modulator Device Clock (PWM_CLK) Divisor
0f016CM_UART_CTLSerial Communications Clock (UART_CLK) Control
0f416CM_UART_DIVSerial Communications Clock (UART_CLK) Divisor

Table 13.3 shows the meaning of the bits in the control registers for each of the clocks, and Table 13.4 shows the fields in the clock manager divisor registers. The procedure for configuring one of the clocks is:

Table 13.3

Bit fields in the clock manager control registers

BitNameDescription
3–0SRCClock source chosen from Table 13.1
4ENABWriting a 0 causes the clock to shut down. The clock will not stop immediately. The BUSY bit will be 1 while the clock is shutting down. When the BUSY bit becomes 0, the clock has stopped and it is safe to reconfigure it. Writing a 1 to this bit causes the clock to start
5KILLWriting a 1 to this bit will stop and reset the clock. This does not shut down the clock cleanly, and could cause a glitch in the clock output
6-Unused
7BUSYA 1 in this bit indicates that the clock is running
8FLIPWriting a 1 to this bit will invert the clock output. Do not change this bit while the clock is running
10–9MASHControls how the clock source is divided.

00: Integer division

01: 1-stage MASH division

10: 2-stage MASH division

11: 3-stage MASH division

Do not change this while the clock is running.
23–11Unused
31–24PASSWDThis field must be set to 5A16 every time the clock control register is written to

t0020

Table 13.4

Bit fields in the clock manager divisor registers

BitNameDescription
11–0DIVFFractional part of divisor. Do not change this while the clock is running
23–12DIVIInteger part of divisor. Do not change this while the clock is running
31–24PASSWDThis field must be set to 5A16 every time the clock divisor register is written to

1. Read the desired clock control register.

2. Clear bit 4 in the word that was read, then OR it with 5A00000016 and store the result back to the desired clock control register.

3. Repeatedly read the desired clock control register, until bit 7 becomes 0.

4. Calculate the divisor required and store it into the desired clock divisor register.

5. Create a word to configure and start the clock. Begin with 5A00000016, and set bits 3–0 to select the desired clock source. Set bits 10–9 to select the type of division, and set bit 4 to 1 to enable the clock.

6. Store the control word into the desired clock control register.

Selection of the divisor depends on which clock source is used, what type of division is selected, and the desired output of the clock being configured. For example, to set the PWM clock to 100 kHz, the 19.20 MHz clock can be used. Dividing that clock by 192 will provide a 100-KHz clock. To accomplish this, it is necessary to stop the PWM clock as described, store the value 5A0C000016 in the PWM clock divisor register, and then start the clock by writing 5A00001116 into the PWM clock control register.

13.1.2 pcDuino Clock Control Unit

The AllWinner A10/A20 SOCs have a relatively simple clock manager, which is referred to as the Clock Control Unit. All of the clock signals in the system are driven by two crystal oscillators: the main oscillator runs at 24 MHz, and the real-time-clock oscillator, which runs at 32768 Hz. The real-time-clock oscillator is used only to provide a signal to the real-time-clock device.

The main clock oscillator drives many of the devices in the system, but there are seven phase-locked-loop circuits in the CCU which provide signals for devices which need clocks that are faster or slower than 24 MHz. Table 13.5 shows which devices are driven by the nine clock signals.

Table 13.5

Clock signals in the AllWinner A10/A20 SOC

Clock DomainModulesFrequencyDescription
OSC24MMost modules24 MHzMain clock
CPU32_clkCPU2 kHz–1.2 GHzDrives CPU
AHB_clkAHB devices8 kHz–276 MHzDrives some devices
APB_clkPeripheral bus500 Hz–138 MHzDrives some devices
SDRAM_clkSDRAM0 Hz–400 MHzDrives SDRAM memory
Usb:clkUSB480 MHzDrives USB devices

t0030

13.2 Serial Communications

There are basically two methods for transferring data between two digital devices: parallel and serial. Parallel connections use multiple wires to carry several bits at one time, typically including extra wires to carry timing information. Parallel communications are used for transferring large amounts of data over very short distances. However, this approach becomes very expensive when data must be transferred more than a few meters. Serial, on the other hand, uses a single wire to transfer the data bits one at a time. When compared to parallel transfer, the speed of serial transfer typically suffers. However, because it uses significantly fewer wires, the distance may be greatly extended, reliability improved, and cost vastly reduced.

13.2.1 UART

One of the oldest and most common devices for communications between computers and peripheral devices is the Universal Asynchronous Receiver/Transmitter, or UART. The word “universal” indicates that the device is highly configurable and flexible. UARTs allow a receiver and transmitter to communicate without a synchronizing signal.

The logic signal produced by the digital UART typically oscillates between zero volts for a low level and five volts for a high level, and the amount of current that the UART can supply is limited. For transmitting the data over long distances, the signals may go through a level-shifting or amplification stage. The circuit used to accomplish this is typically called a line driver. This circuit boosts the signal provided by the UART and also protects the delicate digital outputs from short circuits and signal spikes. Various standards, such as RS-232, RS-422, and RS-485 define the voltages that the line driver uses. For example, the RS-232 standard specifies that valid signals are in the range of + 3 to + 15 V, or − 3 to − 15 V. The standards also specify the maximum time that is allowable when shifting from a high signal to a low signal and vice versa, the amount of current that the device must be capable of sourcing and sinking, and other relevant design criteria.

The UART transmits data by sending each bit sequentially. The receiving UART re-assembles the bits into the original data. Fig. 13.2 shows how the transmitting UART converts a byte of data into a serial signal, and how the receiving UART samples the signal to recover the original data. Serializing the transmission and reassembly of the data are accomplished using shift registers. The receiver and transmitter each have their own clocks, and are configured so that the clocks run at the same speed (or close to the same speed). In this case, the receiver’s clock is running slightly slower than the transmitter’s clock, but the data are still received correctly.

f13-02-9780128036983
Figure 13.2 Transmitter and receiver timings for two UARTS. (A) Waveform of a UART transmitting a byte. (B) Timing of UART receiving a byte.

To transfer a group of bits, called a data frame, the transmitter typically first sends a start bit. Most UARTs can be configured to transfer between four and eight data bits in each group. The transmitting and receiving UARTS must be configured to use the same number of data bits. After each group of data bits, the transmitter will return the signal to the low state and keep it there for some minimum period. This period is usually the time that it would take to send two bits of data, and is referred to as the two stop bits. The stop bits allow the receiver to have some time to process the received byte and prepare for the next start bit. Fig. 13.2A shows what a typical RS-232 signal would look like when transferring the value 5616 (the ASCII “V” character). The UART enters the idle state only if there is not another byte immediately ready to send. If the transmitter has another byte to send, then the start bit can begin at the end of the second stop bit.

Note that it is impossible to ensure that the receiver and transmitter have clocks which are running at exactly the same speed, unless they use the same clock signal. Fig. 13.2B shows how the receiver can reassemble the original data, even with a slightly different clock rate. When the start bit is detected by the receiver, it prepares to receive the data bits, which will be sent by the transmitter at an expected rate (within some tolerance). The receive circuitry of most UARTS is driven by a clock that runs 16 times as fast as the baud rate. The receive circuitry uses its faster clock to latch each bit in the middle of its expected time period. In Fig. 13.2B, the receiver clock is running slower than the transmitter clock. By the end of the data frame, the sample time is very far from the center of the bit, but the correct value is received. If the clocks differed by much more, or if more than eight data bits were sent, then it is very likely that incorrect data would be received. Thus, as long as their clocks are synchronized within some tolerance (which is dependent on the number of data bits and the baud rate), the data will be received correctly.

The RS-232 standard allows point-to-point communication between two devices for limited distances. With the RS-232 standard, simple one-way communications can be accomplished using only two wires: One to carry the serial bits, and another to provide a common ground. For bi-directional communication, three wires are required. In addition, the RS-232 standard specifies optional hand-shaking signals, which the UARTs can use to signal their readiness to transmit or receive data. The RS-422 and RS-485 standards allow multiple devices to be connected using only two wires.

The first UART device to enjoy widespread use was the 8250. The original version had 12 registers for configuration, sending, and receiving data. The most important registers are the ones that allow the programmer to set the transmit and receive bit rates, or baud. One baud is one bit per second. The baud is set by storing a 16 bit divisor in two of the registers in the UART. The chip is driven by an external clock, and the divisor is used to reduce the frequency of the external clock to a frequency that is appropriate for serial communication. For example, if the external clock runs at 1 MHz, and the required baud is 1200, then the divisor must be 833.3¯833si1_e. Note that the divisor can only be an integer, so the device cannot achieve exactly 1200 baud. However, as explained previously, the sending and receiving devices do not have to agree precisely on the baud. During the transmission and reception of a byte, 1200.48 baud is close enough that the bits will be received correctly even if the other end is running slightly below 1200 baud. In the 8250, there was only one 8-bit register for sending data and only one 8-bit register for receiving data. The UART could send an interrupt to the CPU after each byte was transmitted or received. When receiving, the CPU had to respond to the interrupt very quickly. If the current byte was not read quickly enough by the CPU, it would be overwritten by the subsequent incoming byte. When transmitting, the CPU needed to respond quickly to interrupts to provide the next byte to be sent, or the transmission rate would suffer.

The next generation of UART device was the 16550A. This device is the model for most UART devices today. It features 16-byte input and output buffers and the ability to trigger interrupts when a buffer is partially full or partially empty. This allows the CPU to move several bytes of data at a time and results in much lower CPU overhead and much higher data transmission and reception rates. The 16550A also supports much higher baud rates than the 8250.

13.2.2 Raspberry Pi UART0

The BCM2835 system-on-chip provides two UART devices: UART0 and UART1. UART 1 is part of the I2C device, and is not recommended for use as a UART. UART0 is a PL011 UART, which is based on the industry standard 16550A UART. The major differences are that the PL011 allows greater flexibility in configuring the interrupt trigger levels, the registers appear in different locations, and the locations of bits in some of the registers is different. So, although it operates very much like a 16550A, things have been moved to different locations. The transmit and receive lines can be routed through GPIO pin 14 and GPIO pin 15, respectively. UART0 has 18 registers, starting at its base address of 2E2010016. Table 13.6 shows the name, location, and a brief description for each of the registers.

Table 13.6

Raspberry Pi UART0 register map

OffsetNameDescription
0016UART_DRData Register
0416UART_RSRECRReceive Status Register/Error Clear Register
1816UART_ FRFlag register
2016UART_ILPRnot in use
2416UART_IBRDInteger Baud rate divisor
2816UART_FBRDFractional Baud rate divisor
2c16UART_LCRHLine Control register
3016UART_CRControl register
3416UART_IFLSInterrupt FIFO Level Select Register
3816UART_IMSCInterrupt Mask Set Clear Register
3c16UART_RISRaw Interrupt Status Register
4016UART_MISMasked Interrupt Status Register
4416UART_ICRInterrupt Clear Register
4816UART_DMACRDMA Control Register
8016UART_ITCRTest Control register
8416UART_ITIPIntegration test input reg
8816UART_ITOPIntegration test output reg
8c16UART_TDRTest Data reg

UART_DR: The UART Data Register is used to send and receive data. Data are sent or received one byte at a time. Writing to this register will add a byte to the transmit FIFO. Although the register is 32 bits, only the 8 least significant bits are used in transmission, and 12 least significant bits are used for reception. If the FIFO is empty, then the UART will begin transmitting the byte immediately. If the FIFO is full, then the last byte in the FIFO will be overwritten with the new byte that is written to the Data Register. When this register is read, it returns the byte at the top of the receive FIFO, along with four additional status bits to indicate if any errors were encountered. Table 13.7 specifies the names and use of the bits in the UART Data Register.

Table 13.7

Raspberry Pi UART data register

BitNameDescriptionValues
7–0DATAData

Read: Last data received

Write: Data byte to transmit

8FEFraming error

0: No error

1: The received character did not have a valid stop bit

9PEParity error

0: No error

1: The received character did not have the correct parity, as set in theEPS and SPS bits of the Line Control Register (UART_LCRH)

10BEBreak error

0: No error

1: A break condition was detected. The data input line was held low forlonger than the time it would take to receive a complete byte,including the start and stop bits.

11OEOverrun error

0: No error

1: Data was not read quickly enough, and one or more bytes wereoverwritten in the input buffer

31–12-Not usedWrite as zero, read as don’t care

t0040

UART_RSRECR: The UART Receive Status Register/Error Clear Register is used to check the status of the byte most recently read from the UART Data Register, and to check for overrun conditions at any time. The status information for overrun is set immediately when an overrun condition occurs. The Receive Status Register/Error Clear Register provides the same four status bits as the Data Register (but in bits 3–0 rather than bits 11–8). The received data character must be read first from the Data Register, before reading the error status associated with that data character from the RSRECR register. Since the Data Register also contains these 4 bits, this register may not be required, depending on how the software is written. Table 13.8 describes the bits in this register.

Table 13.8

Raspberry Pi UART receive status register/error clear register

BitNameDescriptionValues
0FEFraming error

0: No error

1: The received character did not have a valid stop bit

1PEParity error

0: No error

1: The received character did not have the correct parity, as set in theEPS and SPS bits of the Line Control Register (UART_LCRH)

2BEBreak error

0: No error

1: A break condition was detected. The data input line was held low for longer than the time it would take to receive a complete byte,including the start and stop bits.

3OEOverrun error

0: No error

1: Data was not read quickly enough, and one or more bytes wereoverwritten in the input buffer

31–4Not usedWrite as zero, read as don’t care

t0045

UART_FR: The UART Flag Register can be read to determine the status of the UART. The bits in this register are used mainly when sending and receiving data using the FIFOs. When several bytes need to be sent, the TXFF flag should be checked to ensure that the transmit FIFO is not full before each byte is written to the data register. When receiving data, the RXFE bit can be used to determine whether or not there is more data to be read from the FIFO. Table 13.9 describes the flags in this register.

Table 13.9

Raspberry Pi UART flags register bits

BitNameDescriptionValues
0CTSClear To Send

0: Sender indicates they are ready to receive

1: Sender is NOT ready to receive

1DSRData Set ReadyNot implemented: Write as zero, read as don’t care
2DCDData Carrier DetectNot implemented: Write as zero, read as don’t care
3BUSYUART is busy

0: UART is not transmitting data

1: UART is transmitting a byte

4RXFEReceive FIFO Empty

0: Receive FIFO contains bytes that have been received

1: Receive FIFO is empty

5TXFFTransmit FIFO is Full

0: There is room for at least one more byte in the transmit FIFO

1: Transmit FIFO is full – do not write to the data register at this time

6RXFFReceive FIFO is Full

0: There is no more room in the receive FIFO

1: There is still some space in the receive FIFO

7TXFETransmit FIFO is Empty

0: There are no bytes waiting to be transmitted

1: There is at least one byte waiting to be transmitted

8RIRing IndicatorNot implemented: Write as zero, read as don’t care
31–9Not usedWrite as zero, read as don’t care

t0050

UART_ILPR: This is the IrDA register, which is supported by some PL011 UARTs. IrDA stands for the Infrared Data Association, which is a group of companies that cooperate to provide specifications for a complete set of protocols for wireless infrared communications. The name “IrDA” also refers to that set of protocols. IrDA is not implemented on the Raspberry Pi UART. Writing to this register has no effect and reading returns 0.

UART_IBRD and UART_FBRD: UART_FBRD is the fractional part of the baud rate divisor value, and UART_IBRD is the integer part. The baud rate divisor is calculated as follows:

BAUDDIV=UARTCLK16×Baudrate

si2_e  (13.1)

where UARTCLK is the frequency of the UART_CLK that is configured in the Clock Manager device. The default value is 3 MHz. BAUDDIV is stored in two registers. UART_IBRD holds the integer part and UART_FBRD holds the fractional part. Thus BAUDDIV should be calculated as a U(16,6) fixed point number. The contents of the UART_IBRD and UART_FBRD registers may be written at any time, but the change will not have any effect until transmission or reception of the current character is complete. Table 13.10 shows the arrangement of the integer baud rate divisor register, and Table 13.11 shows the arrangement of the fractional baud rate divisor register.

Table 13.10

Raspberry Pi UART integer baud rate divisor

BitNameDescriptionValues
15–0IBRDInteger Baud Rate DivisorSee Eq. (13.1)
31–16Not usedWrite as zero, read as don’t care

t0055

Table 13.11

Raspberry Pi UART fractional baud rate divisor

BitNameDescriptionValues
5-0FBRDFractional Baud Rate DivisorSee Eq. (13.1)
31-6Not usedWrite as zero, read as don’t care

t0060

UART_LCRH: UART_LCRH is the line control register. It is used to configure the communication parameters. This register must not be changed until the UART is disabled by writing zero to bit 0 of UART_CR, and the BUSY flag in UART_FR is clear. Table 13.12 shows the layout of the line control register.

Table 13.12

Raspberry Pi UART line control register bits

BitNameDescriptionValues
0BRKSend Break

0: Normal operation

1: After the current character is sent, take the TXD output to a lowlevel and keep it there

1PENParity Enable

0: Parity checking and generation is disabled

1: Generate and send parity bit and check parity on received data

2EPSEven Parity Select

0: Odd parity

1: Even parity

3STP2Two Stop Bits

0: Send one stop bit for each data word

1: Send two stop bits for each data word

4FENFIFO Enable

0: Transmit and Receive FIFOs are disabled

1: Transmit and Receive FIFOs are enabled

6–5WLENWord Length

00: 5 bits per data word

01: 6 bits per data word

10: 7 bits per data word

11: 8 bits per data word

31–7Not usedWrite as zero, read as don’t care

t0065

UART_CR: The UART Control Register is used for configuring, enabling, and disabling the UART. Table 13.13 shows the layout of the control register. To enable transmission, the TXE bit and UARTEN bit must be set to 1. To enable reception, the RXE bit and UARTEN bit must be set to 1. In general, the following steps should be used to configure or re-configure the UART:

Table 13.13

Raspberry Pi UART control register bits

BitNameDescriptionValues
0UARTENUART Enable

0: UART disabled

1: UART enabled.

1SIRENNot usedWrite as zero, read as don’t care
2SIRLPNot usedWrite as zero, read as don’t care
3–6Not usedWrite as zero, read as don’t care
7LBELoopback Enable

0: Loopback disabled

1: Loopback enabled. Transmitted data is also fed back to the receiver.

8TXETransmit enable

0: Transmitter is disabled

1: Transmitter is enabled

9RXEReceive enable

0: Receiver is disabled

1: Receiver is enabled

10DTRNot usedWrite as zero, read as don’t care
11RTSComplement of nUARTRTS
12OUT1Not usedWrite as zero, read as don’t care
13OUT2Not usedWrite as zero, read as don’t care
14RTSENRTS Enable

0: Hardware RTS disabled.

1: Hardware RTS Enabled

15CTSENCTS Enable

0: Hardware CTS disabled.

1: Hardware CTS Enabled

16–31Not usedWrite as zero, read as don’t care

t0070

(a) Disable the UART.

(b) Wait for the end of transmission or reception of the current character.

(c) Flush the transmit FIFO by setting the FEN bit to 0 in the Line Control Register.

(d) Reprogram the Control Register.

(e) Enable the UART.

Interrupt Control: The UART can signal the CPU by asserting an interrupt when certain conditions occur. This will be covered in more detail in Chapter 14. For now, it is enough to know that there are five additional registers which are used to configure and use the interrupt mechanism.
UART_IFLS defines the FIFO level that triggers the assertion of the interrupt signal. One interrupt is generated when the FIFO reaches the specified level. The CPU must clear the interrupt before another can be generated.
UART_IMSC is the interrupt mask set/clear register. It is used to enable or disable specific interrupts. This register determines which of the possible interrupt conditions are allowed to generate an interrupt to the CPU.
UART_RIS is the raw interrupt status register. It can be read to raw status of interrupts conditions before any masking is performed.
UART_MIS is the masked interrupt status register. It contains the masked status of the interrupts. This is the register that the operating system should use to determine the cause of a UART interrupt.
UART_ICR is the interrupt clear register. writing to it clears the interrupt conditions. The operating system should use this register to clear interrupts before returning from the interrupt service routine.

UART_DMACR: The DMA control register is used to configure the UART to access memory directly, so that the CPU does not have to move each byte of data to or from the UART. DMA will be explained in more detail in Chapter 14.

Additional Registers: The remaining registers, UART_ITCR, UART_ITIP, and UART_ITOP, are either unimplemented or are used for testing the UART. These registers should not be used.

13.2.3 Basic Programming for the Raspberry Pi UART

Listing 13.1 shows four basic functions for initializing the UART, changing the baud rate, sending a character, and receiving a character using UART0 on the Raspberry Pi. Note that a large part of the code simply defines the location and offset for all of the registers (and bits) that can be used to control the UART.

f13-03a-9780128036983f13-03b-9780128036983f13-03c-9780128036983f13-03d-9780128036983f13-03e-9780128036983
Listing 13.1 Assembly functions for using the Raspberry Pi UART.

13.2.4 pcDuino UART

The AllWinner A10/A20 SOC includes eight UART devices. They are all fully compatible with the 16550A UART, and also provide some enhancements. All of them provide transmit (TX) and receive (RX) signals. UART0 has the full set of RS232 signals, including RTS, CTS, DTR, DSR, DCD, and RING. UART1 has the RTS and CTS signals. The remaining six UARTs only provide the TX and RX signals. They can all be configured for serial IrDA. Table 13.14 shows the base address for each of the eight UART devices.

Table 13.14

pcDuino UART addresses

NameAddress
UART00x01C28000
UART10x01C28400
UART20x01C28800
UART30x01C28C00
UART40x01C29000
UART50x01C29400
UART60x01C29800
UART70x01C29C00

When the 16550 UART was designed, 8-bit processors were common, and most of them provided only 16 address bits. Memory was typically limited to 64 kB, and every byte of address space was important. Because of these considerations, the designers of the 16550 decided to limit the number of addresses used to 8, and to only use eight bits of data per address. There are 10 registers in the 16550 UART, but some of them share the same address. For example, there are three registers mapped to an offset address of zero, two registers mapped at offset four, and two registers mapped at offset eight. Bit seven in the Line Control Register is used to determine which of the registers is active for a given address.

Because they are meant to be fully backwards-compatible with the 16550, the AllWinner A10/A20 SOC UART devices also use only 8 bits for each register, and the first 12 registers correspond exactly with the 16550 UART. The only differences are that the pcDuino uses word addresses rather than byte addresses, and they provide four additional registers that are used for IrDA mode. Table 13.15 shows the arrangement of the registers in each of the 8 UARTs on the pcDuino. The following sections will explain the registers.

Table 13.15

pcDuino UART register offsets

Register NameOffsetDescription
UART_RBR0x00UART Receive Buffer Register
UART_THR0x00UART Transmit Holding Register
UART_DLL0x00UART Divisor Latch Low Register
UART_DLH0x04UART Divisor Latch High Register
UART_IER0x04UART Interrupt Enable Register
UART_IIR0x08UART Interrupt Identity Register
UART_FCR0x08UART FIFO Control Register
UART_LCR0x0CUART Line Control Register
UART_MCR0x10UART Modem Control Register
UART_LSR0x14UART Line Status Register
UART_MSR0x18UART Modem Status Register
UART_SCH0x1CUART Scratch Register
UART_USR0x7CUART Status Register
UART_TFL0x80UART Transmit FIFO Level
UART_RFL0x84UART_RFL
UART_HALT0xA4UART Halt TX Register

The baud rate is set using a 16-bit Baud Rate Divisor, according to the following equation:

BAUDDIV=sclk16×Baudrate

si3_e  (13.2)

where sclk is the frequency of the UART serial clock, which is configured by the Clock Manager device. The default frequency of the clock is 24 MHz. BAUDDIV is stored in two registers. UART_DLL holds the least significant 8 bits, and UART_DLH holds the most significant 8 bits. Thus BAUDDIV should be calculated as a 16-bit unsigned integer. Note that for high baud rates, it may not be possible to get exactly the rate desired. For example, a baud rate of 115200 would require a divisor of 13.02083¯si4_e. Since the baud rate divisor can only be given as an integer, the desired rate must be based on a divisor of 13, so the true baud rate will be 2400000016×13=115384.615385si5_e, or about 0.16% faster than desired. Although slightly fast, it is well within the tolerance for RS232 communication.

UART_RBR: The UART Receive Buffer Register is used to receive data, 1 byte at a time. If the receive FIFO is enabled, then as the UART receives data, it places the data into a receive FIFO. Reading from this address removes 1 byte from the receive FIFO. If the FIFO becomes full and another data byte arrives, then the new data are lost and an overrun error occurs. Table 13.16 shows the layout of the receive buffer register.

Table 13.16

pcDuno UART receive buffer register

BitNameDescriptionValues
7–0RBRDataRead only: One byte of received data. Bit 7 of LCR must bezero.
31–8Unused

t0085

UART_THR: Writing to the Transmit Holding Register will cause that byte to be transmitted by the UART. If the transmit FIFO is enabled, then the byte will be added to the end of the transmit FIFO. If the FIFO is empty, then the UART will begin transmitting the byte immediately. If the FIFO is full, then the new data byte will be lost. Table 13.17 shows the layout of the transmit holding register.

Table 13.17

pcDuno UART transmit holding register

BitNameDescriptionValues
7–0THRDataWrite only: One byte of data to transmit. Bit 7 of LCR must bezero.
31–8Unused

t0090

UART_DLL: The UART Divisor Latch Low register is used to set the least significant byte of the baud rate divisor. When bit 7 of the Line Control Register is set to one, writing to this address will access the DLL register. If bit 7 of the Line Control Register is set to zero, then writing to this address will access the transmit holding register. Table 13.18 shows the layout of the UART_DLL register.

Table 13.18

pcDuno UART divisor latch low register

BitNameDescriptionValues
7–0DLLDataWrite only: Least significant eight bits of the Baud Rate Divisor. Bit 7 of LCR must be one.
31–8Unused

t0095

UART_DLH: The UART Divisor Latch High register is used to set the most significant byte of the baud rate divisor. When bit 7 of the Line Control Register is set to one, writing to this address will access the DLH register. If bit 7 of the Line Control Register is set to zero, then writing to this address will access the Interrupt Enable Register rather than the Divisor Latch High register. Table 13.19 shows the layout of the UART_DLL register.
If the two Divisor Latch Registers (DLL and DLH) are set to zero, the baud clock is disabled and no serial communications occur. DLH should be set before DLL, and at least eight clock cycles of the UART clock should be allowed to pass before data are transmitted or received.

Table 13.19

pcDuno UART divisor latch high register

BitNameDescriptionValues
7–0DLHDataWrite only: Most significant eight bits of the Baud Rate Divisor. Bit 7 of LCR must be one.
31–8Unused

t0100

UART_FCR: is the UART FIFO control register. It is used to enable or disable the receive and transmit FIFOs (buffers), flush their contents, set the level at which the transmit and receive FIFOs trigger an interrupt, and to control Direct Memory Access (DMA) Table 13.20 shows the layout of the UART_FCR register.

Table 13.20

pcDuno UART FIFO control register

BitNameDescription
0FIFOEFIFO Enable

0: transmit and receive FIFOs disabled

1: transmit and receive FIFOs enabled

1RFIFORReceive FIFO Reset: writing a 1 to this bit causes the receive FIFO to be reset, and then continue normal operation
2XFIFORTransmit FIFO Reset: writing a 1 to this bit causes the transmit FIFO to be reset, and then continue normal operation
3DMAMDMA Mode:

0: Mode 0

1: Mode 1

5–4TETTransmit Empty Trigger: These bits control the level at which the Transmit Holding Register Empty interrupt is triggered

00: FIFO is completely empty

01: There are two characters in the FIFO

10: The FIFO is 25% full

11: The FIFO is 50% full

This setting has no effect if THRE_MODE_USER is disabled
7–6RTReceive Trigger: These bits control the level at which the Received Data Available interrupt is triggered.

00: There is one character in the FIFO

01: The FIFO is 25% full

10: The FIFO is 50% full

11: There is room for two more characters in the FIFO

This setting has no effect if THRE_MODE_USER is disabled.
31–8Unused

t0105

UART_LCR: The Line Control Register is used to control the parity, number of data bits, and number of stop bits for the serial port. Bit 7 also controls which registers are mapped at offsets 0, 4, and 8 from the device base address. Table 13.21 shows the layout of the UART_LCR register.

Table 13.21

pcDuno UART line control register

BitNameDescription
1–0DLSThis field controls the number of data bits:

00: 5 data bits

01: 6 data bits

10: 7 data bits

11: 8 data bits

2STOPThis bit controls the number of stop bits used for transmitting and receiving data.

0: 1 stop bit

1: If DLS is set to 00, then 1.5 stop bits, otherwise 2 stop bits

3PENParity Enable:

0: Parity disabled

1: Parity enabled

4EPSEven Parity Select:

0: Odd Parity

1: Even Parity

5Unused
6BCBWriting a one to this bit causes a break to be sent. This bit must be set to zero for normal operation.
7DLABThe Divisor Latch Access Bit controls the behavior of other registers:

0: The RBR, THR, and IER registers are accessible (RBR is used for read at offset 0, and THR for write at offset 0).

1: The DLL and DLM registers are accessible

31–8Unused

t0110

UART_LSR: The Line Status Register is used to read status information from the UART. Table 13.22 shows the layout of the UART_LSR register.

Table 13.22

pcDuno UART line status register

BitNameDescription
0DRWhen the Data Ready bit is set to 1, it indicates that at least one byte is ready to be read from the receive FIFO or RBR.
1OEWhen the Overrun Error bit is set to 1, it indicates that an overrun error occurred for the byte at the top of the receive FIFO.
2PEWhen the Parity Error bit is set to 1, it indicates that a parity error occurred for the byte at the top of the receive FIFO.
3FEWhen the Framing Error bit is set to 1, it indicates that a framing error occurred for the byte at the top of the receive FIFO.
4BIWhen the Break Interrupt bit is set to 1, it indicates thata break has been received.
5THREWhen the Transmit Holding Register Empty bit is 1, it indicates that there are there are no bytes waiting to be transmitted, but there may be a byte currently being transmitted.
6TEMTWhen the Transmitter Empty bit is 1, it indicates that there are no bytes waiting to be transmitted and no byte currently being transmitted.
7FIFOERRWhen this bit is 1, an error has occurred (PE, BE, or BI) in the receive FIFO. This bit is cleared when the Line Status Register is read.
31–8Unused

UART_USR: The UART Status Register is used to read information about the status of the transmit and receive FIFOs, and the current state of the receiver and transmitter. Table 13.23 shows the layout of the UART_USR register. This register contains essentially the same information as the status register in the Raspberry Pi UART.

Table 13.23

pcDuno UART status register

BitNameDescription
0BUSYWhen the Busy bit is 1, it indicates that the UART is currently busy. When it is 0, the UART is idle or inactive.
1TFNFWhen the Transmit FIFO Not Full bit is 1, it indicates that at least one more byte can be safely written to the Transmit FIFO.
2TFEWhen the Transmit FIFO Empty bit is 1, it indicates that there are no bytes remaining in the transmit FIFO.
3RFNEWhen the Receive FIFO Not Empty bit is 1, it indicates that at least one more byte is waiting to be read from the receive FIFO.
4RFFWhen the Receive FIFO Full bit is 1, it indicates that there is no more room in the receive FIFO. If data is not read before the next character is received, an overrun error will occur.
31–5Unused

UART_TFL: The UART Transmit FIFO Level register allows the programmer to determine exactly how many bytes are currently in the transmit FIFO. Table 13.24 shows the layout of the UART_TFL register.

Table 13.24

pcDuno UART transmit FIFO level register

BitNameDescription
6–0TFLThe Transmit FIFO level field contains an integer which indicates the number of bytes currently in the transmit FIFO.
31–7Unused

UART_RFL: The UART Receive FIFO Level register allows the programmer to determine exactly how many bytes are currently in the receive FIFO. Table 13.25 shows the layout of the UART_RFL register.

Table 13.25

pcDuno UART receive FIFO level register

BitNameDescription
6–0RFLThe Receive FIFO level field contains an integer which indicates the number of bytes currently in the receive FIFO.
31–7Unused

UART_HALT: The UART transmit halt register is used to halt the UART so that it can be reconfigured. After the configuration is performed, it is then used to signal the UART to restart with the new settings. It can also be used to invert the receive and transmit polarity. Table 13.26 shows the layout of the UART_HALT register.

Table 13.26

pcDuno UART transmit halt register

BitNameDescription
0Unused
1CHCFG_AT_BUSYSetting this bit to 1 causes the UART to allow changing the Line Control Register (except the DLAB bit) and allows setting the baud rate even when the UART is busy. When this bit is set to 0, changes can only occur when the BUSY bit in the UART Status Register is 0.
2CHANGE_UPDATEAfter writing 1 to CHCFG_AT_BUSY and performing the configuration, 1 should be written to this bit to signal that the UART should re-start with the new configuration. This bit will stay at 1 while the new configuration is loaded, and go back to 0 when the re-start is complete.
3Unused
4SIR_TX_INVERTThis bit allows the polarity of the transmitter to be inverted.

0: Normal polarity

1: Polarity inverted

5SIR_RX_INVERTThis bit allows the polarity of the receiver to be inverted.

0: Normal polarity

1: Polarity inverted

31–5Unused

t0135

Interrupt Control: The UART can signal the CPU by asserting an interrupt when certain conditions occur. This will be covered in more detail in Chapter 14. For now, it is enough to know that there are five additional registers which are used to configure and use the interrupt mechanism.
UART_IFLS defines the FIFO level that triggers the assertion of the interrupt signal. One interrupt is generated when the FIFO reaches the specified level. The CPU must clear the interrupt before another can be generated.
UART_IER is the interrupt enable register. It is used to enable or disable the generation of interrupts for specific conditions.
UART_IIR is the Interrupt Identity Register. When an interrupt occurs, the CPU can read this register to determine what caused the interrupt.

Additional Registers There are several additional registers which are not needed for basic use of the UART.
UART_MCR is the Modem Control Register. It is used to configure the port for IrDA mode, enable Automatic Flow Control, and manage the RS-232 RTS and DTR hardware handshaking signals for the ports in which they are implemented. The default configuration disables these extra features.
UART_MSR is the Modem Status Register, which is used to read the state of the RS-232 modem control and status lines on ports that implement them. This register can be ignored unless a telephone modem is being used on the port.
UART_SCH is the Modem Scratch Register. It provides 8 bits of storage for temporary data values. In the days of 8 and 16-bit computers, when the 16550 UART was designed, this extra byte of storage was useful.

13.3 Chapter Summary

Most modern computer systems have some type of Universal Asynchronous Receiver/Transmitter. These are serial communications devices, and are meant to provide communications with other systems using RS-232 (most commonly) or some other standard serial protocol. Modern systems often have a large number of other devices as well. Each device may need its own clock source to drive it at the correct frequency for its operation. The clock sources for all of the devices are often controlled by yet another device: the clock manager.

Although two systems may have different UARTs, these devices perform the same basic functions. The specifics about how they are programmed will vary from one system to another. However, there is always enough similarity between devices of the same class that a programmer who is familiar with one specific device can easily learn to program another similar device. The more experience a programmer has, the less time it takes to learn how to control a new device.

Exercises

13.1 Write a function for setting the PWM clock on the Raspberry Pi to 2 MHz.

13.2 The UART_GET_BYTE function in Listing 13.1 contains skeleton code for handling errors, but does not actually do anything when errors occur. Describe at least two ways that the errors could be handled.

13.3 Listing 13.1 provides four functions for managing the UART on the Raspberry Pi. Write equivalent functions for the pcDuino UART.

Chapter 14

Running Without an Operating System

Abstract

This chapter starts by describing the extra responsibilities that the programmer must assume when writing code to run without an operating system (bare metal). It then explains privileged and user modes and describes all of the privileged modes available on the ARM processor. Next, it gives an overview of exception processing, and provides example code for setting up the vector table stubs for exception handling functions on the ARM processor. Next, it describes the boot processes on the Raspberry Pi and the pcDuino. After that, it shows how to write a basic bare metal program, without any exception processing. The chapter finishes by showing a more efficient version of the bare metal program using an interrupt.

Keywords

Bare metal; Exception; Vector table; Exception handler; Sleep mode; User mode; Privileged mode; Startup code; Linker script; Boot loader; Interrupt

The previous chapters assumed that the software would be running in user mode under an operating system. Sometimes, it is necessary to write assembly code to run on “bare metal,” which simply means: without an operating system. For example, when we write an operating system kernel, it must run on bare metal and a significant part of the code (especially during the boot process) must be written in assembly language. Coding on bare metal is useful to deeply understand how the hardware works and what happens in the lowest levels of an operating system. There are some significant differences between code that is meant to run under an operating system and code that is meant to run on bare metal.

The operating system takes care of many details for the programmer. For instance, it sets up the stack, text, and data sections, initializes static variables, provides an interface to input and output devices, and gives the programmer an abstracted view of the machine. When accessing data on a disk drive, the programmer uses the file abstraction. The underlying hardware only knows about blocks of data. The operating system provides the data structures and operations which allow the programmer to think of data in terms of files and streams of bytes. A user program may be scattered in physical memory, but the hardware memory management unit, managed by the operating system, allows the programmer to view memory as a simple memory map (such as shown in Fig. 1.7). The programmer uses system calls to access the abstractions provided by the operating system. On bare metal, there are no abstractions, unless the programmer creates them.

However, there are some software packages to help bare-metal programmers. For example, Newlib is a C standard library intended for use in bare-metal programs. Its major features are that:

 it implements the hardware-independent parts of the standard C library,

 for I/O, it relies on only a few low-level functions that must be implemented specifically for the target hardware, and

 many target machines are already supported in the Newlib source code.

To support a new machine, the programmer only has to write a few low-level functions in C and/or Assembly to initialize the system and perform low-level I/O on the target hardware.

14.1 ARM CPU Modes

Many early computers were not capable of protecting the operating system from user programs. That problem was solved mostly by building CPUs that support multiple “levels of privilege” for running programs. Almost all modern CPUs have the ability to operate in at least two modes:

User mode is the mode that normal user programs use when running under an operating system, and

Privileged mode is reserved for operating system code. There are operations that can be performed in privileged mode which cannot be performed in user mode.

The ARM processor provides six privileged modes and one user mode. Five of the privileged modes have their own stack pointer (r13) and link register (r14). When the processor mode is changed, the corresponding link register and stack pointer become active, “replacing” the user stack pointer and link register.

In any of the six privileged modes, the link registers and stack pointers of the other modes can be accessed. The privileged mode stack pointers and link registers are not accessible from user mode. One of the privileged modes, FIQ, has five additional registers which become active when the processor enters FIQ mode. These registers “replace” registers r8 through r12. Additionally, five of the privileged modes have a Saved Process Status Register (SPSR). When entering those privileged modes, the CPSR is copied into the corresponding SPSR. This allows the CPSR to be restored to its original contents when the privileged code returns to the previously active mode. The full register set for all modes is shown in Table 14.1. Registers r0 through r7 and the program counter are shared by all modes. Some processors have an additional monitor mode, as part of the ARMv6-M and ARMv7-M security extensions.

Table 14.1

The ARM user and system registers

usrsvcabtundirqfiq
sys
r0
r1
r2
r3
r4
r5
r6
r7
r8r8_fiq
r9r9_fiq
r10r10_fiq
r11 (fp)r11_fiq
r12 (ip)r12_fiq
r13 (sp)r13_svcr13_abtr13_undr13_irqr13_fiq
r14 (lr)r14_svcr14_abtr14_undr14_irqr14_fiq
r15 (pc)
CPSRCPSRCPSRCPSRCPSRCPSR
SPSR_svcSPSR_abtSPSR_undSPSR_irqSPSR_fiq

t0010

All of the bits of the Program Status Register (PSR) are shown in Fig. 14.1. The processor mode is selected by writing a bit pattern into the mode bits (M[4:0]) of the PSR. The bit pattern assignment for each processor mode is shown in Table 14.2. Not all combinations of the mode bits define a valid processor mode. An illegal value programmed into M[4:0] causes the processor to enter an unrecoverable state. If this occurs, a hardware reset must be used to re-start the processor. Programs running in user mode cannot modify these bits directly. User programs can only change the processor mode by executing the software interrupt (swi) instruction (also known as the svc instruction), which automatically gives control to privileged code in the operating system. The hardware is carefully designed so that the user program cannot run its own code in privileged mode.

f14-01-9780128036983
Figure 14.1 The ARM process status register.

Table 14.2

Mode bits in the PSR

M[4:0]ModeNameRegister Set
10000usrUserR0-R14, CPSR, PC
10001fiqFast InterruptR0-R7, R8_fiq-R14_fiq, CPSR, SPSR_fiq, PC
10010irqInterrupt RequestR0-R12, R13_irq, R14_irq, CPSR, SPSR_irq, PC
10011svcSupervisorR0-R12, R13_svc R14_svc CPSR, SPSR_irq, PC
10111abtAbortR0-R12, R13_abt R14_abt CPSR, SPSR_abt PC
11011undUndefined InstructionR0-R12, R13_und R14_und, CPSR, SPSR_und PC
11111sysSystemR0-R14, CPSR, PC

t0015

The swi instruction does not really cause an interrupt, but the hardware and operating system handle it in a very similar way. The software interrupt is used by user programs to request that the operating system perform some task on their behalf. Another general class of interrupt is the “hardware interrupt.” This class of interrupt may occur at any time and is used by hardware devices to signal that they require service. Another type of interrupt may be generated within the CPU when certain conditions arise, such as attempting to execute an unknown instruction. These are generally known as “exceptions” to distinguish them from hardware interrupts. On the ARM processor, there are three bits in the CPSR which affect interrupt processing:

I: when set to one, normal hardware interrupts are disabled,

F: when set to one, fast hardware interrupts are disabled, and

A: (only on ARMv6 and later processors) when set to one, imprecise aborts are disabled (this is an abort on a memory write that has been held in a write buffer in the processor and not written to memory until later, perhaps after another abort).

Programs running in user mode cannot modify these bits. Therefore, the operating system gains control of the CPU whenever an interrupt occurs and the user program cannot disable interrupts and continue to run. Most operating systems use a hardware timer to generate periodic interrupts, thus they are able to regain control of the CPU every few milliseconds.

14.2 Exception Processing

Most of the privileged modes are entered automatically by the hardware when certain exceptional circumstances occur. For example, when a hardware device needs attention, it can signal the processor by causing an interrupt. When this occurs, the processor immediately enters IRQ mode and begins executing the IRQ exception handler function. Some devices can cause a fast interrupt, which causes the processor to immediately enter FIQ mode and begin executing the FIQ exception handler function. There are six possible exceptions that can occur, each one corresponding to one of the six privileged modes. Each exception must be handled by a dedicated function, with one additional function required to handle CPU reset events. The first instruction of each of these seven exception handlers is stored in a vector table at a known location in memory (usually address 0). When an exception occurs, the CPU automatically loads the appropriate instruction from the vector table and executes it. Table 14.3 shows the address, exception type, and the mode that the processor will be in, for each entry in ARM vector table. The vector table usually contains branch instructions. Each branch instruction will jump to the correct function for handling a specific exception type. Listing 14.1 shows a short section of assembly code which provides definitions for the ARM CPU modes.

Table 14.3

ARM vector table

AddressExceptionMode
0x00000000Resetsvc
0x00000004Undefined Instructionund
0x00000008Software Interruptsvc
0x0000000CPrefetch Abortabt
0x00000010Data Abortabt
0x00000014Reserved
0x00000018Interrupt Requestirq
0x0000001CFast Interrupt Requestfiq
f14-02-9780128036983
Listing 14.1 Definitions for ARM CPU modes.

Many bare-metal programs consist of a single thread of execution running in user mode to perform some task. This main program is occasionally interrupted by the occurrence of some exception. The exception is processed, and then control returns to the main thread. Fig. 14.2 shows the sequence of events when an exception occurs in such a system. The main program typically would be running with the CPU in user mode. When the exception occurs, the CPU executes the corresponding instruction in the vector table, which branches to the exception handler. The exception handler must save any registers that it is going to use, execute the code required to handle the exception, then restore the registers. When it returns to the user mode process, everything will be as it was before the exception occurred. The user mode program continues executing as if the exception never occurred.

f14-03-9780128036983
Figure 14.2 Basic exception processing.

More complex systems may have multiple tasks, threads of execution, or user processes running concurrently. In a single-processor system, only one task, thread, or user process can actually be executing at any given instant, but when an exception occurs, the exception handler may change the currently active task, thread, or user process. This is the basis for all modern multiprocessing systems. Fig. 14.3 shows how an exception may be processed on such a system. It is common on multi-processing systems for a timer device to be used to generate periodic interrupts, which allows the currently active task, thread, or user process to be changed at a fixed frequency.

f14-04-9780128036983
Figure 14.3 Exception processing with multiple user processes.

When any exception occurs, it causes the ARM CPU hardware to perform a very well-defined sequence of actions:

1. The CPSR is copied into the SPSR for the mode corresponding to the type of exception that has occurred.

2. The CPSR mode bits are changed, switching the CPU into the appropriate privileged mode.

3. The banked registers for the new mode become active.

4. The I bit of the CPSR is cleared, which disables interrupts.

5. If the exception was an FIQ, or if a reset has occurred, then the FIQ bit is cleared, disabling fast interrupts.

6. The program counter is copied to the link register for the new mode.

7. The program counter is loaded with the address in the vector table corresponding with the exception that has occurred.

8. The processor then fetches the next instruction using the program counter as usual. However, the program counter has been set so that in loads an instruction from the vector table.

The instruction in the vector table should cause the CPU to branch to a function which handles the exception. At the end of that function, the program counter must be loaded with the address of the instruction where the exception occurred, and the SPSR must be copied back into the CPSR. That will cause the processor to branch back to where it was when the exception occurred, and return to the mode that it was in at that time.

14.2.1 Handling Exceptions

Listing 14.2 shows in detail how the vector table is initialized. The vector table contains eight identical instructions. These instructions load the program counter, which causes a branch. In each case, the program counter is loaded with a value at the memory location that is 32 bytes greater than the corresponding load instruction. An offset of 24 is used because the program counter will have advanced 8 bytes by the time the load instruction is executed. The addresses of the exception handlers have been stored in a second table, that begins at an address 32 bytes after the first load instruction. Thus, each instruction in the vector table loads a unique address into the program counter. Note that one of the slots in the vector table is not used and is reserved by ARM for future use. That slot is treated like all of the others, but it will never be used on any current ARM processor.

f14-05-9780128036983
Listing 14.2 Function to set up the ARM exception table.

Listing 14.3 shows the stub functions for each of the exception handlers.

f14-06a-9780128036983f14-06b-9780128036983
Listing 14.3 Stubs for the exception handlers.

Note that the return sequence depends on the type of exception. For some exceptions, the return address must be adjusted. This is because the program counter may have been advanced past the instruction where the exception occurred. These stub functions simply return the processor to the mode and location at which the exception occurred. To be useful, they will need to be extended significantly. Note that these functions all return using a data processing instruction with the optional s specified and with the program counter as the destination register. This special form of data processing instruction indicates that the SPSR should be copied into the CPSR at the same time that the program counter is loaded with the return address. Thus, the function returns to the point where the exception occurred, and the processer switches back into the mode that it was in when the exception occurred.

A special form of the ldm instruction can also be used to return from an exception processing function. In order to use that method, the exception handler should start by adjusting the link register (depending on the type of exception) and then pushing it onto the stack. The handler should also push any other registers that it will need to use. At the end of the function, an ldmfd is used to restore the registers, but instead of restoring the link register, it loads the program counter. Also a carat (ˆ) is added to the end of the instruction. Listing 14.4 shows the skeleton for an exception handler function using this method.

f14-07-9780128036983
Listing 14.4 Skeleton for an exception handler.

14.3 The Boot Process

In order to create a bare-metal program, we must understand what the processor does when power is first applied or after a reset. The ARM CPU begins to execute code at a predetermined address. Depending on the configuration of the ARM processor, the program counter starts either at address 0 or 0xFFFF0000. In order for the system to work, the startup code must be at the correct address when the system starts up.

On the Raspberry Pi, when power is first applied, the ARM CPU is disabled and the graphics processing unit (GPU) is enabled. The GPU runs a program that is stored in ROM. That program, called the first stage boot loader, reads the second stage boot loader from a file named (bootcode.bin) on the SD card. That program enables the SDRAM, and then loads the third stage bootloader, start.elf. At this point, some basic hardware configuration is performed, and then the kernel is loaded to address 0x8000 from the kernel.img file on the SD card. Once the kernel image file is loaded, a “b #0x8000” instruction is placed at address 0, and the ARM CPU is enabled. The ARM CPU executes the branch instruction at address 0, then immediately jumps to the kernel code at address 0x8000.

To run a bare-metal program on the Raspberry Pi, it is only necessary to build an executable image and store it as kernel.img on the SD card. Then, the boot process will load the bare-metal program instead of the Linux kernel image. Care must be taken to ensure that the linker prepares the program to run at address 0x8000 and places the first executable instruction at the beginning of the image file. It is also important to make a copy of the original kernel image so that it can be restored (using another computer). If the original kernel image is lost, then there will be no way to boot Linux until it is replaced.

The pcDuino uses u-boot, which is a highly configurable open-source boot loader. The boot loader is configured to attempt booting from the SD card. If a bootable SD card is detected, then it is used. Otherwise, the pcDuino boots from its internal NAND flash. In either case, u-boot finds the Linux kernel image file, named uImage, loads it at address 0x40008000, and then jumps to that location. The easiest way to run bare-metal code on the pcDuino is to create a duplicate of the operating system on an SD card, then replace the uImage file with another executable image. Care must be taken to ensure that the linker prepares the program to run at address 0x40008000 and places the first executable instruction at the beginning of the image file. If the SD card is inserted, then the bare-metal code will be loaded. Otherwise, it will boot normally from the NAND flash memory.

14.4 Writing a Bare-Metal Program

A bare-metal program should be divided into several files. Some of the code may be written in assembly, and other parts in C or some other language. The initial startup code, and the entry and exit from exception handlers, must be written in assembly. However, it may be much more productive to write the main program and the remainder of the exception handlers as C functions and have the assembly code call them.

14.4.1 Startup Code

Other than the code being loaded at different addresses, there is very little difference between getting bare-metal code running on the Raspberry Pi and the pcDuino. For either platform, the bare-metal program must include some start-up code. The startup code will:

 initialize the stack pointers for all of the modes,

 set up interrupt and exception handling,

 initialize the .bss section,

 configure the CPU and critical systems (optional),

 set up memory management (optional),

 set up process and/or thread management (optional),

 initialize devices (optional), and call the main function.

The startup code requires some knowledge of the target platform, and must be at least partly written in assembly language. Listing 14.5 shows a function named _start which sets up the stacks, initializes the .bss section, calls a function to set up the vector table, then calls the main function:

f14-08a-9780128036983f14-08b-9780128036983f14-08c-9780128036983
Listing 14.5 ARM startup code.

The first task for the startup code is to ensure that the stack pointer for each processor mode is initialized. When an exception or interrupt occurs, the processor will automatically change into the appropriate mode and begin executing an exception handler, using the stack pointer for that mode. Hardware interrupts can be disabled, but some exceptions cannot be disabled. In order to guarantee correct operation, a stack must be set up for each processor mode, and an exception handler must be provided. The exception handler does not actually have to do anything.

On the Raspberry Pi, memory is mapped to begin at address 0, and all models have at least 256 MB of memory. Therefore, it is safe to assume that the last valid memory address is 0x0FFFFFFF. If each mode is given 4 kB of stack space, then all of the stacks together will consume 32 kB, and the initial stack addresses can be easily calculated. Since the C compiler uses a full descending stack, the initial stack pointers can be assigned addresses 0x10000000, 0x0FFFF000, 0x0FFFE000, etc.

For the pcDuino, there is a small amount of memory mapped at address 0, but most of the available memory is in the region between 0x40000000 and 0xBFFFFFFF. The pcDuino has at least 1 GB of memory. One possible way to assign the stack locations is: 0x50000000, 0x4FFFF000, 0x4FFFE000, etc. This assignment of addresses will make it easy to write one piece of code to set up the stacks for either the Raspberry Pi or the pcDuino.

After initializing the stacks, the startup code must set all bytes in the .bss section to zero. Recall that the .bss section is used to hold data that is initialized to zero, but the program file does not actually contain all of the zeros. Programs running under an operating system can rely on the C standard library to initialize the .bss section. If it is not linked to a C library, then a bare-metal program must set all of the bytes in the .bss section to zero for itself.

14.4.2 Main Program

The final part of this bare-metal program is the main function. Listing 14.6 shows a very simple main program which reads from three GPIO pins which have pushbuttons connected to them, and controls three other pins that have LEDs connected to them. When a button is pressed the LED associated with it is illuminated. The only real difference between the pcDuino and Raspberry Pi versions of this program is in the functions which drive the GPIO device. Therefore, those functions have been removed from the main program file. This makes the main program portable; it can run on the pcDuino or the Raspberry Pi. It could also run on any other ARM system, with the addition of another file to implement the mappings and functions for using the GPIO device for that system.

f14-09-9780128036983
Listing 14.6 A simple main program.

14.4.3 The Linker Script

When compiling the program, it is necessary to perform a few extra steps to ensure that the program is ready to be loaded and run by the boot code. The last step in compiling a program is to link all of the object files together, possibly also including some object files from system libraries. A linker script is a file that tells the linker which sections to include in the output file, as well as which order to put them in, what type of file is to be produced, and what is to be the address of the first instruction. The default linker script used by GCC creates an ELF executable file, which includes startup code from the C library and also includes information which tells the loader where the various sections reside in memory. The default linker script creates a file that can be loaded by the operating system kernel, but which cannot be executed on bare metal.

For a bare-metal program, the linker must be configured to link the program so that the first instruction of the startup function is given the correct address in memory. This address depends on how the boot loader will load and execute the program. On the Raspberry Pi this address is 0x8000, and on the pcDuino this address is 0x40008000. The linker will automatically adjust any other addresses as it links the code together. The most efficient way to accomplish this is by providing a custom linker script to be used instead of the default system script. Additionally, either the linker must be instructed to create a flat binary file, rather than an ELF executable file, or a separate program (objcopy) must be used to convert the ELF executable into a flat binary file.

Listing 14.7 is an example of a linker script that can be used to create a bare-metal program. The first line is just a comment. The second line specifies the name of the function where the program begins execution. In this case, it specifies that a function named _start is where the program will begin execution. Next, the file specifies the sections that the output file will contain. For each output section, it lists the input sections that are to be used.

f14-10-9780128036983
Listing 14.7 A sample Gnu linker script.

The first output section is the .text section, and it is composed of any sections whose names end in .text.boot followed by any sections whose names end in .text. In Listing 14.5, the _start function was placed in the .text.boot section, and it is the only thing in that section. Therefore the linker will put the _start function at the very beginning of the program. The remaining text sections will be appended, and then the remaining sections, in the order that they appear. After the sections are concatenated together, the linker will make a pass through the resulting file, correcting the addresses of branch and load instructions as necessary so that the program will execute correctly.

14.4.4 Putting it All Together

Compiling a program that consists of multiple source files, a custom linker script, and special commands to create an executable image can become tedious. The make utility was created specifically to help in this situation. Listing 14.8 shows a make script that can be used to combine all of the elements of the program together and produce a uImage file for the pcDuino and a kernel.img file for the Raspberry Pi. Listing 14.9 shows how the program can be built by typing “make” at the command line.

f14-11-9780128036983
Listing 14.8 A sample make file.
f14-12-9780128036983
Listing 14.9 Running make to build the image.

14.5 Using an Interrupt

The main program shown in Listing 14.6 is extremely wasteful because it runs the CPU in a loop, repeatedly checking the status of the GPIO pins. It uses far more CPU time (and electrical power) than is necessary. In reality, the pins are unlikely to change state very often, and it is sufficient to check them a few times per second. It only takes a few nanoseconds to check the input pins and set the output pins so the CPU only needs to be running for a few nanoseconds at a time, a few times per second.

A much more efficient implementation would set up a timer to send interrupts at a fixed frequency. Then the main loop can check the buttons, set the outputs, and put the CPU to sleep. Listing 14.10 shows the main program, modified to put the processor to sleep after each iteration of the main loop. The only difference between this main function and the one in Listing 14.6 is the addition of a wfi instruction at line 43. The new implementation will consume far less electrical power and allow the CPU to run cooler, thereby extending its life. However, some additional work must be performed in order to set up the timer and interrupt system before the main function is called.

f14-13-9780128036983
Listing 14.10 An improved main program.

14.5.1 Startup Code

Some changes must be made to the startup code in Listing 14.5 so that after setting up the vector table, it calls a function to initialize the interrupt controller then calls another function to set up the timer. Listing 14.5 shows the modified startup function.

Lines 50 through 57 have been added to initialize the interrupt controller, enable the timer, and change the CPU into user mode before calling main. Of course, the hardware timers and interrupt controllers on the pcDuino and Raspberry Pi are very different.

14.5.2 Interrupt Controllers

The pcDuino has an ARM Generic Interrupt Controller (GIC-400) device to manage interrupts. The GIC device can handle a large number of interrupts. Each one is a separate input signal to the GIC. The GIC hardware prioritizes each input, and assigns each one a unique integer identifier. When the CPU receives an interrupt, it simply reads the GIC to determine which hardware device signaled the interrupt, calls the function which handles that device, then writes to one of the GIC registers to indicate that the interrupt has been processed. Listing 14.12 provides a few basic functions for managing this device.

f14-15a-9780128036983f14-15b-9780128036983f14-15c-9780128036983f14-15d-9780128036983
Listing 14.12 Functions to manage the pdDuino interrupt controller.

The Raspberry Pi has a much simpler interrupt controller. It can enable and disable interrupt sources, and requires that the programmer read up to three registers to determine the source of an interrupt. For our purposes, we only need to manage the ARM timer interrupt. Listing 14.13 provides a few basic functions for using this device to enable the timer interrupt. Extending these functions to provide functionality equal to the GIC would not be very difficult, but would take some time. It would be necessary to set up a mapping from the interrupt bits in the interrupt register controller to integer values, so that each interrupt source has a unique identifier. Then the functions could be written to use those identifiers. The result would be a software implementation to provide capabilities equivalent to the GIC.

f14-16a-9780128036983f14-16b-9780128036983
Listing 14.13 Functions to manage the Raspberry Pi interrupt controller.

Note that although the devices are very different internally, they perform basically the same function. With the addition of a software driver layer, implemented in Listings 14.12 and 14.13 the devices become interchangeable and other parts of the bare-metal program do not have to be changed when porting from one platform to the other.

f14-14a-9780128036983f14-14b-9780128036983
Listing 14.11 ARM startup code with timer interrupt.

14.5.3 Timers

The pcDuino provides several timers that could be used, Timer0 was chosen arbitrarily. Listing 14.14 provides a few basic functions for managing this Device.

f14-17a-9780128036983f14-17b-9780128036983
Listing 14.14 Functions to manage the pdDuino timer0 device.

The Raspberry Pi also provides several timers that could be used, but the ARM timer is the easiest to configure. Listing 14.15 provides a few basic functions for managing this device:

f14-18a-9780128036983f14-18b-9780128036983
Listing 14.15 Functions to manage the Raspberry Pi timer0 device.

14.5.4 Exception Handling

The final step in writing the bare-metal code to operate in an interrupt-driven fashion is to modify the IRQ handler from Listing 14.3. Listing 14.16 shows a new version of the IRQ exception handler which checks and clears the timer interrupt, then returns to the location and CPU mode that were current when the interrupt occurred. This code works for both platforms.

f14-19-9780128036983
Listing 14.16 IRQ handler to clear the timer interrupt.

14.5.5 Building the Interrupt-Driven Program

Finally, the make file must be modified to include the new source code that was added to the program. Listing 14.17 shows the modified make script. The only change is that two extra object files have been added. when make is run, those files will be compiled and linked with the program. Listing 14.9 shows how the program can be built by typing “make” at the command line.

f14-20a-9780128036983
Listing 14.17 A sample make file.

14.6 ARM Processor Profiles

Since its introduction in 1982 as the flagship processor for Acorn RISC Machine, the ARM processor has gone through many changes. Throughout the years, ARM processors have always maintained a good balance of simplicity, performance, and efficiency. Although originally intended as a desktop processor, the ARM architecture has been more successful than any other architecture for use in embedded applications. That is at least partially because of good choices made by its original designers. The architectural decisions resulted in a processor that provides relatively high computing power with a relatively small number of transistors. This design also results in relatively low power consumption.

Today, there are almost 20 major versions of the ARMv7 architecture, targeted for everything from smart sensors to desktops and servers, and sales of ARM-based processors outnumber all other processor architectures combined. Historically, ARM has given numbers to various versions of the architecture. With the ARMv7, they introduced a simpler scheme to describe different versions of the processor. They divided their processor families into three major profiles:

ARMv7-A: Applications processors are capable of running a full, multiuser, virtual memory, multiprocessing operating system.

ARMv7-R: Real-time processors are for embedded systems that may need powerful processors, cache, and/or large amounts of memory.

ARMv7-M: Microcontroller processors only execute Thumb instructions and are intended for use in very small cost-sensitive embedded systems. They provide low cost, low power, and small size, and may not have hardware floating point or other high-performance features.

In 2014, ARM introduced the ARMv8 architecture. This is the first radical change in the ARM architecture in over 30 years. The new architecture extends the register set to thirty 64-bit general purpose registers, and has a completely new instruction set. Compatibility with ARMv7 and earlier code is supported by switching the processor into 32-bit mode, so that it

f14-20b-9780128036983
Listing 14.18 Running make to build the image.

executes the 32-bit ARM instruction set. This is somewhat similar to the way that the Thumb instructions are supported on 32-bit ARM cores, but the change to 32-bit code can only be made when the processor is in privileged mode, and drops back to unprivileged mode.

14.7 Chapter Summary

Writing bare-metal programs can be a daunting task. However, that task can be made easier by writing and testing code under an operating system before attempting to run it bare metal. There are some functions which cannot be tested in this way. In those cases, it is best to keep those functions as simple as possible. Once the program works on bare metal, extra capabilities can be added.

Interrupt-driven processing is the basis for all modern operating systems. The system timer allows the O/S to take control periodically and select a different process to run on the CPU. Interrupts allow hardware devices to do their jobs independently and signal the CPU when they need service. The ability to restrict user access to devices and certain processor features provides the basis for a secure and robust system.

Exercises

14.1 What are the advantages of a CPU which supports user mode and privileged mode over a CPU which does not?

14.2 What are the six privileged modes supported by the ARM architecture?

14.3 The interrupt handling mechanism is somewhat complex and requires significant programming effort to use. Why is it preferred over simply having the processor poll I/O devices?

14.4 Where does program control transfer to when a hardware interrupt occurs?

14.5 What is the purpose of the Undefined Instruction exception? How can it be used to allow an older processor to run programs that have new instructions? What other uses does it have?

14.6 What is an swi instruction? What is its use in operating systems? What is the key difference between an swi instruction and an interrupt?

14.7 Which of the following operations should be allowed only in privileged mode? Briefly explain your decision for each one.

(a) Execute an swi instruction.

(b) Disable all interrupts.

(c) Read the time-of-day clock.

(d) Receive a packet of data from the network.

(e) Shutdown the computer.

14.8 The main program in Listing 14.10 has two different methods to put the processor to sleep waiting for an interrupt. One method is for the Raspberry Pi, while the other is for the pcDuino. In order to compile the code, the correct lines must be uncommented and the unneeded lines must be commented out or removed. Explain two ways to change the code so that exactly the same main program can be used on both systems.

14.9 The programs in this chapter assumed the existence of libraries of functions for controlling the GPIO pins on the Raspberry Pi and the pcDuino. Both libraries provide the same high-level functions, but one operates on the Raspberry Pi GPIO device and the other operates on the pcDuino GPIO device. The C prototypes for the functions are: int GPIO_get_pin(int pin), void GPIO_set_pin(int pin,int state), GPIO_dir_input (int pin), and GPIO_dir_output (int pin). Write these libraries in ARM assembly language for both platforms.

14.10 Write an interrupt-driven program to read characters from the serial port on either the Raspberry Pi or the pcDuino. The UART on either system can be configured to send an interrupt when a character is received.
When a character is received through the UART and an interrupt occurs, the character should be echoed by transmitting it back to the sender. The character should also be stored in a buffer. If the character received is newline (“n), or if the buffer becomes full, then the contents of the buffer should be transmitted through the UART. Then, the buffer cleared and prepared to receive more characters.

Index

Note: Page numbers followed by b indicate boxes, f indicate figures and t indicate tables.

A

Absolute difference 339–340
Absolute value 340–341
Abstract data type (ADT) 
in assembly language 138–139
big integer ADT 195–196, 211
in C header file 138
implementation of 137
interface 137
Therac-25 
design flaws 163–165
history of 162–163
X-ray therapy 161
use of 137
word frequency counts 
better performance 150–161
C header for 141–142
C implementation 141–142, 145
C program to compute 140–141
makefile for 141–142, 146
revised makefile for 148–150
sorting by 147–150
wl_print_numerical function 147–150, 157–161
Accessing devices, Linux 365–376
Acorn Archimedes™ 8
Acorn RISC Machine (ARM) processor 8–9
Addition 
in decimal and binary 173b
fixed-point operation 231–232
floating point operation 246–247
subtraction by 172
vector 335–337
VFP 278
American Standard Code for Information Interchange (ASCII) 
control characters 20, 21t
converting character strings to ASCII codes 21–23, 23t, 24t
interpreting data as ASCII strings 23–24, 24t
ISO extensions to ASCII 24–25, 25t
unicode and UTF-8 25–28, 27t
Arbitrary base 
base ten to 11
to decimal, conversion 220–223
Arithmetic and logic unit (ALU) 54–55
Arithmetic instructions, ARM 83–85
Arithmetic instructions, NEON 335–343
absolute difference 339–340
absolute value and negate 340–341
add vector elements pairwise 338–339
count bits 342–343
select maximum/minimum elements 341–342
vector addition and subtraction 335–337
ARM assembly 
automatic variables 118–119
calling scanf and printf 110–111
complex selection 103–104
function call using stack 115–116
for loop re-written as a post-test loop 107–108
post-test loop 106, 108
pre-test loop 105–107
program 36
reverse function implementation 121–122
simple function call 114
structured data type 124–126
unconditional loop 104–105
ARM condition modifiers 59t
ARM CPU modes 432–435
ARM instruction set architecture 95–96
data processing instructions 79–80
arithmetic operations 83–85
comparison operations 81–82
data movement operations 86–87
division operations 89–90
logical operations 85–86
multiply operations with 32-bit results 87–88
multiply operations with 64-bit results 88–89
Operand2 80, 80t, 81t
pseudo-instructions, ARM 93
no operation 93–94
shifts 94–95
special instructions 
accessing CPSR and SPSR 91
count leading zeros 90
software interrupt 91–92
thumb mode 92–93
ARM processor 
architecture 54f
ARM user registers 55–58, 56f, 57f
branch instructions 70
branch 70–71
branch and link 71–72
load/store instructions 60–61
addressing modes 61–63, 61t
exclusive load/store 69–70
multiple register 65–68
single register 64
swap 68–69
profiles 461–464
pseudo-instructions 73
load address 75–76
load immediate 73–75
ARM user program registers 112f
Assembler 38–40
Assembly language 3
ADTs 138–139
reason to learn 4–8
Atomic Energy of Canada Limited (AECL) 161–162

B

Bare-metal programs 
coding on 431
compiling 449
exception processing 435–436
features 432
linker script 447–448
main program 445–447
Raspberry Pi 442
startup code 443–445
writing 442–449
Base address 
clock manager device 407
for GPIO device 378–379
in memory 367
PWM device 398
Big integer ADT 195–216
bigint_adc function 213–216
C source code file 211
factorial function calculation 212
header file 196
Binary division 
constant 190–194
flowchart for 183f
large numbers 194–195
power of two 181
64-bit functions, signed and unsigned 190
32-bit functions, signed and unsigned 190
variable 182–186
Binary multiplication 
algorithm for 175
large numbers 179–181, 180f
power of two 173
signed multiplication 178–179, 179f, 180b
64 bit numbers 175–176
32-bit numbers 176–177
of two variables 173–176
variable by constant 177–178
Binary tree, of word frequency 151f
index added 157f
sorted index 158f
Binimals 223–224
non-terminating, repeating 223b
terminating 224
Bitwise logical operations, NEON 326–327
with immediate data 327–328, 352–353
insertion and selection 328–329
Boot loader 442, 447
Boot process 442
Branch instructions, ARM processor 70
branch 70–71
branch and link 71–72

C

Central processing unit (CPU) 
components and data paths 54–55
description 3–4
C language 
array of integers 124
array of structured data 127
calling scanf and printf 110
complex selection 103
larger function call 114
for loop 106
program 36
using recursion to reverse a string 120–121
Clock Control Unit (CCU) 409
Clock management device 405–409, 406f
control registers 408t
divisor registers 408t
pcDuino CCU 409
Raspberry Pi 406–409
registers 407t
Communications 
parallel 409
serial 409–429
pcDuino UART 422–429
Raspberry Pi UART0 413–422
UART 410–412
Compare instruction 
ARM 81–82
vector 323–324
vector absolute 353–354
VFP 279
Compilation sequence 5, 6f
Compiler, GNU C 38–40
Complex Instruction Set Computing (CISC) processor 8
Computer data 9
base conversion 
base b to decimal 11–12, 12b
base conversion 10t, 11–15
bases, powers-of-two 14–15, 14f
conversion between arbitrary bases 13b
decimal to base b 12, 13b
characters 20–28, 21t, 22t
non-printing 20–21
printing 20, 22t
ISO 24–25
Unicode and UTF-8 25–28
integers 15, 16f
complement representation 16–19, 17f, 18b, 19b
excess-(2n−−1–1) representation 16
sign-magnitude representation 15
natural numbers 9–11
Conditional assembly 46–47
Control registers 
clock management device 407, 408t
pcDuino UART FIFO 425t
Raspberry Pi UART 416, 417t
Cosine function 
ARM assembly implementation 251, 257
battery powered systems 260
double precision software float C 259
double precision VFP C 260
factorial terms, formats and constants for 249–251
formats for powers of x 248–249
intermediate calculations 251
performance comparison 259–260
performance implementations 259t
properties 247–248
single precision software float C 259
single precision VFP C 259
table printing 251, 258
32-bit fixed point assembly 259
32-bit fixed point C 259
Count bits 342–343
Current Program Status Register (CPSR) 57–58
accessing 91
flag bits 58, 58t

D

Data conversion instructions 
NEON 321–322
fixed point and single-precision 321–322
half-precision and single-precision 322
vector floating point 
fixed point to single precision 284–285
floating point to integer 282–284
Data frame 410
Data movement instructions 
ARM 86–87
NEON 309–320
change size of elements in vector 311–312
duplicate scalar 312–313
extract elements 313–314
move immediate data 310–311
moving between NEON scalar and integer register 309–310
reverse elements 314–315
swap vectors 315–316
table lookup 317–319
transpose matrix 316–317
zip/unzip vectors 319–320
vector floating point 279–282
ARM register and VFP system register 282
between two VFP registers 279–280
VFP register and one integer register 280–281
VFP register and two integer registers 281
Data processing instructions, ARM 79–80
arithmetic operations 83–85
comparison operations 81–82
data movement operations 86–87
division operations 89–90
logical operations 85–86
multiply operations 
with 64-bit results 88–89
with 32-bit results 87–88
Operand2 80, 80t, 81t
vector floating point 277–279
compare instruction 279
mathematical operations 278
unary operations 277–278
Data register, Raspberry Pi UART 413, 414t
Data section, memory 28–29
Decimal 223–224
to arbitrary base, conversion 220–223
terminating 224
Direct Memory Access (DMA) 377–378
control register 418
Division 
binary 
constant 190–194
flowchart for 183f
large numbers 194–195
power of two 181
64-bit functions, signed and unsigned 190
32-bit functions, signed and unsigned 190
variable 182–186
by constant 236–241
in decimal and binary 181f
fixed-point operation 234–236
floating point operation 247
maintaining precision 236
mixed 235
NEON 343
results of 234–235
signed 235
unsigned 235
of variable by constant 193
VFP 278
Divisor registers 
clock management device 408t
divisor latch high register 424, 425t
divisor latch low register 424, 424t
Double-precision floating point number 
IEEE 754 245–246
sine function 355, 357
Duty cycle 395

E

Exception handling 438–441, 461
skeleton for 441
stub functions 438–441
Exception processing 434–441, 436f
ARM vector table 434–435, 435t
bare-metal programs 435–436
handling exceptions 438–441
skeleton for 441
stub functions 438–441
with multiple user processes 436, 437f
Executing program, memory layout of 28–31, 29f, 30f
Extract elements 313–314

F

Fault Tree Analysis 162
FIFO control register 425t
Fixed-point numbers 
interpreting 226–230
properties of 230–231
Q notation 230
signed 227–228
two’s complement 229
unsigned 226, 228
Fixed-point operation 
addition 231–232
division 
by constant 236–241
maintaining precision 236
mixed 235
results of 234–235
signed 235
unsigned 235
multiplication 232–233
to single-precision 284–285, 321–322
subtraction 231–232
Flags register 414, 415t
Floating-point Exception register (FPEXC) 274
Floating point numbers 
binimal representation 242–243
IEEE 754 
double-precision 245–246
half-precision 243–245
quad-precision 246
single-precision 245
to integer 282–284
Floating point operations 
addition 246–247
division 247
multiplication 247
subtraction 246–247
Floating Point Status and Control Register (FPSCR) 268–273
bits in 268–269, 268f
performance vs. compliance 271–272
vector mode 272–273
Floating-point System ID register (FPSID) 274
Fractional baud rate divisor 414, 416t
Fractional numbers, base conversion 223–225
arbitrary base to decimal 220
decimal to arbitrary base 220–223
powers-of-two 222–223
Full-compliance mode 272
Fused multiply accumulate operation 346

G

General Purpose I/O (GPIO) device 376–392, 395
applications 377–378
features 377–378
GPIO pin event detect status registers 382
GPIO pin pull-up/down registers 381–382
input and output 378f
parallel printer port 377
pcDuino 382–392
detecting GPIO events 390
enabling internal pull-up/pull-down 389–390
function select code assignments 392t
GPIO pins available on 390–392
header pin assignments 391f
reading and setting GPIO pins 388–389
setting GPIO pin function 384–385
pin function select bits 380t
port 376–377
Raspberry Pi 378–382
detecting GPIO events 382
enabling internal pull-up/pull-down 381–382
GPIO pins available on 382
header pin assignments 384f
reading GPIO input pins 381
setting GPIO output pins 380–381
setting GPIO pin function 379–380
Generic Interrupt Controller (GIC) device 449–451
GNU assembler (GAS) 35, 40
directives 40
allocating space for variables and constants 41–43, 42f
conditional assembly 46–47
current section selection 40–41
filling and aligning 43–45
including other source files 47–48
macros 48–50
setting and manipulating symbols 45–47
program structure 36, 38
assembler directives 36–38
assembly instructions 36, 38
comments 37
labels 37
GNU C compiler 38–40, 57

H

Half-precision floating point number 
IEEE 754 243–245
to single-precision 322
Hardware interrupt 434
High-level language 
description 4–5
structured data type 73–74
Hindu-Arabic number system 9–10

I

IBM PC 377
Image data type 138–139
Immediate data 
bitwise logical operations with 327–328, 352–353
data movement NEON instructions 310–311
Information hiding 137
Instruction components 58
immediate values 59–60, 60t
setting and using condition flags 58–59, 58t
Instruction set architecture (ISA) 53
Instruction stream 3
Integer baud rate divisor 414, 416t
Integer mathematics 
big integer ADT 195–216
binary division 
constant 190–194
large numbers 194–195
power of two 181
variable 182–186
binary multiplication by 
large numbers 179–181
power of two 173
signed multiplication 178–179, 180b
two variables 173–176
variable by constant 177–178
division 236, 239
floating point to 282–284
overflow 171
subtraction by addition 172
Integer register 
moving between NEON scalar and 309–310
VFP register and 280–281
Interrupt clear register 418
Interrupt controllers 449–451
Interrupt-driven program 461
Interrupt enable register 429
Interrupt Identity Register 429
Interrupt mask set/clear register 417

L

Least significant bit (LSB) 11
LED, GPIO device 377–378
Line control register 
pcDuino UART 425, 426t
Raspberry Pi UART 416, 416t
Line driver 410
Line status register 426, 427t
Linked list 
index creation 147, 157f
re-ordering 147
sorted index 158f
sorting 147
Linker 38–40, 46
Linker script 447–448
Linux, accessing devices under 365–376
Load and store instructions 60–61
ARM 55–58
addressing modes 61–63, 61t
exclusive load/store 69–70
multiple register 65–68
NEON 302–309
load copies of structure to all lanes 305–307
multiple structures data 307–309
single structure using one lane 303–305, 304t
single register 64
swap 68–69
Load constant 351–352
Loop unrolling 355
Low pass filter 395–396, 398

M

Macros, GNU assembly directives 48–50
Masked interrupt status register 418
Mathematical operations, VFP 278
Memory 
base address in 367
of executing program 28–31, 29f, 30f
hardware address mapping for 366f
on Raspberry Pi 372
Modem Control Register 429
Modem Scratch Register 429
Modem Status Register 429
Monostable multivibrator 400
Most significant bit (MSB) 11
Multiplication 
binary 
algorithm for 175
large numbers 179–181, 180f
power of two 173
signed multiplication 178–179, 179f, 180b
64 bit numbers 175–176
32-bit numbers 176–177
of two variables 173–176
variable by constant 177–178
in decimal and binary 174b
fixed-point operation 232–233
floating point operation 247
mixed 233
NEON 343–351
estimate reciprocals 348–349
fused multiply accumulate 346
reciprocal step 349–351
saturating multiply and double 347–348
by scalar 345–346
signed 233
unsigned 233
VFP 278
Multistage noise shaping (MASH) filtering 407

N

NEON instructions 298–299, 358–361
arithmetic instructions 335–343
absolute difference 339–340
absolute value and negate 340–341
add vector elements pairwise 338–339
count bits 342–343
select maximum/minimum elements 341–342
vector addition and subtraction 335–337
bitwise logical operations 326–327
with immediate data 327–328
insertion and selection 328–329
comparison operations 322–326
vector absolute compare 324–325
vector comparison 323–324
vector test bits 325–326
data conversion between 
fixed point and single-precision 321–322
half-precision and single-precision 322
data movement instructions 309–320
change size of elements in vector 311–312
duplicate scalar 312–313
extract elements 313–314
move immediate data 310–311
moving between NEON scalar and integer register 309–310
reverse elements 314–315
swap vectors 315–316
table lookup 317–319
transpose matrix 316–317
zip/unzip vectors 319–320
intrinsics functions 299
load and store instructions 302–309
load copies of structure to all lanes 305–307, 308t
multiple structures 306t, 307–309
single structure using one lane 303–305, 304t
multiplication and division 343–351
estimate reciprocals 348–349
fused multiply accumulate 346
reciprocal step 349–351
saturating multiply and double 347–348
by scalar 345–346
pseudo-instructions 351–354
bitwise logical operations with immediate data 352–353
load constant 351–352
vector absolute compare 353–354
shift instructions 329–334
saturating shift right by immediate 332–333
shift and insert 333–334
shift left by immediate 329–330
shift left/right by variable 330–331
shift right by immediate 331–332
sine function 354–358, 357t
double precision 355, 357
performance comparison 357–358, 357t
single precision 354–355
syntax of 299–302
user program registers 300f
Newlib 432
Newton-Raphson method 343, 348–349
for improving reciprocal estimates 349–350
Non-integral mathematics 
fixed-point numbers 
interpreting 226–230
properties of 230–231
Q notation 230
fixed-point operations 
addition and subtraction 231–232
division 234–241
multiplication 232–233
floating point numbers 
double-precision, IEEE 754 245–246
half-precision, IEEE 754 243–245
quad-precision, IEEE 754 246
single-precision, IEEE 754 245
floating point operations 
addition and subtraction 246–247
multiplication and division 247
fractional numbers, base conversion 
arbitrary base to decimal 220
decimal to arbitrary base 220–223
fractions and bases 223–225
Patriot missile failure 261–263
sine and cosine function 
factorial terms, formats and constants 249–251
formats for powers of x 248–249
performance comparison 259–260
table printing 258
using fixed-point calculations 257

O

Operand2 80, 80t, 81t
Operating system 431–432
designers 365–366

P

Parallel communications 409
Patriot missile failure 261–263
pcDuino 382–392
bare-metal programs 
linker script 447
main program 445–447
startup code 445
boot process 442
Clock Control Unit 409
GPIO 
detecting events 390
enabling internal pull-up/pull-down 389–390
function select code assignments 392t
header locations 390f
header pin assignments 391f
pin function setting 384–385
pins available on 390–392
reading and setting GPIO pins 388–389
user program memory space on 372, 376
interrupt controllers 449–451, 457
PWM device 400–403
configuring 403
control register bits 402t
prescaler bits 401t
register map 401t
timer0 device 458, 460
UART 422–429
addresses 422t
divisor latch high register 424, 425t
divisor latch low register 424, 424t
FIFO control register 425t
interrupt control 429
interrupt enable register 429
Interrupt Identity Register 429
line control register 425, 426t
line status register 426, 427t
Modem Control Register 429
Modem Scratch Register 429
Modem Status Register 429
receive buffer register 423, 424t
receive FIFO level register 426, 428t
register offsets 423t
status register 426, 427t
transmit FIFO level register 426, 428t
transmit halt register 428, 428t
transmit holding register 424, 424t
PDP-11 163
Privileged mode 432–433
Program Status Register (PSR) 433–434
mode bits 434t
Pseudo-instructions, ARM processor 73, 93
load address 75–76
load immediate 73–75
NEON 351–354
bitwise logical operations with immediate data 352–353
load constant 351–352
vector absolute compare 353–354
no operation 93–94
shifts 94–95
Pulse density modulation (PDM) 396, 396f
Pulse frequency modulation (PFM) 396, 396f
Pulse modulation 
pcDuino PWM device 400–403
PDM 396, 396f
PWM 397, 397f
Raspberry Pi PWM device 398–400, 400b
types 395
Pulse width modulation (PWM) 397, 397f
pcDuino PWM device 400–403
Raspberry Pi PWM device 398–400, 400b

Q

Q notation 230
Quad-precision floating point number 246

R

Radix point 220
Radix ten Hindu-Arabic system 10
Raspberry Pi 365–367
bare-metal programs 442
linker script 447
main program 445–447
startup code 445
clock management device 406–409
GPIO 378–382
detecting events 382
enabling internal pull-up/pull-down 381–382
header pin assignments 384f
output pins setting 380–381
pin alternate functions 385t
pin function setting 379–380
pins available on 382
reading input pins 381
register 379t
user program memory on 372
header location 383f
interrupt controllers 441, 451
PWM device 398–400, 400b
clock values on 400
control register bits 399t
register map 398t
timer0 device 458–461
UART 413–418
assembly functions for 422
basic programming for 418–422
control register 416, 417t
data register 413, 414t
DMA control register 418
flags register bits 414, 415t
fractional baud rate divisor 414, 416t
integer baud rate divisor 414, 416t
interrupt clear register 418
interrupt control 417
interrupt mask set/clear register 417
line control register bits 416, 416t
masked interrupt status register 418
raw interrupt status register 418
receive status register/error clear register 414, 415t
registers 413t
Raw interrupt status register 418
Receive buffer register, UART 423, 424t
Receive FIFO level register, UART 426, 428t
Receive status register/error clear register 414, 415t
Reciprocals 
estimate 348–349
step 349–351
Reduced Instruction Set Computing (RISC) processor 8
Reverse elements 314–315
RS-232 standard 410, 412
RS-422 standards 410, 412
RS-485 standards 410, 412
RunFast mode 272

S

Saved Process Status Register (SPSR) 432–433
Scalar 
duplication 312–313
multiplication by 345–346
sine function using 285–286
Serial communications 409–429
pcDuino UART 422–429
addresses 422t
divisor latch high register 424, 425t
divisor latch low register 424, 424t
FIFO control register 425t
interrupt control 429
line control register 425, 426t
line status register 426, 427t
receive buffer register 423, 424t
receive FIFO level register 426, 428t
register offsets 423t
status register 426, 427t
transmit FIFO level register 426, 428t
transmit halt register 428, 428t
transmit holding register 424, 424t
Raspberry Pi UART0 413–418
assembly functions for 422
basic programming for 418–422
control register 416, 417t
data register 413, 414t
flags register bits 414, 415t
fractional baud rate divisor 414, 416t
integer baud rate divisor 414, 416t
interrupt control 417
line control register bits 416, 416t
receive status register/error clear register 414, 415t
register map 413t
UART 410–412
Serial Peripheral Interface (SPI) functions 382
Shift instructions, NEON 329–334
saturating shift right by immediate 332–333
shift and insert 333–334
shift left by immediate 329–330
shift left/right by variable 330–331
shift right by immediate 331–332
Sine function 
ARM assembly implementation 251, 257
battery powered systems 260
double precision software float C 259
double precision VFP C 260
factorial terms, formats and constants for 249–251
formats for powers of x 248–249
intermediate calculations 251
double precision 355, 357
performance comparison 357–358, 357t
single precision 354–355
performance comparison 259–260
performance implementations 259t
properties 247–248
scalar implementation 286–287
single precision software float C 259
single precision VFP C 259
sinq 248
table printing 251, 258
32-bit fixed point assembly 259
32-bit fixed point C 259
vector implementation 289, 291
VFP 
performances 291, 292t
scalar mode 285–286
vector mode 287–291
Single instruction multiple data (SIMD) instructions 5
Single-precision floating point number 
fixed point to 284–285, 321–322
half-precision to 322
IEEE 754 245
sine function 354–355
Sorting 
linked list 147
by word frequency 147–150
Spaghetti code 100
Special instructions, ARM 
accessing CPSR and SPSR 91
count leading zeros 90
software interrupt 91–92
thumb mode 92–93
Stack and Heap segments 28–29
Status register 
ARM process 433f
pcDuino UART 426, 427t
Structured programming 
aggregate data types 123–131
arrays 124–125
arrays of structured data 126–131
structured data 124–126
description 99–100
iteration 104–108
for loop 106–108
post-test loop 106
pre-test loop 105
selection 101–104
complex selection 103–104
using branch instructions 102
using conditional execution 101–102
sequencing 100–101
subroutines 108–122
advantages 109
automatic variables 118–119
calling 113–117
disadvantages 110
passing parameters 110–113
recursive functions 119–122
standard C library functions 110
writing 117–118
Subtraction 
by addition 172
in decimal and binary 173b
fixed-point operation 231–232
floating point operation 246–247
ten’s complement 172b
vector 335–337
VFP 278
Swap vectors 315–316

T

Table lookup 317–319
Text section, memory 28–29
Therac-25 
for cancer 161
design flaws 163–165
double pass accelerator 161
history of 162–163
overdose 162–163
X-ray therapy 161
Three address instruction 80
Transmit FIFO level register 426, 428t
Transmit halt register 428, 428t
Transmit holding register 424, 424t
Transpose matrix 316–317

U

UCS Transformation Format-8-bit (UTF-8) 26–27
Unary operations 277–278
Universal Asynchronous Receiver/Transmitter (UART) 410–412
line driver 410
pcDuino 422–429
addresses 422t
divisor latch high register 424, 425t
divisor latch low register 424, 424t
FIFO control register 425t
interrupt control 429
line control register 425, 426t
line status register 426, 427t
receive buffer register 423, 424t
receive FIFO level register 426, 428t
register offsets 423t
status register 426, 427t
transmit FIFO level register 426, 428t
transmit halt register 428, 428t
transmit holding register 424, 424t
Raspberry Pi 413–418
assembly functions for 422
basic programming for 418–422
control register 416, 417t
data register 413, 414t
flags register bits 414, 415t
fractional baud rate divisor 414, 416t
integer baud rate divisor 414, 416t
interrupt control 417
line control register bits 416, 416t
receive status register/error clear register 414, 415t
register map 413t
standards 410
transmitter and receiver timings for 411f
Universal Character Set (UCS) code 26
Unzip vectors 319–320
User mode 432

V

Vector absolute comparison 324–325, 353–354
Vector floating point (VFP) 
code meanings for 271t
compare instruction 279
coprocessor 266–268
data conversion instructions 282–285
data movement instructions between 279–282
ARM register and VFP system register 282
two VFP register 279–280
VFP register and one integer register 280–281
VFP register and two integer register 281
data processing instructions 277–279
compare instruction 279
mathematical operations 278
unary operations 277–278
FPSCR 268–273
instructions 292
load and store instructions 274–277
overview 266–268
register usage rules 273–274
sine function 
performance 291, 292t
using scalar mode 285–286
using vector mode 287–291
user program registers 267f
Vectors 268
addition and subtraction 335–337
change size of elements 311–312
comparison operation 323–324
FPSCR 272–273
sine function using 287–291
swapping 315–316
unzip 319–320
Vector table 434–435, 435t
Vector test bits 325–326

W

wl_print_numerical function 147–150
Word frequency counts, ADT 
better performance 150–161
binary tree of 151f, 157f, 158f
C header for 141–142
C implementation 141–142, 145, 150–151, 157
C program to compute 140–141
makefile for 141–142, 146
revised makefile for 148–150
sorting by 147–150

Z

Zip vectors 319–320