First Edition

Part I: Assembly as a Language
1.4 Memory Layout of an Executing Program
Chapter 2: GNU Assembly Syntax
2.1 Structure of an Assembly Program
Chapter 3: Load/Store and Branch Instructions
3.1 CPU Components and Data Paths
Chapter 4: Data Processing and Other Instructions
4.1 Data Processing Instructions
4.4 Alphabetized List of ARM Instructions
Chapter 5: Structured Programming
Chapter 6: Abstract Data Types
6.3 Ethics Case Study: Therac-25
Part II: Performance Mathematics
Chapter 7: Integer Mathematics
Chapter 8: Non-Integral Mathematics
8.1 Base Conversion of Fractional Numbers
8.8 Ethics Case Study: Patriot Missile Failure
Chapter 9: The ARM Vector Floating Point Coprocessor
9.1 Vector Floating Point Overview
9.2 Floating Point Status and Control Register
9.5 Data Processing Instructions
9.6 Data Movement Instructions
9.7 Data Conversion Instructions
9.8 Floating Point Sine Function
9.9 Alphabetized List of VFP Instructions
Chapter 10: The ARM NEON Extensions
10.3 Load and Store Instructions
10.4 Data Movement Instructions
10.7 Bitwise Logical Operations
10.10 Multiplication and Division
10.12 Performance Mathematics: A Final Look at Sine
10.13 Alphabetized List of NEON Instructions
11.1 Accessing Devices Directly Under Linux
11.2 General Purpose Digital Input/Output
Chapter 13: Common System Devices
Chapter 14: Running Without an Operating System
Newnes is an imprint of Elsevier
The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, UK
50 Hampshire Street, 5th Floor, Cambridge, MA 02139, USA
Copyright © 2016 Elsevier Inc. All rights reserved.
No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions.
This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein).
Library of Congress Cataloging-in-Publication Data
A catalog record for this book is available from the Library of Congress
British Library Cataloging-in-Publication Data
A catalogue record for this book is available from the British Library
ISBN: 978-0-12-803698-3
For information on all Newnes publications visit our website at https://www.elsevier.com/

Publisher: Joe Hayton
Acquisition Editor: Tim Pitts
Editorial Project Manager: Charlotte Kent
Production Project Manager: Julie-Ann Stansfield
Designer: Mark Rogers
Typeset by SPi Global, India
Table 1.1 Values represented by two bits 9
Table 1.2 The first 21 integers (starting with 0) in various bases 10
Table 1.3 The ASCII control characters 21
Table 1.4 The ASCII printable characters 22
Table 1.5 Binary equivalents for each character in “Hello World” 23
Table 1.6 Binary, hexadecimal, and decimal equivalents for each character in “Hello World” 24
Table 1.7 Interpreting a hexadecimal string as ASCII 24
Table 1.8 Variations of the ISO 8859 standard 25
Table 1.9 UTF-8 encoding of the ISO/IEC 10646 code points 27
Table 3.1 Flag bits in the CPSR register 58
Table 3.2 ARM condition modifiers 59
Table 3.3 Legal and illegal values for #<immediate|symbol> 60
Table 3.4 ARM addressing modes 61
Table 3.5 ARM shift and rotate operations 61
Table 4.1 Shift and rotate operations in Operand2 80
Table 4.2 Formats for Operand2 81
Table 8.1 Format for IEEE 754 half-precision 244
Table 8.2 Result formats for each term 252
Table 8.3 Shifts required for each term 252
Table 8.4 Performance of sine function with various implementations 259
Table 9.1 Condition code meanings for ARM and VFP 271
Table 9.2 Performance of sine function with various implementations 292
Table 10.1 Parameter combinations for loading and storing a single structure 304
Table 10.2 Parameter combinations for loading multiple structures 306
Table 10.3 Parameter combinations for loading copies of a structure 308
Table 10.4 Performance of sine function with various implementations 357
Table 11.1 Raspberry Pi GPIO register map 379
Table 11.2 GPIO pin function select bits 380
Table 11.3 GPPUD control codes 381
Table 11.4 Raspberry Pi expansion header useful alternate functions 385
Table 11.5 Number of pins available on each of the AllWinner A10/A20 PIO ports 385
Table 11.6 Registers in the AllWinner GPIO device 386
Table 11.7 Allwinner A10/A20 GPIO pin function select bits 388
Table 11.8 Pull-up and pull-down resistor control codes 389
Table 11.9 pcDuino GPIO pins and function select code assignments. 392
Table 12.1 Raspberry Pi PWM register map 398
Table 12.2 Raspberry Pi PWM control register bits 399
Table 12.3 Prescaler bits in the pcDuino PWM device 401
Table 12.4 pcDuino PWM register map 401
Table 12.5 pcDuino PWM control register bits 402
Table 13.1 Clock sources available for the clocks provided by the clock manager 407
Table 13.2 Some registers in the clock manager device 407
Table 13.3 Bit fields in the clock manager control registers 408
Table 13.4 Bit fields in the clock manager divisor registers 408
Table 13.5 Clock signals in the AllWinner A10/A20 SOC 409
Table 13.6 Raspberry Pi UART0 register map 413
Table 13.7 Raspberry Pi UART data register 414
Table 13.8 Raspberry Pi UART receive status register/error clear register 415
Table 13.9 Raspberry Pi UART flags register bits 415
Table 13.10 Raspberry Pi UART integer baud rate divisor 416
Table 13.11 Raspberry Pi UART fractional baud rate divisor 416
Table 13.12 Raspberry Pi UART line control register bits 416
Table 13.13 Raspberry Pi UART control register bits 417
Table 13.14 pcDuino UART addresses 422
Table 13.15 pcDuino UART register offsets 423
Table 13.16 pcDuno UART receive buffer register 424
Table 13.17 pcDuno UART transmit holding register 424
Table 13.18 pcDuno UART divisor latch low register 424
Table 13.19 pcDuno UART divisor latch high register 425
Table 13.20 pcDuno UART FIFO control register 425
Table 13.21 pcDuno UART line control register 426
Table 13.22 pcDuno UART line status register 427
Table 13.23 pcDuno UART status register 427
Table 13.24 pcDuno UART transmit FIFO level register 428
Table 13.25 pcDuno UART receive FIFO level register 428
Table 13.26 pcDuno UART transmit halt register 428
Table 14.1 The ARM user and system registers 433
Table 14.2 Mode bits in the PSR 434
Table 14.3 ARM vector table 435
Figure 1.1 Simplified representation of a computer system 4
Figure 1.2 Stages of a typical compilation sequence 6
Figure 1.3 Tables used for converting between binary, octal, and hex 14
Figure 1.4 Four different representations for binary integers 16
Figure 1.5 Complement tables for bases ten and two 17
Figure 1.6 A section of memory 29
Figure 1.7 Typical memory layout for a program with a 32-bit address space 30
Figure 2.1 Equivalent static variable declarations in assembly and C 42
Figure 3.1 The ARM processor architecture 54
Figure 3.2 The ARM user program registers 56
Figure 3.3 The ARM process status register 57
Figure 5.1 ARM user program registers 112
Figure 6.1 Binary tree of word frequencies 151
Figure 6.2 Binary tree of word frequencies with index added 157
Figure 6.3 Binary tree of word frequencies with sorted index 158
Figure 7.1 In signed 8-bit math, 110110012 is −3910 179
Figure 7.2 In unsigned 8-bit math, 110110012 is 21710 179
Figure 7.3 Multiplication of large numbers 180
Figure 7.4 Longhand division in decimal and binary 181
Figure 7.5 Flowchart for binary division 183
Figure 8.1 Examples of fixed-point signed arithmetic 232
Figure 9.1 ARM integer and vector floating point user program registers 267
Figure 9.2 Bits in the FPSCR 268
Figure 10.1 ARM integer and NEON user program registers 300
Figure 10.2 Pixel data interleaved in three doubleword registers 302
Figure 10.3 Pixel data de-interleaved in three doubleword registers 303
Figure 10.4 Example of vext.8 d12,d4,d9,#5 313
Figure 10.5 Examples of the vrev instruction. (A) vrev16.8 d3,d4; (B) vrev32.16 d8,d9; (C) vrev32.8 d5,d7 315
Figure 10.6 Examples of the vtrn instruction. (A) vtrn.8 d14,d15; (B) vtrn.32 d31,d15 316
Figure 10.7 Transpose of a 3 × 3 matrix 317
Figure 10.8 Transpose of a 4 × 4 matrix of 32-bit numbers 318
Figure 10.9 Example of vzip.8 d9,d4 320
Figure 10.10 Effects of vsli.32 d4,d9,#6 334
Figure 11.1 Typical hardware address mapping for memory and devices 366
Figure 11.2 GPIO pins being used for input and output. (A) GPIO pin being used as input to read the state of a push-button switch. (B) GPIO pin being used as output to drive an LED 378
Figure 11.3 The Raspberry Pi expansion header location 383
Figure 11.4 The Raspberry Pi expansion header pin assignments 384
Figure 11.5 Bit-to-pin assignments for PIO control registers 388
Figure 11.6 The pcDuino header locations 390
Figure 11.7 The pcDuino header pin assignments 391
Figure 12.1 Pulse density modulation 396
Figure 12.2 Pulse width modulation 397
Figure 13.1 Typical system with a clock management device 406
Figure 13.2 Transmitter and receiver timings for two UARTS. (A) Waveform of a UART transmitting a byte. (B) Timing of UART receiving a byte 411
Figure 14.1 The ARM process status register 433
Figure 14.2 Basic exception processing 436
Figure 14.3 Exception processing with multiple user processes 437
Listing 2.1 “Hello World” program in ARM assembly 36
Listing 2.2 “Hello World” program in C 37
Listing 2.3 “Hello World” assembly Listing 39
Listing 2.4 A Listing with mis-aligned data 43
Listing 2.5 A Listing with properly aligned data 45
Listing 2.6 Defining a symbol for the number of elements in an array 47
Listing 5.1 Selection in C 101
Listing 5.2 Selection in ARM assembly using conditional execution 102
Listing 5.3 Selection in ARM assembly using branch instructions 102
Listing 5.4 Complex selection in C 103
Listing 5.5 Complex selection in ARM assembly 104
Listing 5.6 Unconditional loop in ARM assembly 105
Listing 5.7 Pre-test loop in ARM assembly 105
Listing 5.8 Post-test loop in ARM assembly 106
Listing 5.9 for loop in C 106
Listing 5.10 for loop rewritten as a pre-test loop in C 107
Listing 5.11 Pre-test loop in ARM assembly 107
Listing 5.12 for loop rewritten as a post-test loop in C 108
Listing 5.13 Post-test loop in ARM assembly 108
Listing 5.14 Calling scanf and printf in C 111
Listing 5.15 Calling scanf and printf in ARM assembly 111
Listing 5.16 Simple function call in C 114
Listing 5.17 Simple function call in ARM assembly 114
Listing 5.18 A larger function call in C 114
Listing 5.19 A larger function call in ARM assembly 115
Listing 5.20 A function call using the stack in C 115
Listing 5.21 A function call using the stack in ARM assembly 116
Listing 5.22 A function call using stm to push arguments onto the stack 116
Listing 5.23 A small function in C 118
Listing 5.24 A small function in ARM assembly 118
Listing 5.25 A small C function with a register variable 119
Listing 5.26 Automatic variables in ARM assembly 119
Listing 5.27 A C program that uses recursion to reverse a string 120
Listing 5.28 ARM assembly implementation of the reverse function 121
Listing 5.29 Better implementation of the reverse function 122
Listing 5.30 Even better implementation of the reverse function 122
Listing 5.31 String reversing in C using pointers 123
Listing 5.32 String reversing in assembly using pointers 123
Listing 5.33 Initializing an array of integers in C 124
Listing 5.34 Initializing an array of integers in assembly 125
Listing 5.35 Initializing a structured data type in C 125
Listing 5.36 Initializing a structured data type in ARM assembly 126
Listing 5.37 Initializing an array of structured data in C 127
Listing 5.38 Initializing an array of structured data in assembly 128
Listing 5.39 Improved initialization in assembly 129
Listing 5.40 Very efficient initialization in assembly 130
Listing 6.1 Definition of an Abstract Data Type in a C header file 138
Listing 6.2 Definition of the image structure may be hidden in a separate header file 139
Listing 6.3 Definition of an ADT in Assembly 140
Listing 6.4 C program to compute word frequencies 140
Listing 6.5 C header for the wordlist ADT 142
Listing 6.6 C implementation of the wordlist ADT 143
Listing 6.7 Makefile for the wordfreq program 146
Listing 6.8 ARM assembly implementation of wl_print_numerical() 148
Listing 6.9 Revised makefile for the wordfreq program 149
Listing 6.10 C implementation of the wordlist ADT using a tree 151
Listing 6.11 ARM assembly implementation of wl_print_numerical() with a tree 158
Listing 7.1 ARM assembly code for adding two 64 bit numbers 176
Listing 7.2 ARM assembly code for multiplication with a 64 bit result 176
Listing 7.3 ARM assembly code for multiplication with a 32 bit result 177
Listing 7.4 ARM assembly implementation of signed and unsigned 32-bit and 64-bit division functions 187
Listing 7.5 ARM assembly code for division by constant 193 192
Listing 7.6 ARM assembly code for division of a variable by a constant without using a multiply instruction 193
Listing 7.7 Header file for a big integer abstract data type 195
Listing 7.8 C source code file for a big integer abstract data type 196
Listing 7.9 Program using the bigint ADT to calculate the factorial function 211
Listing 7.10 ARM assembly implementation if the bigint_adc function 213
Listing 8.1 Examples of fixed-point multiplication in ARM assembly 233
Listing 8.2 Dividing x by 23 239
Listing 8.3 Dividing x by 23 Using Only Shift and Add 240
Listing 8.4 Dividing x by − 50 242
Listing 8.5 Inefficient representation of a binimal 242
Listing 8.6 Efficient representation of a binimal 243
Listing 8.7 ARM assembly implementation of sin x and cos x using fixed-point calculations 252
Listing 8.8 Example showing how the sin x and cos x functions can be used to print a table 257
Listing 9.1 Simple scalar implementation of the sin x function using IEEE single precision 285
Listing 9.2 Simple scalar implementation of the sin x function using IEEE double precision 286
Listing 9.3 Vector implementation of the sin x function using IEEE single precision 288
Listing 9.4 Vector implementation of the sin x function using IEEE double precision 289
Listing 10.1 NEON implementation of the sin x function using single precision 354
Listing 10.2 NEON implementation of the sin x function using double precision 355
Listing 11.1 Function to map devices into the user program memory on a Raspberry Pi 367
Listing 11.2 Function to map devices into the user program memory space on a pcDuino 372
Listing 11.3 ARM assembly code to set GPIO pin 26 to alternate function 1 381
Listing 11.4 ARM assembly code to configure PA10 for output 388
Listing 11.5 ARM assembly code to set PA10 to output a high state 389
Listing 11.6 ARM assembly code to read the state of PI14 and set or clear the Z flag 389
Listing 13.1 Assembly functions for using the Raspberry Pi UART 418
Listing 14.1 Definitions for ARM CPU modes 435
Listing 14.2 Function to set up the ARM exception table 439
Listing 14.3 Stubs for the exception handlers 440
Listing 14.4 Skeleton for an exception handler 441
Listing 14.5 ARM startup code 443
Listing 14.6 A simple main program 446
Listing 14.7 A sample Gnu linker script 448
Listing 14.8 A sample make file 450
Listing 14.9 Running make to build the image 451
Listing 14.10 An improved main program 452
Listing 14.11 ARM startup code with timer interrupt 453
Listing 14.12 Functions to manage the pdDuino interrupt controller 454
Listing 14.13 Functions to manage the Raspberry Pi interrupt controller 457
Listing 14.14 Functions to manage the pdDuino timer0 device 459
Listing 14.15 Functions to manage the Raspberry Pi timer0 device 460
Listing 14.16 IRQ handler to clear the timer interrupt 462
Listing 14.17 A sample make file 463
Listing 14.18 Running make to build the image 464
This book is intended to be used in a first course in assembly language programming for Computer Science (CS) and Computer Engineering (CE) students. It is assumed that students using this book have already taken courses in programming and data structures, and are competent programmers in at least one high-level language. Many of the code examples in the book are written in C, with an assembly implementation following. The assembly examples can stand on their own, but students who are familiar with C, C++, or Java should find the C examples helpful.
Computer Science and Computer Engineering are very large fields. It is impossible to cover everything that a student may eventually need to know. There are a limited number of course hours available, so educators must strive to deliver degree programs that make a compromise between the number of concepts and skills that the students learn and the depth at which they learn those concepts and skills. Obviously, with these competing goals it is difficult to reach consensus on exactly what courses should be included in a CS or CE curriculum.
Traditionally, assembly language courses have consisted of a mechanistic learning of a set of instructions, registers, and syntax. Partially because of this approach, over the years, assembly language courses have been marginalized in, or removed altogether from, many CS and CE curricula. The author feels that this is unfortunate, because a solid understanding of assembly language leads to better understanding of higher-level languages, compilers, interpreters, architecture, operating systems, and other important CS an CE concepts.
One of the goals of this book is to make a course in assembly language more valuable by introducing methods (and a bit of theory) that are not covered in any other CS or CE courses, while using assembly language to implement the methods. In this way, the course in assembly language goes far beyond the traditional assembly language course, and can once again play an important role in the overall CS and CE curricula.
Because of their ubiquity, x86 based systems have been the platforms of choice for most assembly language courses over the last two decades. The author believes that this is unfortunate, because in every respect other than ubiquity, the x86 architecture is the worst possible choice for learning and teaching assembly language. The newer chips in the family have hundreds of instructions, and irregular rules govern how those instructions can be used. In an attempt to make it possible for students to succeed, typical courses use antiquated assemblers and interface with the antiquated IBM PC BIOS, using only a small subset of the modern x86 instruction set. The programming environment has little or no relevance to modern computing.
Partially because of this tendency to use x86 platforms, and the resulting unnecessary burden placed on students and instructors, as well as the reliance on antiquated and irrelevant development environments, assembly language is often viewed by students as very difficult and lacking in value. The author hopes that this textbook helps students to realize the value of knowing assembly language. The relatively simple ARM processor family was chosen in hopes that the students also learn that although assembly language programming may be more difficult than high-level languages, it can be mastered.
The recent development of very low-cost ARM based Linux computers has caused a surge of interest in the ARM architecture as an alternative to the x86 architecture, which has become increasingly complex over the years. This book should provide a solution for a growing need.
Many students have difficulty with the concept that a register can hold variable x at one point in the program, and hold variable y at some other point. They also often have difficulty with the concept that, before it can be involved in any computation, data has to be moved from memory into the CPU. Using a load-store architecture helps the students to more readily grasp these concepts.
Another common difficulty that students have is in relating the concepts of an address and a pointer variable. You can almost see the little light bulbs light up over their heads, when they have the “eureka!” moment and realize that pointers are just variables that hold an address. The author hopes that the approach taken in this book will make it easier for students to have that “eureka!” moment. The author believes that load-store architectures make that realization easier.
Many students also struggle with the concept of recursion, regardless of what language is used. In assembly, the mechanisms involved are exposed and directly manipulated by the programmer. Examples of recursion are scattered throughout this textbook. Again, the clean architecture of the ARM makes it much easier for the students to understand what is going on.
Some students have difficulty understanding the flow of a program, and tend to put many unnecessary branches into their code. Many assembly language courses spend so much time and space on learning the instruction set that they never have time to teach good programming practices. This textbook puts strong emphasis on using structured programming concepts. The relative simplicity of the ARM architecture makes this possible.
One of the major reasons to learn and use assembly language is that it allows the programmer to create very efficient mathematical routines. The concepts introduced in this book will enable students to perform efficient non-integral math on any processor. These techniques are rarely taught because of the time that it takes to cover the x86 instruction set. With the ARM processor, less time is spent on the instruction set, and more time can be spent teaching how to optimize the code.
The combination of the ARM processor and the Linux operating system provides the least costly hardware platform and development environment available. A cluster of 10 Raspberry Pis, or similar hosts, with power supplies and networking, can be assembled for 500 US dollars or less. This cluster can support up to 50 students logging in through ssh. If their client platform supports the X window system, then they can run GUI enabled applications. Alternatively, most low-cost ARM systems can directly drive a display and take input from a keyboard and mouse. With the addition of an NFS server (which itself could be a low-cost ARM system and a hard drive), an entire Linux ARM based laboratory of 20 workstations could be built for 250 US dollars per seat or less. Admittedly, it would not be a high-performance laboratory, but could be used to teach C, assembly, and other languages. The author would argue that inexperienced programmers should learn to program on low-performance machines, because it reinforces a life-long tendency towards efficiency.
The approach of this book is to present concepts in different ways throughout the book, slowly building from simple examples towards complex programming on bare-metal embedded systems. Students who don’t understand a concept when it is explained in a certain way may easily grasp the concept when it is presented later from a different viewpoint.
The main objective of this book is to provide an improved course in assembly language by replacing the x86 platform with one that is less costly, more ubiquitous, well-designed, powerful, and easier to learn. Since students are able to master the basics of assembly language quickly, it is possible to teach a wider range of topics, such as fixed and floating point mathematics, ethical considerations, performance tuning, and interrupt processing. The author hopes that courses using this book will better prepare students for the junior and senior level courses in operating systems, computer architecture, and compilers.
Please visit the companion web site to access additional resources. Instructors may download the author’s lecture slides and solution manual for the exercises. Students and instructors may also access the laboratory manual and additional code examples. The author welcomes suggestions for additional lecture slides, laboratory assignments, or other materials.
I would like to thank Randy Warner for reading the manuscript, catching errors, and making helpful suggestions. I would also like to thank the following students for suggesting exercises with answers and catching numerous errors in the drafts: Zach Buechler, Preston Cook, Joshua Daybrest, Matthew DeYoung, Josh Dodd, Matt Dyke, Hafiza Farzami, Jeremy Goens, Lawrence Hoffman, Colby Johnson, Benjamin Kaiser, Lauren Keene, Jayson Kjenstad, Murray LaHood-Burns, Derek Lane, Yanlin Li, Luke Meyer, Matthew Mielke, Forrest Miller, Christopher Navarro, Girik Ranchhod, Josh Schweigert, Christian Sieh, Weston Silbaugh, Jacob St. Amand, Njaal Tengesdal, Dylan Thoeny, Michael Vortherms, Dicheng Wu, and Kekoa (Peter) Yamaguchi. Finally, I am also very grateful for my assistants, Scott Logan, Ian Carlson, and Derek Stotz, who gave very valuable feedback during the writing of this book.
Assembly as a Language
This chapter first gives a very high-level description of the major components of function of a computer system. It then motivates the reader by giving reasons why learning assembly language is important for Computer Scientists and Computer Engineers. It then explains why the ARM processor is a good choice for a first assembly language. Next it explains binary data representations, including various integer formats, ASCII, and Unicode. Finally, it describes the memory sections for a typical program during execution. By the end of the chapter, the groundwork has been laid for learning to program in assembly language.
Instruction; Instruction stream; Central processing unit; Memory; Input/output device; High-level language; Assembly language; ARM processor; Binary; Hexadecimal; Decimal; Radix or base system; Base conversion; Sign magnitude; Unsigned; Complement; Excess-n; ASCII; Unicode; UTF-8; Stack; Heap; Data section; Text section
An executable computer program is, ultimately, just a series of numbers that have very little or no meaning to a human being. We have developed a variety of human-friendly languages in which to express computer programs, but in order for the program to execute, it must eventually be reduced to a stream of numbers. Assembly language is one step above writing the stream of numbers. The stream of numbers is called the instruction stream. Each number in the instruction stream instructs the computer to perform one (usually small) operation. Although each instruction does very little, the ability of the programmer to specify any sequence of instructions and the ability of the computer to perform billions of these small operations every second makes modern computers very powerful and flexible tools. In assembly language, one line of code usually gets translated into one machine instruction. In high-level languages, a single line of code may generate many machine instructions.
A simplified model of a computer system, as shown in Fig. 1.1, consists of memory, input/output devices, and a central processing unit (CPU), connected together by a system bus. The bus can be thought of as a roadway that allows data to travel between the components of the computer system. The CPU is the part of the system where most of the computation occurs, and the CPU controls the other devices in the system.

Memory can be thought of as a series of mailboxes. Each mailbox can hold a single postcard with a number written on it, and each mailbox has a unique numeric identifier. The identifier, x is called the memory address, and the number stored in the mailbox is called the contents of address x. Some of the mailboxes contain data, and others contain instructions which control what actions are performed by the CPU.
The CPU also contains a much smaller set of mailboxes, which we call registers. Data can be copied from cards stored in memory to cards stored in the CPU, or vice-versa. Once data has been copied into one of the CPU registers, it can be used in computation. For example, in order to add two numbers in memory, they must first be copied into registers on the CPU. The CPU can then add the numbers together and store the result in one of the CPU registers. The result of the addition can then be copied back into one of the mailboxes in the memory.
Modern computers execute instructions sequentially. In other words, the next instruction to be executed is at the memory address immediately following the current instruction. One of the registers in the CPU, the program counter (PC), keeps track of the location from which the next instruction is to be fetched. The CPU follows a very simple sequence of actions. It fetches an instruction from memory, increments the PC, executes the instruction, and then repeats the process with the next instruction. However, some instructions may change the PC, so that the next instruction is fetched from a non-sequential address.
There are many high-level programming languages, such as Java, Python, C, and C++ that have been designed to allow programmers to work at a high level of abstraction, so that they do not need to understand exactly what instructions are needed by a particular CPU. For compiled languages, such as C and C++, a compiler handles the task of translating the program, written in a high-level language, into assembly language for the particular CPU on the system. An assembler then converts the program from assembly language into the binary codes that the CPU reads as instructions.
High-level languages can greatly enhance programmer productivity. However, there are some situations where writing assembly code directly is desirable or necessary. For example, assembly language may be the best choice when writing
• the first steps in booting the computer,
• code to handle interrupts,
• low-level locking code for multi-threaded programs,
• code for machines where no compiler exists,
• code which needs to be optimized beyond the limits of the compiler,
• on computers with very limited memory, and
• code that requires low-level access to architectural and/or processor features.
Aside from sheer necessity, there are several other reasons why it is still important for computer scientists to learn assembly language.
One example where knowledge of assembly is indispensable is when designing and implementing compilers for high-level languages. As shown in Fig. 1.2, a typical compiler for a high-level language must generate assembly language as its output. Most compilers are designed to have multiple stages. In the input stage, the source language is read and converted into a graph representation. The graph may be optimized before being passed to the output, or code generation, stage where it is converted to assembly language. The assembly is then fed into the system’s assembler to generate an object file. The object file is linked with other object files (which are often combined into libraries) to create an executable program.

The code generation stage of a compiler must traverse the graph and emit assembly code. The quality of the assembly code that is generated can have a profound influence on the performance of the executable program. Therefore, the programmer responsible for the code generation portion of the compiler must be well versed in assembly programming for the target CPU.
Some people believe that a good optimizing compiler will generate better assembly code than a human programmer. This belief is not justified. Highly optimizing compilers have lots of clever algorithms, but like all programs, they are not perfect. Outside of the cases that they were designed for, they do not optimize well. Many newer CPUs have instructions which operate on multiple items of data at once. However, compilers rarely make use of these powerful single instruction multiple data ( SIMD) instructions. Instead, it is common for programmers to write functions in assembly language to take advantage of SIMD instructions. The assembly functions are assembled into object file(s), then linked with the object file(s) generated from the high-level language compiler.
Many modern processors also have some support for processing vectors (arrays). Compilers are usually not very good at making effective use of the vector instructions. In order to achieve excellent vector performance for audio or video codecs and other time-critical code, it is often necessary to resort to small pieces of assembly code in the performance-critical inner loops. A good example of this type of code is when performing vector and matrix multiplies. Such operations are commonly needed in processing images and in graphical applications. The ARM vector instructions are explained in Chapter 9.
Another reason for assembly is when writing certain parts of an operating system. Although modern operating systems are mostly written in high-level languages, there are some portions of the code that can only be done in assembly. Typical uses of assembly language are when writing device drivers, saving the state of a running program so that another program can use the CPU, restoring the saved state of a running program so that it can resume executing, and managing memory and memory protection hardware. There are many other tasks central to a modern operating system which can only be accomplished in assembly language. Careful design of the operating system can minimize the amount of assembly required, but cannot eliminate it completely.
Another good reason to learn assembly is for debugging. Simply understanding what is going on “behind the scenes” of compiled languages such as C and C++ can be very valuable when trying to debug programs. If there is a problem in a call to a third party library, sometimes the only way a developer can isolate and diagnose the problem is to run the program under a debugger and step through it one machine instruction at a time. This does not require a deep knowledge of assembly language coding but at least a passing familiarity with assembly is helpful in that particular case. Analysis of assembly code is an important skill for C and C++ programmers, who may occasionally have to diagnose a fault by looking at the contents of CPU registers and single-stepping through machine instructions.
Assembly language is an important part of the path to understanding how the machine works. Even though only a small percentage of computer scientists will be lucky enough to work on the code generator of a compiler, they all can benefit from the deeper level of understanding gained by learning assembly language. Many programmers do not really understand pointers until they have written assembly language.
Without first learning assembly language, it is impossible to learn advanced concepts such as microcode, pipelining, instruction scheduling, out-of-order execution, threading, branch prediction, and speculative execution. There are many other concepts, especially when dealing with operating systems and computer architecture, which require some understanding of assembly language. The best programmers understand why some language constructs perform better than others, how to reduce cache misses, and how to prevent buffer overruns that destroy security.
Every program is meant to run on a real machine. Even though there are many languages, compilers, virtual machines, and operating systems to enable the programmer to use the machine more conveniently, the strengths and weaknesses of that machine still determine what is easy and what is hard. Learning assembly is a fundamental part of understanding enough about the machine to make informed choices about how to write efficient programs, even when writing in a high-level language.
As an analogy, most people do not need to know a lot about how an internal combustion engine works in order to operate an automobile. A race car driver needs a much better understanding of exactly what happens when he or she steps on the accelerator pedal in order to be able to judge precisely when (and how hard) to do so. Also, who would trust their car to a mechanic who could not tell the difference between a spark plug and a brake caliper? Worse still, should we trust an engineer to build a car without that knowledge? Even in this day of computerized cars, someone needs to know the gritty details, and they are paid well for that knowledge. Knowledge of assembly language is one of the things that defines the computer scientist and engineer.
When learning assembly language, the specific instruction set is not critically important, because what is really being learned is the fine detail of how a typical stored-program machine uses different storage locations and logic operations to convert a string of bits into a meaningful calculation. However, when it comes to learning assembly languages, some processors make it more difficult than it needs to be. Because some processors have an instruction set that is extremely irregular, non-orthogonal, large, and poorly designed, they are not a good choice for learning assembly. The author feels that teaching students their first assembly language on one of those processors should be considered a crime, or at least a form of mental abuse. Luckily, there are processors that are readily available, low-cost, and relatively easy to learn assembly with. This book uses one of them as the model for assembly language.
In the late 1970s, the microcomputer industry was a fierce battleground, with several companies competing to sell computers to small business and home users. One of those companies, based in the United Kingdom, was Acorn Computers Ltd. Acorn’s flagship product, the BBC Micro, was based on the same processor that Apple Computer had chosen for their Apple IITM line of computers; the 8-bit 6502 made by MOS Technology. As the 1980s approached, microcomputer manufacturers were looking for more powerful 16-bit and 32-bit processors. The engineers at Acorn considered the processor chips that were available at the time, and concluded that there was nothing available that would meet their needs for the next generation of Acorn computers.
The only reasonably-priced processors that were available were the Motorola 68000 (a 32-bit processor used in the Apple Macintosh and most high-end Unix workstations) and the Intel 80286 (a 16-bit processor used in less powerful personal computers such as the IBM PC). During the previous decade, a great deal of research had been conducted on developing high-performance computer architectures. One of the outcomes of that research was the development of a new paradigm for processor design, known as Reduced Instruction Set Computing (RISC). One advantage of RISC processors was that they could deliver higher performance with a much smaller number of transistors than the older Complex Instruction Set Computing (CISC) processors such as the 68000 and 80286. The engineers at Acorn decided to design and produce their own processor. They used the BBC Micro to design and simulate their new processor, and in 1987, they introduced the Acorn ArchimedesTM. The ArchimedesTM was arguably the most powerful home computer in the world at that time, with graphics and audio capabilities that IBM PCTM and Apple MacintoshTM users could only dream about. Thus began the long and successful dynasty of the Acorn RISC Machine (ARM) processor.
Acorn never made a big impact on the global computer market. Although Acorn eventually went out of business, the processor that they created has lived on. It was re-named to the Advanced RISC Machine, and is now known simply as ARM. Stewardship of the ARM processor belongs to ARM Holdings, LLC which manages the design of new ARM architectures and licenses the manufacturing rights to other companies. ARM Holdings does not manufacture any processor chips, yet more ARM processors are produced annually than all other processor designs combined. Most ARM processors are used as components for embedded systems and portable devices. If you have a smart phone or similar device, then there is a very good chance that it has an ARM processor in it. Because of its enormous market presence, clean architecture, and small, orthogonal instruction set, the ARM is a very good choice for learning assembly language.
Although it dominates the portable device market, the ARM processor has almost no presence in the desktop or server market. However, that may change. In 2012, ARM Holdings announced the ARM64 architecture, which is the first major redesign of the ARM architecture in 30 years. The ARM64 is intended to compete for the desktop and server market with other high-end processors such as the Sun SPARC and Intel Xeon. Regardless of whether or not the ARM64 achieves much market penetration, the original ARM 32-bit processor architecture is so ubiquitous that it clearly will be around for a long time.
The basic unit of data in a digital computer is the binary digit, or bit. A bit can have a value of zero or one. In order to store numbers larger than 1, bits are combined into larger units. For instance, using two bits, it is possible to represent any number between zero and three. This is shown in Table 1.1. When stored in the computer, all data is simply a string of binary digits. There is more than one way that such a fixed-length string of binary digits can be interpreted.
Computers have been designed using many different bit group sizes, including 4, 8, 10, 12, and 14 bits. Today most computers recognize a basic grouping of 8 bits, which we call a byte. Some computers can work in units of 4 bits, which is commonly referred to as a nibble (sometimes spelled “nybble”). A nibble is a convenient size because it can exactly represent one hexadecimal digit. Additionally, most modern computers can also work with groupings of 16, 32 and 64 bits. The CPU is designed with a default word size. For most modern CPUs, the default word size is 32 bits. Many processors support 64-bit words, which is increasingly becoming the default size.
A numeral system is a writing system for expressing numbers. The most common system is the Hindu-Arabic number system, which is now used throughout the world. Almost from the first day of formal education, children begin learning how to add, subtract, and perform other operations using the Hindu-Arabic system. After years of practice, performing basic mathematical operations using strings of digits between 0 and 9 seems natural. However, there are other ways to count and perform arithmetic, such as Roman numerals, unary systems, and Chinese numerals. With a little practice, it is possible to become as proficient at performing mathematics with other number systems as with the Hindu-Arabic system.
The Hindu-Arabic system is a base ten or radix ten system, because it uses the ten digits 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9. For our purposes, the words radix and base are equivalent, and refer to the number of individual digits available in the numbering system. The Hindu-Arabic system is also a positional system, or a place-value notation, because the value of each digit in a number depends on its position in the number. The radix ten Hindu-Arabic system is only one of an infinite family of closely related positional systems. The members of this family differ only in the radix used (and therefore, the number of characters used). For bases greater than base ten, characters are borrowed from the alphabet and used to represent digits. For example, the first column in Table 1.2 shows the character “A” being used as a single digit representation for the number 10.
Table 1.2
The first 21 integers (starting with 0) in various bases
| Base | |||||||||
| 16 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 10 |
| 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 10 | 11 |
| 4 | 4 | 4 | 4 | 4 | 4 | 4 | 10 | 11 | 100 |
| 5 | 5 | 5 | 5 | 5 | 5 | 10 | 11 | 12 | 101 |
| 6 | 6 | 6 | 6 | 6 | 10 | 11 | 12 | 20 | 110 |
| 7 | 7 | 7 | 7 | 10 | 11 | 12 | 13 | 21 | 111 |
| 8 | 8 | 8 | 10 | 11 | 12 | 13 | 20 | 22 | 1000 |
| 9 | 9 | 10 | 11 | 12 | 13 | 14 | 21 | 100 | 1001 |
| A | 10 | 11 | 12 | 13 | 14 | 20 | 22 | 101 | 1010 |
| B | 11 | 12 | 13 | 14 | 15 | 21 | 23 | 102 | 1011 |
| C | 12 | 13 | 14 | 15 | 20 | 22 | 30 | 110 | 1100 |
| D | 13 | 14 | 15 | 16 | 21 | 23 | 31 | 111 | 1101 |
| E | 14 | 15 | 16 | 20 | 22 | 24 | 32 | 112 | 1110 |
| F | 15 | 16 | 17 | 21 | 23 | 30 | 33 | 120 | 1111 |
| 10 | 16 | 17 | 20 | 22 | 24 | 31 | 100 | 121 | 10000 |
| 11 | 17 | 18 | 21 | 23 | 25 | 32 | 101 | 122 | 10001 |
| 12 | 18 | 20 | 22 | 24 | 30 | 33 | 102 | 200 | 10010 |
| 13 | 19 | 21 | 23 | 25 | 31 | 34 | 103 | 201 | 10011 |
| 14 | 20 | 22 | 24 | 26 | 32 | 40 | 110 | 201 | 10100 |

In base ten, we think of numbers as strings of the 10 digits, “0”–“9”. Each digit counts 10 times the amount of the digit to its right. If we restrict ourselves to integers, then the digit furthest to the right is always the ones digit. It is also referred to as the least significant digit. The digit immediately to the left of the ones digit is the tens digit. To the left of that is the hundreds digit, and so on. The leftmost digit is referred to as the most significant digit. The following equation shows how a number can be decomposed into its constituent digits:
Note that the subscript of “10” on 5783910 indicates that the number is given in base ten.
Imagine that we only had 7 digits: 0, 1, 2, 3, 4, 5, and 6. We need 10 digits for base ten, so with only 7 digits we are limited to base seven. In base seven, each digit in the string represents a power of seven rather than a power of ten. We can represent any integer in base seven, but it may take more digits than in base ten. Other than using a different base for the power of each digit, the math works exactly the same as for base ten. For example, suppose we have the following number in base seven: 3304257. We can convert this number to base ten as follows:
Base two, or binary is the “native” number system for modern digital systems. The reason for this is mainly because it is relatively easy to build circuits with two stable states: on and off (or 1 and 0). Building circuits with more than two stable states is much more difficult and expensive, and any computation that can be performed in a higher base can also be performed in binary. The least significant (rightmost) digit in binary is referred to as the least significant bit, or LSB, while the leftmost binary digit is referred to as the most significant bit, or MSB.
The most common bases used by programmers are base two (binary), base eight (octal), base ten (decimal) and base sixteen (hexadecimal). Octal and hexadecimal are common because, as we shall see later, they can be translated quickly and easily to and from base two, and are often easier for humans to work with than base two. Note that for base sixteen, we need 16 characters. We use the digits 0 through 9 plus the letters A through F. Table 1.2 shows the equivalents for all numbers between 0 and 20 in base two through base ten, and base sixteen.
Before learning assembly language it is essential to know how to convert from any base to any other base. Since we are already comfortable working in base ten, we will use that as an intermediary when converting between two arbitrary bases. For instance, if we want to convert a number in base three to base five, we will do it by first converting the base three number to base ten, then from base ten to base five. By using this two-stage process, we will only need to learn to convert between base ten and any arbitrary base b.
Converting from an arbitrary base b to base ten simply involves multiplying each base b digit d by bn, where n is the significance of digit d, and summing all of the results. For example, converting the base five number 34215 to base ten is performed as follows:
This conversion procedure works for converting any integer from any arbitrary base b to its equivalent representation in base ten. Example 1.1 gives another specific example of how to convert from base b to base ten.
Converting from base ten to an arbitrary base b involves repeated division by the base, b. After each division, the remainder is used as the next more significant digit in the base b number, and the quotient is used as the dividend for the next iteration. The process is repeated until the quotient is zero. For example, converting 5610 to base four is accomplished as follows:
Reading the remainders from right to left yields: 3204. This result can be double-checked by converting it back to base ten as follows:
Since we arrived at the same number we started with, we have verified that 5610 = 3204. This conversion procedure works for converting any integer from base ten to any arbitrary base b. Example 1.2 gives another example of converting from base ten to another base b.
Although it is possible to perform the division and multiplication steps in any base, most people are much better at working in base ten. For that reason, the easiest way to convert from any base a to any other base b is to use a two step process. First step is to convert from base a to decimal. The second step is to convert from decimal to base b. Example 1.3 shows how to convert from any base to any other base.
In addition to the methods above, there is a simple method for quickly converting between base two, base eight, and base sixteen. These shortcuts rely on the fact that 2, 8, and 16 are all powers of two. Because of this, it takes exactly four binary digits (bits) to represent exactly one hexadecimal digit. Likewise, it takes exactly three bits to represent an octal digit. Conversely, each hexadecimal digit can be converted to exactly four binary digits, and each octal digit can be converted to exactly three binary digits. This relationship makes it possible to do very fast conversions using the tables shown in Fig. 1.3.

When converting from hexadecimal to binary, all that is necessary is to replace each hex digit with the corresponding binary digits from the table. For example, to convert 5AC416 to binary, we just replace “5” with “0101,” replace “A” with “1010,” replace “C” with “1100,” and replace “4” with “0100.” So, just by referring to the table, we can immediately see that 5AC416 = 01011010110001002. This method works exactly the same for converting from octal to binary, except that it uses the table on the right side of Fig. 1.3.
Converting from binary to hexadecimal is also very easy using the table. Given a binary number, n, take the four least significant digits of n and find them in the table on the left side of Fig. 1.3. The hexadecimal digit on the matching line of the table is the least significant hex digit. Repeat the process with the next set of four bits and continue until there are no bits remaining in the binary number. For example, to convert 00111001010101112 to hexadecimal, just divide the number into groups of four bits, starting on the right, to get: 0011|1001|0101|01112. Now replace each group of four bits by looking up the corresponding hex digit in the table on the left side of Fig. 1.3, to convert the binary number to 395716. In the case where the binary number does not have enough bits, simply pad with zeros in the high-order bits. For example, dividing the number 10011000100112 into groups of four yields 1|0011|0001|00112 and padding with zeros in the high-order bits results in 0001|0011|0001|00112. Looking up the four groups in the table reveals that 0001|0011|0001|00112 = 131316.
The computer stores groups of bits, but the bits by themselves have no meaning. The programmer gives them meaning by deciding what the bits represent, and how they are interpreted. Interpreting a group of bits as unsigned integer data is relatively simple. Each bit is weighted by a power-of-two, and the value of the group of bits is the sum of the non-zero bits multiplied by their respective weights. However, programmers often need to represent negative as well as non-negative numbers, and there are many possibilities for storing and interpreting integers whose value can be both positive and negative. Programmers and hardware designers have developed several standard schemes for encoding such numbers. The three main methods for storing and interpreting signed integer data are two’s complement, sign-magnitude, and excess-N, Fig. 1.4 shows how the same binary pattern of bits can be interpreted as a number in four different ways.

The sign-magnitude representation simply reserves the most significant bit to represent the sign of the number, and the remaining bits are used to store the magnitude of the number. This method has the advantage that it is easy for humans to interpret, with a little practice. However, addition and subtraction are slightly complicated. The addition/subtraction logic must compare the sign bits, complement one of the inputs if they are different, implement an end-around carry, and complement the result if there was no carry from the most significant bit. Complements are explained in Section 1.3.3. Because of the complexity, most integer CPUs do not directly support addition and subtraction of integers in sign-magnitude form. However, this method is commonly used for mantissa in floating-point numbers, as will be explained in Chapter 8. Another drawback to sign-magnitude is that it has two representations for zero, which can cause problems if the programmer is not careful.
Another method for representing both positive and negative numbers is by using an excess-N representation. With this representation, the number that is stored is N greater than the actual value. This representation is relatively easy for humans to interpret. Addition and subtraction are easily performed using the complement method, which is explained in Section 1.3.3. This representation is just the same as unsigned math, with the addition of a bias which is usually (2n−1 − 1). So, zero is represented as zero plus the bias. In n = 12 bits, the bias is 212−1 − 1 = 204710, or 0111111111112. This method is commonly used to store the exponent in floating-point numbers, as will be explained in Chapter 8.
A very efficient method for dealing with signed numbers involves representing negative numbers as the radix complements of their positive counterparts. The complement is the amount that must be added to something to make it “whole.” For instance, in geometry, two angles are complementary if they add to 90°. In radix mathematics, the complement of a digit x in base b is simply b − x. For example, in base ten, the complement of 4 is 10 − 4 = 6.
In complement representation, the most significant digit of a number is reserved to indicate whether or not the number is negative. If the first digit is less than
(where b is the radix), then the number is positive. If the first digit is greater than or equal to
, then the number is negative. The first digit is not part of the magnitude of the number, but only indicates the sign of the number. For example, numbers in ten’s complement notation are positive if the first digit is less than 5, and negative if the first digit is greater than 4. This works especially well in binary, since the number is considered positive if the first bit is zero and negative if the first bit is one. The magnitude of a negative number can be obtained by taking the radix complement. Because of the nice properties of the complement representation, it is the most common method for representing signed numbers in digital computers.
Finding the complement: The radix complement of an n digit number y in radix ( base) b is defined as
For example, the ten’s complement of the four digit number 873410 is 104 − 8734 = 1266. In this example, we directly applied the definition of the radix complement from Eq. (1.4). That is easy in base ten, but not so easy in an arbitrary base, because it involves performing a subtraction. However, there is a very simple method for calculating the complement which does not require subtraction. This method involves finding the diminished radix complement, which is (bn − 1) − y by substituting each digit with its complement from a complement table. The radix complement is found by adding one to the diminished radix complement. Fig. 1.5 shows the complement tables for bases ten and two. Examples 1.4 and 1.5 show how the complement is obtained in bases ten and two respectively. Examples 1.6 and 1.7 show additional conversions between binary and decimal.

Subtraction using complements One very useful feature of complement notation is that it can be used to perform subtraction by using addition. Given two numbers in base b, xb, and yb, the difference can be computed as:
where C(yb) is the radix complement of yb. Assume that xb and yb are both positive where yb ≤ xb and both numbers have the same number of digits n (yb may have leading zeros). In this case, the result of xb + C(yb) will always be greater than or equal to bn, but less than 2 × bn. This means that the result of xb + C(yb) will always begin with a ‘1’ in the n + 1 digit position. Dropping the initial ‘1’ is equivalent to subtracting bn, making the result x − y + bn − bn or just x − y, which is the desired result. This can be reduced to a simple procedure. When y and x are both positive and y ≤ x, the following four steps are to be performed:
1. pad the subtrahend (y) with leading zeros, as necessary, so that both numbers have the same number of digits (n),
2. find the b’s complement of the subtrahend,
3. add the complement to the minuend,
4. discard the leading ‘1’.
The complement notation provides a very easy way to represent both positive and negative integers using a fixed number of digits, and to perform subtraction by using addition. Since modern computers typically use a fixed number of bits, complement notation provides a very convenient and efficient way to store signed integers and perform mathematical operations on them. Hardware is simplified because there is no need to build a specialized subtractor circuit. Instead, a very simple complement circuit is built and the adder is reused to perform subtraction as well as addition.
In the previous section, we discussed how the computer stores information as groups of bits, and how we can interpret those bits as numbers in base two. Given that the computer can only store information using groups of bits, how can we store textual information? The answer is that we create a table, which assigns a numerical value to each character in our language.
Early in the development of computers, several computer manufacturers developed such tables, or character coding schemes. These schemes were incompatible and computers from different manufacturers could not easily exchange textual data without the use of translation software to convert the character codes from one coding scheme to another.
Eventually, a standard coding scheme, known as the American Standard Code for Information Interchange (ASCII) was developed. Work on the ASCII standard began on October 6, 1960, with the first meeting of the American Standards Association’s (ASA) X3.2 subcommittee. The first edition of the standard was published in 1963. The standard was updated in 1967 and again in 1986, and has been stable since then. Within a few years of its development, ASCII was accepted by all major computer manufacturers, although some continue to support their own coding schemes as well.
ASCII was designed for American English, and does not support some of the characters that are used by non-English languages. For this reason, ASCII has been extended to create more comprehensive coding schemes. Most modern multilingual coding schemes are based on ASCII, though they support a wider range of characters.
At the time that it was developed, transmission of digital data over long distances was very slow, and usually involved converting each bit into an audio signal which was transmitted over a telephone line using an acoustic modem. In order to maximize performance, the standards committee chose to define ASCII as a 7-bit code. Because of this decision, all textual data could be sent using seven bits rather than eight, resulting in approximately 10% better overall performance when transmitting data over a telephone modem. A possibly unforeseen benefit was that this also provided a way for the code to be extended in the future. Since there are 128 possible values for a 7-bit number, the ASCII standard provides 128 characters. However, 33 of the ASCII characters are non-printing control characters. These characters, shown in Table 1.3, are mainly used to send information about how the text is to be displayed and/or printed. The remaining 95 printable characters are shown in Table 1.4.
Table 1.3
The ASCII control characters
| Binary | Oct | Dec | Hex | Abbr | Glyph | Name |
| 000 0000 | 000 | 0 | 00 | NUL | ˆ@ | Null character |
| 000 0001 | 001 | 1 | 01 | SOH | ˆA | Start of header |
| 000 0010 | 002 | 2 | 02 | STX | ˆB | Start of text |
| 000 0011 | 003 | 3 | 03 | ETX | ˆC | End of text |
| 000 0100 | 004 | 4 | 04 | EOT | ˆD | End of transmission |
| 000 0101 | 005 | 5 | 05 | ENQ | ˆE | Enquiry |
| 000 0110 | 006 | 6 | 06 | ACK | ˆF | Acknowledgment |
| 000 0111 | 007 | 7 | 07 | BEL | ˆG | Bell |
| 000 1000 | 010 | 8 | 08 | BS | ˆH | Backspace |
| 000 1001 | 011 | 9 | 09 | HT | ˆI | Horizontal tab |
| 000 1010 | 012 | 10 | 0A | LF | ˆJ | Line feed |
| 000 1011 | 013 | 11 | 0B | VT | ˆK | Vertical tab |
| 000 1100 | 014 | 12 | 0C | FF | ˆL | Form feed |
| 000 1101 | 015 | 13 | 0D | CR | ˆM | Carriage return[g] |
| 000 1110 | 016 | 14 | 0E | SO | ˆN | Shift out |
| 000 1111 | 017 | 15 | 0F | SI | ˆO | Shift in |
| 001 0000 | 020 | 16 | 10 | DLE | ˆP | Data link escape |
| 001 0001 | 021 | 17 | 11 | DC1 | ˆQ | Device control 1 (oft. XON) |
| 001 0010 | 022 | 18 | 12 | DC2 | ˆR | Device control 2 |
| 001 0011 | 023 | 19 | 13 | DC3 | ˆS | Device control 3 (oft. XOFF) |
| 001 0100 | 024 | 20 | 14 | DC4 | ˆT | Device control 4 |
| 001 0101 | 025 | 21 | 15 | NAK | ˆU | Negative acknowledgement |
| 001 0110 | 026 | 22 | 16 | SYN | ˆV | Synchronous idle |
| 001 0111 | 027 | 23 | 17 | ETB | ˆW | End of transmission Block |
| 001 1000 | 030 | 24 | 18 | CAN | ˆX | Cancel |
| 001 1001 | 031 | 25 | 19 | EM | ˆY | End of medium |
| 001 1010 | 032 | 26 | 1A | SUB | ˆZ | Substitute |
| 001 1011 | 033 | 27 | 1B | ESC | ˆ[ | Escape |
| 001 1100 | 034 | 28 | 1C | FS | ˆ\ | File separator |
| 001 1101 | 035 | 29 | 1D | GS | ˆ] | Group separator |
| 001 1110 | 036 | 30 | 1E | RS | ˆˆ | Record separator |
| 001 1111 | 037 | 31 | 1F | US | ˆ_ | Unit separator |
| 111 1111 | 177 | 127 | 7F | DEL | ˆ? | Delete |

Table 1.4
The ASCII printable characters
| Binary | Oct | Dec | Hex | Glyph |
| 010 0000 | 040 | 32 | 20 | _ |
| 010 0001 | 041 | 33 | 21 | ! |
| 010 0010 | 042 | 34 | 22 | ” |
| 010 0011 | 043 | 35 | 23 | # |
| 010 0100 | 044 | 36 | 24 | $ |
| 010 0101 | 045 | 37 | 25 | % |
| 010 0110 | 046 | 38 | 26 | & |
| 010 0111 | 047 | 39 | 27 | ’ |
| 010 1000 | 050 | 40 | 28 | ( |
| 010 1001 | 051 | 41 | 29 | ) |
| 010 1010 | 052 | 42 | 2A | * |
| 010 1011 | 053 | 43 | 2B | + |
| 010 1100 | 054 | 44 | 2C | , |
| 010 1101 | 055 | 45 | 2D | − |
| 010 1110 | 056 | 46 | 2E | . |
| 010 1111 | 057 | 47 | 2F | / |
| 011 0000 | 060 | 48 | 30 | 0 |
| 011 0001 | 061 | 49 | 31 | 1 |
| 011 0010 | 062 | 50 | 32 | 2 |
| 011 0011 | 063 | 51 | 33 | 3 |
| 011 0100 | 064 | 52 | 34 | 4 |
| 011 0101 | 065 | 53 | 35 | 5 |
| 011 0110 | 066 | 54 | 36 | 6 |
| 011 0111 | 067 | 55 | 37 | 7 |
| 011 1000 | 070 | 56 | 38 | 8 |
| 011 1001 | 071 | 57 | 39 | 9 |
| 011 1010 | 072 | 58 | 3A | : |
| 011 1011 | 073 | 59 | 3B | ; |
| 011 1100 | 074 | 60 | 3C | < |
| 011 1101 | 075 | 61 | 3D | = |
| 011 1110 | 076 | 62 | 3E | > |
| 011 1111 | 077 | 63 | 3F | ? |
| 100 0000 | 100 | 64 | 40 | @ |
| 100 0001 | 101 | 65 | 41 | A |
| 100 0010 | 102 | 66 | 42 | B |
| 100 0011 | 103 | 67 | 43 | C |
| 100 0100 | 104 | 68 | 44 | D |
| 100 0101 | 105 | 69 | 45 | E |
| 100 0110 | 106 | 70 | 46 | F |
| 100 0111 | 107 | 71 | 47 | G |
| 100 1000 | 110 | 72 | 48 | H |
| 100 1001 | 111 | 73 | 49 | I |
| 100 1010 | 112 | 74 | 4A | J |
| 100 1011 | 113 | 75 | 4B | K |
| 100 1100 | 114 | 76 | 4C | L |
| 100 1101 | 115 | 77 | 4D | M |
| 100 1110 | 116 | 78 | 4E | N |
| 100 1111 | 117 | 79 | 4F | O |
| 101 0000 | 120 | 80 | 50 | P |
| 101 0001 | 121 | 81 | 51 | Q |
| 101 0010 | 122 | 82 | 52 | R |
| 101 0011 | 123 | 83 | 53 | S |
| 101 0100 | 124 | 84 | 54 | T |
| 101 0101 | 125 | 85 | 55 | U |
| 101 0110 | 126 | 86 | 56 | V |
| 101 0111 | 127 | 87 | 57 | W |
| 101 1000 | 130 | 88 | 58 | X |
| 101 1001 | 131 | 89 | 59 | Y |
| 101 1010 | 132 | 90 | 5A | Z |
| 101 1011 | 133 | 91 | 5B | [ |
| 101 1100 | 134 | 92 | 5C | \ |
| 101 1101 | 135 | 93 | 5D | ] |
| 101 1110 | 136 | 94 | 5E | ˆ |
| 101 1111 | 137 | 95 | 5F | _ |
| 110 0000 | 140 | 96 | 60 | ‘ |
| 110 0001 | 141 | 97 | 61 | a |
| 110 0010 | 142 | 98 | 62 | b |
| 110 0011 | 143 | 99 | 63 | c |
| 110 0100 | 144 | 100 | 64 | d |
| 110 0101 | 145 | 101 | 65 | e |
| 110 0110 | 146 | 102 | 66 | f |
| 110 0111 | 147 | 103 | 67 | g |
| 110 1000 | 150 | 104 | 68 | h |
| 110 1001 | 151 | 105 | 69 | i |
| 110 1010 | 152 | 106 | 6A | j |
| 110 1011 | 153 | 107 | 6B | k |
| 110 1100 | 154 | 108 | 6C | l |
| 110 1101 | 155 | 109 | 6D | m |
| 110 1110 | 156 | 110 | 6E | n |
| 110 1111 | 157 | 111 | 6F | o |
| 111 0000 | 160 | 112 | 70 | p |
| 111 0001 | 161 | 113 | 71 | q |
| 111 0010 | 162 | 114 | 72 | r |
| 111 0011 | 163 | 115 | 73 | s |
| 111 0100 | 164 | 116 | 74 | t |
| 111 0101 | 165 | 117 | 75 | u |
| 111 0110 | 166 | 118 | 76 | v |
| 111 0111 | 167 | 119 | 77 | w |
| 111 1000 | 170 | 120 | 78 | x |
| 111 1001 | 171 | 121 | 79 | y |
| 111 1010 | 172 | 122 | 7A | z |
| 111 1011 | 173 | 123 | 7B | { |
| 111 1100 | 174 | 124 | 7C | | |
| 111 1101 | 175 | 125 | 7D | } |
| 111 1110 | 176 | 126 | 7E | ˜ |


The non-printing characters are used to provide hints or commands to the device that is receiving, displaying, or printing the data. The FF character, when sent to a printer, will cause the printer to eject the current page and begin a new one. The LF character causes the printer or terminal to end the current line and begin a new one. The CR character causes the terminal or printer to move to the beginning of the current line. Many text editing programs allow the user to enter these non-printing characters by using the control key on the keyboard. For instance, to enter the BEL character, the user would hold the control key down and press the G key. This character, when sent to a character display terminal, will cause it to emit a beep. Many of the other control characters can be used to control specific features of the printer, display, or other device that the data is being sent to.
Suppose we wish to covert a string of characters, such as “Hello World” to an ASCII representation. We can use an 8-bit byte to store each character. Also, it is common practice to include an additional byte at the end of the string. This additional byte holds the ASCII NUL character, which indicates the end of the string. Such an arrangement is referred to as a null-terminated string.
To convert the string “Hello World” into a null-terminated string, we can build a table with each character on the left and its equivalent binary, octal, hexadecimal, or decimal value (as defined in the ASCII table) on the right. Table 1.5 shows the characters in “Hello World” and their equivalent binary representations, found by looking in Table 1.4. Since most modern computers use 8-bit bytes (or multiples thereof) as the basic storage unit, an extra zero bit is shown in the most significant bit position.
Table 1.5
Binary equivalents for each character in “Hello World”
| Character | Binary |
| H | 01001000 |
| e | 01100101 |
| l | 01101100 |
| l | 01101100 |
| o | 01101111 |
| 00100000 | |
| W | 01010111 |
| o | 01101111 |
| r | 01110010 |
| l | 01101100 |
| d | 01100100 |
| NUL | 00000000 |
Reading the Binary column from top to bottom results in the following sequence of bytes: 01001000 01100101 01101100 01101100 01101111 00100000 01010111 01101111 01110010 01101100 01100100 0000000. To convert the same string to a hexadecimal representation, we can use the shortcut method that was introduced previously to convert each 4-bit nibble into its hexadecimal equivalent, or read the hexadecimal value from the ASCII table. Table 1.6 shows the result of extending Table 1.5 to include hexadecimal and decimal equivalents for each character. The string can now be converted to hexadecimal or decimal simply by reading the correct column in the table. So “Hello World” expressed as a null-terminated string in hexadecimal is “48 65 6C 6C 6F 20 57 6F 62 6C 64 00” and in decimal it is ”72 101 108 108 111 32 87 111 98 108 100 0”.
Table 1.6
Binary, hexadecimal, and decimal equivalents for each character in “Hello World”
| Character | Binary | Hexadecimal | Decimal |
| H | 01001000 | 48 | 72 |
| e | 01100101 | 65 | 101 |
| l | 01101100 | 6C | 108 |
| l | 01101100 | 6C | 108 |
| o | 01101111 | 6F | 111 |
| 00100000 | 20 | 32 | |
| W | 01010111 | 57 | 87 |
| o | 01101111 | 6F | 111 |
| r | 01110010 | 62 | 98 |
| l | 01101100 | 6C | 108 |
| d | 01100100 | 64 | 100 |
| NUL | 00000000 | 00 | 0 |

It is sometimes necessary to convert a string of bytes in hexadecimal into ASCII characters. This is accomplished simply by building a table with the hexadecimal value of each byte in the left column, then looking in the ASCII table for each value and entering the equivalent character representation in the right column. Table 1.7 shows how the ASCII table is used to interpret the hexadecimal string “466162756C6F75732100” as an ASCII string.
ASCII was developed to encode all of the most commonly used characters in North American English text. The encoding uses only 128 of the 256 codes that are available in a 8-bit byte. ASCII does not include symbols frequently used in other countries, such as the British pound symbol (£) or accented characters (ü). However, the International Standards Organization (ISO) has created several extensions to ASCII to enable the representation of characters from a wider variety of languages.
The ISO has defined a set of related standards known collectively as ISO 8859. ISO 8859 is an 8-bit extension to ASCII which includes the 128 ASCII characters along with an additional 128 characters, such as the British Pound symbol and the American cent symbol. Several variations of the ISO 8859 standard exist for different language families. Table 1.8 provides a brief description of the various ISO standards.
Table 1.8
Variations of the ISO 8859 standard
| Name | Alias | Languages |
| ISO8859-1 | Latin-1 | Western European languages |
| ISO8859-2 | Latin-2 | Non-Cyrillic Central and Eastern European languages |
| ISO8859-3 | Latin-3 | Southern European languages and Esperanto |
| ISO8859-4 | Latin-4 | Northern European and Baltic languages |
| ISO8859-5 | Latin/Cyrillic | Slavic languages that use a Cyrillic alphabet |
| ISO8859-6 | Latin/Arabic | Common Arabic language characters |
| ISO8859-7 | Latin/Greek | Modern Greek language |
| ISO8859-8 | Latin/Hebrew | Modern Hebrew languages |
| ISO8859-9 | Latin-5 | Turkish |
| ISO8859-10 | Latin-6 | Nordic languages |
| ISO8859-11 | Latin/Thai | Thai language |
| ISO8859-12 | Latin/Devanagari | Never completed. Abandoned in 1997 |
| ISO8859-13 | Latin-7 | Some Baltic languages not covered by Latin-4 or Latin-6 |
| ISO8859-14 | Latin-8 | Celtic languages |
| ISO8859-15 | Latin-9 | Update to Latin-1 that replaces some characters. Most |
| notably, it includes the euro symbol (€), which did not | ||
| exist when Latin-1 was created | ||
| ISO8859-16 | Latin-10 | Covers several languages not covered by Latin-9 and |
| includes the euro symbol (€) |
Although the ISO extensions helped to standardize text encodings for several languages that were not covered by ASCII, there were still some issues. The first issue is that the display and input devices must be configured for the correct encoding, and displaying or printing documents with multiple encodings requires some mechanism for changing the encoding on-the-fly. Another issue has to do with the lexicographical ordering of characters. Although two languages may share a character, that character may appear in a different place in the alphabets of the two languages. This leads to issues when programmers need to sort strings into lexicographical order. The ISO extensions help to unify character encodings across multiple languages, but do not solve all of the issues involved in defining a universal character set.
In the late 1980s, there was growing interest in developing a universal character encoding for all languages. People from several computer companies worked together and, by 1990, had developed a draft standard for Unicode. In 1991, the Unicode Consortium was formed and charged with guiding and controlling the development of Unicode. The Unicode Consortium has worked closely with the ISO to define, extend, and maintain the international standard for a Universal Character Set (UCS). This standard is known as the ISO/IEC 10646 standard. The ISO/IEC 10646 standard defines the mapping of code points (numbers) to glyphs (characters). but does not specify character collation or other language-dependent properties. UCS code points are commonly written in the form U+XXXX, where XXXX in the numerical code point in hexadecimal. For example, the code point for the ASCII DEL character would be written as U+007F. Unicode extends the ISO/IEC standard and specifies language-specific features.
Originally, Unicode was designed as a 16-bit encoding. It was not fully backward-compatible with ASCII, and could encode only 65,536 code points. Eventually, the Unicode character set grew to encompass 1,112,064 code points, which requires 21 bits per character for a straightforward binary encoding. By early 1992, it was clear that some clever and efficient method for encoding character data was needed.
UTF-8 (UCS Transformation Format-8-bit) was proposed and accepted as a standard in 1993. UTF-8 is a variable-width encoding that can represent every character in the Unicode character set using between one and four bytes. It was designed to be backward compatible with ASCII and to avoid the major issues of previous encodings. Code points in the Unicode character set with lower numerical values tend to occur more frequently than code points with higher numerical values. UTF-8 encodes frequently occurring code points with fewer bytes than those which occur less frequently. For example, the first 128 characters of the UTF-8 encoding are exactly the same as the ASCII characters, requiring only 7 bits to encode each ASCII character. Thus any valid ASCII text is also valid UTF-8 text. UTF-8 is now the most common character encoding for the World Wide Web, and is the recommended encoding for email messages.
In November 2003, UTF-8 was restricted by RFC 3629 to end at code point 10FFFF16. This allows UTF-8 to encode 1,114,111 code points, which is slightly more than the 1,112,064 code points defined in the ISO/IEC 10646 standard. Table 1.9 shows how ISO/IEC 10646 code points are mapped to a variable-length encoding in UTF-8. Note that the encoding allows each byte in a stream of bytes to be placed in one of the following three distinct categories:
Table 1.9
UTF-8 encoding of the ISO/IEC 10646 code points
| First | Last | ||||||
| UCS | Code | Code | |||||
| Bits | Point | Point | Bytes | Byte 1 | Byte 2 | Byte 3 | Byte 4 |
| 7 | U+0000 | U+007F | 1 | 0xxxxxxx | |||
| 11 | U+0080 | U+07FF | 2 | 110xxxxx | 10xxxxxx | ||
| 16 | U+0800 | U+FFFF | 3 | 1110xxxx | 10xxxxxx | 10xxxxxx | |
| 21 | U+10000 | U+10FFFF | 4 | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |

1. If the most significant bit of a byte is zero, then it is a single-byte character, and is completely ASCII-compatible.
2. If the two most significant bits in a byte are set to one, then the byte is the beginning of a multi-byte character.
3. If the most significant bit is set to one, and the second most significant bit is set to zero, then the byte is part of a multi-byte character, but is not the first byte in that sequence.
The UTF-8 encoding of the UCS characters has several important features:
Backwards compatible with ASCII: This allows the vast number of existing ASCII documents to be interpreted as UTF-8 documents without any conversion.
Self-synchronization: Because of the way code points are assigned, it is possible to find the beginning of each character by looking only at the top two bits of each byte. This can have important performance implications when performing searches in text.
Encoding of code sequence length: The number of bytes in the sequence is indicated by the pattern of bits in the first byte of the sequence. Thus, the beginning of the next character can be found quickly. This feature can also have important performance implications when performing searches in text.
Efficient code structure: UTF-8 efficiently encodes the UCS code points. The high-order bits of the code point go in the lead byte. Lower-order bits are placed in continuation bytes. The number of bytes in the encoding is the minimum required to hold all the significant bits of the code point.
Easily extended to include new languages: This feature will be greatly appreciated when we contact intelligent species from other star systems.
With UTF-8 encoding, the first 128 characters of the UCS are each encoded in a single byte. The next 1,920 characters require two bytes to encode. The two-byte encoding covers almost all Latin alphabets, and also Arabic, Armenian, Cyrillic, Coptic, Greek, Hebrew, Syriac and Tāna alphabets. It also includes combining diacritical marks, which are used in combination with another character, such as á, ñ, and ö. Most of the Chinese, Japanese, and Korean (CJK) characters are encoded using three bytes. Four bytes are needed for the less common CJK characters, various historic scripts, mathematical symbols, and emoji (pictographic symbols).
Consider the UTF-8 encoding for the British Pound symbol (£), which is UCS code point U+00A3. Since the code point is greater than 7F16, but less than 80016, it will require two bytes to encode. The encoding will be 110xxxxx 10xxxxxx, where the x characters are replaced with the 11 least-significant bits of the code point, which are 00010100011. Thus, the character £ is encoded in UTF-8 as 11000010 10100011 in binary, or C2 A3 in hexadecimal.
The UCS code point for the Euro symbol (€) is U+20AC. Since the code point is between 80016 and FFFF16, it will require three bytes to encode in UTF-8. The three-byte encoding is 1110xxxx 10xxxxxx 10xxxxxx where the x characters are replaced with the 16 least-significant bits of the code point. In this case the code point, in binary is 0010000010101100. Therefore, the UTF-8 encoding for € is 11100010 10000010 10101100 in binary, or E2 82 AC in hexadecimal.
In summary, there are three components to modern language support. The ISO/IEC 10646 defines a mapping from code points (numbers) to glyphs (characters). UTF-8 defines an efficient variable-length encoding for code points (text data) in the ISO/IEC 10646 standard. Unicode adds language specific properties to the ISO/IEC 10646 character set. Together, these three elements currently provide support for textual data in almost every human written language, and they continue to be extended and refined.
Computer memory consists of number of storage locations, or cells, each of which has a unique numeric address. Addresses are usually written in hexadecimal. Each storage location can contain a fixed number of binary digits. The most common size is one byte. Most computers group bytes together into words. A computer CPU that is capable of accessing a single byte of memory is said to have byte addressable memory. Some CPUs are only capable of accessing memory in word-sized groups. They are said to have word addressable memory.
Fig. 1.6 A shows a section of memory containing some data. Each byte has a unique address that is used when data is transferred to or from that memory cell. Most processors can also move data in word-sized chunks. On a 32-bit system, four bytes are grouped together to form a word. There are two ways that this grouping can be done. Systems that store the most significant byte of a word in the smallest address, and the least significant byte in the largest address, are said to be big-endian. The big-endian interpretation of a region of memory is shown in Fig. 1.6B. As shown in Fig. 1.6C, little-endian systems store the least significant byte in the lowest address and the most significant byte in the highest address. Some processors, such as the ARM, can be configured as either little-endian or big-endian. The Linux operating system, by default, configures the ARM processor to run in little-endian mode .

The memory layout for a typical program is shown in Fig. 1.7. The program is divided into four major memory regions, or sections. The programmer specifies the contents of the Text and Data sections. The Stack and Heap segments are defined when the program is loaded for execution. The Stack and Heap may grow and shrink as the program executes, while the Text and Data segments are set to fixed sizes by the compiler, linker, and loader. The Text section contains the executable instructions, while the Data section contains constants and statically allocated variables. The sizes of the Text and Data segments depend on how large the program is, and how much static data storage has been declared by the programmer. The heap contains variables that are allocated dynamically, and the stack is used to store parameters for function calls, return addresses, and local (automatic) variables.

In a high-level language, storage space for a variable can be allocated in one of three ways: statically, dynamically, and automatically. Statically allocated variables are allocated from the .data section. The storage space is reserved, and usually initialized, when the program is loaded and begins execution. The address of a statically allocated variable is fixed at the time the program begins running, and cannot be changed. Automatically allocated variables, often referred to as local variables, are stored on the stack. The stack pointer is adjusted down to make space for the newly allocated variable. The address of an automatic variable is always computed as an offset from the stack pointer. Dynamic variables are allocated from the heap, using malloc, new, or a language-dependent equivalent. The address of a dynamic variable is always stored in another variable, known as a pointer, which may be an automatic or static variable, or even another dynamic variable. The four major sections of program memory correspond to executable code, statically allocated variables, dynamically allocated variables, and automatically allocated variables.
There are several reasons for Computer Scientists and Computer Engineers to learn at least one assembly language. There are programming tasks that can only be performed using assembly language, and some tasks can be written to run much more efficiently and/or quickly if written in assembly language. Programmers with assembly language experience tend to write better code even when using a high-level language, and are usually better at finding and fixing bugs.
Although it is possible to construct a computer capable of performing arithmetic in any base, it is much cheaper to build one that works in base two. It is relatively easy to build an electrical circuit with two states, using two discrete voltage levels, but much more difficult to build a stable circuit with 10 discrete voltage levels. Therefore, modern computers work in base two.
Computer data can be viewed as simple bit strings. The programmer is responsible for supplying interpretations to give meaning to those bit strings. A set of bits can be interpreted as a number, a character, or anything that the programmer chooses. There are standard methods for encoding and interpreting characters and numbers. Fig. 1.4 shows some common methods for encoding integers. The most common encodings for characters are UTF-8 and ASCII.
Computer memory can be viewed as a sequence of bytes. Each byte has a unique address. A running program has four regions of memory. One region holds the executable code. The other three regions hold different types of variables.
1.1 What is the two’s complement of 11011101?
1.2 Perform the base conversions to fill in the blank spaces in the following table:
1.3 What is the 8-bit ASCII binary representation for the following characters?
(b) “a”
(c) “!”
1.4 What is \ minus ! given that \ and ! are ASCII characters? Give your answer in binary.
(a) Convert the string “Super!” to its ASCII representation. Show your result as a sequence of hexadecimal values.
(b) Convert the hexadecimal sequence into a sequence of values in base four.
1.6 Suppose that the string “This is a nice day” is stored beginning at address 4B3269AC16. What are the contents of the byte at address 4B3269B116 in hexadecimal?
(a) Convert 1011012 to base ten.
(b) Convert 102310 to base nine.
(c) Convert 102310 to base two.
(d) Convert 30110 to base 16.
(e) Convert 30110 to base 2.
(f) Represent 30110 as a null-terminated ASCII string (write your answer in hexadecimal).
(g) Convert 34205 to base ten.
(h) Convert 23145 to base nine.
(i) Convert 1167 to base three.
(j) Convert 129411 to base 5.
1.8 Given the following binary string:
01001001 01110011 01101110 00100111 01110100 00100000 01000001 01110011 01110011 01100101 01101101 01100010 01101100 01111001 00100000 01000110 01110101 01101110 00111111 00000000
(a) Convert it to a hexadecimal string.
(b) Convert the first four bytes to a string of base ten numbers.
(c) Convert the first (little-endian) halfword to base ten.
(d) Convert the first (big-endian) halfword to base ten.
(e) If this string of bytes were sent to an ASCII printer or terminal, what would be printed?
1.9 The number 1,234,567 is stored as a 32-bit word starting at address F043900016. Show the address and contents of each byte of the 32-bit word on a
(b) big-endian system.
1.10 The ISO/IEC 10646 standard defines 1,112,064 code points (glyphs). Each code point could be encoded using 24 bits, or three bytes. The UTF-8 encoding uses up to four bytes to encode a code point. Give three reasons why UTF-8 is preferred over a simple 3-byte per code point encoding.
1.11 UTF-8 is often referred to as Unicode. Why is this not correct?
1.12 Skilled assembly programmers can convert small numbers between binary, hexadecimal, and decimal in their heads. Without referring to any tables or using a calculator or pencil, fill in the blanks in the following table:
1.13 What are the differences between a CPU register and a memory location?
This chapter begins with a high-level description of assembly language and the assembler. It then explains the five elements of assembly language syntax, and gives some examples. It then goes in to more depth about how the assembler converts assembly language files into object files, which are then linked with other object files to create an executable file. Then it explains the most commonly used directives for the GNU assembler, and gives some examples to help relate the assembly code to equivalent C code.
Compiler; Assembler; Linker; Labels; Comments; Directives; Instructions; Sections; Symbols
All modern computers consist of three main components: the central processing unit (CPU), memory, and devices. It can be argued that the major factor that distinguishes one computer from another is the CPU architecture. The architecture determines the set of instructions that can be performed by the CPU. The human-readable language which is closest to the CPU architecture is assembly language.
When a new processor architecture is developed, its creators also define an assembly language for the new architecture. In most cases, a precise assembly language syntax is defined and an assembler is created by the processor developers. Because of this, there is no single syntax for assembly language, although most assembly languages are similar in many ways and have certain elements in common.
The GNU assembler (GAS) is a highly portable re-configurable assembler. GAS uses a simple, general syntax that works for a wide variety of architectures. Although the syntax used by GAS for the ARM processor is slightly different from the syntax defined by the developers of the ARM processor, it provides the same capabilities.
An assembly program consists of four basic elements: assembler directives, labels, assembly instructions, and comments. Assembler directives allow the programmer to reserve memory for the storage of variables, control which program section is being used, define macros, include other files, and perform other operations that control the conversion of assembly instructions into machine code. The assembly instructions are given as mnemonics, or short character strings that are easier for human brains to remember than sequences of binary, octal, or hexadecimal digits. Each assembly instruction may have an optional label, and most assembly instructions require the programmer to specify one or more operands.
Most assembly language programs are written in lines of 80 characters organized into four columns. The first column is for optional labels. The second column is for assembly instructions or assembler directives. The third column is for specifying operands, and the fourth column is for comments. Traditionally, the first two columns are 8 characters wide, the third column is 16 characters wide, and the last column is 48 characters wide. However, most modern assemblers (including GAS) do not require a fixed column widths. Listing 2.1 shows a basic “Hello World” program written in GNU ARM Assembly to run under Linux. For comparison, Listing 2.2 shows an equivalent program written in C. The assembly language version of the program is significantly longer than the C version, and will only work on an ARM processor. The C version is at a higher level of abstraction, and can be compiled to run on any system that has a C compiler. Thus, C is referred to as a high-level language, and assembly is a low-level language.


Most modern assemblers are called two-pass assemblers because they read the input file twice. On the first pass, the assembler keeps track of the location of each piece of data and each instruction, and assigns an address or numerical value to each label and symbol in the input file. The main goal of the first pass is to build a symbol table, which maps each label or symbol to a numerical value.
On the second pass, the assembler converts the assembly instructions and data declarations into binary, using the symbol table to supply numerical values whenever they are needed. In Listing 2.1, there are two labels: main and str. During assembly, those labels are assigned the value of the address counter at the point where they appear. Labels can be used anywhere in the program to refer to the address of data, functions, or blocks of code. In GNU assembly syntax, labels always end with a colon (:) character.
There are two basic comment styles: multi-line and single-line. Multi-line comments start with /* and everything is ignored until a matching sequence of */ is found. These comments are exactly the same as multi-line comments in C and C++. In ARM assembly, single line comments begin with an @ character, and all remaining characters on the line are ignored. Listing 2.1 shows both types of comment. In addition, if the name of the file ends in .S, then single line comments can begin with //. If the file name does not end with a capital .S, then the // syntax is not allowed.
Directives are used mainly to define symbols, allocate storage, and control the behavior of the assembler, allowing the programmer to control how the assembler does its job. The GNU assembler has many directives, but assembly programmers typically need to know only a few of them. All assembler directives begin with a period “.” which is followed by a sequence of letters, usually in lower case. Listing 2.1 uses the .data, .asciz, .text, and .globl directives. The most commonly used directives are discussed later in this chapter. There are many other directives available in the GNU Assembler which are not covered here. Complete documentation is available online as part of the GNU Binutils package.
Assembly instructions are the program statements that will be executed on the CPU. Most instructions cause the CPU to perform one low-level operation, In most assembly languages, operations can be divided into a few major types. Some instructions move data from one location to another. Others perform addition, subtraction, and other computational operations. Another class of instructions is used to perform comparisons and control which part of the program is to be executed next. Chapters 3 and 4 explain most of the assembly instructions that are available on the ARM processor.
Listing 2.3 shows how the GNU assembler will assemble the “Hello World” program from Listing 2.1. The assembler converts the string on input line 2 into the binary representation of the string. The results are shown in hexadecimal in the Code column of the listing. The first byte of the string is stored at address zero in the .data section of the program, as shown by the 0000 in the Addr column on line 2.

On line 4, the assembler switches to the .text section of the program and begins converting instructions into binary. The first instruction, on line 9, is converted into its 4-byte machine code, 00402DE916, and stored at location 0000 in the .text section of the program, as shown in the Code and Addr columns on line 6.
Next, the assembler converts the ldr instruction on line 10 into the four-byte machine instruction 0C009FE516 and stores it at address 0004. It repeats this process with each remaining instruction until the end of the program. The assembler writes the resulting data into a specially formatted file, called an object file. Note that the assembler was unable to locate the printf function. The linker will take care of that. The object file created by the assembler, hello.o, contains the data in the Code column of Listing 2.3, along with information to help the linker to link (or “patch”) the instruction on line 11 so that printf is called correctly.
After creating the object file, the next step in creating an executable program would be to invoke the linker and request that it link hello.o with the C standard library. The linker will generate the final executable file, containing the code assembled from hello.S, along with the printf function and other start-up code from the C standard library. The GNU C compiler is capable of automatically invoking the assembler for files that end in .s or .S, and can also be used to invoke the linker. For example, if Listing 2.1 is stored in a file named hello.S in the current directory, then the command
will run the GNU C compiler, telling it to create an executable program file named hello, and to use hello.S as the source file for the program. The C compiler will notice the .S extension, and invoke the assembler to create an object file which is stored in a temporary file, possibly named hello.o. Then the C compiler will invoke the linker to link hello.o with the C standard library, which provides the printf function and some start-up code which calls the main function. The linker will create an executable file named hello. When the linker has finished, the C compiler will remove the temporary object file.
Each processor architecture has its own assembly language, created by the designers of the architecture. Although there are many similarities between assembly languages, the designers may choose different names for various directives. The GNU assembler supports a relatively large set of directives, some of which have more than one name. This is because it is designed to handle assembling code for many different processors without drastically changing the assembly language designed by the processor manufacturers. We will now cover some of the most commonly used directives for the GNU assembler.
The instructions and data that make up a program are stored in different sections of the program file. There are several standard sections that the programmer can choose to put code and data in. Sections can also be further divided into numbered subsections. Each section has its own address counter, which is used to keep track of the location of bytes within that section. When a label is encountered, it is assigned the value of the current address counter for the currently active section.
Selecting a section and subsection is done by using the appropriate assembly directive. Once a section has been selected, all of the instructions and/or data will go into that section until another section is selected. The most important directives for selecting a section are:
Instructs the assembler to append the following instructions or data to the data subsection numbered subsection. If the subsection number is omitted, it defaults to zero. This section is normally used for global variables and constants which have labels.
Tells the assembler to append the following statements to the end of the text subsection numbered subsection. If the subsection number is omitted, subsection number zero is used. This section is normally used for executable instructions, but may also contain constant data.
The bss (short for Block Started by Symbol) section is used for defining data storage areas that should be initialized to zero at the beginning of program execution. The .bss directive tells the assembler to append the following statements to the end of the bss subsection numbered subsection. If the subsection number is omitted, subsection number zero is used. This section is normally used for global variables which need to be initialized to zero. Regardless of what is placed into the section at compile-time, all bytes will be set to zero when the program begins executing. This section does not actually consume any space in the object or executable file. It is really just a request for the loader to reserve some space when the program is loaded into memory.
In addition to the three common sections, the programmer can create other sections using this directive. However in order for custom sections to be linked into a program, the linker must be made aware of them. Controlling the linker is covered in Section 14.4.3.
There are several directives that allow the programmer to allocate and initialize static storage space for variables and constants. The assembler supports bytes, integer types, floating point types, and strings. These directives are used to allocate a fixed amount of space in memory and optionally initialize the memory. Some of these directives allow the memory to be initialized using an expression. An expression can be a simple integer, or a C-style expression. The directives for allocating storage are as follows:
.byte expects zero or more expressions, separated by commas. Each expression is assembled into the next byte. If no expressions are given, then the address counter is not advanced and no bytes are reserved.
.hword expressions .short expressions
For the ARM processor, these two directives do exactly the same thing. They expect zero or more expressions, separated by commas, and emit a 16-bit number for each expression. If no expressions are given, then the address counter is not advanced and no bytes are reserved.
.word expressions .long expressions
For the ARM processor, these two directives do exactly the same thing. They expect zero or more expressions, separated by commas. They will emit four bytes for each expression given. If no expressions are given, then the address counter is not advanced and no bytes are reserved.
The .ascii directive expects zero or more string literals, each enclosed in quotation marks and separated by commas. It assembles each string (with no trailing ASCII NULL character) into consecutive addresses.
.asciz ” string ” .string ” string ”
The .asciz directive is similar to the .ascii directive, but each string is followed by an ASCII NULL character (zero). The “z” in .asciz stands for zero. .string is just another name for .asciz.
.float flonums .single flonums
This directive assembles zero or more floating point numbers, separated by commas. On the ARM, they are 4-byte IEEE standard single precision numbers. .float and .single are synonymous.
The .double directive expects zero or more floating point numbers, separated by commas. On the ARM, they are stored as 8-byte IEEE standard double precision numbers.
Fig. 2.1A shows how these directives are used to declare variables and constants. Fig. 2.1B shows the equivalent statements for creating global variables in C or C++. Note that in both cases, the variables created will be visible anywhere within the file that they are declared, but not visible in other files which are linked into the program.

In C, the declaration of an array can be performed by leaving out the number of elements and specifying an initializer, as shown in the last three lines of Fig. 2.1B. In assembly, the equivalent is accomplished by providing a label, a type, and a list of values, as shown in the last three lines of Fig. 2.1A. The syntax is different, but the result is precisely the same.
Listing 2.4 shows how the assembler assigns addresses to these labels. The second column of the listing shows the address (in hexadecimal) that is assigned to each label. The variable i is assigned the first address. Since it is a word variable, the address counter is incremented by four bytes and the next address is assigned to the variable j. The address counter is incremented again, and fmt is assigned the address 0008. The fmt variable consumes seven bytes, so the ch variable gets address 000f. Finally, the array of words named ary begins at address 0012. Note that 1216 = 1810 is not evenly divisible by four, which means that the word variables in ary are not aligned on word boundaries.

On the ARM CPU, data can be moved to and from memory one byte at a time, two bytes at a time (half-word), or four bytes at a time (word). Moving a word between the CPU and memory takes significantly more time if the address of the word is not aligned on a four-byte boundary (one where the least significant two bits are zero). Similarly, moving a half-word between the CPU and memory takes significantly more time if the address of the half-word is not aligned on a two-byte boundary (one where the least significant bit is zero). Therefore, when declaring storage, it is important that words and half-words are stored on appropriate boundaries. The following directives allow the programmer to insert as much space as necessary to align the next item on any boundary desired.
.align abs-expr, abs-expr, abs-expr
Pad the location counter (in the current subsection) to a particular storage boundary. For the ARM processor, the first expression specifies the number of low-order zero bits the location counter must have after advancement. The second expression gives the fill value to be stored in the padding bytes. It (and the comma) may be omitted. If it is omitted, then the fill value is assumed to be zero. The third expression is also optional. If it is present, it is the maximum number of bytes that should be skipped by this alignment directive.
.balign [lw] abs-expr, abs-expr, abs-expr
These directives adjust the location counter to a particular storage boundary. The first expression is the byte-multiple for the alignment request. For example, .balign 16 will insert fill bytes until the location counter is an even multiple of 16. If the location counter is already a multiple of 16, then no fill bytes will be created. The second expression gives the fill value to be stored in the fill bytes. It (and the comma) may be omitted. If it is omitted, then the fill value is assumed to be zero. The third expression is also optional. If it is present, it is the maximum number of bytes that should be skipped by this alignment directive.
The .balignw and .balignl directives are variants of the .balign directive. The .balignw directive treats the fill pattern as a 2-byte word value, and .balignl treats the fill pattern as a 4-byte long word value. For example, “.balignw 4,0x368d” will align to a multiple of four bytes. If it skips two bytes, they will be filled in with the value 0x368d (the exact placement of the bytes depends upon the endianness of the processor).
.skip size, fill .space size, fill
Sometimes it is desirable to allocate a large area of memory and initialize it all to the same value. This can be accomplished by using these directives. These directives emit size bytes, each of value fill. Both size and fill are absolute expressions. If the comma and fill are omitted, fill is assumed to be zero. For the ARM processor, the .space and .skip directives are equivalent. This directive is very useful for declaring large arrays in the .bss section.
Listing 2.5 shows how the code in Listing 2.4 can be improved by adding an alignment directive at line 6. The directive causes the assembler to emit two zero bytes between the end of the ch variable and the beginning of the ary variable. These extra “padding” bytes cause the following word data to be word aligned, thereby improving performance when the word data is accessed. It is a good practice to always put an alignment directive after declaring character or half-word data.

The assembler provides support for setting and manipulating symbols that can then be used in other places within the program. The labels that can be assigned to assembly statements and directives are one type of symbol. The programmer can also declare other symbols and use them throughout the program. Such symbols may not have an actual storage location in memory, but they are included in the assembler’s symbol table, and can be used anywhere that their value is required. The most common use for defined symbols is to allow numerical constants to be declared in one place and easily changed. The .equ directive allows the programmer to use a label instead of a number throughout the program. This contributes to readability, and has the benefit that the constant value can then be easily changed every place that it is used, just by changing the definition of the symbol. The most important directives related to symbols are:
.equ symbol, expression .set symbol, expression
This directive sets the value of symbol to expression. It is similar to the C language #define directive.
The .equiv directive is like .equ and .set, except that the assembler will signal an error if the symbol is already defined.
.global symbol .globl symbol
This directive makes the symbol visible to the linker. If symbol is defined within a file, and this directive is used to make it global, then it will be available to any file that is linked with the one containing the symbol. Without this directive, symbols are visible only within the file where they are defined.
This directive declares symbol to be a common symbol, meaning that if it is defined in more than one file, then all instances should be merged into a single symbol. If the symbol is not defined anywhere, then the linker will allocate length bytes of uninitialized memory. If there are multiple definitions for symbol, and they have different sizes, the linker will merge them into a single instance using the largest size defined.
Listing 2.6 shows how the .equ directive can be used to create a symbol holding the number of elements in an array. The symbol arysize is defined as the value of the current address counter (denoted by the .) minus the value of the ary symbol, divided by four (each word in the array is four bytes). The listing shows all of the symbols defined in this program segment. Note that the four variables are shown to be in the data segment, and the arysize symbol is marked as an “absolute” symbol, which simply means that it is a number and not an address. The programmer can now use the symbol arysize to control looping when accessing the array data. If the size of the array is changed by adding or removing constant values, the value of arysize will change automatically, and the programmer will not have to search through the code to change the original value, 5, to some other value in every place it is used.

Sometimes it is desirable to skip assembly of portions of a file. The assembler provides some directives to allow conditional assembly. One use for these directives is to optionally assemble code to aid in debugging.
.if marks the beginning of a section of code which is only considered part of the source program being assembled if the argument (which must be an absolute expression) is non-zero. The end of the conditional section of code must be marked by the .endif directive. Optionally, code may be included for the alternative condition by using the .else directive.
Assembles the following section of code if the specified symbol has been defined.
Assembles the following section of code if the specified symbol has not been defined.
Assembles the following section of code only if the condition for the preceding .if or.ifdef was false.
Marks the end of a block of code that is only assembled conditionally.
This directive provides a way to include supporting files at specified points in the source program. The code from the included file is assembled as if it followed the point of the .include directive. When the end of the included file is reached, assembly of the original file continues. The search paths used can be controlled with the ‘-I’ command line parameter. Quotation marks are required around file. This assembler directive is similar to including header files in C and C++ using the #include compiler directive.
The directives .macro and .endm allow the programmer to define macros that the assembler expands to generate assembly code. The GNU assembler supports simple macros. Some other assemblers have much more powerful macro capabilities.
.macro macname .macro macname macargs …
Begin the definition of a macro called macname. If the macro definition requires arguments, their names are specified after the macro name, separated by commas or spaces. The programmer can supply a default value for any macro argument by following the name with ‘=deflt’.
The following begins the definition of a macro called reserve_str, with two arguments. The first argument has a default value, but the second does not:

When a macro is called, the argument values can be specified either by position, or by keyword. For example, reserve_str 9,17 is equivalent to reserve_str p2=17,p1=9. After the definition is complete, the macro can be called either as
(with \p1 evaluating to x and \p2 evaluating to y), or as
(with \p1 evaluating as the default, in this case 0, and \p2 evaluating to y). Other examples of valid .macro statements are:


End the current macro definition.
Exit early from the current macro definition. This is usually used only within a .if or .ifdef directive.
This is a pseudo-variable used by the assembler to maintain a count of how many macros it has executed. That number can be accessed with ‘\@’, but only within a macro definition.
The following definition specifies a macro SHIFT that will emit the instruction to shift a given register left by a specified number of bits. If the number of bits specified is negative, then it will emit the instruction to perform a right shift instead of a left shift.

After that definition, the following code:

will generate these instructions:

The meaning of these instructions will be covered in Chapters 3 and 4.
The following definition specifies a macro enum that puts a sequence of numbers into memory by using a recursive macro call to itself:

With that definition, ‘enum 0,5’ is equivalent to this assembly input:

There are four elements to assembly syntax: labels, directives, instructions, and comments. Directives are used mainly to define symbols, allocate storage, and control the behavior of the assembler. The most common assembler directives were introduced in this chapter, but there are many other directives available in the GNU assembler. Complete documentation is available online as part of the GNU Binutils package.
Directives are used to declare statically allocated storage, which is equivalent to declaring global static variables in C. In assembly, labels and other symbols are visible only within the file that they are declared, unless they are explicitly made visible to other files with the .global directive. In C, variables that are declared outside of any function are visible to all files in the program, unless the static keyword is used to make them visible only within the file where they are declared. Thus, both C and assembly support file and global scope for static variables, but with the opposite defaults and different syntax.
Directives can also be used to declare macros. Macros are expanded by the assembler and may generate multiple statements. Careful use of macros can automate some simple tasks, allowing several lines of assembly code to be replaced with a single macro invocation.
2.1 What is the difference between
(a) the .data section and .bss section?
(b) the .ascii and .asciz directives?
(c) the .word and the .long directives?
2.2 What is the purpose of the .align assembler directive? What does “.align 2” do in GNU ARM assembly?
2.3 Assembly language has four main elements. What are they?
2.4 Using the directives presented in this chapter, show three different ways to create a null-terminated string containing the phrase “segmentation fault”.
2.5 What is the total memory, in bytes, allocated for the following variables?

2.6 Identify the directive(s), label(s), comment(s), and instruction(s) in the following code:

2.7 Write assembly code to declare variables equivalent to the following C code:

2.8 Show how to store the following text as a single string in assembly language, while making it readable and keeping each line shorter than 80 characters:
The three goals of the mission are:
1) Keep each line of code under 80 characters,
2) Write readable comments,
3) Learn a valuable skill for readability.
2.9 Insert the minimum number of .align directives necessary in the following code so that all word variables are aligned on word boundaries and all halfword variables are aligned on halfword boundaries, while minimizing the amount of wasted space.

2.10 Re-order the directives in the previous problem so that no .align directives are necessary to ensure proper alignment. How many bytes of storage were wasted by the original ordering of directives, compared to the new one?
2.11 What are the most important directives for selecting a section?
2.12 Why are .ascii and .asciz directives usually followed by an .align directive, but .word directives are not?
2.13 Using the “Hello World” program shown in Listing 2.1 as a template, write a program that will print your name.
2.14 Listing 2.3 shows that the assembler will assign the location 0000000016 to the main symbol and also to the str symbol. Why does this not cause problems?
This chapter explains how a particular assembly language is related to the architectural design of a particular CPU family. It then gives an overview of the ARM architecture. Next, it describes the ARM register set and data paths, including the Process Status Register, and the flags which are used to control conditional execution. Then it introduces the concept of instructions and operands, and explains immediate data used as an operand. Next it describes the load and store instructions and all of the addressing modes available on the ARM processor. Then it explains the branch and conditional branch instructions. The chapter ends with some examples showing how the branch and link instruction can be used to call functions from the C standard library.
Architecture; Instruction set architecture; Data path; Register; Memory; Load; Store; Branch; Address; Addressing mode; Conditional execution; Function or subroutine call
The part of the computer architecture related to programming is referred to as the instruction set architecture (ISA). The ISA includes the set of registers that the user program can access, and the set of instructions that the processor supports, as well as data paths and processing elements within the processor. The first step in learning a new assembly language is to become familiar with the ISA. For most modern computer systems, data must be loaded in a register before it can be used for any data processing instruction, but there are a limited number of registers. Memory provides a place to store data that is not currently needed. Program instructions are also stored in memory and fetched into the CPU as they are needed. This chapter introduces the ISA for the ARM processor.
The CPU is composed of data storage and computational components connected together by a set of buses. The most important components of the CPU are the registers, where data is stored, and the arithmetic and logic unit (ALU), where arithmetic and logical operations are performed on the data. Some CPUs also have dedicated hardware units for multiplication and/or division. Fig. 3.1 shows the major components of the ARM CPU and the buses that connect the components together. These buses provide pathways for the data to move between the computational and storage components. The organization of the components and buses in a CPU govern what types of operations can be performed.

The set of instructions and addressing modes available on the ARM processor is closely related to the architecture shown in Fig. 3.1. The architecture provides for certain operations to be performed efficiently, and this has a direct relationship to the types of instructions that are supported.
Note that on the ARM, two source registers can be selected for an instruction, using the A and B buses. The data on the B bus is routed through a shifter, and then to the ALU. This allows the second operand of most instructions to be shifted an arbitrary amount before it reaches the ALU. The data on the A bus goes directly to the ALU. Additionally, the A and B buses can provide operands for the multiplier, and the multiplier can provide data for the A and B buses.
Data coming in from memory or an input/output device is fed directly onto the ALU bus. From there, it can be stored in one of the general-purpose registers. Data being written to memory or an input/output device is taken directly from the B bus, which means that store operations can move data from a register, but cannot modify the data on the way to memory or input/output devices.
The address register is a temporary register that is used by the CPU whenever it needs to read or write to memory or I/O devices. It is used every time an instruction is fetched from memory, and is used for all load and store operations. The address register can be loaded from the program counter, for fetching the next instruction. Also the address register can be loaded from the ALU, which allows the processor to support addressing modes where a register is used as a base pointer and an offset is calculated on-the-fly. After its contents are used to access memory or I/O devices, the base address can be incremented and the incremented value can be stored back into a register. This allows the processor to increment the program counter after each instruction, and to implement certain addressing modes where a pointer is automatically incremented after each memory access.
As shown in Fig. 3.2, the ARM processor provides 13 general-purpose registers, named r0 through r12. These registers can each store 32 bits of data. In addition to the 13 general-purpose registers, the ARM has three other special-purpose registers.

The program counter, r15, always contains the address of the next instruction that will be executed. The processor increments this register by four, automatically, after each instruction is fetched from memory. By moving an address into this register, the programmer can cause the processor to fetch the next instruction from the new address. This gives the programmer the ability to jump to any address and begin executing code there.
The link register, r14, is used to hold the return address for subroutines. Certain instructions cause the program counter to be copied to the link register, then the program counter is loaded with a new address. These branch-and-link instructions are briefly covered in Section 3.5 and in more detail in Section 5.4.
The program stack was introduced in Section 1.4. The stack pointer, r13, is used to hold the address where the stack ends. This is commonly referred to as the top of the stack, although on most systems the stack grows downwards and the stack pointer really refers to the bottom of the stack. The address where the stack ends may change when registers are pushed onto the stack, or when temporary local variables (automatic variables) are allocated or deleted. The use of the stack for storing automatic variables is described in Chapter 5. The use of r13 as the stack pointer is a programming convention. Some instructions (eg, branches) implicitly modify the program counter and link registers, but there are no special instructions involving the stack pointer. As far as the hardware is concerned, r13 is exactly the same as registers r0–r12, but all ARM programmers use it for the stack pointer.
Although register r13 is normally used as the stack pointer, it can be used as a general-purpose register if the stack is not used. However the high-level language compilers always use it as the stack pointer, so using it as a general-purpose register will result in code that cannot inter-operate with code generated using high-level languages. The link register, r14, can also be used as a general-purpose register, but its contents are modified by hardware when a subroutine is called. Using r13 and r14 as general-purpose registers is dangerous and strongly discouraged.
There are also two other registers which may have special purposes. As with the stack pointer, these are programming conventions. There are no special instructions involving these registers. The frame pointer (r11) is used by high-level language compilers to track the current stack frame. This is sometimes useful when running your program under a debugger, and can sometimes help the compiler to generate more efficient code for returning from a subroutine. The GNU C compiler can be instructed to use r11 as a general-purpose register by using the --omit-frame-pointer command line option. The inter-procedure scratch register r12 is used by the C library when calling functions in dynamically linked libraries. The contents may change, seemingly at random, when certain functions (such as printf) are called.
The final register in the ARM user programming model is the Current Program Status Register (CPSR). This register contains bits that indicate the status of the current program, including information about the results of previous operations. Fig. 3.3 shows the bits in the CPSR. The first four bits, N, Z, C, and V are the condition flags. Most instructions can modify these flags, and later instructions can use the flags to modify their operation. Their meaning is as follows:

Negative: This bit is set to one if the signed result of an operation is negative, and set to zero if the result is positive or zero.
Zero: This bit is set to one if the result of an operation is zero, and set to zero if the result is non-zero.
Carry: This bit is set to one if an add operation results in a carry out of the most significant bit, or if a subtract operation results in a borrow. For shift operations, this flag is set to the last bit shifted out by the shifter.
oVerflow: For addition and subtraction, this flag is set if a signed overflow occurred.
The remaining bits are used by the operating system or for bare-metal programs, and are described in Section 14.1.
The ARM processor supports a relatively small set of instructions grouped into four basic instruction types. Most instructions have optional modifiers which can be used to change their behavior. For example, many instructions can have modifiers which set or check condition codes in the CPSR. The combination of basic instructions with optional modifiers results in an extremely rich assembly language. There are four general instruction types, or categories. The following sections give a brief overview of the features which are common to instructions in each category. The individual instructions are explained later in this chapter, and in the following chapter.
As mentioned previously, the CPSR contains four flag bits (bits 28–31), which can be used to control whether or not certain instructions are executed. Most of the data processing instructions have an optional modifier to control whether or not the flag bits are affected when the instruction is executed. For example, the basic instruction for addition is add. When the add instruction is executed, the result is stored in a register, but the flag bits in the CPSR are not affected.
However, the programmer can add the s modifier to the add instruction to create the adds instruction. When it is executed, this instruction will affect the CPSR flag bits. The flag bits can be used by subsequent instructions to control execution and branching. The meaning of the flags depends on the type of instruction that last set the flags. Table 3.1 shows the names and meanings of the four bits depending on the type of instruction that set or cleared them. Most instructions support the s modifier to control setting the flags.
Table 3.1
Flag bits in the CPSR register
| Name | Logical Instruction | Arithmetic Instruction |
| N (Negative) | No meaning | Bit 31 of the result is set. Indicates a negative number in signed operations |
| Z (Zero) | Result is all zeroes | Result of operation was zero |
| C (Carry) | After Shift operation, ‘1’ was left in carry flag | Result was greater than 32 bits |
| V (oVerflow) | No meaning | The signed two’s complement result requires more than 32 bits. Indicates a possible corruption of the result |

Most ARM instructions can have a condition modifier attached. If present, the modifier controls, at run-time, whether or not the instruction is actually executed. These condition modifiers are added to basic instructions to create conditional instructions. Table 3.2 shows the condition modifiers that can be attached to base instructions. For example, to create an instruction that adds only if the CPSR Z flag is set, the programmer would add the eq condition modifier to the basic add instruction to create the addeq instruction.
Table 3.2
ARM condition modifiers
| <cond> | English Meaning |
| al | always (this is the default <cond> |
| eq | Z set (=) |
| ne | Z clear (≠) |
| ge | N set and V set, or N clear and V clear (≥) |
| lt | N set and V clear, or N clear and V set (<) |
| gt | Z clear, and either N set and V set, or N clear and V set (>) |
| le | Z set, or N set and V clear, or N clear and V set (≤) |
| hi | C set and Z clear (unsigned >) |
| ls | C clear or Z (unsigned ≤) |
| hs | C set (unsigned ≥) |
| cs | Alternate name for HS |
| lo | C clear (unsigned <) |
| cc | Alternate name for LO |
| mi | N set (result < 0) |
| pl | N clear (result ≥ 0) |
| vs | V set (overflow) |
| vc | V clear (no overflow) |
Setting and using condition flags are orthogonal operations. This means that they can be used in combination. Using the previous example, the programmer could add the s modifier to create the addeqs instruction, which executes only if the Z bit is set, and updates the CPSR flags only if it executes.
An immediate value in assembly language is a constant value that is specified by the programmer. Some assembly languages encode the immediate value as part of the instruction. Other assembly languages create a table of immediate values in a literal pool and insert appropriate instructions to access them. ARM assembly language provides both methods.
Immediate values can be specified in decimal, octal, hexadecimal, or binary. Octal values must begin with a zero, and hexadecimal values must begin with “0x”. Likewise immediate values that start with “0b” are interpreted as binary numbers. Any value that does not begin with zero, 0x, or 0 b will be interpreted as a decimal value.
There are two ways that immediate values can be specified in GNU ARM assembly. The =<immediate|symbol> syntax can be used to specify any immediate 32-bit number, or to specify the 32-bit value of any symbol in the program. Symbols include program labels (such as main) and symbols that are defined using .equ and similar assembler directives. However, this syntax can only be used with load instructions, and not with data processing instructions. This restriction is necessary because of the way the ARM machine instructions are encoded. For data processing instructions, there are a limited number of bits that can be devoted to storing immediate data as part of the instruction.
The #<immediate|symbol> syntax is used to specify immediate data values for data processing instructions. The #<immediate|symbol> syntax has some restrictions. Basically, the assembler must be able to construct the specified value using only eight bits of data, a shift or rotate, and/or a complement. For immediate values that can cannot be constructed by shifting or rotating and complementing an 8-bit value, the programmer must use an ldr instruction with the =<immediate|symbol> to specify the value. That method is covered in Section 3.4. Some examples of immediate values are shown in Table 3.3.
Table 3.3
Legal and illegal values for #<immediate—symbol>
| #32 | Ok because it can be stored as an 8-bit value |
| #1021 | Illegal because the number cannot be created from an 8-bit value using shift or rotate and complement |
| #1024 | Ok because it is 1 shifted left 10 bits |
| #0b1011 | Ok because it fits in 8 bits |
| #-1 | Ok because it is the one’s complement of 0 |
| #0xFFFFFFFE | Ok because it is the one’s complement of 1 |
| #0xEFFFFFFF | Ok because it is the one’s complement of 1 shifted left 31 bits |
| #strsize | Ok if the value of strsize can be created from an 8-bit value using shift or rotate and complement |

The ARM processor has a strict separation between instructions that perform computation and those that move data between the CPU and memory. Because of this separation between load/store operations and computational operations, it is a classic example of a load-store architecture. The programmer can transfer bytes (8 bits), half-words (16 bits), and words (32 bits), from memory into a register, or from a register into memory. The programmer can also perform computational operations (such as adding) using two source operands and one register as the destination for the result. All computational instructions assume that the registers already contain the data. Load instructions are used to move data into the registers, while store instructions are used to move data from the registers to memory.
Most of the load/store instructions use an <address> which is one of the six options shown in Table 3.4. The < shift_op > can be any of shift operations from Table 3.5, and shift should be a number between 0 and 31. Although there are really only six addressing modes, there are eleven variations of the assembly language syntax. Four of the variations are simply shorthand notations. One of the variations allows an immediate data value or the address of a label to be loaded into a register, and may result in the assembler generating more than one instruction. The following section describes each addressing mode in detail.
Table 3.4
ARM addressing modes
| Syntax | Name |
| [Rn, #±<offset_12>] | Immediate offset |
| [Rn, ±Rm, <shift_op> #<shift>] | Scaled register offset |
| [Rn, #±<offset_12>]! | Immediate pre-indexed |
| [Rn, ±Rm, <shift_op> #<shift>]! | Scaled register pre-indexed |
| [Rn], #±<offset_12> | Immediate post-indexed |
| [Rn], ±Rm, <shift_op> #<shift> | Scaled register post-indexed |
Table 3.5
ARM shift and rotate operations
| <shift> | Meaning |
| lsl | Logical Shift Left by specified amount |
| lsr | Logical Shift Right by specified amount |
| asr | Arithmetic Shift Right by specified amount |
Immediate offset: [Rn, #±< offset_12 >]
The immediate offset (which may be positive or negative) is added to the contents of Rn. The result is used as the address of the item to be loaded or stored. For example, the following line of code:
calculates a memory address by adding 12 to the contents of register r1. It then loads four bytes of data, starting at the calculated memory address, into register r0. Similarly, the line:
subtracts 8 from the contents of r6 and uses that as the address where it stores the contents of r9 in memory.
Register immediate: [Rn]
When using immediate offset mode with an offset of zero, the comma and offset can be omitted. That is, [Rn] is just shorthand notation for [Rn, #0]. This shorthand is referred to as register immediate mode. For example, the following line of code:
uses the contents of register r2 as a memory address and loads four bytes of data, starting at that address, into register r3. Likewise,
copies the contents of r8 to the four bytes of memory starting at the address that is in r0.
Scaled register offset: [Rn, ±Rm, < shift_op > #<shift>]
Rm is shifted as specified, then added to or subtracted from Rn. The result is used as the address of the item to be loaded or stored. For example,
shifts the contents of r1 left two bits, adds the result to the contents of r2 and uses the sum as an address in memory from which it loads four bytes into r3. Recall that shifting a binary number left by two bits is equivalent to multiplying that number by four. This addressing mode is typically used to access an array, where r2 contains the address of the beginning of the array, and r1 is an integer index. The integer shift amount depends on the size of the objects in the array. To store an item from register r0 into an array of half-words, the following instruction could be used:
where r4 holds the address of the first byte of the array, and r5 holds the integer index for the desired array item.
Register offset: [Rn, ±Rm]
When using scaled register offset mode with a shift amount of zero, the comma and shift specification can be omitted. That is, [Rn, ±Rm] is just shorthand notation for [Rn, ±Rm, lsl #0]. This shorthand is referred to as register offset mode.
Immediate pre-indexed: [Rn, #±Rm< offset_12 >]!
The address is computed in the same way as immediate offset mode, but after the load or store, the address that was used is stored in Rn. This mode can be used to step through elements in an array, updating a pointer to the next array element before each element is accessed.
Scaled register pre-indexed: [Rn, ±Rm, < shift_op > #<shift>]!
The address is computed in the same way as scaled register offset mode, but after the load or store, the address that was used is stored in Rn. This mode can be used to step through elements in an array, updating a pointer to the current array element before each access.
Register pre-indexed: [Rn, ±Rm]!
When using scaled register pre-indexed mode with a shift amount of zero, the comma and shift specification can be omitted. That is, [Rn, ±Rm]! is shorthand notation for [Rn, ±Rm, lsl #0]!. This shorthand is referred to as register pre-indexed mode.
Immediate post-indexed: [Rn], #±< offset_12 >
Register Rn is used as the address of the value to be loaded or stored. After the value is loaded or stored, the value in Rn is updated by adding the immediate offset, which may be negative or positive. This mode can be used to step through elements in an array, updating a pointer to point at the next array element after each one is accessed.
Scaled register post-indexed: [Rn], ±Rm, < shift_op > #<shift>
Register Rn is used as the address of the value to be loaded or stored. After the value is loaded or stored, the value in Rn is updated by adding or subtracting the contents of Rm shifted as specified. This mode can be used to step through elements in an array, updating a pointer to point at the next array element after each one is accessed.
Register post-indexed: [Rn], ±Rm
When using scaled register post-indexed mode with a shift amount of zero, the comma and shift specification can be omitted. That is, [Rn], ±Rm is shorthand notation for [Rn], ±Rm, lsl #0. This shorthand is referred to as register post-indexed mode.
Load Immediate: [Rn], =<immediate|symbol>
This is really a pseudo-instruction. The assembler will generate a mov instruction if possible. Otherwise it will store the value of immediate or the address of symbol in a “literal table” and generate a load instruction, using one of the previous addressing modes, to load the value into a register. This addressing mode can only be used with the ldr instruction.
The load and store instructions allow the programmer to move data from memory to registers or from registers to memory. The load/store instructions can be grouped into the following types:
• multiple register, and
• atomic.
The following sections describe the seven load and store instructions that are available, and all of their variations.
These instructions transfer a single word, half-word, or byte from a register to memory or from memory to a register:
str Store Register.
• The optional <cond> can be any of the codes from Table 3.2 on page 59 specifying conditional execution.
• The optional <size> is one of:
h unsigned half-word
sb signed byte
sh signed half-word
• The <address> is any valid address specifier described in Section 3.4.1.
ARM has two instructions for loading and storing multiple registers:
ldm Load Multiple Registers, and
stm Store Multiple Registers.
These instructions are used to store registers on the program stack, and for copying blocks of data. The ldm and stm instructions each have four variants, and each variant has two equivalent names. So, although there are only two basic instructions, there are sixteen mnemonics. These are the most complex instructions in the ARM assembly language.
• <variant> is chosen from the following tables:
| Block Copy Method | Stack Type | |||
| Variant | Description | Variant | Description | |
| ia | Increment After | ea | Empty Ascending | |
| ib | Increment Before | fa | Full Ascending | |
| da | Decrement After | ed | Empty Descending | |
| db | Decrement Before | fd | Full Descending | |

• The optional ! specifies that the address register Rd should be modified after the registers are stored.
• An optional trailing ˆ can only be used by operating system code. It causes the transfer to affect user registers instead of operating system registers.
There are two equivalent mnemonics for each load/store multiple instruction. For example, ldmia is exactly the same instruction as ldmfd, and stmdb is exactly the same instruction as stmfd. There are two different names so that the programmer can indicate what the instruction is being used for.
The mnemonics in the Block Copy Method table are used when the programmer is using the instructions to move blocks of data. For instance, the programmer may want to copy eight words from one address in memory to another address. One very efficient way to do that is to:
1. load the address of the first byte of the source into a register,
2. load the address of the first byte of the destination into another register,
3. use ldmia (load multiple increment after) to load eight registers from the source address, then
4. use stmia (store multiple increment after) to store the registers to the destination address.
Assuming source and dest are labeled blocks of data declared elsewhere, the following listing shows the exact instructions needed to move eight words from source to dest:

The mnemonics in the Stack Type table are used when the programmer is performing stack operations. The most common variants are stmfd and ldmfd, which are used for pushing registers onto the program stack and later popping them back off, respectively. In Linux, the C compiler always uses the stmfd and ldmfd versions for accessing the stack. The following code shows how the programmer could save the contents of registers r0-r9 on the stack, use them to perform a block copy, then restore their contents:

Note that in the previous example, after the stmfd sp!, { r0-r9 } instruction, sp will contain the address of the last word on the stack, because the optional ! was used to indicate that the register should be updated.
| Name | Effect | Description |
| ldmia and ldmfd |
for all i ∈ register_list do
end for if ! is present then
end if | Load multiple registers from memory, starting at the address in Rd and increment the address by four bytes after each load. |
| stmia and stmea |
for all i ∈ register_list do
end for if ! is present then
end if | Store multiple registers in memory, starting at the address in Rd and increment the address by four bytes after each store. |
| ldmib and ldmed |
for all i ∈ register_list do
end for if ! is present then
end if | Load multiple registers from memory, starting at the address in Rd and increment the address by four bytes before each load. |
| stmib and stmfa |
for all i ∈ register_list do
end for if ! is present then
end if | Store multiple registers in memory, starting at the address in Rd and increment the address by four bytes before each store. |
| ldmda and ldmfa |
for all i ∈ register_list do
end for if ! is present then
end if | Load multiple registers from memory, starting at the address in Rd and decrement the address by four bytes after each load. |
| stmda and stmed |
for all i ∈ register_list do
end for if ! is present then
end if | Store multiple registers in memory, starting at the address in Rd and decrement the address by four bytes after each store. |
| ldmdb and ldmea |
for all i ∈ register_list do
end for if ! is present then
end if | Load multiple registers from memory, starting at the address in Rd and decrement the address by four bytes before each load. |
| stmdb and stmfd |
for all i ∈ register_list do
end for if ! is present then
end if | Store multiple registers in memory, starting at the address in Rd and decrement the address by four bytes before each store. |


Multiprogramming and threading require the ability to set and test values atomically. This instruction is used by the operating system or threading libraries to guarantee mutual exclusion:
Note: swp and swpb are deprecated in favor of ldrex and strex, which work on multiprocessor systems as well as uni-processor systems.
• The optional s specifies whether or not the instruction should affect the bits in the CPSR.
• The optional <cond> can be any of the codes from Table 3.2 specifying conditional execution.

These instructions are used by the operating system or threading libraries to guarantee mutual exclusion, even on multiprocessor systems:
ldrex Load Multiple Registers, and
strex Store Multiple Registers.
Exclusive load (ldrex) reads data from memory, tagging the memory address at the same time. Exclusive store (strex) stores data to memory, but only if the tag is still valid. A strex to the same address as the previous ldrex will invalidate the tag. A str to the same address may invalidate the tag (implementation defined). The strex instruction sets a bit in the specified register which indicates whether or not the store succeeded. This allows the programmer to implement semaphores on uni-processor and multiprocessor systems.
• The optional <cond> can be any of the codes from Table 3.2 specifying conditional execution.

Branch instructions allow the programmer to change the address of the next instruction to be executed. They are used to implement loops, if-then structures, subroutines, and other flow control structures. There are two basic branch instructions:
• Branch and Link (subroutine call).
This instruction is used to perform conditional and unconditional branches in program execution:
It is used for creating loops and if-then-else constructs.
• The optional <cond> can be any of the codes from Table 3.2 specifying conditional execution.
• The target_label can be any label in the current file, or any label that is defined as .global or .globl in any file that is linked in.

The following instruction is used to call subroutines:
The branch and link instruction is identical to the branch instruction, except that it copies the current program counter to the link register before performing the branch. This allows the programmer to copy the link register back into the program counter at some later point. This is how subroutines are called, and how subroutines return and resume executing at the next instruction after the one that called them.
• The optional <cond> can be any of the codes from Table 3.2 specifying conditional execution.
• The target_address can be any label in the current file, or any label that is defined as .global or .globl in any file that is linked in.

Example 3.1 shows how the bl instruction can be used to call a function from the C standard library to read a single character from standard input. By convention, when a function is called, it will leave its return value in r0. Example 3.2 shows how the bl instruction can be used to call another function from the C standard library to print a message to standard output. By convention, when a function is called, it will expect to find its first argument in r0. There are other rules, which all ARM programmers must follow, regarding which registers are used when passing arguments to functions and procedures. Those rules will be explained fully in Section 5.4.
The assembler provides a small number of pseudo-instructions. From the perspective of the programmer, these instructions are indistinguishable from standard instructions. However, when the assembler encounters a pseudo-instruction, it may substitute a different instruction or generate a short sequence of machine instructions.
This pseudo-instruction loads a register with any 32-bit value:
When this pseudo-instruction is encountered, the assembler first determines whether or not it can substitute a mov Rd,#<immediate> or mvn Rd,#<immediate> instruction. If that is not possible, then it reserves four bytes in a “literal pool” and stores the immediate value there. Then, the pseudo-instruction is translated into an ldr instruction using Immediate Offset addressing mode with the pc as the base register.
• The optional <cond> can be any of the codes from Table 3.2 specifying conditional execution.
• The <immediate> parameter is any valid 32-bit quantity.
Example 3.3 shows how the assembler generates code from the load immediate pseudo-instruction. Line 2 of the example listing just declares two 32-bit words. They cause the next variable to be given a non-zero address for demonstration purposes, and are not used anywhere in the program, but line 3 declares a string of characters in the data section. The string is located at offset 0x00000008 from the beginning of the data section. The linker is responsible for calculating the actual address, when it assigns a location for the data section. Line 6 shows how a register can be loaded with an immediate value using the mov instruction. The next line shows the equivalent using the ldr pseudo-instruction. Note that the assembler generates the same machine instruction (FD5FE0E3) for both lines.
Line 8 shows the ldr pseudo-instruction being used to load a value that cannot be loaded using the mov instruction. The assembler generated a load half-word instruction using the program counter as the base register, and an offset to the location where the value is stored. The value is actually stored in a literal pool at the end of the text segment. The listing has three lines labeled 11. The first line 11 is an instruction. The remaining lines are the literal pool.
On line 9, the programmer used the ldr pseudo-instruction to request that the address of str be loaded into r4. The assembler created a storage location to hold the address of str, and generated a load word instruction using the program counter as the base register and an offset to the location where the address is stored. The address of str is actually stored in the text segment, on the third line 11.
These pseudo instructions are used to load the address associated with a label:
adrl Load Address Long
They are more efficient than the ldr rx,=label instruction, because they are translated into one or two add or subtract operations, and do not require a load from memory. However, the address must be in the same section as the adr or adrl pseudo-instruction, so they cannot be used to load addresses of labels in the .data section.
• The adr pseudo-instruction will be translated into one or two pc-relative add or sub instructions.
• The adrl pseudo-instruction will always be translated into two instructions. The second instruction may be a nop instruction.
• The label must be defined in the same file and section where these pseudo-instructions are used.

The ARM Instruction Set Architecture includes 17 registers and a four basic instruction types. This chapter explained the instructions used for
• moving data between memory and registers, and
• branching and calling subroutines.
The load and store operations are used to move data between memory and registers. The basic load and store operations, ldr and str, have a very powerful set of addressing modes. To facilitate moving multiple registers to or from memory, the ARM ISA provides the ldm and stm instructions, which each have several variants. The assembler provides two pseudo-instructions for loading addresses and immediate values.
The ARM processor provides only two types of branch instruction. The bl instruction is used to call subroutines (functions). The b instruction can be used to create loops and to create if-then-else constructs. The ability to append a condition to almost any instruction results in a very rich instruction set.
3.1 Which registers hold the stack pointer, return address, and program counter?
3.2 Which is more efficient for loading a constant value, the ldr pseudo-instruction, or the mov instruction? Explain.
3.3 Which two variants of the Store Multiple instruction are used most often, and why?
3.4 The stm and ldm instructions include an optional ‘!’ after the address register. What does it do?
3.5 The following C statement declares an array of four integers, and initializes their values to 7, 3, 21, and 10, in that order.
(a) Write the equivalent in GNU ARM assembly.
(b) Write the ARM assembly instructions to load all four numbers into registers r3, r5, r6, and r9, respectively, using:
i. a single ldm instruction, and
ii. four ldr instructions.
3.6 What is the difference between a memory location and a CPU register?
3.7 How many registers are provided by the ARM Instruction Set Architecture?
3.8 Use ldm and stm to write a short sequence of ARM assembly language to copy 16 words of data from a source address to a destination address. Assume that the source address is already loaded in r0 and the destination address is already loaded in r1. You may use registers r2 through r5 to hold values as needed. Your code is allowed to modify r0 and/or r1.
3.9 Assume that x is an array of integers. Convert the following C statements into ARM assembly language.
(b) x[10] = x[0];
(c) x[9] = x[3];
3.10 Assume that x is an array of integers, and i and j are integers. Convert the following C statements into ARM assembly language.
(b) x[j] = x[i];
(c) x[i] = x[j*2];
3.11 What is the difference between the b instruction and the bl instruction? What is each used for?
3.12 What are the meanings of the following instructions?
(b) ldrlt
(c) bgt
(d) bne
(e) bge
This chapter begins by explaining Operand2, which is used by most ARM data processing instructions to specify one of the source operands for the data processing operation. It explains all of the shift operations and how they can be combined with other data processing operations in a single instruction. It then explains each of the data processing instructions, giving a short example showing how they can be used. Short examples, relating the assembly instructions to C statements, are incorporated throughout the chapter. One of the examples shows how to construct a loop. After the data processing instructions are explained, the chapter covers the special instructions and pseudo-instructions.
Operand2; Data processing; Shift; Loop; Comparison; Data movement; Three address instruction; Two address instruction
The ARM processor has approximately 25 data processing instructions. The exact number depends on the processor version. For example, older versions of the architecture did not have the six multiply instructions, and the Cortex M3 and newer processors have two division instructions. There are also a few special instructions that are used infrequently to perform operations that are not classified as load/store, branch, or data processing.
The data processing instructions operate only on CPU registers, so data must first be moved from memory into a register before processing can be performed. Most of these instructions use two source operands and one destination register. Each instruction performs one basic arithmetical or logical operation. The operations are grouped in the following categories:
• Logical Operations,
• Comparison Operations,
• Data Movement Operations,
• Status Register Operations,
• Multiplication Operations, and
• Division Operations.
Most of the data processing instructions require the programmer to specify two source operands and one destination register for the result. Because three items must be specified for these instructions, they are known as three address instructions. The use of the word address in this case has nothing to do with memory addresses. The term three address instruction comes from earlier processor architectures that allow arithmetic operations to be performed with data that is stored in memory rather than registers. The first source operand specifies a register whose contents will be on the A bus in Fig. 3.1. The second source operand will be on the B bus and is referred to as Operand2. Operand2 can be any one of the following three things:
• a register (r0-r15) and a shift operation to modify it, or
• a 32-bit immediate value that can be constructed by shifting, rotating, and/or complementing an 8-bit value.
The options for Operand2 allow a great deal of flexibility. Many operations that would require two instructions on most processors can be performed using a single ARM instruction. Table 4.1 shows the mnemonics used for specifying shift operations, which we refer to as < shift_op >.
The lsl operation shifts each bit left by a specified amount n. Zero is shifted into the n least significant bits, and the most significant n bits are lost. The lsr operation shifts each bit right by a specified amount n. Zero is shifted into the n most significant bits, and the least significant n bits are lost. The asr operation shifts each bit right by a specified amount n. The n most significant bits become copies of the sign bit (bit 31), and the least significant n bits are lost. The ror operation rotates each bit right by a specified amount n. The n most significant bits become the least significant n bits. The RRX operation rotates one place to the right but the CPSR carry flag, C, is included. The carry flag and the register together create a 33 bit quantity to be rotated. The carry flag is rotated into the most significant bit of the register, and the least significant bit of the register is rotated into the carry flag. Table 4.2 shows all of the possible forms for Operand2.
Table 4.2
Formats for Operand2
| #<immediate|symbol> | A 32-bit immediate value that can be constructed from an 8 bit value |
| Rm | Any of the 16 registers r0-r15 |
| Rm, <shift_op> #<shift_imm> | The contents of a register shifted or rotated by an immediate amount between 0 and 31 |
| Rm, <shift_op> Rs | The contents of a register shifted or rotated by an amount specified by the contents of another register |
| Rm, rrx | The contents of a register rotated right by one bit through the carry flag |
These four comparison operations update the CPSR flags, but have no other effect:
cmn Compare Negative,
tst Test Bits, and
teq Test Equivalence.
They each perform an arithmetic operation, but the result of the operation is discarded. Only the CPSR carry flags are affected.
• <op> is either cmp, cmn, tst, or teq.
• The optional <cond> can be any of the codes from Table 3.2 specifying conditional execution.

Example 4.1 shows how conditional execution and the test instruction can be used together to create an if-then-else structure. Note that in this case, the assembly code is more concise than the C code. That is not generally true.
There are six basic arithmetic operations:
adc Add with Carry,
sub Subtract,
sbc Subtract with Carry,
rsb Reverse Subtract, and
rsc Reverse Subtract with Carry.
All of them involve two 32-bit source operands and a destination register.
• <op> is one of add, adc, sub, sbc, or rsb, or rsc.
• The optional s specifies whether or not the instruction should affect the bits in the CPSR.
• The optional <cond> can be any of the codes from Table 3.2 on page 59 specifying conditional execution.

Example 4.2 shows a complete program for adding the contents of two statically allocated variables and printing the result. The printf () function expects to find the address of a string in r0. As it prints the string, it finds the \%d formatting command, which indicates that the value of an integer variable should be printed. It expects the variable to be stored in r1. Note that the variable sum does not need to be stored in memory. It is stored in r1, where printf () expects to find it.
Example 4.3 shows how the compare, branch, and add instructions can be used to create a loop. There are basically three steps for creating a loop: allocating and initializing the loop variable, testing the loop variable, and modifying the loop variable. In general, any of the registers r0-r12 can be used to hold the loop variable. Section 5.4 introduces some considerations for choosing an appropriate register. For now, it is assumed that r0 is available for use as the loop variable for this example.
There are five basic logical operations:
orr Bitwise OR,
eor Bitwise Exclusive OR,
orn Bitwise OR NOT, and
bic Bit Clear.
All of them involve two source operands and a destination register.
• <op> is either and, eor, orr, orn, or bic.
• The optional s specifies whether or not the instruction should affect the bits in the CPSR.
• The optional <cond> can be any of the codes from Table 3.2 specifying conditional execution.

The data movement operations copy data from one register to another:
mvn Move Not, and
movt Move Top.
The movt instruction copies 16 bits of data into the upper 16 bits of the destination register, without affecting the lower 16 bits. It is available on ARMv6T2 and newer processors.
• The optional s specifies whether or not the instruction should affect the bits in the CPSR.
• The optional <cond> can be any of the codes from Table 3.2 specifying conditional execution.

These two instructions perform multiplication using two 32-bit registers to form a 32-bit result:
mla Multiply and Accumulate.
The mla instruction adds a third register to the result of the multiplication.
• The optional s specifies whether or not the instruction should affect the bits in the CPSR.
• The optional <cond> can be any of the codes from Table 3.2 specifying conditional execution.

These instructions perform multiplication using two 32-bit registers to form a 64-bit result:
umull Unsigned Multiply Long,
smlal Signed Multiply and Accumulate Long, and
umlal Unsigned Multiply and Accumulate Long.
The smlal and umlal instructions add a 64-bit quantity to the result of the multiplication.
• <type> must be either s for signed or u for unsigned.
• <op> must be either mul, or mla.
• The optional s specifies whether or not the instruction should affect the bits in the CPSR.
• The optional <cond> can be any of the codes from Table 3.2 specifying conditional execution.

Some ARM processors have the following instructions to perform division:
udiv Unsigned Divide.
The divide operations are available on Cortex M3 and newer ARM processors. The processor used on the Raspberry Pi does not have these instructions. The Raspberry Pi 2 does have them.
• <type> must be either s for signed or u for unsigned.
• The optional <cond> can be any of the codes from Table 3.2 specifying conditional execution.
• The optional s specifies whether or not the instruction should affect the bits in the CPSR.

There are a few instructions that do not fit into any of the previous categories. They are used to request operating system services and access advanced CPU features.
This instruction counts the number of leading zeros in the operand register and stores the result in the destination register:
• The optional <cond> can be any of the codes from Table 3.2 specifying conditional execution.

These two instructions allow the programmer to access the status bits of the CPSR and SPSR:
mrs Move Status to Register, and
msr Move Register to Status.
The SPSR is covered in Section 14.1.
• The optional < fields > is any combination of:
x extension field
s status field
f flags field
• The optional <cond> can be any of the codes from Table 3.2 specifying conditional execution.

The following instruction allows a user program to perform a system call to request operating system services:
In Unix and Linux, the system calls are documented in the second section of the online manual. Each system call has a unique id number which is defined in the /usr/include/syscall.h file.
• The <syscall_number> is encoded in the instruction. The operating system may examine it to determine which operating system service is being requested.
• In Linux, <syscall_number> is ignored. The system call number is passed in r7, and up to seven parameters are passed in r0-r6. No Linux system call requires more than seven parameters.

The ARM processor has an alternate mode where it executes a 16-bit instruction set known as Thumb. This instruction allows the programmer to change the processor mode and branch to Thumb code:
The thumb instruction set is sometimes more efficient than the full ARM instruction set, and may offer advantages on small systems.

The assembler provides a small number of pseudo-instructions. From the perspective of the programmer, these instructions are indistinguishable from standard instructions. However, when the assembler encounters a pseudo-instruction, it may substitute a different instruction or generate a short sequence of machine instructions.
This pseudo instruction does nothing, but takes one clock cycle to execute.
This is equivalent to a mov r0,r0 instruction.

These pseudo instructions are assembled into mov instructions with an appropriate shift of Operand2:
lsr Logical Shift Right,
asr Arithmetic Shift Right,
ror Rotate Right, and
rrx Rotate Right with eXtend.
• <op> must be either lsl, lsr, asr, or ror.
• Rs is a register holding the shift amount. Only the least significant byte is used.
• shift must be between 1 and 32.
• If the optional s is specified, then the N and Z flags are updated according to the result, and the C flag is updated to the last bit shifted out.
• The optional <cond> can be any of the codes from Table 3.2 on page 59 specifying conditional execution.
| Name | Effect | Description |
| lsl | ![]() | Shift Left |
| lsr | ![]() | Shift Right |
| asr | ![]() | Shift Right with sign extend |
| rrx | ![]() | Rotate Right with eXtend |
The rrx operation rotates one place to the right but the CPSR carry flag, C, is included. The carry flag and the register together create a 33-bit quantity to be rotated. The carry flag is rotated into the most significant bit of the register, and the least significant bit of the register is rotated into the carry flag.

This chapter and the previous one introduced the core set of ARM instructions. Most of these instructions were introduced with the very first ARM processors. There are approximately 50 additional instructions and pseudo instructions that were introduced with the ARMv6 and later versions of the architecture, or that only appear in specific versions of the ARM. There are also additional instructions available on systems that have the Vector Floating Point (VFP) coprocessor and/or the NEON extensions. The instructions introduced so far are:
| Name | Page | Operation |
| adc | 83 | Add with Carry |
| add | 83 | Add |
| adr | 75 | Load Address |
| adrl | 75 | Load Address Long |
| and | 85 | Bitwise AND |
| asr | 94 | Arithmetic Shift Right |
| b | 70 | Branch |
| bic | 86 | Bit Clear |
| bl | 71 | Branch and Link |
| bx | 92 | Branch and Exchange |
| clz | 90 | Count Leading Zeros |
| cmn | 81 | Compare Negative |
| cmp | 81 | Compare |
| eor | 85 | Bitwise Exclusive OR |
| ldm | 65 | Load Multiple Registers |
| ldr | 73 | Load Immediate |
| ldr | 64 | Load Register |
| ldrex | 69 | Load Multiple Registers |
| lsl | 94 | Logical Shift Left |
| lsr | 94 | Logical Shift Right |
| mla | 87 | Multiply and Accumulate |
| mov | 86 | Move |
| movt | 86 | Move Top |
| mrs | 91 | Move Status to Register |
| msr | 91 | Move Register to Status |
| mul | 87 | Multiply |
| mvn | 86 | Move Not |
| nop | 93 | No Operation |
| orn | 86 | Bitwise OR NOT |
| orr | 85 | Bitwise OR |
| ror | 94 | Rotate Right |
| rrx | 94 | Rotate Right with eXtend |
| rsb | 83 | Reverse Subtract |
| rsc | 83 | Reverse Subtract with Carry |
| sbc | 83 | Subtract with Carry |
| sdiv | 89 | Signed Divide |
| smlal | 88 | Signed Multiply and Accumulate Long |
| smull | 88 | Signed Multiply Long |
| stm | 65 | Store Multiple Registers |
| str | 64 | Store Register |
| strex | 69 | Store Multiple Registers |
| sub | 83 | Subtract |
| swi | 91 | Software Interrupt |
| swp | 68 | Load Multiple Registers |
| teq | 81 | Test Equivalence |
| tst | 81 | Test Bits |
| udiv | 89 | Unsigned Divide |
| umlal | 88 | Unsigned Multiply and Accumulate Long |
| umull | 88 | Unsigned Multiply Long |

The ARM Instruction Set Architecture includes 17 registers and four basic instruction types. This chapter introduced the instructions used for
• moving data from one register to another,
• performing computational operations with two source operands and one destination register,
• multiplication and division,
• performing comparisons, and
• performing special operations.
Most of the data processing instructions are three address instructions, because they involve two source operands and produce one result. For most instructions, the second source operand can be a register, a rotated or shifted register, or an immediate value. This flexibility results in a relatively powerful assembly language. In addition, almost all instructions can be executed conditionally, which, if used properly, results in very efficient and compact code.
4.1 If r0 initially contains 1, what will it contain after the third instruction in the sequence below?

4.2 What will r0 and r1 contain after each of the following instructions? Give your answers in base 10.

4.3 What is the difference between lsr and asr?
4.4 Write the ARM assembly code to load the numbers stored in num1 and num2, add them together, and store the result in numsum. Use only r0 and r1.
4.5 Given the following variable definitions:

where you do not know the values of x and y, write a short sequence of ARM assembly instructions to load the two numbers, compare them, and move the largest number into register r0.
4.6 Assuming that a is stored in register r0 and b is stored in register r1, show the ARM assembly code that is equivalent to the following C code.

4.7 Without using the mul instruction, give the instructions to multiply r3 by the following constants, leaving the result in r0. You may also use r1 and r2 to hold temporary results, and you do not need to preserve the original contents of r3.
(b) 100
(c) 575
(d) 123
4.8 Assume that r0 holds the least significant 32 bits of a 64-bit integer a, and r1 holds the most significant 32 bits of a. Likewise, r2 holds the least significant 32 bits of a 64-bit integer b, and r3 holds the most significant 32 bits of b. Show the shortest instruction sequences necessary to:
(a) compare a to b, setting the CPSR flags,
(b) shift a left by one bit, storing the result in b,
(c) add b to a, and
(d) subtract b from a.
4.9 Write a loop to count the number of bits in r0 that are set to 1. Use any other registers that are necessary.
4.10 The C standard library provides the open() function, which is documented in the second section of the Linux manual pages. This function is a very small “wrapper” to allow C programmers to access the open() system call. Assembly programmers can access the system call directly. In ARM Linux, the system call number for open() is 5. The values for flag constants used with open() are defined in
Write the ARM assembly instructions and directives necessary to make a Linux system call to open a file named input.txt for reading, without using the C standard library. In other words, write the assembly equivalent to: open(”input.txt”,O˙RDONLY); using the swi instruction.
This chapter first introduces the structured programming concepts and describes the principles of good software design. It then shows how the language elements covered in the previous three chapters are used to create the elements required by structured programming, giving comparative examples of these elements in C and assembly language. It covers programming elements for sequencing, selection, and iteration. Then it covers in greater detail how to access the standard C library functions from assembly language, and how to access assembly language functions from C. It then explains how automatic variables are allocated, and covers writing recursive functions in assembly language. Finally, it explains the implementation of C structs and shows how they can be accessed from assembly language, then covers arrays in the same way.
Structured programming; Sequencing; Selection; Iteration; Loop; Subroutine; Function; Recursion; Struct; Aggregate data; Array
Before IBM released FORTRAN in 1957, almost all programming was done in assembly language. Part of the reason for this is that nobody knew how to design a good high-level language, nor did they know how to write a compiler to generate efficient code. Early attempts at high-level languages resulted in languages that were not well structured, difficult to read, and difficult to debug. The first release of FORTRAN was not a particularly elegant language by today’s standards, but it did generate efficient code.
In the 1960s, a new paradigm for designing high-level languages emerged. This new paradigm emphasized grouping program statements into blocks of code that execute from beginning to end. These basic blocks have only one entry point and one exit point. Control of which basic blocks are executed, and in what order, is accomplished with highly structured flow control statements. The structured program theorem provides the theoretical basis of structured programming. It states that there are three ways of combining basic blocks: sequencing, selection, and iteration. These three mechanisms are sufficient to express any computable function. It has been proven that all programs can be written using only basic blocks, the pre-test loop, and if-then-else structure. Although most high-level languages provide additional statements for the convenience of the programmer, they are just “syntactical sugar.” Other structured programming concepts include well-formed functions and procedures, pass-by-reference and pass-by-value, separate compilation, and information hiding.
These structured programming languages enabled programmers to become much more productive. Well-written programs that adhere to structured programming principles are much easier to write, understand, debug, and maintain. Most successful high-level languages are designed to enforce, or at least facilitate, good programming techniques. This is not generally true for assembly language. The burden of writing a well-structured code lies with the programmer, and not with the language.
The best assembly programmers rely heavily on structured programming concepts. Failure to do so results in code that contains unnecessary branch instructions and, in the worst cases, results in something called spaghetti code. Consider a code listing where a line has been drawn from each branch instruction to its destination. If the result looks like someone spilled a plate of spaghetti on the page, then the listing is spaghetti code. If a program is spaghetti code, then the flow of control is difficult to follow. Spaghetti code is much more likely to have bugs and is extremely difficult to debug. If the flow of control is too complex for the programmer to follow, then it cannot be adequately debugged. It is the responsibility of the assembly language programmer to write code that uses a block-structured approach.
Adherence to structured programming principles results in code that has a much higher probability of working correctly. Well-written code also has fewer branch statements, making the percentage of data processing statements versus branch statements is higher. High data processing density results in higher throughput of data. In other words, writing code in a structured manner leads to higher efficiency.
Sequencing simply means executing statements (or instructions) in a linear sequence. When statement n is completed, statement n + 1 will be executed next. Uninterrupted sequences of statements form basic blocks. Basic blocks have exactly one entry point and one exit point. Flow control is used to select which basic block should be executed next.
The first control structure that we will examine is the basic selection construct. It is called selection because it selects one of the two (or possibly more) blocks of code to execute, based on some condition. In its most general form, the condition could be computed in a variety of ways, but most commonly it is the result of some comparison operation or the result of evaluating a Boolean expression.
Most languages support selection in the form of an if-then-else statement. Selection can be implemented very easily in ARM assembly language with a two-stage process:
1. perform an operation that updates the CPSR flags, and
2. use conditional execution to select a block of instructions to execute.
Because the ARM architecture supports conditional execution on almost every instruction, there are two basic ways to implement this control structure: by using conditional execution on all instructions in a block, or by using branch instructions. The conditional execution can be applied directly to instructions following the flag update, or to branch instructions that transfer execution to another location. Listing 5.1 shows a typical if-then-else statement in C.

Listing 5.2 shows the ARM code equivalent to Listing 5.1, using conditional execution. The then and else are written with one instruction each on lines 7 and 8. The then section is written as a conditional instruction with the lt condition attached. The else section is a single instruction with the opposite (ge) condition. Therefore only one of the two instructions will actually execute, depending on the results of the cmp instruction. If there are three or fewer instructions in each block that can be selected, then this is the preferred and most efficient method of writing the bodies of the then and else selections.

Listing 5.3 shows the ARM code equivalent to Listing 5.1, using branch instructions. Note that this method requires a conditional branch, an unconditional branch, and two labels. If there are more than three instructions in either basic block, then this is the preferred and most efficient method of writing the bodies of the then and else selections.

More complex selection structures should be written with care. Listing 5.4 shows a fragment of C code which compares the variables a, b, and c, and sets the variable x to the least of the three values. In C, Boolean expressions use short-circuit evaluation. For example, consider the Boolean AND operator in the expression ((a<b)&&(a<c)). If the first sub-expression evaluates to false, then the truth value of the complete expression can be immediately determined to be false, so the second sub-expression is not evaluated. This usually results in the compiler generating very efficient assembly code. Good programmers can take advantage of short-circuiting by checking array bounds early in a Boolean expression and accessing array elements later in the expression. For example, the expression ((i<15)&&(array[i]<0)) makes sure that the index i is less than 15 before attempting to access the array. If the index is greater than 14, the array access will not take place. This prevents the program from attempting to access the 16th element on an array that has only 15 elements.

Listing 5.5 shows an ARM assembly code fragment which is equivalent to Listing 5.4. In this code fragment, r0 is used to store a temporary value for the variable x, and the value is only stored to memory once at the end of the fragment of code. The outer if-then-else statement is implemented using branch instructions. The first comparison is performed on line 8. If the comparison evaluates to false, then it immediately branches to the else block of the outer if-then-else statement. But if the first comparison evaluates to true, then it performs the second comparison. Again, if that comparison evaluates to false, then it branches to the else block of the outer if-then-else statement. If both comparisons evaluate to true, then it executes the then block of the outer if-then-else statement, and then branches to the statement following the else block.

The if-then-else statement on line 5 of Listing 5.4 is implemented using conditional execution. The comparison is performed on line 13 of Listing 5.5. Lines 14 and 15 contain instructions that are conditionally executed. Since they have complementary conditions, it is guaranteed that one of them will move a value into r0. The comparison on line 13 determines which statement executes.
Note that the number of comparisons performed will always be minimized, and the number of branches has also been minimized. The only way that line 13 can be reached is if one of the first two comparisons evaluates to false. If line 2 is executed, then no matter which sequence of events occurs, the program fragment will always reach line 16 and a value will be stored in x. Thus, the ARM assembly code fragment in Listing 5.5 can be considered to be a block of code with exactly one entry point and one exit point.
When writing nested selection structures, it is important to maintain a block structure, even if the bodies of the blocks consist of only a single instruction. It is often very helpful to write the algorithm in pseudo-code or a high-level language, such as C or Java, before converting it to assembly. Prolific commenting of the code is also strongly encouraged.
Iteration involves the transfer of control from a statement in a sequence to a previous statement in the sequence. The simplest type of iteration is the unconditional loop, also known as the infinite loop. This type of loop may be used in programs or tasks that should continue running indefinitely. Listing 5.6 shows an ARM assembly fragment containing an unconditional loop. Few high-level languages provide a true unconditional loop, but the high-level programmer can achieve a similar effect by using a conditional loop and specifying a condition that always evaluates to true.

A pre-test loop is a loop in which a test is performed before the block of instructions forming the loop body is executed. If the test evaluates to true, then the loop body is executed. The last instruction in the loop body is a branch back to the beginning of the test. If the test evaluates to false, then execution branches to the first instruction following the loop body. All structured programming languages have a pre-test loop construct. For example, in C, the pre-test loop is called a while loop. In assembly, a pre-test loop is constructed very similarly to an if-then statement. The only difference is that it includes an additional branch instruction at the end of the sequence of instructions that form the body. Listing 5.7 shows a pre-test loop in ARM assembly.

In a post-test loop, the test is performed after the loop body is executed. If the test evaluates to true, then execution branches to the first instruction in the loop body. Otherwise, execution continues sequentially. Most structured programming languages have a post-test loop construct. For example, in C, the post-test loop is called a do-while loop. Listing 5.8 shows a post-test loop in ARM assembly. The body of a post-test loop will always be executed at least once.

Many structured programming languages have a for loop construct, which is a type of counting loop. The for loop is not essential, and is only included as a matter of syntactical convenience. In some cases, a for loop is easier to write and understand than an equivalent pre-test or post-test loop. However, with the addition of an if-then construct, any loop can be implemented as a pre-test loop. The following sections show how loops can be converted from one form to another.
Listing 5.9 shows a simple C program with a for loop. The program prints “Hello World” 10 times, appending an integer to the end of each line.

In order to write an equivalent program in assembly, the programmer must first rewrite the for loop as a pre-test loop. Listing 5.10 shows the program rewritten so that it is easier to translate into assembly. Note that the initialization of the loop variable has been moved to its own line before the while statement. Also, the loop variable is modified on the last line of the loop body. This is a straightforward conversion from one type of loop to another type. Listing 5.11 shows a translation of the pre-test loop structure into ARM assembly.


If the programmer can guarantee that the body of a for loop will always execute at least once, then the for loop can be converted to an equivalent post-test loop. This form of loop is more efficient, because the loop control variable is tested one time less than for a pre-test loop. Also, a post-test loop requires only one label and one conditional branch instruction, whereas a pre-test loop requires two labels, a conditional branch, and an unconditional branch.
Since the loop in Listing 5.9 always executes the body exactly 10 times, we know that the body will always execute at least once. Therefore, the loop can be converted to a post-test loop. Listing 5.12 shows the program rewritten as a post-test loop so that it is easier to translate into assembly. Note that, as in the previous example, the initialization of the loop variable has been moved to its own line before the do-while loop, and the loop variable is modified on the last line of the loop body. This post-test version will produce the same output as the pre-test version. This is a straightforward conversion from one type of loop to an equivalent type. Listing 5.13 shows a straightforward translation of the post-test loop structure into ARM assembly.


A subroutine is a sequence of instructions to perform a specific task, packaged as a single unit. Depending on the particular programming language, a subroutine may be called a procedure, a function, a routine, a method, a subprogram, or some other name. Some languages, such as Pascal, make a distinction between functions and procedures. A function must return a value and must not alter its input arguments or have any other side effects (such as producing output or changing static or global variables). A procedure returns no value, but may alter the value of its arguments or have other side effects.
Other languages, such as C, make no distinction between procedures and functions. In these languages, functions may be described as pure or impure. A function is pure if:
1. the function always evaluates the same result value when given the same argument value(s), and
2. evaluation of the result does not cause any semantically observable side effect or output.
The first condition implies that the result of the function cannot depend on any hidden information or state that may change as program execution proceeds, or between different executions of the program, nor can it depend on any external input from I/O devices. The result value of a pure function does not depend on anything other than the argument values. If the function returns multiple result values, then these two conditions must apply to all returned values. Otherwise the function is impure. Another way to state this is that impure functions have side effects while pure functions have no side effects.
Assembly language does not impose any distinction between procedures and functions, pure or impure. Although every assembly language will provide a way to call subroutines and return from them, it is up to the programmer to decide how to pass arguments to the subroutines and how to pass return values back to the section of code that called the subroutine. Once again, the expert assembly programmer will use structured programming concepts to write efficient, readable, debugable, and maintainable code.
Subroutines help programmers to design reliable programs by decomposing a large problem into a set of smaller problems. It is much easier to write and debug a set of small code pieces than it is to work on one large piece of code. Careful use of subroutines will often substantially reduce the cost of developing and maintaining a large program, while increasing its quality and reliability. The advantages of breaking a program into subroutines include:
• enabling reuse of code across multiple programs,
• reducing duplicate code within a program,
• enabling the programming task to be divided between several programmers or teams,
• decomposing a complex programming task into simpler steps that are easier to write, understand, and maintain,
• enabling the programming task to be divided into stages of development, to match various stages of a project, and
• hiding implementation details from users of the subroutine (a programming principle known as information hiding).
There are two minor disadvantages in using subroutines. First, invoking a subroutine (versus using in-line code) imposes overhead. The arguments to the subroutine must be put into some known location where the subroutine can find them. if the subroutine is a function, then the return value must be put into a known location where the caller can find it. Also, a subroutine typically requires some standard entry and exit code to manage the stack and save and restore the return address.
In most languages, the cost of using subroutines is hidden from the programmer. In assembly, however, the programmer is often painfully aware of the cost, since they have to explicitly write the entry and exit code for each subroutine, and must explicitly write the instructions to pass the data into the subroutine. However, the advantages usually outweigh the costs. Assembly programs can get very large and failure to modularize the code by using subroutines will result in code that cannot be understood or debugged, much less maintained and extended.
Subroutines may be defined within a program, or a set of subroutines may be packaged together in a library. Libraries of subroutines may be used by multiple programs, and most languages provide some built-in library functions. The C language has a very large set of functions in the C standard library. All of the functions in the C standard library are available to any program that has been linked with the C standard library. Even assembly programs can make use of this library. Linking is done automatically when gcc is used to assemble the program source. All that the programmer needs to know is the name of the function and how to pass arguments to it.
Listing 5.14 shows a very simple C program which reads an integer from standard input using scanf and prints the integer to standard output using printf. An equivalent program written in ARM assembly is shown in Listing 5.15. These examples show how arguments can be passed to subroutines in C and equivalently in assembly language.


All processor families have their own standard methods, or function calling conventions, which specify how arguments are passed to subroutines and how function values are returned. The function call standard allows programmers to write subroutines and libraries of subroutines that can be called by other programmers. In most cases, the function calling standards are not enforced by hardware, but assembly programmers and compiler writers conform to the standards in order to make their code accessible to other programmers. The basic subroutine calling rules for the ARM processor are simple:
• The first four arguments go in registers r0-r3.
• Any remaining arguments are pushed to the stack.
If the subroutine returns a value, then it is stored in r0 before the function returns to its caller. Calling a subroutine in ARM assembly usually requires several lines of code. The number of lines required depends on how many arguments the subroutine requires and where the data for those arguments are stored. Some variables may already be in the correct register. Others may need to be moved from one register to another. Still others may need to be pushed onto the stack. Careful programming is required to minimize the amount of work that must be done just to move the subroutine arguments into their required locations.
The ARM register set was introduced in Chapter 3. Some registers have special purposes that are dictated by the hardware design. Others have special purposes that are dictated by programming conventions. Programmers follow these conventions so that their subroutines are compatible with each other. These conventions are simply a set of rules for how registers should be used. In ARM assembly, all registers have alternate names which can be used to help remember the rules for using them. Fig. 5.1 shows an expanded view of the ARM registers, including their alternate names and conventional use.

Registers r0-r3 are also known as a1-a4, because they are used for passing arguments to subroutines. Registers r4-r11 are also known as v1-v8, because they are used for holding local variables in a subroutine. As mentioned in Section 3.2, register r11 can also be referred to as fp because it is used by the C compiler to track the stack frame, unless the code is compiled using the --omit-frame- pointer command line option.
The intra-procedure scratch register, r12, is used by the C library when calling dynamically linked functions. If a subroutine does not call any C library functions, then it can use r12 as another register to store local variables. If a C library function is called, it may change the contents of r12. Therefore, if r12 is being used to store a local variable, it should be saved to another register or to the stack before a C library function is called.
The stack pointer (sp), link register (lr), and program counter (pc), along with the argument registers, are all involved in performing subroutine calls. The calling subroutine must place arguments in the argument registers, and possibly on the stack as well. Placing the arguments in their proper locations is known as marshaling the arguments. After marshaling the arguments, the calling subroutine executes the bl instruction, which will modify the program counter and link register. The bl instruction copies the contents of the program counter to the link register, then loads the program counter with the address of the first instruction in the subroutine that is being called. The CPU will then fetch and execute its next instruction from the address in the program counter, which is the first instruction of the subroutine that is being called.
Our first examples of calling a function will involve the printf function from the C standard library. The printf function can be a bit confusing at first, but it is an extremely useful and flexible function for printing formatted output. The printf function examines its first argument to determine how many other arguments have been passed to it. The first argument is a format string, which is a null-terminated ASCII string. The format string may include conversion specifiers, which start with the \% character. For each conversion specifier, printf assumes that an argument has been passed in the correct register or location on the stack. The argument is retrieved, converted according to the specified format, and printed. Other specifiers include \%X to print the matching argument as an integer in hexadecimal, \%c to print the matching argument as an ASCII character, \%s to print a zero-terminated string. The integer specifiers can include an optional width and zero-padding specification. For example \%8X will print an integer in hexadecimal, using 8 characters. Any leading zeros will be printed as spaces. The format string \%08X will print an integer in hexadecimal, using 8 characters. In this case, any leading zeros will be printed as zeros. Similarly, \%15d can be used to print an integer in base 10 using spaces to pad the number up to 15 characters, while \%015d will print an integer in base 10 using zeros to pad up to 15 characters.
Listing 5.16 shows a call to printf in C. The printf function requires one argument, and can accept more than one. In this case, there is only one argument, the format string. Listing 5.17 shows an equivalent call made in ARM assembly language. The single argument is loaded into r0 in conformance with the ARM subroutine calling convention.


Listing 5.18 shows a call to printf in C having four arguments. The format string is the first argument. The format string contains three conversion specifiers, and is followed by three more arguments. Arguments are matched to conversion specifiers according to their positions. The type of each argument matches the type indicated in the conversion specifier. The first conversion specifier is applied to the second argument, the second conversion specifier is applied to the third argument, and the third conversion specifier is applied to the fourth argument. The \%d conversion specifiers indicate that the arguments are to be interpreted as integers and printed in base 10.

Listing 5.19 shows an equivalent call made in ARM assembly language. The arguments are loaded into r0-r3 in conformance with the ARM subroutine calling convention. Note that we assume that formatstr has previously been defined using a .asciz or .string assembler directive or equivalent method. As long as there are four or fewer arguments that must be passed, they can all fit in registers r0-r3 (a.k.a a1-a4), but when there are more arguments, things become a little more complicated. Any remaining arguments must be passed on the program stack, using the stack pointer r13. Care must be taken to ensure that the arguments are pushed to the stack in the proper order. Also, after the function call, the arguments must be removed from the stack, so that the stack pointer is restored to its original value.

Listing 5.20 shows a call to printf in C having more than four arguments. The format string is the first argument. The format string contains five conversion specifiers, which implies that the format string must be followed by five additional arguments. Arguments are matched to conversion specifiers according to their positions. The type of each argument matches the type indicated in the conversion specifier. The first conversion specifier is applied to the second argument, the second conversion specifier is applied to the third argument, the third conversion specifier is applied to the fourth argument, etc. The \%d conversion specifiers indicate that the arguments are to be interpreted as integers and printed in base 10.

Listing 5.21 shows an equivalent call made in ARM assembly language. Since there are six arguments, the last two must be pushed to the program stack. The arguments are loaded into r0 one at a time and then the register pre-indexed addressing mode is used to subtract four bytes from the stack pointer and then store the argument at the top of the stack. Note that the sixth argument is pushed to the stack first, followed by the fifth argument. The remaining arguments are loaded in r0-r3. Note that we assume that formatstr has previously been defined to be ”The results are: or \ lstinline { .string assembler directive.

Listing 5.22 shows how the fifth and sixth arguments can be pushed to the stack using a single stmfd instruction. The sixth argument is loaded into r3 and the fifth argument is loaded into r0, then the stmfd instruction is used to store them on the stack and adjust the stack pointer. A little care must be taken to ensure that the arguments are stored in the correct order on the stack. Remember that the stmfd instruction will always push the lowest-numbered register to the lowest address, and the stack grows downward. Therefore, r3, the sixth argument, will be pushed onto the stack first, making it grow downward by four bytes. Next, r0 is pushed, making the stack grow downward by four more bytes. As in the previous example, the remaining four arguments are loaded into a1-a4.

After the printf function is called, the fifth and sixth arguments must be popped from the stack. If those values are no longer needed, then there is no need to load them into registers. The quickest way to pop them from the stack is to simply adjust the stack pointer back to its original value. In this case, we pushed two arguments onto the stack, using a total of eight bytes. Therefore, all we need to do is add eight to the stack pointer, thereby restoring its original value.
We have looked at the conventions that are followed for calling functions. Now we will examine these same conventions from the point of view of the function being called. Because of the calling conventions, the programmer writing a function can assume that
• the first four arguments are in r0-r3,
• any additional arguments can be accessed with ldr rd,[sp,# offset ],
• the calling function will remove arguments from the stack, if necessary,
• if the function return type is not void, then they must enusure that the return value is in r0 (and possibly r1, r2, r3), and
• the return address will be in lr.
Also because of the conventions, there are certain registers that can be used freely while others must be preserved or restored so that the calling function can continue operating correctly. Registers which can be used freely are referred to as volatile, and registers which must be preserved or restored before returning are referred to as non-volatile. When writing a subroutine (function),
• registers r0-r3 and r12 are volatile,
• registers r4-r11 and r13 are non-volatile (they can be used, but their contents must be restored to their original value before the function returns),
• register r14 can be used by the function, but its contents must be saved so that the return address can be loaded into r15 when the function returns to its caller,
• if the function calls another function, then it must save register r14 either on the stack or in a non-volatile register before making the call.
Listing 5.23 shows a small C function that simply returns the sum of its six arguments. The ARM assembly version of that function is shown in Listing 5.24. Note that on line 5, the fifth argument is loaded from the stack, and on line 7, the sixth argument is loaded in a similar way, using an offset from the stack pointer. If the calling function has followed the conventions, then the fifth and sixth arguments will be where they are expected to be in relation to the stack pointer.


In block-structured high-level languages, an automatic variable is a variable that is local to a block of code and not declared with static duration. It has a lifetime that lasts only as long as its block is executing. Automatic variables can be stored in one of two ways:
1. the stack is temporarily adjusted to hold the variable, or
2. the variable is held in a register during its entire life.
When writing a subroutine in assembly, it is the responsibility of the programmer to decide what automatic variables are required and where they will be stored. In high-level languages this decision is usually made by the compiler. In some languages, including C, it is possible to request that an automatic variable be held in a register. The compiler will attempt to comply with the request, but it is not guaranteed. Listing 5.25 shows a small function which requests that one of its variables be kept in a register instead of on the stack.

Listing 5.26 shows how the function could be implemented in assembly. Note that the array of integers consumes 80 bytes of storage on the stack, and could not possibly fit into the registers available on the ARM processor. However, the loop control variable can easily be stored in one of the registers for the duration of the function. Also notice that on line 1 the storage for the array is allocated simply by adjusting the stack pointer, and on line 9 the storage is released by restoring the stack pointer to its original contents. It is critical that the stack pointer be restored, no matter how the function returns. Otherwise, the calling function will probably mysteriously fail. For this reason, each function should have exactly one block of instructions for returning. If the function needs to return from some location other than the end, then it should branch to the return block rather than returning directly.

A function that calls itself is said to be recursive. Certain problems are easy to implement recursively, but are more difficult to solve iteratively. A problem exhibits recursive behavior when it can be defined by two properties:
1. a simple base case (or cases), and
2. a set of rules that reduce all other cases toward the base case.
For example, we can define person’s ancestors recursively as follows:
1. one’s parents are one’s ancestors (base case),
2. the ancestors of one’s ancestors are also one’s ancestors (recursion step).
Recursion is a very powerful concept in programming. Many functions are naturally recursive, and can be expressed very concisely in a recursive way. Numerous mathematical axioms are based upon recursive rules. For example, the formal definition of the natural numbers by the Peano axioms can be formulated as:
2. each natural number has a successor, which is also a natural number.
Using one base case and one recursive rule, it is possible to generate the set of all natural numbers. Other recursively defined mathematical objects include functions and sets.
Listing 5.27 shows the C code for a small program which uses recursion to reverse the order of characters in a string. The base case where recursion ends is when there are fewer than two characters remaining to be swapped. The recursive rule is that the reverse of a string can be created by swapping the first and last characters and then reversing the string between them. In short, a string is reversed if:

1. the string has a length of zero or one character, or
2. the first and last characters have been swapped and the remaining characters have been reversed.
In Listing 5.27, line 3 checks for the base case. If the string has not been reversed according to the first rule, then the second rule is applied. Lines 5–7 swap the first and last characters, and line 8 recursively reverses the characters between them.
Listing 5.28 shows how the reverse function can be implemented using recursion in ARM assembly. Line 1 saves the link register to the stack and decrements the stack pointer. Next, storage is allocated for an automatic variable. Lines 3 and 4 test for the base case. If the current case is the base case, then the function simply returns (restoring the stack as it goes). Otherwise, the first and last characters are swapped in lines 5 through 10 and a recursive call is made in lines 11 through 13.

The code in Listing 5.28 can be made a bit more efficient. First, the test for the base case can be performed before anything else is done, as shown in Listing 5.29. Also, the local variable tmp can be stored in a volatile register rather than stored on the stack, because it is only needed for lines 4 through 8. It is not needed after the recursive call, so there is really no need to preserve it on the stack. This means that our function can use half as much stack space and will run much faster. This further refined version is shown in Listing 5.30. This version uses ip (r12) as the tmp variable instead of using the stack.


The previous examples used the concept of an array of characters to access the string that is being reversed. Listing 5.31 shows how this problem can be solved in C using pointers to the first and last characters rather than array indices. This version only has two parameters in the reverse function, and uses pointer dereferencing rather than array indexing to access each character. Other than that difference, it works the same as the original version. Listing 5.32 shows how the reverse function can be implemented efficiently in ARM assembly. This implementation has the same number of instructions as the previous version, but lines 4 through 7 use a different addressing mode. On the ARM processor, the pointer method and the array index method are equally efficient. However, many processors do not have the rich set of addressing modes available on the ARM. On those processors, the pointer method may be significantly more efficient.


An aggregate data item can be referenced as a single entity, and yet consists of more than one piece of data. Aggregate data types are used to keep related data together, so that the programmer’s job becomes easier. Some examples of aggregate data are arrays, structures or records, and objects, In most programming languages, aggregate data types can be defined to create higher-level structures. Most high-level languages allow aggregates to be composed of basic types as well as other aggregates. Proper use of structured data helps to make programs less complicated and easier to understand and maintain.
In high-level languages, there are several benefits to using aggregates. Aggregates make the relationships between data clear, and allow the programmer to perform operations on blocks of data. Aggregates also make passing parameters to functions simpler and easier to read.
The most common aggregate data type is an array. An array contains zero or more values of the same data type, such as characters, integers, floating point numbers, or fixed point numbers. An array may also contain values of another aggregate data type. Every element in an array must have the same type. Each data item in an array can be accessed by its array index.
Listing 5.33 shows how an array can be allocated and initialized in C. Listing 5.34 shows the equivalent code in ARM assembly. Note that in this case, the scaled register offset addressing mode was used to access each element in the array. This mode is often convenient when the size of each element in the array is an integer power of 2. If that is not the case, then it may be necessary to use a different addressing mode. An example of this will be given in Section 5.5.3.


The second common aggregate data type is implemented as the struct in C or the record in Pascal. It is commonly referred to as a structured data type or a record. This data type can contain multiple fields. The individual fields in the structured data may also be referred to as structured data elements, or simply elements. In most high-level languages, each element of a structured data type may be one of the base types, an array type, or another structured data type. Listing 5.35 shows how a struct can be declared, allocated, and initialized in C. Listing 5.36 shows the equivalent code in ARM assembly.


Care must be taken using assembly to access data structures that were declared in higher level languages such as C and C++. The compiler will typically pad a data structure to ensure that the data fields are aligned for efficiency. On most systems, it is more efficient for the processor to access word-sized data if the data is aligned to a word boundary. Some processors simply cannot load or store a word from an address that is not on a word boundary, and attempting to do so will result in an exception. The assembly programmer must somehow determine the relative address of each field within the higher-level language structure. One way that this can be accomplished in C is by writing a small function which prints out the offsets to each field in the C structure. The offsets can then be used to access the fields of the structure from assembly language. Another method for finding the offsets is to run the program under a debugger and examine the data structure.
It is often useful to create arrays of structured data. For example, a color image may be represented as a two-dimensional array of pixels, where each pixel consists of three integers which specify the amount of red, green, and blue that are present in the pixel. Typically, each of the three values is represented using an unsigned eight bit integer. Image processing software often adds a fourth value, α, specifying the transparency of each pixel.
Listing 5.37 shows how an array of pixels can be allocated and initialized in C. The listing uses the malloc() function from the C standard library to allocate storage for the pixels from the heap (see Section 1.4). Note that the code uses the sizeof () function to determine how many bytes of memory are consumed by a single pixel, then multiplies that by the width and height of the image. Listing 5.38 shows the equivalent code in ARM assembly.


Note that the code in Listing 5.38 is far from optimal. It can be greatly improved by combining the two loops into one loop. This will remove the need for the multiply on line 28 and the addition on line 29, and will simplify the code structure. An additional improvement would be to increment the single loop counter by three on each loop iteration, making it very easy to calculate the pointer for each pixel. Listing 5.39 shows the ARM assembly implementation with these optimizations.

Although the implementation shown in Listing 5.39 is more efficient than the previous version, there are several more improvements that can be made. If we consider that the goal of the code is to allocate some number of bytes and initialize them all to zero, then the code can be written more efficiently. Rather than using three separate store instructions to set 3 bytes to zero on each iteration of the loop, why not use a single store instruction to set four bytes to zero on each iteration? The only problem with this approach is that we must consider the possibility that the array may end in the middle of a word. However, this can be dealt with by using two consecutive loops. The first loop sets one word of the array to zero on each iteration, and the second loop finishes off any remaining bytes. Listing 5.40 shows the results of these additional improvements. This third implementation will run much faster than the previous implementations.

Spaghetti code is the bane of assembly programming, but it can easily be avoided. Although assembly language does not enforce structured programming, it does provide the low-level mechanisms required to write structured programs. The assembly programmer must be aware of, and assiduously practice, proper structured programming techniques. The burden of writing properly structured code blocks, with selection structures and iteration structures, lies with the programmer, and failure to apply structured programming techniques will result in code that is difficult to understand, debug, and maintain.
Subroutines provide a way to split programs into smaller parts, each of which can be written and debugged individually. This allows large projects to be divided among team members. In assembly language, defining and using subroutines is not as easy as in higher level languages. However, the benefits usually outweigh the costs. The C library provides a large number of functions. These can be accessed by an assembly program as long as it is linked with the C standard library.
Assembly provides the mechanisms to access aggregate data types. Arrays can be accessed using various addressing modes on the ARM processor. The pre-indexing and post-indexing modes allow array elements to be accessed using pointers, with the pointers being incremented after each element access. Fields in structured data records can be accessed using immediate offset addressing mode. The rich set of addressing modes available on the ARM processor allows the programmer to use aggregate data types more efficiently than on most processors.
5.1 What does it mean for a register to be volatile? Which ARM registers are considered volatile according to the ARM function calling convention?
5.2 Fully explain the differences between static variables and automatic variables.
5.3 In ARM assembly language, write a function that is equivalent to the following C function.

5.4 What are the two places where an automatic variable can be stored?
5.5 You are writing a function and you decided to use registers r4 and r5 within the function. Your function will not call any other functions; it is self-contained. Modify the following skeleton structure to ensure that r4 and r5 can be used within the function and are restored to comply with the ARM standards, but without unnecessary memory accesses.

5.6 Convert the following C program to ARM assembly, using a post-test loop:

5.7 Write a complete ARM function to shift a 64-bit value left by any given amount between 0 and 63 bits. The function should expect its arguments to be in registers r0, r1, and r2. The lower 32 bits of the value are passed in r0, the upper 32 bits of the value are passed in r1, and the shift amount is passed in r2.
5.8 Write a complete subroutine in ARM assembly that is equivalent to the following C subroutine.

5.9 Write a complete function in ARM assembly that is equivalent to the following C function.


5.10 Write an ARM assembly function to calculate the average of an array of integers, given a pointer to the array and the number of items in the array. Your assembly function must implement the following C function prototype:
Assume that the processor does not support the div instruction, but there is a function available to divide two integers. You do not have to write this function, but you may need to call it. Its C prototype is:
5.11 Write a complete function in ARM assembly that is equivalent to the following C function. Note that a and b must be allocated on the stack, and their addresses must be passed to scanf so that it can place their values into memory.

5.12 The factorial function can be defined as:
The following C program repeatedly reads x from the user and calculates x! It quits when it reads end-of-file or when the user enters a negative number or something that is not an integer.
Write this program in ARM assembly.

5.13 For large x, the factorial function is slow. However, a lookup table can be added to the function to improve average performance. This technique is commonly known as memoization or tabling, but is sometimes called dynamic programming. The following C implementation of the factorial function uses memoization. Modify your ARM assembly program from the previous problem to include memoization.


This chapter extends the coverage of structured programming to include abstract data types (ADT). It begins by giving the definition of an abstract data type and giving a small example of an ADT that could be used to read, process, and write Netpbm images. The next section introduces an ADT written in C to perform word frequency counts, and shows how performance can be greatly improved by using better algorithms and/or by writing some functions in assembly language. It also shows how a binary tree structure created by C code can be traversed in assembly language. The chapter ends with a ethics module about the Therac-25 cancer treatment device.
Abstract data type; Word frequency count; Binary tree; Index; Sort; Ethics
An abstract data type (ADT) is composed of data and the operations that work on that data. The ADT is one of the cornerstones of structured programming. Proper use of ADTs has many benefits. Most importantly, abstract data types help to support information hiding. A software module hides information by encapsulating the information into a module or other construct which presents an interface. The interface typically consists of the names of data types provided by the ADT and a set of subroutine definitions, or prototypes, for operating on the data types. The implementation of the ADT is hidden from the client code that uses the ADT.
A common use of information hiding is to hide the physical storage layout for data so that if it is changed, the change is restricted to a small subset of the total program. For example, if a three-dimensional point (x,y,z) is represented in a program with three floating point scalar variables, and the representation is later changed to a single array variable of size three, a module designed with information hiding in mind would protect the remainder of the program from such a change.
Information hiding reduces software development risk by shifting the code’s dependency on an uncertain implementation onto a well-defined interface. Clients of the interface perform operations purely through the interface, which does not change. If the implementation changes, the client code does not have to change.
Encapsulating software and data structures behind an interface allows the construction of objects that mimic the behavior and interactions of objects in the real world. For example, a simple digital alarm clock is a real-world object that most people can use and understand. They can understand what the alarm clock does, and how to use it through the provided interface (buttons and display) without needing to understand every part inside of the clock. If the internal circuitry of the clock were to be replaced with a different implementation, people could continue to use it in the same way, provided that the interface did not change.
As with all other structured programming concepts, ADTs can be implemented in assembly language. In fact, most high-level compilers convert structured programming code into assembly during compilation. All that is required is that the programmer define the data structure(s), and the set of operations that can be used on the data. Listing 6.1 gives an example of an ADT interface in C. The type Image is not fully defined in the interface. This prevents client software from accessing the internal structure of the image data type. Therefore, programmers using the ADT can modify images only by using the provided functions. Other structured programming and object-oriented programming languages such as C++, Java, Pascal, and Modula 2 provide similar protection for data structures so that client code can access the data structure only through the provided interface. Note that only the pval definition is exposed, indicating to client programs that the red, green, and blue components of a pixel must be a number between 0 and 255. In C, as with other structured programming languages, the implementation of the subroutines can also be hidden by placing them in separate compilation modules. Those modules will have access to the internal structure of the Image data type.

Assembly language does not have the ability to define a data structure as such, but it does provide the mechanisms needed to specify the location of each field with respect to the beginning of a data structure, as well as the overall size of the data structure. With a little thought and effort, it is possible to implement ADTs in Assembly language. Listing 6.2 shows the private implementation of the Image data type, which is included by the C files which implement the Image data type. Listing 6.3 shows how the data structures from the previous listings can be defined in assembly language. With those definitions, any of the functions declared in Listing 6.1 can be written in assembly language.


Counting the frequency of words in written text has several uses. In digital forensics, it can be used to provide evidence as to the author of written communications. Different people have different vocabularies, and use words with differing frequency. Word counts can also be used to classify documents by type. Scientific articles from different fields contain words specific to that field, and historical novels will differ from western novels in word frequency.
Listing 6.4 shows the main function for a simple C program which reads a text file and creates a list of all the words contained in a file, along with their frequency of occurrence. The program has been divided into two parts: the main program, and an ADT which is used to keep track the words and their frequencies, and to print a table of word frequencies.


The interface for the ADT is shown in Listing 6.5. There are several ways that the ADT could be implemented. Note that the interface given in the header file does not show the internal fields of the word list data type. Thus, any file which includes this header is allowed to declare pointers to wordlist data types, but cannot access or modify any internal fields. The list of words could be stored in an array, a linked list, a binary tree, or some other data structure. The subroutines could be implemented in C or in some other language, including assembly. Listing 6.6 shows an implementation in C using a linked list. Note that the function for printing the word frequency list in numerical order has not been implemented. It will be written in assembly language. Since the program is split into multiple files, it is a good idea to use the make utility to build the executable program. A basic makefile is shown in Listing 6.7.





Suppose we wish to implement one of the functions from Listing 6.6 in ARM assembly language. We would delete the function from the C file, create a new file with the assembly version of the function, and modify the makefile so that the new file is included in the program. The header file and the main program file would not require any changes. The header file provides function prototypes that the C compiler uses to determine how parameters should be passed to the functions. As long as our new assembly function conforms to its C header definition, the program will work correctly.
The linked list is created in alphabetical order, but the wl_print_numerical() function is required to print it sorted in reverse order of number of occurrences. There are several ways in which this could be accomplished, with varying levels of efficiency. The possible approaches include, but are not limited to:
• Re-ordering the linked list using an insertion sort: This approach creates a complete new list by removing each item, one at a time, from the original list, and inserting it into a new list sorted by the number of occurrences rather than the words themselves. The time complexity for this approach would be O(N2), but would require no additional storage. However, if the list were later needed in alphabetical order, or any more words were to be added, then it would need to be re-sorted in the original order.
• Sorting the linked list using a merge sort algorithm: Merge sort is one of the most efficient sorting algorithms known and can be efficiently applied to data in files and linked lists. The merge sort works as follows:
1. The sub-list size, i, is set to 1.
2. The list is divided into sub-lists, each containing i elements. Each sub-list is assumed to be sorted. (A sub-list of length one is sorted by definition.)
3. The sub-lists are merged together to create a list of sub-lists of size 2i, where each sub-list is sorted.
4. The sub-list size, i, is set to 2i.
5. The process is repeated from step 2 until i ≥ N, where N is the number of items to be sorted.
The time complexity for the merge sort algorithm is
, which is far more efficient than the insertion sort. This approach would also require no additional storage. However, if the list were later needed in alphabetical order, or any more words were to be added, then it would need to be re-sorted in the original alphabetical order.
• Create an index, and sort the index rather than rebuilding the list. Since the number of elements in the list is known, we can allocate an array of pointers. Each pointer in the array is then initialized to point to one element in the linked list. The array forms an index, and the pointers in the array can be re-sorted in any desired order, using any common sorting method such as bubble sort (O(N2)), in-place insertion sort (O(N2)), quick sort (
), or others. This approach requires additional storage, but has the advantage that it does not need to modify the original linked list.
There are many other possibilities for re-ordering the list. Regardless of which method is chosen, the main program and the interface (header file) need not be changed. Different implementations of the sorting function can be substituted without affecting any other code.
The wl_print_numerical() function can be implemented in assembly as shown in Listing 6.8. The function operates by re-ordering the linked list using an insertion sort as described above. Listing 6.9 shows the change that must be made to the make file. Now, when make is run, it compiles the two C files and the assembly file into object files, then links them all together. The C implementation of wl_print_numerical() in list.c must be deleted or commented out or the linker will emit an error indicating that it found two versions of wl_print_numerical().




The word frequency counter, as previously implemented, takes several minutes to count the frequency of words in the author’s manuscript for this textbook on a Raspberry Pi. Most of the time is spent building the list of words and re-sorting the list in order of word frequency. Most of the time for both of these operations is spent in searching for the word in the list before incrementing its count or inserting it in the list. There are more efficient ways to build ordered lists of data.
Since the code is well modularized using an ADT, the internal mechanism of the list can be modified without affecting the main program. A major improvement can be made by changing the data structure from a linked list to a binary tree. Fig. 6.1 shows an example binary tree storing word frequency counts. The time required to insert into a linked list is O(N), but the time required to insert into a binary tree is
. To give some perspective, the author’s manuscript for this textbook contains about 125,000 words. Since log2(125,000) < 17, we would expect the linked list implementation to require about
times as long as a binary tree implementation to process the author’s manuscript for this textbook. In reality, there is some overhead to the binary tree implementation. Even with the extra overhead, we should see a significant speedup. Listing 6.10 shows the C implementation using a balanced binary tree instead of a linked list.








With the tree implementation, wl_print_numerical() could build a new tree, sorted on the word frequency counts. However, it may be more efficient to build a separate index, and sort the index by word frequency counts. The assembly code will allocate an array of pointers, and set each pointer to one of the nodes in the tree, as shown in Fig. 6.2. Then, it will use a quick sort to sort the pointers into descending order by word frequency count, as shown in Fig. 6.3. This implementation is shown in Listing 6.11.






The tree-based implementation gets most of its speed improvement through using two
algorithms to replace O(N2) algorithms. These examples show how a small part of a program can be implemented in assembly language, and how to access C data structures from assembly language. The functions could just as easily have been written in C rather than assembly, without greatly affecting performance. Later chapters will show examples where the assembly implementation does have significantly better performance than the C implementation.
The Therac-25 was a device designed for radiation treatment of cancer. It was produced by Atomic Energy of Canada Limited (AECL), which had previously produced the Therac-6 and Therac-20 units in partnership with CGR of France. It was capable of treating tumors close to the skin surface using electron beam therapy, but could also be configured for Megavolt X-ray therapy to treat deeper tumors. The X-ray therapy required the use of a tungsten radiation shield to limit the area of the body that was exposed to the potentially lethal radiation produced by the device.
The Therac-25 used a double pass accelerator, which provided more power, in a smaller space, at less cost, compared to its predecessors. The second major innovation was that computer control was a central part of the design, rather than an add-on component as in its predecessors. Most of the hardware safety interlocks that were integral to the designs of the Therac-6 and Therac-20 were seen as unnecessary, because the software would perform those functions. Computer control was intended to allow operators to set up the machine more quickly, allowing them to spend more time communicating with patients and to treat more patients per day. It was also seen as a way to reduce production costs by relying on software, rather than hardware, safety interlocks.
There were design issues with both the software and the hardware. Although this machine was built with the goal of saving lives, between 1985 and 1986, three deaths and other injuries were attributed to the hardware and software design of this machine. Death due to radiation exposure is usually slow and painful, and the problem was not identified until the damage had been done.
AECL was required to obtain US Food and Drug Administration (FDA) approval before releasing the Therac-25 to the US market. They obtained approval quickly by declaring “pre-market equivalence,” effectively claiming that the new machine was not significantly different from its predecessors. This practice was common in 1984, but was overly optimistic, considering that most of the safety features had been changed from hardware to software implementations. With FDA approval, AECL made the Therac-25 commercially available and performed a Fault Tree Analysis to evaluate the safety of the device.
Fault Tree Analysis, as its name implies, requires building a tree to describe every possible fault and assigning probabilities to those faults. After building the tree, the probabilities of hazards, such as overdose, can be calculated. Unfortunately, the engineers assumed that the software (much of which was re-used from the previous Therac models) would operate correctly. This turned out not to be the case, because the hardware interlocks present in the previous models had hidden some of the software faults. The analysts did consider some possible computer faults, such as an error being caused by cosmic rays, but assigned extremely low probabilities to those faults. As a result, the assessment was very inaccurate.
When the first report of an overdose was reported to AECL in 1985, they sent an engineer to the site to investigate. They also filed a report with the FDA and the Canadian Radiation Protection Board (CRPB). AECL also notified all users of the fact that there had been a report and recommended that operators should visually confirm hardware settings before each treatment. The AECL engineers were unable to reproduce the fault, but suspected that it was due to the design and placement of a microswitch. They redesigned the microswitch and modified all of the machines that had been deployed. They also retracted their recommendation that operators should visually confirm hardware settings before each treatment.
Later that year, a second incident occurred. In this case, there is no evidence that AECL took any action. In January of 1986, AECL received another incident report. An employee at AECL responded by denying that the Therac-25 was at fault, and stated that no other similar incidents had been reported. Another incident occurred in March of that year. AECL sent an engineer to investigate. The engineer was unable to determine the cause, and suggested that it was due to an electrical problem, which may have caused an electrical shock. An independent engineering firm was called to examine the machine and reported that it was very unlikely that the machine could have delivered an electrical shock to the patient. In April of 1986, another incident was reported. In this case, the AECL engineers, working with the medical physicist at the hospital, were able to reproduce the sequence of events that lead to the overdose.
As required by law, AECL filed a report with the FDA. The FDA responded by declaring the Therac-25 defective. AECL was ordered to notify all of the sites where the Therac-25 was in use, investigate the problem, and file a corrective action plan. AECL notified all sites, and recommended removing certain keys from the keyboard on the machines. The FDA responded by requiring them to send another notification with more information about the defect and the consequent hazards. Later in 1986, AECL filed a revised corrective action plan.
Another overdose occurred in January 1987, and was attributed to a different software fault. In February, the FDA and CRPB both ordered that all Therac-25 units be shut down, pending effective and permanent modifications. AECL spent six months developing a new corrective action plan, which included a major overhaul of the software, the addition of mechanical safety interlocks, and other safety-related modifications.
The Therac-25 was controlled by a DEC PDP-11 computer, which was the most popular minicomputer ever produced. Around 600,000 were produced between 1970 and 1990 and used for a variety of purposes, including embedded systems, education, and general data processing. It was a 16-bit computer and was far less powerful than a Raspberry Pi. The Therac-25 computer was programmed in assembly language by one programmer and the source code was not documented. Documentation for the hardware components was written in French. After the faults were discovered, a commission concluded that the primary problems with the Therac-25 were attributable to poor software design practices, and not due to any one of several specific coding errors. This is probably the best known case where a poor overall software design, and insufficient testing, led to loss of life.
The worst problems in the design and engineering of the machine were:
• The code was not subjected to independent review.
• The software design was not considered during the assessment of how the machine could fail or malfunction.
• The operator could ignore malfunctions and cause the machine to proceed with treatment.
• The hardware and software were designed separately and not tested as a complete system until the unit was assembled at the hospitals where it was to be used.
• The design of the earlier Therac-6 and Therac-20 machines included hardware interlocks which would ensure that the X-ray mode could not be activated unless the tungsten radiation shield was in place. The hardware interlock was replaced with a software interlock in the Therac-25.
• Errors were displayed as numeric codes, and there was no indication of the severity of the error condition.
The operator interface consisted of a keyboard and text-mode monitor, which was common in the early 1980s. The interface had a data entry area in the middle of the screen and a command line at the bottom. The operator was required to enter parameters in the data entry area, then move the cursor to the command line to initiate treatment. When the operator moved the cursor to the command line, internal variables were updated and a flag variable was set to indicate that data entry was complete. That flag was cleared when a command was entered on the command line. If the operator moved the cursor back to the data entry area without entering a command, then the flag was not cleared, and any subsequent changes to the data entry area did not affect the internal variables.
A global variable was used to indicate that the magnets were currently being adjusted. This variable was modified by two functions, and did not always contain the correct value. Adjusting the magnets required about eight seconds, and the flag was correct for only a small period at the beginning of this time period.
Due to the errors in the design and implementation of the software, the following sequence of events could result in the machine causing injury to, or even the death of, the patient:
1. The operator mistakenly specified high-power mode during data entry.
2. The operator moved the cursor to the command line area.
3. The operator noticed the mistake, and moved the cursor back to the data entry area without entering a command.
4. The operator corrected the mistake and moved the cursor back to the command line.
5. The operator entered the command line area, left it, made the correction, and returned within the eight-second window required for adjusting the magnets.
If the above sequence occurred, then the operator screen could indicate that the machine was in low power mode, although it was actually set in high-power mode. During a final check before initiating the beam, the software would find that the magnets were set for high-power mode but the operator setting was for low power mode. It displayed a numeric error code and prevented the machine from starting. The operator could clear the error code by resetting the computer (which only required one key to be pressed on the keyboard). This caused the tungsten shield to withdraw but left the machine in X-ray mode. When the operator entered the command to start the beam, the machine could be in high-power mode without having the tungsten shield in place. X-rays were applied to the unprotected patient.
It took some time for this critical flaw to appear. The failure only occurred when the operator initially made a one-keystroke mistake in entering the prescription data, moved to the command area, and then corrected the mistake within eight seconds. Initially, operators were slow to enter data, and spent a lot of time making sure that the prescription was correct before initiating treatment. As they became more familiar with the machine, they were able to enter data and correct mistakes more quickly. Eventually, operators became familiar enough with the machine that they could enter data, make a correction, and return to the command area within the critical eight-second window. Also, the operators became familiar with the machine reporting numeric error codes without any indication of the severity of the code. The operators were given a table of codes and their meanings. The code reported was “no dose” and indicated “treatment pause.” There is no reason why the operator should consider that to be a serious problem; they had become accustomed to frequent malfunctions that did not have any consequences to the patient.
Although the code was written in assembly language, that fact was not cited as an important factor. The fundamental problems were poor software design and overconfidence. The reuse of code in an application for which it was not initially designed also may have contributed to the system flaws. A proper design using established software design principles, including structured programming and abstract data types, would almost certainly have avoided these fatalities.
The abstract data type is a structured programming concept which contributes to software reliability, eases maintenance, and allows for major revisions to be performed in a safe way. Many high-level languages enforce, or at least facilitate, the use of ADTs. Assembly language does not. However, the ethical assembly language programmer will make the extra effort to write code that conforms to the standards of structured programming and use abstract data types to help ensure safety, reliability, and maintainability.
ADTs also facilitate the implementation of software modules in more than one language. The interface specifies the components of the ADT, but not the implementation. The implementation can be in any language. As long as assembly programmers and compiler authors generate code that conforms to a well-known standard, their code can be linked with code written in other languages.
Poor coding practices and poor design can lead to dire consequences, including loss of life. It is the responsibility of the programmer, regardless of the language used, to make ethical decisions in the design and implementation of software. Above all, the programmer must be aware of the possible consequences of the decisions they make.
6.1 What are the advantages of designing software using abstract data types?
6.2 Why is the internal structure of the Pixel data type hidden from client code in Listing 6.2?
6.3 High-level languages provide mechanisms for information hiding, but assembly does not. Why should the assembly programmer not simply bypass all information hiding and access the internal data structures of any ADT directly?
6.4 The assembly code in wl_print_numerical() accesses the internal structure of the wordlistnode data type. Why is it allowed to do so? Should it be allowed to do so?
6.5 Given the following definitions for a stack ADT:


Write the InitStack() function in ARM assembly language.
6.6 Referring to the previous question, write the Push() function in ARM assembly language.
6.7 Referring to the previous two questions, write the Pop() function in ARM assembly language.
6.8 Referring to the previous three questions, write the Top() function in ARM assembly language.
6.9 Referring to the previous three questions, write the PrintStack() function in ARM assembly language.
6.10 Re-implement all of the previous stack functions using a linked list rather than a static array.
6.11 The “Software Engineering Code of Ethics And Professional Practice” states that a responsible software engineer should “Approve software only if they have well-founded belief that it is safe, meets specifications, passes appropriate tests…” (sub-principle 1.03) and “Ensure adequate testing, debugging, and review of software…on which they work.” (sub-principle 3.10). Unfortunately, defects did make their way into the system.
The software engineering code of ethics also states that a responsible software engineer should “Treat all forms of software maintenance with the same professionalism as new development.”
(a) Explain how the Software Engineering Code of Ethics And Professional Practice were violated by the Therac 25 developers.
(b) How should the engineers and managers at AECL have responded when problems were reported?
(c) What other ethical and non-ethical considerations may have contributed to the deaths and injuries?
Performance Mathematics
This chapter introduces the concept of high performance mathematics. The chapter starts by explaining basic math in bases other than 10. It explains subtraction using complement mathematics. Next it gives efficient algorithms for performing signed and unsigned multiplication in binary. It explains how multiplication by a constant can often be converted into a much more efficient sequence of shift and add or subtract operations, and gives a method for multiplying two arbitrarily large numbers. Next, an efficient algorithm is given for binary division, followed by a technique for converting division by a constant into multiplication by a related constant. The next section introduces an ADT, written in C, which can be used to perform basic mathematical operations on integers of any size. The chapter concludes by showing that the ADT can be made much more efficient by replacing some of the functions with assembly language implementations.
Addition; Subtraction; Complement; Multiplication; Division; Big integer; High performance; Abstract data type
There are some differences between the way calculations are performed in a computer versus the way most of us were taught as children. The first difference is that calculations are performed in binary instead of base ten. Another difference is that the computer is limited to a fixed number of binary digits, which raises the possibility of having a result that is too large to fit in the number of bits available. This occurrence is referred to as overflow. The third difference is that subtraction is performed using complement addition.
Addition in base b is very similar to base ten addition, except that the result of each column is limited to b − 1. For example, binary addition works exactly the same as decimal addition, except that the result of each column is limited to 0 or 1. The following figure shows an addition in base ten and the equivalent addition in base two.

The carry from one column to the next is shown as a small number above the column that it is being carried into. Note that carries from one column to the next are done the same way in both bases. The only difference is that there are more columns in the base two addition because it takes more digits to represent a number in binary than it does in decimal.
Finding the complement was explained in Section 1.3.3. Subtraction can be computed by adding the radix complement of the subtrahend to the menuend. Example 7.1 shows a complement subtraction with positive results. When x < y, the result will be negative. In the complement method, this means that there will be a ‘1’ in the most significant bit, and in order to convert the result to base ten, we must take the radix complement. Example 7.2 shows complement subtraction with negative results. Example 7.3 shows several more signed addition and subtraction operations in base ten and binary.
Many processors have hardware multiply instructions. However hardware multipliers require a large number of transistors, and consume significant power. Processors designed for extremely low power consumption or very small size usually do not implement a multiply instruction, or only provide multiply instructions that are limited to a small number of bits. On these systems, the programmer must implement multiplication using basic data processing instructions.
If the multiplier is a power of two, then multiplication can be accomplished with a shift to the left. Consider the 4-bit binary number x = x3 × 23 + x2 × 22 + x1 × 21 + x0 × 20, where xn denotes bit n of x. If x is shifted left by one bit, introducing a zero into the least significant bit, then it becomes
Therefore, a shift of one bit to the left is equivalent to multiplication by two. This argument can be extended to prove that a shift left by n bits is equivalent to multiplication by 2n.
Most techniques for binary multiplication involve computing a set of partial products and then summing the partial products together. This process is similar to the method taught to primary schoolchildren for conducting long multiplication on base ten integers, but has been modified here for application to binary. The method typically taught in school for multiplying decimal numbers is based on calculating partial products, shifting them to the left and then adding them together. The most difficult part is to obtain the partial products, as that involves multiplying a long number by one base ten digit. The following example shows how the partial products are formed when multiplying 123 by 456.

The first partial product can be written as 123 × 6 × 100 = 738. The second is 123 × 5 × 101 = 6150, and the third is 123 × 4 × 102 = 49200. In practice, we usually leave out the trailing zeros. The procedure is the same in binary, but is simpler because the partial product involves multiplying a long number by a single base 2 digit. Since the multiplier is always either zero or one, the partial product is very easy to compute. The product of multiplying any binary number x by a single binary digit is always either 0 or x. Therefore, the multiplication of two binary numbers comes down to shifting the multiplicand left appropriately for each non-zero bit in the multiplier, and then adding the shifted numbers together.
Suppose we wish to multiply two four-bit numbers, 1011 and 1010:

Notice in the previous example that each partial sum is either zero or x shifted by some amount. A slightly quicker way to perform the multiplication is to leave out any partial sum which is zero. Example 7.4 shows the results of multiplying 10110 by 8910 in decimal and binary using this shorter method. For implementation in hardware and software, it is easier to accumulate the partial products, by adding each to a running sum, rather than building a circuit to add multiple binary numbers at once.
Binary multiplication can be implemented as a sequence of shift and add instructions. Given two registers, x and y, and an accumulator register a, the product of x and y can be computed using Algorithm 1. When applying the algorithm, it is important to remember that, in the general case, the result of multiplying an n bit number by an m bit number is (at most) an n + m bit number. For instance 112 × 112 = 10012. Therefore, when applying Algorithm 1, it is necessary to know the number of bits in x and y. Since x is shifted left on each iteration of the loop, the registers used to store x and a must both be at least as large as the number of bits in x plus the number of bits in y.

Assume we wish to multiply two numbers, x = 01101001 and y = 01011010. Applying Algorithm 1 results in the following sequence:
| a | x | y | Next operation |
| 0000000000000000 | 0000000001101001 | 01011010 | shift only |
| 0000000000000000 | 0000000011010010 | 00101101 | add, then shift |
| 0000000011010010 | 0000000110100100 | 00010110 | shift only |
| 0000000011010010 | 0000001101001000 | 00001011 | add, then shift |
| 0000010000011010 | 0000011010010000 | 00000101 | add, then shift |
| 0000101010101010 | 0000110100100000 | 00000010 | shift only |
| 0000101010101010 | 0001101001000000 | 00000001 | add, then shift |
| 0010010011101010 | 0011010010000000 | 00000000 | shift only |
| 105 × 90 = 9450 | |||

To multiply two n bit numbers, you must be able to add two 2n-bit numbers. On the ARM processor, n is usually assumed to be 32-bits, because that is the natural word size for the ARM processor. Adding 64-bit numbers requires two add instructions and the carry from the least-significant 32 bits must be added to the sum of the most-significant 32 bits. The ARM processor provides a convenient way to perform the add with carry. Assume we have two 64 bit numbers, x and y. We have x in r0, r1 and y in r2, r3, where the high order words of each number are in the higher-numbered registers, and we want to calculate x = x + y. Listing 7.1 shows a two instruction sequence for the ARM processor. The first instruction adds the two least-significant words together and sets (or clears) the carry bit and other flags in the CPSR. The second instruction adds the two most significant words along with the carry bit.

On the ARM processor, the algorithm to multiply two 32-bit unsigned integers is very efficient. Listing 7.2 shows one possible algorithm for multiplying two 32-bit numbers to obtain a 64-bit result. The code is a straightforward implementation of the algorithm, and some modifications can be made to improve efficiency. For example, if we only want a 32-bit result, we do not need to perform 64-bit addition. This significantly simplifies the code, as shown in Listing 7.3.


If x or y is a constant, then a loop is not necessary. The multiplication can be directly translated into a sequence of shift and add operations. This will result in much more efficient code than the general algorithm. If we inspect the constant multiplier, we can usually find a pattern to exploit that will save a few instructions. For example, suppose we want to multiply a variable x by 1010. The multiplier 1010 = 10102, so we only need to add x shifted left 1 bit to x shifted left 3 bits as shown below:

Now suppose we want to multiply a number x by 1110. The multiplier 1110 = 10112, so we will add x to x shifted left one bit plus x shifted left 3 bits as in the following:

If we wish to multiply a number x by 100010, we note that 100010 = 11111010002 It looks like we need one shift plus five add/shift operations, or six add/shift operations. With a little thought, the number of operations can be reduced from six to five as shown below:

Applying the basic multiplication algorithm to multiply a number x by 25510 would result in seven add/shift operations, but we can do it with only three operations and use only one register, as shown below:

Most modern systems have assembly language instructions for multiplication, but hardware multiply units require a relatively large number of transistors. For that reason, processors intended for small embedded applications often do not have a multiply instruction. Even when a hardware multiplier is available, on some processors it is often more efficient to use shift, add, and subtract operations when multiplying by a constant. The hardware multiplier units that are available on most ARM processors are very powerful. They can typically perform multiplication with a 32-bit result in as little as one clock cycle. The long multiply instructions take between three and five clock cycles, depending on the size of the operands. Using the multiply instruction on an ARM processor to multiply by a constant usually requires loading the constant into a register before performing the multiply. Therefore, if the multiplication can be performed using three or fewer shift, add, and subtract instructions, then it will be equal to or better than using the multiply instruction.
Consider the two multiplication problems shown in Figs. 7.1 and 7.2. Note that the result of a multiply depends on whether the numbers are interpreted as unsigned numbers or signed numbers. For this reason, most computer CPUs have two different multiply operations for signed and unsigned numbers.


If the CPU provides only an unsigned multiply, then a signed multiply can be accomplished by using the unsigned multiply operation along with a conditional complement. The following procedure can be used to implement signed multiplication.
1. if the multiplier is negative, take the two’s complement,
2. if the multiplicand is negative, take the two’s complement,
3. perform unsigned multiply, and
4. if the multiplier or multiplicand was negative (but not both), then take two’s complement of result.
Example 7.5 demonstrates this method using one negative number.
Consider the method used for multiplying two digit numbers is base ten, using only the one-digit multiplication tables. Fig. 7.3 shows how a two digit number a = a1 × 101 + a0 × 100 is multiplied by another two digit number b = b1 × 101 + b0 × 100 to produce a four digit result using basic multiplication operations which only take one digit from a and one digit from b at each step.

This technique can be used for numbers in any base and for any number of digits. Recall that one hexadecimal digit is equivalent to exactly four binary digits. If a and b are both 8-bit numbers, then they are also 2-digit hexadecimal numbers. In other words 8-bit numbers can be divided into groups of four bits, each representing one digit in base sixteen. Given a multiply operation that is capable of producing an 8-bit result from two 4-bit inputs, the technique shown above can then be used to multiply two 8-bit numbers using only 4-bit multiplication operations.
Carrying this one step further, suppose we are given two 16-bit numbers, but our computer only supports multiplying eight bits at a time and producing a 16-bit result. We can consider each 16-bit number to be a two digit number in base 256, and use the above technique to perform four eight bit multiplies with 16-bit results, then shift and add the 16-bit results to obtain the final 32-bit result. This approach can be extended to implement efficient multiplication of arbitrarily large numbers, using a fixed-sized multiplication operation.
Binary division can be implemented as a sequence of shift and subtract operations. When performing binary division by hand, it is convenient to perform the operation in a manner very similar to the way that decimal division is performed. As shown in Fig. 7.4, the operation is identical, but takes more steps in binary.

If the divisor is a power of two, then division can be accomplished with a shift to the right. Using the same approach as was used in Section 7.2.1, it can be shown that a shift right by n bits is equivalent to division by 2n. However, care must be taken to ensure that an arithmetic shift is used if the numerator is a signed two’s complement number, and a logical shift is used if the numerator is unsigned.
The algorithm for dividing binary numbers is somewhat more complicated than the algorithm for multiplication. The algorithm consists of two main phases:
1. shift the divisor left until it is greater than dividend and count the number of shifts, then
2. repeatedly shift the divisor back to the right and subtract whenever possible.
Fig. 7.5 shows the algorithm in more detail. Because of the complexity of the algorithm, division in hardware requires a significant number of transistors. The ARM architecture did not introduce a divide instruction until ARMv7, and even then it was not implemented on all processors. Many ARM systems (including the Raspberry Pi) do not have hardware division. However, the ARM processor instruction set makes it possible to write very efficient code for division.

Before we introduce the ARM code, we will take some time to step through the algorithm using an example. Let us begin by dividing 94 by 7. The result is shown below:

To implement the algorithm, we need three registers, one for the dividend, one for the divisor, and one for a counter. The dividend and divisor are loaded into their registers and the counter is initialized to zero as shown below:
Next, the divisor is shifted left and the counter incremented repeatedly until the divisor is greater than the dividend. This is shown in the following sequence:
Next, we allocate a register for the quotient and initialize it to zero. Then, according to the algorithm, we repeatedly subtract if possible, shift to the right, and decrement the counter. This sequence continues until the counter becomes negative. For our example this results in the following sequence:






When the algorithm terminates, the quotient register contains the result of the division, and the modulus (remainder) is in the dividend register. Thus, one algorithm is used to compute both the quotient and the modulus at the same time. There are variations on this algorithm. For example, one variation is to shift a single bit left in a register, rather than incrementing a count. This variation has the same two phases as the previous algorithm, but counts in powers of two rather than by ones. The following sequence shows what occurs after each iteration of the first loop in the algorithm.
The divisor is greater than the dividend, so the algorithm proceeds to the second phase. In this phase, if the divisor is less than or equal to the dividend, then the power register is added to the quotient and the divisor is subtracted from the dividend. Then, the power and Divisor registers are shifted to the right. The process is repeated until the power register is zero. The following sequence shows what the registers will contain at the end of each iteration of the second loop.






As with the previous version, when the algorithm terminates, the quotient register contains the result of the division, and the modulus (remainder) is in the dividend register. Listing 7.4 shows the ARM assembly code to implement this version of the division algorithm for 32-bit numbers, and the counting method for 64-bit numbers.




In general, division is slow. Newer ARM processors provide a hardware divide instruction which requires between two and twelve clock cycles to produce a result, depending on the size of the operands. Older processors must perform division using software, as previously described. In either case, division is by far the slowest of the basic mathematical operations. However, division by a constant c can be converted to a multiply by the reciprocal of c. It is obviously much more efficient to use a multiply instead of a divide wherever possible. Efficient division of a variable by a constant is achieved by applying the following equality:
The only difficulty is that we have to do it in binary, using only integers. If we modify the right-hand side by multiplying and dividing by some power of two (2n), we can rewrite Eq. (7.1) as follows:
Recall that, in binary, multiplying by 2n is the same as shifting left by n bits, while multiplying by 2−n is done by shifting right by n bits. Therefore, Eq. (7.2) is just Eq. (7.1) with two shift operations added. The two shift operations cancel each other out. Now, let
We can rewrite Eq. (7.2) as:
We now have a method for dividing by a constant c which involves multiplying by a different constant, m, and shifting the result. In order to achieve the best precision, we want to choose n such that m is as large as possible with the number of bits we have available.
Suppose we want efficient code to calculate x ÷ 23 using 8-bit signed integer multiplication. Our first task is to find
such that 011111112 ≥ m ≥ 010000002. In other words, we want to find the value of n where the most significant bit of m is zero, and the next most significant bit of m is one. If we choose n = 11, then
Rounding to the nearest integer gives m = 89. In 8 bits, m is 010110012 or 5916. We now have values for m and n, and therefore we can apply Eq. (7.4) to divide any number x by 23. The procedure is simple: calculate y = x × m, then shift y right by 11 bits.
However, there are two more considerations. First, when the divisor is positive, the result for some values of x may be incorrect due to rounding error. It is usually sufficient to increment the reciprocal value by one in order to avoid these errors. In the previous example, the number would be changed from 5916 to 5A16. When implementing this technique for finding the reciprocal, the programmer should always verify that the results are correct for all input values. The second consideration is when the dividend is negative. In that case it is necessary to subtract one from the final result.
For example, to calculate 10110 ÷ 2310 in binary, with eight bits of precision, we first perform the multiplication as follows:

Then shift the result right by 11 bits. 100011000111012 shifted right 1110 bits is: 1002 = 410. If the modulus is required, it can be calculated as 101 mod 23 = 101 − (4 × 23) = 9, which once again requires multiplication by a constant.
In the previous example the shift amount of 11 bits provided the best precision possible. But how was that number chosen? The shift amount, n, can be directly computed as
where p is the desired number of bits of precision. The value of m can then be computed as
For example, to divide by the constant 33, with 16 bits of precision, we compute n as
and then we compute m as
Therefore, multiplying a 16 bit number by 7C2016 and then shifting right 20 bits is equivalent to dividing by 33.
Example 7.6 shows how to calculate m and n for division by 193. On the ARM processor, division by a constant can be performed very efficiently. Listing 7.5 shows how division by 193 can be implemented using only a few lines of code. In the listing, the numbers are 32 bits in length, so the constant m is much larger than in the example that was multiplied by hand, but otherwise the method is the same.

On processors without the multiply instruction, we can use the technique of shifting and adding shown previously. If we wish to divide by 23 using 32 bits of precision, we compute the multiplier as
That is 010110010000101100100001011001012. Note that there are only 12 non-zero bits, and the pattern 1011001 appears three times in the 32-bit multiplier. The multiply can be implemented as 224(26x + 24x + 23x + 20x) + 213(26x + 24x + 23x + 20x) +22(26x + 24x + 23x + 20x) + 20x. So the following code sequence can be used on processors that do not have the multiply instruction:


Section 7.2.5 showed how large numbers can be multiplied by breaking them into smaller numbers and using a series of multiplication operations. There is no similar method for synthesizing a large division operation with an arbitrary number of digits in the dividend and divisor. However, there is a method for dividing a large dividend by a divisor given that the division operation can operate on numbers with at least the same number of digits as in the divisor.
Suppose we wish to perform division of an arbitrarily large dividend by a one digit divisor using a basic division operation that can divide a two digit dividend by a one digit divisor. The operation can be performed in multiple steps as follows:
1. Divide the most significant digit of the dividend by the divisor. The result is the most significant digit of the quotient.
2. Prepend the remainder from the previous division step to the next digit of the dividend, forming a two-digit number, and divide that by the divisor. This produces the next digit of the result.
3. Repeat from step 2 until all digits of the dividend have been processed.
4. Take the final remainder as the modulus.
The following example shows how to divide 6189 by 7 using only 2-digits at a time:
This method can be applied in any base and with any number of digits. The only restriction is that the basic division operation must be capable of dividing a 2n digit number by an n digit number and producing a 2n digit quotient and an n digit remainder. for example, the div instruction available on Cortex M3 and newer processors is capable of dividing a 32-bit dividend by a 32-bit divisor, producing a 32-bit quotient. The remainder can be calculated by multiplying the quotient by the divisor and subtracting the product from the dividend. Using this division operation it is possible to divide an arbitrarily large number by a 16-bit divisor.
We have seen that, given a divide operation capable of dividing an n digit number by an n digit number, it is possible to divide a dividend with any number of digits by a divisor with
digits. Unfortunately, there is no similar method to deal with an arbitrarily large divisor, or to divide an arbitrarily large dividend by a divisor with more than
digits. In those cases the division must be performed using a general division algorithm as shown previously.
For some programming tasks, it may be helpful to deal with arbitrarily large integers. For example, the factorial function and Ackerman’s function grow very quickly and will overflow a 32-bit integer for small input values. In this section, we will outline an abstract data type which provides basic operations for arbitrarily large integer values. Listing 7.7 shows the C header for this ADT, and Listing 7.8 shows the C implementation. Listing 7.9 shows a small program that uses the bigint ADT to create a table of x! for all x between 0 and 100.




















The implementation could be made more efficient by writing some of the functions in assembly language. One opportunity for improvement is in the add function, which must calculate the carry from one chunk of bits to the next. In assembly, the programmer has direct access to the carry bit, so carry propagation should be much faster.
When attempting to speed up a C program by converting selected parts of it to assembly language, it is important to first determine where the most significant gains can be made. A profiler, such as gprof, can be used to help identify the sections of code that will matter most. It is also important to make sure that the result is not just highly optimized C code. If the code cannot benefit from some features offered by assembly, then it may not be worth the effort of re-writing in assembly. The code should be re-written from a pure assembly language viewpoint.
It is also important to avoid premature assembly programming. Make sure that the C algorithms and data structures are efficient before moving to assembly. if a better algorithm can give better performance, then assembly may not be required at all. Once the assembly is written, it is more difficult to make major changes to the data structures and algorithms. Assembly language optimization is the final step in optimization, not the first one.
Well-written C code is modularized, with many small functions. This helps readability, promotes code reuse, and may allow the compiler to achieve better optimization. However, each function call has some associated overhead. If optimal performance is the goal, then calling many small functions should be avoided. For instance, if the piece of code to be optimized is in a loop body, then it may be best to write the entire loop in assembly, rather than writing a function and calling it each time through the loop. Writing in assembly is not a guarantee of performance. Spaghetti code is slow. Load/store instructions are slow. Multiplication and division are slow. The secret to good performance is avoiding things that are slow. Good optimization requires rethinking the code to take advantage of assembly language.
The bigint_adc function was re-written in assembly, as shown in Listing 7.10. This function is used internally by several other functions in the bigint ADT to perform addition and subtraction. The profiler indicated that it is used more than any other function. If assembly language can make this function run faster, then it should have a profound effect on the program.




The bigfact main function was executed 50 times on a Raspberry Pi, using the C version of bigint_adc and then with the assembly version. The total time required using the C version was 27.65 seconds, and the program spent 54.0% of its time (14.931 seconds) in the bigint_adc function. The assembly version ran in 15.07 seconds, and the program spent 15.3% of its time (2.306 seconds) in the bigint_adc function. Therefore the assembly version of the function achieved a speedup of 6.47 over the C implementation. Overall, the program achieved a speedup of 1.83 by writing one function in assembly.
Running gprof on the improved program reveals that most of the time is now spent in the bigint_mul function (63.2%) and two functions that it calls: bigint_mul_uint (39.1%) and bigint_shift_left_chunk (21.6%). It seems clear that optimizing those two functions would further improve performance.
Complement mathematics provides a method for performing all basic operations using only the complement, add, and shift operations. Addition and subtraction are fast, but multiplication and division are relatively slow. In particular, division should be avoided whenever possible. The exception to this rule is division by a power of the radix, which can be implemented as a shift. Good assembly programmers replace division by a constant c with multiplication by the reciprocal of c. They also replace the multiply instruction with a series of shifts and add or subtract operations when it makes sense to do so. These optimizations can make a big difference in performance.
Writing sections of a program in assembly can result in better performance, but it is not guaranteed. The chance of achieving significant performance improvement is increased if the following rules are used:
1. Only optimize the parts that really matter.
2. Design data structures with assembly in mind.
3. Use efficient algorithms and data structures.
4. Write the assembly code last.
5. Ignore the C version and write good, clean, assembly.
6. Reduce function calls wherever it makes sense.
7. Avoid unnecessary memory accesses.
8. Write good code. The compiler will beat poor assembly every time, but good assembly will beat the compiler every time.
Understanding the basic mathematical operations can enable the assembly programmer to work with integers of any arbitrary size with efficiency that cannot be matched by a C compiler. However, it is best to focus the assembly programming on areas where the greatest gains can be made.
7.1 Multiply − 90 by 105 using signed 8-bit binary multiplication to form a signed 16-bit result. Show all of your work.
7.2 Multiply 166 by 105 using unsigned 8-bit binary multiplication to form an unsigned 16-bit result. Show all of your work.
7.3 Write a section of ARM assembly code to multiply the value in r1 by 1310 using only shift and add operations.
7.4 The following code will multiply the value in r0 by a constant C. What is C?

7.5 Show the optimally efficient instruction(s) necessary to multiply a number in register r0 by the constant 6710.
7.6 Show how to divide 7810 by 610 using binary long division.
7.7 Demonstrate the division algorithm using a sequence of tables as shown in Section 7.3.2 to divide 15510 by 1110.
7.8 When dividing by a constant value, why is it desirable to have m as large as possible?
7.9 Modify your program from Exercise 5.13 in Chapter 5 to produce a 64-bit result, rather than a 32-bit result.
7.10 Modify your program from Exercise 5.13 in Chapter 5 to produce a 128-bit result, rather than a 32-bit result. How would you do this in C?
7.11 Write the bigint_shift_left_chunk function from Listing 7.8 in ARM assembly, and measure the performance improvement.
7.12 Write the bigint_mul_uint function in ARM assembly, and measure the performance improvement.
7.13 Write the bigint_mul function in ARM assembly, and measure the performance improvement.
This chapter starts by demonstrating how to convert fractional numbers to radix notation in any base. It then presents a theorem that can be used to determine in which bases a given fraction will terminate rather than repeating. That theorem is then used to explain why some base ten fractional numbers cannot be represented in binary with a finite number of bits. Next fixed-point numbers are introduced. The rules for addition, subtraction, multiplication, and division are given. Division by a constant is explained in terms of fixed-point mathematics. Next, the IEEE floating point formats are explained. The chapter ends with an example showing how fixed-point mathematics can be used to write functions for sine and cosine which give better precision and higher performance than the functions provided by GCC.
Fixed point; Radix point; Non-terminating repeating fraction; S/U notation; Q notation; Floating point; Performance
Chapter 7 introduced methods for performing computation using integers. Although many problems can be solved using only integers, it is often necessary (or at least more convenient) to perform computation using real numbers or even complex numbers. For our purposes, a non-integral number is any number that is not an integer. Many systems are only capable of performing computation using binary integers, and have no hardware support for non-integral calculations. In this chapter, we will examine methods for performing non-integral calculations using only integer operations.
Section 1.3.2 explained how to convert integers in a given base into any other base. We will now extend the methods to convert fractional values. A fractional number can be viewed as consisting of an integer part, a radix point, and a fractional part. In base 10, the radix point is also known as the decimal point. In base 2, it is called the binimal point. For base 16, it is the heximal point, and in base 8 it is an octimal point. The term radix point is used as a general term for a location that divides a number into integer and fractional parts, without specifying the base.
The procedure for converting fractions from a given base b into base ten is very similar to the procedure used for integers. The only difference is that the digit to the left of the radix point is weighted by b0 and the exponents become increasingly negative for each digit right of the radix point. The basic procedure is the same for any base b. For example, the value 101.01012 can be converted to base ten by expanding it as follows:
Likewise, the hexadecimal fraction 4F2.9A0 can be converted to base ten by expanding it as follows:
When converting from base ten into another base, the integer and fractional parts are treated separately. The base conversion for the integer part is performed in exactly the same way as in Section 1.3.2, using repeated division by the base b. The fractional part is converted using repeated multiplication. For example, to convert the decimal value 5.687510 to a binary representation:
1. Convert the integer portion, 510 into its binary equivalent, 1012.
2. Multiply the decimal fraction by two. The integer part of the result is the first binary digit to the right of the radix point.
Because x = 0.6875 × 2 = 1.375, the first binary digit to the right of the point is a 1. So far, we have 5.62510 = 101.12
3. Multiply the fractional part of x by 2 once again.
Because x = 0.375 × 2 = 0.75, the second binary digit to the right of the point is a 0. So far, we have 5.62510 = 101.102
4. Multiply the fractional part of x by 2 once again.
Because x = 0.75 × 2 = 1.50, the third binary digit to the right of the point is a 1. So now we have 5.625 = 101.101
5. Multiply the fractional part of x by 2 once again.
Because x = 0.5 × 2 = 1.00, the fourth binary digit to the right of the point is a 1. So now we have 5.625 = 101.1011
6. Since the fractional part is now zero, we know that all remaining digits will be zero.
The procedure for obtaining the fractional part can be accomplished easily using a tabular method, as shown below:
| Operation | Result | |
| Integer | Fraction | |
| 0.6875 × 2 = 1.375 | 1 | 0.375 |
| 0.375 × 2 = 0.75 | 0 | 0.75 |
| 0.75 × 2 = 1.5 | 1 | 0.5 |
| 0.5 × 2 = 1.0 | 1 | 0.0 |

Putting it all together, 5.687510 = 101.10112. After converting a fraction from base 10 into another base, the result should be verified by converting back into base 10. The results from the previous example can be expanded as follows:
Converting decimal fractions to base sixteen is accomplished in a very similar manner. To convert 842.23437510 into base 16, we first convert the integer portion by repeatedly dividing by 16 to yield 34A. We then repeatedly multiply the fractional part, extracting the integer portion of the result each time as shown in the table below:
In the second line, the integer part is 12, which must be replaced with a hexadecimal digit. The hexadecimal digit for 1210 is C, so the fractional part is 3C. Therefore, 842.23437510 =34A.3C16 The result is verified by converting it back into base 10 as follows:
Converting fractional values between binary, hexadecimal, and octal can be accomplished in the same manner as with integer values. However, care must be taken to align the radix point properly. As with integers, converting from hexadecimal or octal to binary is accomplished by replacing each hex or octal digit with the corresponding binary digits from the appropriate table shown in Fig. 1.3.
For example, to convert 5AC.43B16 to binary, we just replace “5” with “0101,” replace “A” with “1010,” replace “C” with “1100,” replace “4” with “0100,” replace “3” with “0011,” replace “B” with “1011,” So, using the table, we can immediately see that 5AC.43B16 = 010110101100.0100001110112. This method works exactly the same way for converting from octal to binary, except that it uses the table on the right side of Fig. 1.3.
Converting fractional numbers from binary to hexadecimal or octal is also very easy when using the tables. The procedure is to split the binary string into groups of bits, working outwards from the radix point, then replace each group with its hexadecimal or octal equivalent. For example, to convert 01110010.10101112 to hexadecimal, just divide the number into groups of four bits, starting at the radix point and working outwards in both directions. It may be necessary to pad with zeroes to make a complete group on the left or right, or both. Our example is grouped as follows: |0000|0111|0010.1010|1110|2. Now each group of four bits is converted to hexadecimal by looking up the corresponding hex digit in the table on the left side of Fig. 1.3. This yields 072.AE16. For octal, the binary number would be grouped as follows: |001|110|010.101|011|100|2. Now each group of three bits is converted to octal by looking up the corresponding digit in the table on the right side of Fig. 1.3. This yields 162.5348.
One interesting phenomenon that is often encountered is that fractions which terminate in one base may become non-terminating, repeating fractions in another base. For example, the binary representation of the decimal fraction
is a repeating fraction, as shown in Example 8.1. The resulting fractional part from the last step performed is exactly the same as in the second step. Therefore, the sequence will repeat. If we continue, we will repeat the sequence of steps 2–5 forever. Hence, the final binary representation will be:
Because of this phenomenon, it is impossible to exactly represent 1.1010 (and many other fractional quantities) as a binary fraction in a finite number of bits.
The fact that some base 10 fractions cannot be exactly represented in binary has lead to many subtle software bugs and round-off errors, when programmers attempt to work with currency (and other quantities) as real-valued numbers. In this section, we explore the idea that the representation problem can be avoided by working in some base other than base 2. If that is the case, then we can simply build hardware (or software) to work in that base, and will be able to represent any fractional value precisely using a finite number of digits. For brevity, we will refer to a binary fractional quantity as a binimal and a decimal fractional quantity as a decimal. We would like to know whether there are more non-terminating decimals than binimals, more non-terminating binimals than decimals, or neither. Since there are an infinite number of non-terminating decimals and an infinite number of non-terminating binimals, we could be tempted to conclude that they are equal. However, that is an oversimplification. If we ask the question differently, we can discover some important information. A better way to ask the question is as follows:
Question: Is the set of terminating decimals a subset of the set of terminating binimals, or vice versa, or neither?
We start by introducing a lemma which can be used to predict whether or not a terminating fraction in one base will terminate in another base. We introduce the notation x|y (read as “x divides y”) to indicate that y can be evenly divided by x.
Answer: The set of terminating binimals is a subset of the set of terminating decimals, but the set of terminating decimals is not a subset of the set of terminating binimals.
Theorem 8.2.1 implies that any binary fraction can be expressed exactly as a decimal fraction, but Theorem 8.2.2 implies that there are decimal fractions which cannot be expressed exactly in binary. Every fraction (when expressed in lowest terms) which has a non-zero power of five in its denominator cannot be represented in binary with a finite number of bits. Another implication is that some fractions cannot be expressed exactly in either binary or decimal. For example, let B = 30 = 2 * 3 * 5. Then any number with denominator
terminates in base 30. However if k2≠0, then the fraction will terminate in neither base two nor base ten, because three is not a prime factor of ten or two.
Another implication of the theorem is that the more prime factors we have in our base, the more fractions we can express exactly. For instance, the smallest base that has two, three, and five as prime factors is base 30. Using that base, we can exactly express fractions in radix notation that cannot be expressed in base ten or in base two with a finite number of digits. For example, in base 30, the fraction
will terminate after one division since 15 = 3151. To see what the number will look like, let us extend the hexadecimal system of using letters to represent digits beyond 9. So we get this chart for base 30:
Since
, the fraction can be expressed precisely as 0.M30. Likewise, the fraction
is
but terminates in base 30. Since 45 = 3351, this number will have three or fewer digits following the radix point. To compute the value, we will have to raise it to higher terms. Using 302 as the denominator gives us:
Now we can convert it to base 30 by repeated division.
with remainder 20. Since 20 < 30, we cannot divide again. Therefore,
in base 30 is 0.8K.
Although base 30 can represent all fractions that can be expressed in bases two and ten, there are still fractions that cannot be represented in base 30. For example,
has the prime factor seven in its denominator, and therefore will only terminate in bases were seven is a prime factor of the base. The fraction
will terminate in base 7, base 14, base 21, base 42 and many others, but not in base 30. Since there are an infinite number of primes, no number system is immune from this problem. No matter what base the computer works in, there are fractions that cannot be expressed exactly with a finite number of digits. Therefore, it is incumbent upon programmers and hardware designers to be aware of round-off errors and take appropriate steps to minimize their effects.
For example, there is no reason why the hardware clocks in a computer should work in base ten. They can be manufactured to measure time in base two. Instead of counting seconds in tenths, hundredths or thousandths, they could be calibrated to measure in fourths, eighths, sixteenths, 1024ths, etc. This would eliminate the round-off error problem in keeping track of time.
As shown in the previous section, given a finite number of bits, a computer can only approximately represent non-integral numbers. It is often necessary to accept that limitation and perform computations involving approximate values. With due care and diligence, the results will be accurate within some acceptable error tolerance. One way to deal with real-valued numbers is to simply treat the data as fixed- point numbers. Fixed-point numbers are treated as integers, but the programmer must keep track of the radix point during each operation. We will present a systematic approach to designing fixed-point calculations.
When using fixed-point arithmetic, the programmer needs a convenient way to describe the numbers that are being used. Most languages have standard data types for integers and floating point numbers, but very few have support for fixed-point numbers. Notable exceptions include PL/1 and Ada, which provide support for fixed-point binary and fixed-point decimal numbers. We will focus on fixed-point binary, but the techniques presented can also be applied to fixed-point numbers in any base.
Each fixed-point binary number has three important parameters that describe it:
1. whether the number is signed or unsigned,
2. the position of the radix point in relation to the right side of the sign bit (for signed numbers) or the position of the radix point in relation to the most significant bit (for unsigned numbers), and
3. the number of fractional bits stored.
Unsigned fixed-point numbers will be specified as U(i,f), where i is the position of the radix point in relation to the left side of the most significant bit, and f is the number of bits stored in the fractional part.
For example, U(10,6) indicates that there are six bits of precision in the fractional part of the number, and the radix point is ten bits to the right of the most significant bit stored. The layout for this number is shown graphically as:

where i is an integer bit and f is a fractional bit. Very small numbers with no integer part may have a negative i. For example, U(−8,16) specifies an unsigned number with no integer part, eight leading zero bits which are not actually stored, and 16 bits of fractional precision. The layout for this number is shown graphically as:

Likewise, signed fixed-point numbers will be specified using the following notation: S(i,f), where i is the position of the radix point in relation to the right side of the sign bit, and f is the number of fractional bits stored. As with integer two’s-complement notation, the sign bit is always the leftmost bit stored. For example, S(9,6) indicates that there are six bits in the fractional part of the number, and the radix point is nine bits to the right of the sign bit. The layout for this number is shown graphically as:

where i is an integer bit and f is a fractional bit. Very small numbers with no integer part may have a negative i. For example, S(−7,16) specifies a signed number with no integer part, six leading sign bits which are not actually stored, a sign bit that is stored and 15 bits of fraction. The layout for this number is shown graphically as:

Note that the “hidden” bits in a signed number are assumed to be copies of the sign bit, while the “hidden” bits in an unsigned number are assumed to be zero.
The following figure shows an unsigned fixed-point number with seven bits in the integer part and nine bits in the fractional part. It is a U(7,9) number. Note that the total number of bits is 7 + 9 = 16

The value of this number in base 10 can be computed by summing the values of each non-zero bit as follows:
Likewise, the following figure shows a signed fixed-point number with nine bits in the integer part and six bits in the fractional part. It is as S(9,6) number. Note that the total number of bits is 9 + 6 + 1 = 16.

The value of this number in base 10 can be computed by summing the values of each non-zero bit as follows:
Note that in the above two examples, the pattern of bits are identical. The value of a number depends upon how it is interpreted. The notation that we have introduced allows us to easily specify exactly how a number is to be interpreted. For signed values, if the first bit is non-zero, then the two’s complement should be taken before the number is evaluated. For example, the following figure shows an S(8,7) number that has a negative value.

The value of this number in base 10 can be computed by taking the two’s complement, summing the values of the non-zero bits, and adding a negative sign to the result. The two’s complement of 1011010101111010 is 0100101010000101 + 1 = 0100101010000110. The value of this number is:
For a final example we will interpret this bit pattern as an S(−5,16). In that format, the layout is:

The value of this number in base ten can be computed by taking the two’s complement, summing the values of the non-zero bits, and adding a negative sign to the result. The two’s complement is:

The value of this number interpreted as an S(−5,16) is:
Fixed-point number formats can also be represented using Q notation, which was developed by Texas Instruments. Q notation is equivalent to the S/U format used in this book, except that the integer portion is not always fully specified. In general, Q formats are specified as Qm, n where m is the number of integer bits, and n is the number of fractional bits. If a fixed word size w is being used then m may be omitted, and is assumed to be w − n. For example, a Q10 number has 10 fractional bits, and the number of integer bits is not specified, but is assumed to be the number of bits required to complete a word of data. A Q2,4 number has two integer bits and four fractional bits in a 6-bit word. There are two conflicting conventions for dealing with the sign bit. In one convention, the sign bit is included as part of m, and in the other convention, it is not. When using Q notation, it is important to state which convention is being used. Additionally, a U may be prefixed to indicate an unsigned value. For example UQ8.8 is equivalent to U(8,8), and Q7,9 is equivalent to S(7,9).
Once the decision has been made to use fixed-point calculations, the programmer must make some decisions about the specific representation of each fixed-point variable. The combination of size and radix will affect several properties of the numbers, including:
Precision: the maximum number of non-zero bits representable,
Resolution: the smallest non-zero magnitude representable,
Accuracy: the magnitude of the maximum difference between a true real value and its approximate representation,
Range: the difference between the largest and smallest number that can be represented, and
Dynamic range: the ratio of the maximum absolute value to the minimum positive absolute value representable.
Given a number specified using the notation introduced previously, we can determine its properties. For example, an S(9,6) number has the following properties:
Resolution: R = 2−6 = 0.015625
Accuracy: 
Range: Minimum value is 1000000000.000000 = −512 Maximum value is 0111111111.111111 = 1023.9921875 Range is G = 1023.9921875 + 512 = 1535.9921875
Dynamic range: For a signed fixed-point rational representation, S(i,f), the dynamic range is
Therefore, the dynamic range of an S(9,6) is 216 = 65536.
Being aware of these properties, the programmer can select fixed-point representations that fit the task that they are trying to solve. This allows the programmer to strive for very efficient code by using the smallest fixed-point representation possible, while still guaranteeing that the results of computations will be within some limits for error tolerance.
Fixed-point numbers are actually stored as integers, and all of the integer mathematical operations can be used. However, some care must be taken to track the radix point at each stage of the computation. The advantages of fixed-point calculations are that the operations are very fast and can be performed on any computer, even if it does not have special hardware support for non-integral numbers.
Fixed-point addition and subtraction work exactly like their integer counterparts. Fig. 8.1 gives some examples of fixed-point addition with signed numbers. Note that in each case, the numbers are aligned so that they have the same number of bits in their fractional part. This requirement is the only difference between integer and fixed-point addition. In fact, integer arithmetic is just fixed-point arithmetic with no bits in the fractional part. The arithmetic that was covered in Chapter 7 was fixed-point arithmetic using only S(i,0) and U(i,0) numbers. Now we are simply extending our knowledge to deal with numbers where f≠0. There are some rules which must be followed to ensure that the results are correct. The rules for subtraction are the same as the rules for addition. Since we are using two’s complement math, subtraction is performed using addition.

Suppose we want to add an S(7,8) number to an S(7,4) number. The radix points are at different locations, so we cannot simply add them. Instead, we must shift one of the numbers, changing its format, until the radix points are aligned. The choice of which one to shift depends on what format we desire for the result. If we desire eight bits of fraction in our result, then we would shift the S(7,4) left by four bits, converting it into an S(7,8). With the radix points aligned, we simply use an integer addition operation to add the two numbers. The result will have it’s radix point in the same location as the two numbers being added.
Recall that the result of multiplying an n bit number by an m bit number is an n + m bit number. In the case of fixed-point numbers, the size of the fractional part of the result is the sum of the number of fractional bits of each number, and the total size of the result is the sum of the total number of bits in each number. Consider the following example where two U(5,3) numbers are multiplied together:

The result is a U(10,6) number. The number of bits in the result is the sum of all of the bits of the multiplicand and the multiplier. The number of fractional bits in the result is the sum of the number of fractional bits in the multiplicand and the multiplier. There are three simple rules to predict the resulting format when multiplying any two fixed-point numbers.
Unsigned Multiplication The result of multiplying two unsigned numbers U(i1,f1) and U(i2,f2) is a U(i1 + i2,f1 + f2) number.
Mixed Multiplication The result of multiplying a signed number S(i1,f1) and an unsigned number U(i2,f2) is an S(i1 + i2,f1 + f2) number.
Signed Multiplication The result of multiplying two signed numbers S(i1,f1) and S(i2,f2) is an S(i1 + i2 + 1,f1 + f2) number.
Note that this rule works for integers as well as fixed-point numbers, since integers are really fixed-point numbers with f = 0. If the programmer desires a particular format for the result, then the multiply is followed by an appropriate shift.
Listing 8.1 gives some examples of fixed-point multiplication using the ARM multiply instructions. In each case, the result is shifted to produce the desired format. It is the responsibility of the programmer to know what type of fixed-point number is produced after each multiplication and to adjust the result by shifting if necessary.

Derivation of the rule for determining the format of the result of division is more complicated than the one for multiplication. We will first consider only unsigned division of a dividend with format U(i1,f1) by a divisor with format U(i2,f2).
Consider the results of dividing two fixed-point numbers, using integer operations with limited precision. The value of the least significant bit of the dividend N is
and the value of the least significant bit of the divisor D is
. In order to perform the division using integer operations, it is necessary to multiply N by
and multiply D by
so that both numbers are integers. Therefore, the division operation can be written as:
Note that no multiplication is actually performed. Instead, the programmer mentally shifts the radix point of the divisor and dividend, then computes the radix point of the result. For example, given two U(5,3) numbers, the division operation is accomplished by converting them both to integers, performing the division, then computing the location radix point:
Note that the result is an integer. If the programmer wants to have some fractional bits in the result, then the dividend must be shifted to the left before the division is performed.
If the programmer wants to have fq fractional bits in the quotient, then the amount that the dividend must be shifted can easily be computed as
For example, suppose the programmer wants to divide 01001.011 stored as a U(28,3) by 00011.110 which is also stored as a U(28,3), and wishes to have six fractional bits in the result. The programmer would first shift 01001.011 to the left by six bits, then perform the division and compute the position of the radix in the result as shown:

Since the divisor may be between zero and one, the quotient may actually require more integer bits than there are in the dividend. Consider that the largest possible value of the dividend is
, and the smallest positive value for the divisor is
. Therefore, the maximum quotient is given by:
Taking the limit of the previous equation,
provides the following bound on how many bits are required in the integer part of the quotient:
Therefore, in the worst case, the quotient will require i1 + f2 integer bits. For example, if we divide a U(3,5), a = 111.11111 = 7.9687510, by a U(5,3), b = 00000.001 = 0.12510, we end up with a U(6,2)q = 111111.11 = 63.7510.
The same thought process can be used to determine the results for signed division as well as mixed division between signed and unsigned numbers. The results can be reduced to the following three rules:
Unsigned Division The result of dividing an unsigned fixed-point number U(i1,f1) by an unsigned number U(i2,f2) is a U(i1 + f2,f1 − f2) number.
Mixed Division The result of dividing two fixed-point numbers where one of them is signed and the other is unsigned is an S(i1 + f2,f1 − f2) number.
Signed Division The result of dividing two signed fixed-point numbers is an S(i1 + f2 + 1,f1 − f2) number.
Consider the results when a U(2,3), a = 00000.001 = 0.12510 is divided by a U(4,1), b = 1000.0 = 8.010. The quotient is q = 0.000001, which requires six bits in the fractional part. However, if we simply perform the division, then according to the rules shown above, the result will be a U(8,−2). There is no such thing as a U(8,−2), so the result is meaningless.
When f2 > f1, blindly applying the rules will result in a negative fractional part. To avoid this, the dividend can be shifted left so that it has at least as many fractional bits as the divisor. This leads to the following rule: If f2 > f1 then convert the divisor to an S(i1,x), where x ≥ f2, then apply the appropriate rule. For example, dividing an S(5,2) by a U(3,12) would result in an S(17,−10). But shifting the S(5,2) 16 bits to the left will result in an S(5,18), and dividing that by a U(3,12) will result in an S(17,6).
Recall that integer division produces a result and a remainder. In order to maintain precision, it is necessary to perform the integer division operation in such a way that all of the significant bits are in the result and only insignificant bits are left in the remainder. The easiest way to accomplish this is by shifting the dividend to the left before the division is performed.
To find a rule for determining the shift necessary to maintain full precision in the quotient, consider the worst case. The minimum positive value of the dividend is
and the largest positive value for the divisor is
. Therefore, the minimum positive quotient is given by:
Therefore, in the worst case, the quotient will require i2 + f1 fractional bits to maintain precision. However, fewer bits can be reserved if full precision is not required.
Recall that the least significant bit of the quotient will be
. Shifting the dividend left by i2 + f2 bits will convert it into a U(i1,i2 + f1 + f2). Using the rule above, when it is divided by a U(i2,f2), the result is a U(i1 + f2,i2 + f1). This is the minimum size which is guaranteed to preserve all bits of precision. The general method for performing fixed-point division while maintaining maximum precision is as follows:
1. shift the dividend left by i2 + f2, then
2. perform integer division.
The result will be a U(i1 + f2,i2 + f1) for unsigned division, or an S(i1 + f2 + 1,i2 + f1) for signed division. The result for mixed division is left as an exercise for the student.
Section 7.3.3 introduced the idea of converting division by a constant into multiplication by the reciprocal of that constant. In that section it was shown that by pre-multiplying the reciprocal by a power of two (a shift operation), then dividing the final result by the same power of two (a shift operation), division by a constant could be performed using only integer operations with a more efficient multiply replacing the (usually) very slow divide.
This section presents an alternate way to achieve the same results, by treating division by an integer constant as an application of fixed-point multiplication. Again, the integer constant divisor is converted into its reciprocal, but this time the process is considered from the viewpoint of fixed-point mathematics. Both methods will achieve exactly the same results, but some people tend to grasp the fixed-point approach better than the purely integer approach.
When writing code to divide by a constant, the programmer must strive to achieve the largest number of significant bits possible, while using the shortest (and most efficient) representation possible. On modern computers, this usually means using 32-bit integers and integer multiply operations which produce 64-bit results. That would be extremely tedious to show in a textbook, so the principals will be demonstrated here using 8-bit integers and an integer multiply which produces a 16-bit result.
Suppose we want efficient code to calculate x ÷ 23 using only 8-bit signed integer multiplication. The reciprocal of 23, in binary, is
If we store R as an S(1,11), it would look like this:

Note that in this format, the reciprocal of 23 has five leading zeros. We can store R in eight bits by shifting it left to remove some of the leading zeros. Each shift to the left changes the format of R. After removing the first leading zero bit, we have:

After removing the second leading zero bit, we have:

After removing the third leading zero bit, we have:

Note that the number in the previous format has a “hidden” bit between the radix point and the sign bit. That bit is not actually stored, but is assumed to be identical to the sign bit. Removing the fourth leading zero produces:

The number in the previous format has two “hidden” bits between the radix point and the sign bit. Those bits are not actually stored, but are assumed to be identical to the sign bit. Removing the fifth leading zero produces:

We can only remove five leading zero bits, because removing one more would change the sign bit from 0 to 1, resulting in a completely different number. Note that the final format has three “hidden” bits between the radix point and the sign bit. These bits are all copies of the sign bit. It is an S(−4,8) number because the sign is four bits to the right of the radix point (resulting in the three “hidden” bits). According to the rules of fixed-point multiplication given earlier, an S(7,0) number x multiplied by an S(−4,8) number R will yield an S(4,8) number y. The value y will be
because we have three “hidden” bits to the right of the radix point. Therefore,
indicating that after the multiplication, we must shift the result right by three bits to restore the radix. Since
is positive, the number R must be increased by one to avoid round-off error. Therefore, we will use R + 1 = 01011010 = 9010 in our multiply operation. To calculate y = 10110 ÷ 2310, we can multiply and perform a shift as follows:

Because our task is to implement integer division, everything to the right of the radix point can be immediately discarded, keeping only the upper eight bits as the integer portion of the result. The integer portion, 1000112, shifted right three bits, is 1002 = 410. If the modulus is required, it can be calculated as: 101 − (4 × 23) = 9. Some processors, such as the Motorola HC11, have a special multiply instruction which keeps only the upper half of the result. This method would be especially efficient on that processor. Listing 8.2 shows how the 8-bit division code would be implemented in ARM assembly. Listing 8.3 shows an alternate implementation which uses shift and add operations rather than a multiply.


The procedure is exactly the same for dividing by a negative constant. Suppose we want efficient code to calculate
using 16-bit signed integers. We first convert
into binary:
The two’s complement of
is
We can represent
as the following S(1,21) fixed-point number:

Note that the upper seven bits are all one. We can remove six of those bits and adjust the format as follows. After removing the first leading one, the reciprocal is:

Removing another leading one changes the format to:

On the next step, the format is:

Note that we now have a “hidden” bit between the radix point and the sign bit. The hidden bit is not actually part of the number that we store and use in the computation, but it is assumed to be the same as the sign bit.
After three more leading ones are removed, the format is:

Note that there are four “hidden” bits between the radix point and the sign. Since the reciprocal
is negative, we do not need to round by adding one to the number R. Therefore, we will use R = 10101110000101012 = AE1516 in our multiply operation.
Since we are using 16-bit integer operations, the dividend, x, will be an S(15,0). The product of an S(15,0) and an S(−5,16) will be an S(11,16). We will remove the 16 fractional bits by shifting right. The four “hidden” bits indicate that the result must be shifted an additional four bits to the right, resulting in a total shift of 20 bits. Listing 8.4 shows how the 16-bit division code would be implemented in ARM assembly.

Sometimes we need more range than we can easily get from fixed precision. One approach to solving this problem is to create an aggregate data type that can represent a fractional number by having fields for an exponent, a sign bit, and an integer mantissa. For example, in C, we could represent a fractional number using the data structure shown in Listing 8.5. That data structure, along with some subroutines for addition, subtraction, multiplication and division, would provide the capability to perform arithmetic without explicitly tracking the radix point. The subroutines for the basic arithmetical operations could do that, thereby freeing the programmer to work at a higher level.

The structure shown in Listing 8.5 is a rather inefficient way to represent a fractional number, and may create different data structures on different machines. The sign only requires one bit, and the size of the exponent and mantissa are dependent upon the machine on which the code is compiled. The sign will use one bit, the exponent eight bits, and the mantissa 23 bits.
The C language includes the notion of bit fields. This allows the programmer to specify exactly how many bits are to be used for each field within a struct, Listing 8.6 shows a C data structure that consumes 32 bits on all machines and architectures. It provides the same fields as the structure in Listing 8.5, but specifies exactly how many bits each field consumes.

The compiler will compress this data structure into 32 bits, regardless of the natural word size of the machine.
The method of representing fractional numbers as a sign, exponent, and mantissa is very powerful, and IEEE has set standards for various floating point formats. These formats can be described using bit fields in C, as described above. Many processors have hardware that is specifically designed to perform arithmetic using the standard IEEE formatted data. The following sections highlight most of the IEEE defined numerical definitions.
The IEEE standard specifies the bitwise representation for numbers, and specifies parameters for how arithmetic is to be performed. The IEEE standard for numbers includes the possibility of having numbers that cannot be easily represented. For example, any quantity that is greater than the most positive representable value is positive infinity, and any quantity that is less than the most negative representable value is negative infinity. There are special bit patterns to encode these quantities. The programmer or hardware designer is responsible for ensuring that their implementation conforms to the IEEE standards. The following sections describe some of the IEEE standard data formats.
The half-precision format gives a 16-bit encoding for fractional numbers with a small range and low precision. There are situations where this format is adequate. If the computation is being performed on a very small machine, then using this format may result in significantly better performance than could be attained using one of the larger IEEE formats. However, in most situations, the programmer can achieve better performance and/or precision by using a fixed-point representation. The format is as follows:

• The Significand (a.k.a. “Mantissa”) is stored using a sign-magnitude coding, with bit 15 being the sign bit.
• The exponent is an excess-15 number. That is, the number stored is 15 greater than the actual exponent.
• There are 10 bits of significand, but there are 11 bits of significand precision. There is a “hidden” bit, m10, between m9 and e0. When a number is stored in this format, it is shifted until its leftmost non-zero bit is in the hidden bit position, and the hidden bit is not actually stored. The exception to this rule is when the number is zero or very close to zero. The radix point is assumed to be between the hidden bit and the first bit stored. The radix point is then shifted by the exponent.
Table 8.1 shows how to interpret IEEE 754 Half-Precision numbers. The exponents 00000 and 11111 have special meaning. The value 00000 is used to represent zero and numbers very close to zero, and the exponent value 11111 is used to represent infinity and NaN. NaN, which is the abbreviation for not a number, is a value representing an undefined or unrepresentable value. One way to get NaN as a result is to divide infinity by infinity. Another is to divide zero by zero. The NaN value can indicate that there is a bug in the program, or that a calculation must be performed using a different method.
Table 8.1
Format for IEEE 754 half-precision
| Exponent | Significand = 0 | Significand≠0 | Equation |
| 00000 | ± 0 | subnormal | − 1sign × 2−14 × 0.significand |
| 00001 …11110 | normalized value | − 1sign × 2exp−15 × 1.significand | |
| 11111 | ![]() | NaN | |

Subnormal means that the value is too close to zero to be completely normalized. The minimum strictly positive (subnormal) value is 2−24 ≈ 5.96 × 10−8. The minimum positive normal value is 2−14 ≈ 6.10 × 10−5. The maximum exactly representable value is (2 − 2−10) × 215 = 65504.

represents

represents
The single precision format provides a 23-bit mantissa and an 8-bit exponent, which is enough to represent a reasonably large range with reasonable precision. This type can be stored in 32 bits, so it is relatively compact. At the time that the IEEE standards were defined, most machines used a 32-bit word, and were optimized for moving and processing data in 32-bit quantities. For many applications this format represents a good trade-off between performance and precision.

The double-precision format was designed to provide enough range and precision for most scientific computing requirements. It provides a 10-bit exponent and a 53-bit mantissa. When the IEEE 754 standard was introduced, this format was not supported by most hardware. That has changed. Most modern floating point hardware is optimized for the IEEE 754 double-precision standard, and most modern processors are designed to move 64-bit or larger quantities. On modern floating-point hardware, this is the most efficient representation.
However, processing large arrays of double-precision data requires twice as much memory, and twice as much memory bandwidth, as single-precision.

The IEEE 754 Quad-Precision format was designed to provide enough range and precision for very demanding applications. It provides a 14-bit exponent and a 116-bit mantissa. This format is still not supported by most hardware. The first hardware floating point unit to support this format was the SPARC V8 architecture. As of this writing, the popular Intel x86 family, including the 64-bit versions of the processor, do not have hardware support for the IEEE 754 quad-precision format. On modern high-end processors such as the SPARC, this may be an efficient representation. However, for mid-range processors such as the Intel x86 family and the ARM, this format is definitely out of their league.

Many processors do not have hardware support for floating point. On those processors, all floating point must be accomplished through software. Processors that do support floating point in hardware must have quite sophisticated circuitry to manage the basic operations on data in the IEEE 754 standard formats. Regardless of whether the operations are carried out in software or hardware, the basic arithmetic operations require multiple steps.
The steps required for addition and subtraction of floating point numbers is the same, regardless of the specific format. The steps for adding or subtracting to floating point numbers a and b are as follows:
1. Extract the exponents Ea and Eb.
2. Extract the significands Ma and Mb. and convert them into 2’s complement numbers, using the signs Sa and Sb.
3. Shift the significand with the smaller exponent right by |Ea − Eb|.
4. Perform addition (or subtraction) on the significands to get the significand of the result, Mr. Remember that the result may require one more significant bit to avoid overflow.
5. If Mr is negative, then take the 2’s complement and set Sr to 1. Otherwise set Sr to 0.
6. Shift Mr until the leftmost 1 is in the “hidden” bit position, and add the shift amount to the smaller of the two exponents to form the new exponent Er.
7. Combine the sign Sr, the exponent Er, and significand Mr to form the result.
The complete algorithm must also provide for correct handling of infinity and NaN.
Multiplication and division of floating point numbers also requires several steps. The steps for multiplication and division of two floating point numbers a and b are as follows:
1. Calculate the sign of the result Sr.
2. Extract the exponents Ea and Eb.
3. Extract the significands Ma and Mb.
4. Multiply (or divide) the significands to form Mr.
5. Add (or subtract) the exponents (in excess-N) to get Er.
6. Shift Mr until the leftmost 1 is in the “hidden” bit position, and add the shift amount to Er.
7. Combine the sign S, the exponent Er, and significand Mr to form the result.
The complete algorithm must also provide for correct handling of infinity and NaN.
It has been said, and is commonly accepted, that “you can’t beat the compiler.” The meaning of this statement is that using hand-coded assembly language is futile and/or worthless because the compiler is “smarter” than a human. This statement is a myth, as will now be demonstrated.
There are many mathematical functions that are useful in programming. Two of the most useful functions are
and
. However, these functions are not always implemented in hardware, particularly for fixed-point representations. If these functions are required for fixed-point computation, then they must be written in software. These two functions have some nice properties that can be exploited. In particular:
• If we have the
function, then we can calculate
using the relationship
Therefore, we only need to get the sine function working, and then we can implement cosine with only a little extra effort.
•
is cyclical, so
. This means that we can limit the domain of our function to the range [−π,π].
•
is symmetric, so that
. This means that we can further restrict the domain to [0,π].
• After we restrict the domain to [0,π], we notice another symmetry,
and we can further restrict the domain to
.
• The range of both functions,
and
, is in the range [−1,1].
If we exploit all of these properties, then we can write a single shared function to be used by both sine and cosine. We will name this function sinq, and choose the following fixed-point formats:
• sinq will accept x as an S(1,30), and
• sinq will return an S(1,30)
These formats were chosen because S(1,30) is a good format for storing a signed number between zero and
, and also the optimal format for storing a signed number between one and negative one.
The sine function will map x into the domain accepted by sinq and then call sinq to do the actual work. If the result should be negative, then the sine function will negate it before returning. The cosine function will use the relationship previously mentioned, and call the sine function.
We have now reduced the problem to one of approximating
within the range
. An approximation to the function
can be calculated using the Taylor Series:
The first few terms of the series should be sufficient to achieve a good approximation. The maximum value possible for the seventh term is
, which indicates that our function should be accurate to at least 25 bits using seven terms. If more accuracy is desired, then additional terms can be added.
The numerators in the first nine terms of the Taylor series approximation are: x, x3, x5, x7, x9, x11, x13, x15, and x17. Given an S(1,30) format for x, we can predict the format for the numerator of each successive term in the Taylor series. If we simply perform successive multiplies, then we would get the following formats for the powers of x:
| Term | Format | 32-bit |
| x | S(1,30) | S(1,30) |
| x3 | S(3,90) | S(3,28) |
| x5 | S(5,150) | S(5,26) |
| x7 | S(7,210) | S(7,24) |
| x9 | S(9,270) | S(9,22) |
| x11 | S(11,330) | S(11,20) |
| x13 | S(13,390) | S(13,18) |
The middle column in the table shows that the format for x17 would require 528 bits if all of the fractional bits are retained. Dealing with a number at that level of precision would be slow and impractical. We will, of necessity, need to limit the number of bits used. Since the ARM processor provides a multiply instruction involving two 32-bit numbers, we choose to truncate the numerators to 32 bits. The third column in the table indicates the resulting format for each term if precision is limited to 32 bits.
On further consideration of the Taylor series, we notice that each of the above terms will be divided by a constant. Instead of dividing, we can multiply by the reciprocal of the constant. We will create a similar table holding the formats and constants for the factorial terms. With a bit of luck, the division (implemented as multiplication) in each term will result in a reasonable format for each resulting term.
The first term of the Taylor series is
, so we can simply skip the division. The second term is
and the third term is
We can convert
to binary as follows:
Since the pattern repeats, we can conclude that
. Since we need a negative number, we take the two’s complement, resulting in
. Represented as an S(1,30), this would be

Since the first four bits are one, we can remove three bits and store it as:

In hexadecimal, this is AAAAAAAA16.
Performing the same operations, we find that
can be converted to binary as follows:
Since the fraction in the seventh row is the same as the fraction in the third row, we know that the table will repeat forever. Therefore,
. Since the first six bits to the right of the radix are all zero, we can remove the first five bits. Also adding one to the least significant bit to account for rounding error yields the following S(−6,32):

In hexadecimal, the number to be multiplied is 4444444516. Note that since
is a positive number, the reciprocal was incremented by one to avoid round-off errors. We can apply the same procedure to the remaining terms, resulting in the following table:
We want to keep as much precision as is reasonably possible for our intermediate calculations. Using 64 bits of precision for all intermediate calculations will give a good trade-off between performance and precision. The integer portion should never require more than two bits, so we choose an S(2,61) as our intermediate representation. If we combine the previous two tables, we can determine what the format of each complete term will be. This is shown in Table 8.2.
Table 8.2
Result formats for each term
| Numerator | Reciprocal | Result | ||||
| Term | Value | Format | Value | Format | Hex | Format |
| 1 | x | S(1,30) | Extend to 64 bits and shift right | S(2,61) | ||
| 2 | x3 | S(3,28) | ![]() | S(−2,32) | AAAAAAAA | S(2, 61) |
| 3 | x5 | S(5,26) | ![]() | S(−6,32) | 44444444 | S(0, 63) |
| 4 | x7 | S(7,24) | ![]() | S(−12,32) | 97F97F97 | S(−4, 64) |
| 5 | x9 | S(9,22) | ![]() | S(−18,32) | 5C778E96 | S(−8, 64) |
| 6 | x11 | S(11,20) | ![]() | S(−25,32) | 9466EA60 | S(−13, 64) |
| 7 | x13 | S(13,18) | ![]() | S(−32,32) | 5849184F | S(−18, 64) |

Note that the formats were truncated to fit in a 64-bit result. We can now see that the formats for the first nine terms of the Taylor series are reasonably similar. They all require exactly 64 bits, and the radix points can be shifted so that they are aligned for addition. In order to make the shifting and adding process easier, we will pre-compute the shift amounts and store them in a look-up table.
Table 8.3 shows the shifts that are necessary to convert each term to an S(2,61) so that it can be added to the running total.
Table 8.3
Shifts required for each term
| Term Number | Original Format | Shift Amount | Resulting Format |
| 1 | S(1,30) | 1 | S(2,61) |
| 2 | S(2,61) | 0 | S(2,61) |
| 3 | S(0,63) | 2 | S(2,61) |
| 4 | S(−4,64) | 6 | S(2,61) |
| 5 | S(−8,64) | 10 | S(2,61) |
| 6 | S(−13,64) | 15 | S(2,61) |
| 7 | S(−18,64) | 20 | S(2,61) |

Note that the seventh term contributes very little to the final 32-bit sum which is stored in the upper 32 bits of the running total. We now have all of the information that we need in order to implement the function. Listing 8.7 shows how the sine and cosine function can be implemented in ARM assembly using fixed point computation, and Listing 8.8 shows a main program which prints a table of values and their sine and cosines.






and
using fixed-point calculations.

and
functions can be used to print a table.In some situations it can be very advantageous to use fixed-point math. For example, when using an ARMv6 or older processor, there may not be a hardware floating point unit available. Table 8.4 shows the CPU time required for running a program to compute the sine function on 10,000,000 random values, using various implementations of the sine function. In each case, the program main() function was written in C. The only difference in the six implementations was the data type (which could be fixed-point, IEEE single precision, or IEEE double precision), and the sine function that was used. The times shown in the table include only the amount of CPU time actually used in the sine function, and do not include the time required for program startup, storage allocation, random number generation, printing results, or program exit. The six implementations are as follows:
Table 8.4
Performance of sine function with various implementations
| Optimization | Implementation | CPU seconds |
| None | 32-bit Fixed Point Assembly | 3.85 |
| 32-bit Fixed Point C | 18.99 | |
| Single Precision Software Float C | 56.69 | |
| Double Precision Software Float C | 55.95 | |
| Single Precision VFP C | 11.60 | |
| Double Precision VFP C | 11.48 | |
| Full | 32-bit Fixed Point Assembly | 3.22 |
| 32-bit Fixed Point C | 5.02 | |
| Single Precision Software Float C | 20.53 | |
| Double Precision Software Float C | 54.51 | |
| Single Precision VFP C | 3.70 | |
| Double Precision VFP C | 11.08 |
32-bit Fixed Point Assembly The sine function is computed using the code shown in Listing 8.7.
32-bit Fixed Point C The sine function is computed using exactly the same algorithm as in Listing 8.7, but it is implemented in C rather than Assembly.
Single Precision Software Float C Sine is computed using the floating point sine function which is provided by the GCC C compiler. The code is compiled for an ARMv6 or earlier processor without hardware floating point support. The C code is written to use IEEE single precision floating point numbers.
Double Precision Software Float C Exactly the same as the previous method, but using IEEE double precision instead of single precision.
Single Precision VFP C Sine is computed using the floating point sine function which is provided by the GCC C compiler. The code is compiled for the ARMv6 or later processor using hardware floating point support. The C code is written to use IEEE single precision floating point numbers.
Double Precision VFP C Same as the previous method, but using IEEE double precision instead of single precision.
Each of the six implementations was compiled both with and without compiler optimizations, resulting in a total of 12 test cases. All cases were run on a standard Raspberry Pi model B with the default CPU clock rate.
From Table 8.4, it is clear that the fixed-point implementation written in assembly beats the code generated by the compiler in every case. The closest that the compiler can get is when it can use the VFP hardware floating point unit and the compiler is run with full optimization. Even in that case the fixed-point assembly implementation is almost 15% faster than the single precision floating point implementation, and has 33% more precision (32 bits versus 24 bits). In the worst case, when a VFP hardware unit is not available, the assembly code beats the compiler by a whopping 638% in speed and 33% in precision for single precision floats, and is 1692% faster than double precision floating point at a cost of 41% in precision. Note that even with floating point hardware support, fixed point in assembly is still 3.44 times as fast as the C compiler code.
Similar results could be obtained on any processor architecture, and any reasonably complex mathematical problem. When developing software for small systems, the developer must weigh the costs and benefits of alternative implementations. For battery powered systems, it is important to realize that choices of hardware and software can affect power consumption even more strongly than computing performance. First, the power used by a system which includes a hardware floating point processor will be consistently higher than that of a system without one. Second, the reduction in processing time required for the job is closely related to the reduction in power required. Therefore, for battery operated systems, A fixed-point implementation could greatly extend battery life. The following statements summarize the results from the experiment in this section:
1. A competent assembly programmer can beat the assembler, in some cases by a very large margin.
2. If computational performance is critical, then a well-designed fixed-point implementation will usually outperform even a hardware-accelerated floating point implementation.
3. If there is no hardware support for floating point, then floating point performance is extremely poor, and fixed point will always provide the best performance.
4. If battery life is a consideration, then a fixed-point implementation can have an enormous advantage.
Note also from the table that the assembly language version of the fixed-point sine function beats the identical C version by a wide margin. Section 9.8.2 will demonstrate that a good assembly language programmer who is familiar with the floating point hardware can beat the compiler by an even wider performance margin.
Fixed-point arithmetic is very efficient on modern computers. However it is incumbent upon the programmer to track the radix point at all stages of the computation, and to ensure that a sufficient number of bits are provided on both sides of the radix point. The programmer must ensure that all computations are carried out with the desired level of precision, resolution, accuracy, range, and dynamic range. Failure to do so can have serious consequences.
On February 25, 1991, during the Gulf War, an American Patriot Missile battery in Dharan, Saudi Arabia, failed to intercept an incoming Iraqi SCUD missile. The SCUD struck an American army barracks, killing 28 soldiers and injuring around 98 other people. The cause was an inaccurate calculation of the time elapsed since the system was last booted.
The hardware clock on the system counted the time in tenths of a second since the last reboot. Current time, in seconds, was calculated by multiplying that number by
. For this calculation,
was represented as a U(1,23) fixed-point number. Since
cannot be represented precisely in a fixed number of bits, there was round-off error in the calculations. The small imprecision, when multiplied by a large number, resulted in significant error. The longer the system ran after boot, the larger the error became.
The system determined whether or not it should fire by predicting where the incoming missile would be at a specific time in the future. The time and predicted location were then fed to a second system which was responsible for locking onto the target and firing the Patriot missile. The system would only fire when the missile was at the proper location at the specified time. If the radar did not detect the incoming missile at the correct time and location, then the system would not fire.
At the time of the failure, the Patriot battery had been up for around 100 h. We can estimate the error in the timing calculations by considering how the binary number was stored. The binary representation of
is
. Note that it is a non-terminating, repeating binimal. The 24-bit register in the Patriot could only hold the following set of bits:

This resulted in an error of
. The error can be computed in base 10 as:
To find out how much error was in the total time calculation, we multiply e by the number of tenths of a second in 100 h. This gives 9.5 × 10−8 × 100 × 60 × 60 × 10 = 0.34 s. A SCUD missile travels at about 1,676 m/s. Therefore it travels about 570 m in 0.34 s. Because of this, the targeting and firing system was expecting to find the SCUD at a location that was over half a kilometer from where it really was. This was far enough that the incoming SCUD was outside the “range gate” that the Patriot tracked. It did not detect the SCUD at its predicted location, so it could not lock on and fire the Patriot.
This is an example of how a seemingly insignificant error can lead to a major failure. In this case, it led to loss of life and serious injury. Ironically, one factor that contributed to the problem was that part of the code had been modified to provide more accurate timing calculations, while another part had not. This meant that the inaccuracies did not cancel each other. Had both sections of code been re-written, or neither section changed, then the issue probably would not have surfaced.
The Patriot system was originally designed in 1974 to be mobile and to defend against aircraft that move much more slowly than ballistic missiles. It was expected that the system would be moved often, and therefore the computer would be rebooted frequently. Also, the slow-moving aircraft would be much easier to track, and the error in predicting where it is expected to be would not be significant. The system was modified in 1986 to be capable of shooting down Soviet ballistic missiles. A SCUD missile travels at about twice the speed of the Soviet missiles that the system was re-designed for.
The system was deployed to Iraq in 1990, and successfully shot down a SCUD missile in January of 1991. In mid-February of 1991, Israeli troops discovered that the system became inaccurate if it was allowed to run for long periods of time. They claimed that the system would become unreliable after 20 hours of operation. U.S. military did not think the discovery was significant, but on February 16th, a software update was released. Unfortunately, the update could not immediately reach all units because of wartime difficulties in transportation. The Army released a memo on February 21st, stating that the system was not to be run for “very long times,” but did not specify how long a “very long time” would be. The software update reached Dhahran one day after the Patriot Missile system failed to intercept a SCUD missile, resulting in the death of 28 Americans and many more injuries.
Part of the reason this error was not found sooner was that the program was written in assembly language, and had been patched several times in its 15-year life. The code was difficult to understand and maintain, and did not conform to good programming practices. The people who worked to modify the code to handle the SCUD missiles were not as familiar with the code as they would have been if it were written more recently, and time was a critical factor. Prolonged testing could have caused a disaster by keeping the system out of the hands of soldiers in a time of war. The people at Raytheon Labs had some tough decisions to make. It cannot be said that Raytheon was guilty of negligence or malpractice. The problem with the system was not necessarily the developers, but that the system was modified often and in inconsistent ways, without complete understanding.
Sometimes it is desirable to perform calculations involving non-integral numbers. The two common ways to represent non-integral numbers in a computer are fixed point and floating point. A fixed point representation allows the programmer to perform calculations with non-integral numbers using only integer operations. With fixed point, the programmer must track the radix point throughout the computation. Floating point representations allow the radix point to be tracked automatically, but require much more complex software and/or hardware. Fixed point will usually provide better performance than floating point, but requires more programming skill.
Fractional numbers in radix notation may not terminate in all bases. Numbers which terminate in base two will also terminate in base ten, but the converse is not true. Programmers should avoid counting using fractions which do not terminate in base two, because it leads to the accumulation of round-off errors.
8.1 Perform the following base conversions:
(a) Convert 10110.0012 to base ten.
(b) Convert 11000.01012 to base ten.
(c) Convert 10.12510 to binary.
8.2 Complete the following table (assume all values represent positive fixed-point numbers):
8.3 You are working on a problem involving real numbers between −2 and 2 on a computer that has 16-bit integer registers and no hardware floating point support. You decide to use 16-bit fixed-point arithmetic.
(a) What fixed-point format should you use?
(b) Draw a diagram showing the sign, if any, radix point, integer part, and fractional part.
(c) What is the precision, resolution, accuracy, and range of your format?
8.4 What is the resulting type of each of the following fixed-point operations?
(b) S(3,4)÷U(4,20)
8.5 Convert 26.64062510 to a binary U(18,14) representation. Show the ARM assembly code necessary to load that value into register r4.
8.6 For each of the following fractions, indicate whether or not it will terminate in bases 2, 5, 7, and 10.
(b) 
(c) 
(d) 
(e) 
8.7 What is the exact value of the binary number 0011011100011010 when interpreted as an IEEE half-precision number? Give your answer in base ten.
8.8 The “Software Engineering Code of Ethics And Professional Practice” states that a responsible software engineer should “Approve software only if they have well-founded belief that it is safe, meets specifications, passes appropriate tests…” (sub-principle 1.03) and “Ensure adequate testing, debugging, and review of software…on which they work” (sub-principle 3.10).
The software engineering code of ethics also states that a responsible software engineer should “Treat all forms of software maintenance with the same professionalism as new development.”
(a) Explain how the Software Engineering Code of Ethics And Professional Practice were violated by the Patriot Missile system developers.
(b) How should the engineers and managers at Raytheon have responded when they were asked to modify the Patriot Missile System to work outside of its original design parameters?
(c) What other ethical and non-ethical considerations may have contributed to the disaster?
This chapter begins by giving an overview of the ARM Vector Floating Point (VFP) coprocessor and the ARM VFP register set. Next, it gives an overview of the Floating Point Status and Control Register (FPSCR). It then explains RunFast mode, which gives higher performance but is not fully compliant with the IEEE floating point standards. That is followed by a explanation of vector mode, which can give an additional performance boost in some situations. Then, after a short discussion of the register usage rules, it describes each of the VFP instructions, providing a short description of each one. Next, it presents four implementations of a function to calculate sine using the ARM VFP coprocessor, and shows that they are all significantly faster than the implementation provided by GCC.
Floating point; Vector; IEEE Compliance; Performance
Some ARM processors have dedicated hardware to support floating point operations. For ARMv7 and previous architectures, floating point is provided by an optional Vector Floating Point (VFP) coprocessor. Many newer processors also support the NEON extensions, which are covered in Chapter 10. The remainder of this chapter will explain the VFP coprocessor.
There are four major revisions of the VFP coprocessor:
VFPv2: An optional extension to the ARMv5 and ARMv6 processors. VFPv2 has 16 64-bit FPU registers.
VFPv3: An optional extension to the ARMv7 processors. It is backwards compatible with VFPv2, except that it cannot trap floating-point exceptions. VFPv3-D32 has 32 64-bit FPU registers. Some processors have VFPv3-D16, which supports only 16 64-bit FPU registers. VFPv3 adds several new instructions to the VFP instruction set.
VFPv4: Implemented on some Cortex ARMv7 processors. VFPv4 has 32 64-bit FPU registers. It adds both half-precision extensions and multiply-accumulate instructions to the features of VFPv3. Some processors have VFPv4-D16, which supports only 16 64-bit FPU registers.
Fig. 9.1 shows the 16 ARM integer registers, and the additional registers provided by the VFP coprocessor. Banks four through seven are only present on the VFPv3-D32 and VFPv4-D32 versions of the coprocessor. Note that each register in Banks zero through three can be used to store either one 64-bit number or two 32-bit numbers. For example, double precision register d0 may also be referred to as single precision registers s0 and s1. Each 32-bit VFP register can hold an integer or a single precision floating point number. Registers in Banks four through seven cannot be used as single precision registers.

The VFP adds about 23 new instructions to the ARM instruction set. The exact number of VFP instructions depends on the specific version of the VFP coprocessor. Instructions are provided to:
• transfer floating point values between VFP registers,
• transfer floating-point values between the VFP coprocessor registers and main memory,
• transfer 32-bit values between the VFP coprocessor registers and the ARM integer registers,
• perform addition, subtraction, multiplication, and division, involving two source registers and a destination register,
• compute the square root of a value,
• perform combined multiply-accumulate operations,
• perform conversions between various integer, fixed point, and floating point representations, and
• compare floating-point values.
In addition to performing basic operations involving two source registers and one destination register, VFP instructions can also perform operations involving registers arranged as short vectors (arrays) of up to eight single-precision values or four double-precision values. A single instruction can be used to perform operations on all of the elements of such vectors. This feature can substantially accelerate computation on arrays and matrices of floating point data. This type of data is common in graphics and signal processing applications. Vector mode can reduce code size and increase speed of execution by supporting parallel operations and multiple transfers.
The Floating Point Status and Control Register (FPSCR) is similar to the CPSR register. The FPSCR stores status bits from floating point operations in much the same way as the CPSR stores status bits from integer operations. The programmer can also write to certain bits in the FPSCR to control the behavior of the VFP coprocessor. The layout of the FPSCR is shown in Fig. 9.2. The meaning of each field is as follows:

N The Negative flag is set to one by vcmp if Fd < Fm.
Z The Zero flag is Set to one by vcmp if Fd = Fm.
C The Carry flag is set to one by vcmp if Fd = Fm, or Fd > Fm, or Fd and Fm are unordered.
V The oVerflow flag is set to one by vcmp if Fd and Fm are unordered.
QC NEON only. The saturation cumulative flag is set to one by saturating instructions if saturation has occurred.
0: Disable Default NaN mode. NaN operands propagate through to the output of a floating-point operation.
1: Enable Default NaN mode. Any operation involving one or more NaNs returns the default NaN.
The default single precision NaN is 7FC0000016 and the default double-precision NaN is 7FF800000000000016. Default NaN mode does not comply with IEEE 754 standard, but may increase performance. NEON instructions ignore this bit and always use Default NaN mode.
0: Disable Flush-to-Zero mode.
1: Enable Flush-to-Zero mode.
Flush-to-Zero mode replaces subnormal numbers with 0. This does not comply with IEEE 754 standard, but may increase performance. NEON instructions ignore this bit and always use flush-to-Zero mode.
01 Round towards Plus infinity (RP).
10 Round towards Minus infinity (RM).
11 Round towards Zero (RZ).
NEON instructions ignore these bits and always use Round to Nearest mode.
STRIDE Sets the stride (distance between items) for vector operations:
01 Reserved.
10 Reserved.
11 Stride is 2.
LEN Sets the vector length for vector operations:
000 Vector length is 1 (scalar mode).
001 Vector length is 2.
010 Vector length is 3.
011 Vector length is 4.
100 Vector length is 5.
101 Vector length is 6.
110 Vector length is 7.
111 Vector length is 8.
IDE Input Denormal (subnormal) exception Enable:
1: An exception is generated when one or more operand is subnormal.
1: An exception is generated when the result contains more significand bits than the destination format can contain, and must be rounded.
UFE UnderFlow exception Enable:
1: An exception is generated when the result is closer to zero than can be represented by the destination format.
OFE OverFlow exception Enable:
1: An exception is generated when the result is farther from zero than can be represented by the destination format.
DZE Division by Zero exception Enable:
1: An exception is generated by divide instructions when the divisor is zero or subnormal.
IOE Invalid Operation exception Enable:
1: An exception is generated when the result is not defined, or cannot be represented. For example, adding positive and negative infinity gives an invalid result.
IDC The Input Subnormal Cumulative flag is set to one when an IDE condition has occurred.
IXC The IneXact Cumulative flag is set to one when an IXE condition has occurred.
UFC The UnderFlow Cumulative flag is set to one when a UFE condition has occurred.
OFC The OverFlow Cumulative flag is set to one when an OFE condition has occurred.
DZC The Division by Zero Cumulative flag is set to one when a DZE condition has occurred.
IOC The Invalid Operation Cumulative flag is set to one when an OFE condition has occurred.
The only VFP instruction that can be used to update the status flags in the FPSCR is fcmp, which is similar to the integer cmp instruction. To use the FPSCR flags to control conditional instructions, including conditional VFP instructions, they must first be moved into the CPSR register. Table 9.1 shows the meanings of the FPSCR flags when they are transferred to the CPSR and used for conditional execution on following instructions. The following rules govern how the bits in the FPSCR may be changed by subroutines:
Table 9.1
Condition code meanings for ARM and VFP
| <cond> | ARM Data Processing Instruction | VFP fcmp Instruction |
| AL | Always | Always |
| EQ | Equal | Equal |
| NE | Not Equal | Not equal, or unordered |
| GE | Signed greater than or equal | Greater than or equal |
| LT | Signed less than | Less than, or unordered |
| GT | Signed greater than | Greater than |
| LE | Signed less than or equal | Less than or equal, or unordered |
| HI | Unsigned higher | Greater than, or unordered |
| LS | Unsigned lower or same | Less than or equal |
| HS | Carry set/unsigned higher or same | Greater than or equal, or unordered |
| CS | Same as HS | Same as HS |
| LO | Carry clear/ unsigned lower | less than |
| CC | Same as LO | Same as LO |
| MI | Negative | Less than |
| PL | Positive or zero | Greater than or equal, or unordered |
| VS | Overflow | Unordered (at least one NaN operand) |
| VC | No overflow | Not unordered |
1. Bits 27-31, 0-4, and 7 do not need to be preserved.
2. Subroutines may modify bits 8-12, 15, and 22-25 but the practice is discouraged. These bits should only be changed by specific support subroutines which change the global state of the program. If they are modified within a subroutine, then their original value must be restored before the function returns or calls another function.
3. Bits 16–18 and bits 20–21 may be changed by a subroutine, but must be set to zero before the function returns or calls another function.
4. All other bits are reserved for future use and must not be modified.
Floating point operations are complex, and there are many special cases, such as dealing with NaNs, infinities, and subnormals. These special cases are a normal part of performing floating point math, but they are relatively infrequent. In order to simplify the hardware, many special situations which occur infrequently are handled by software. When one of these exceptional situations occurs, the VFP hardware sets the appropriate flags in the FPSCR and generates an interrupt. The ARM CPU then executes an interrupt handler to deal with the exceptional situation. When the routine finishes, it returns to the point where the exception occurred and execution resumes just as if the situation had been dealt with by the hardware. This approach is taken by many processor architectures to reduce the complexity, cost, and/or power consumption of the floating point hardware, This approach also allows the programmer to make a trade-off between performance and strict IEEE 754 compliance.
The support code for dealing with VFP exceptions is included in most ARM-based operating systems. Even bare-metal embedded systems can include the VFP support service routines. With the support code enabled, the VFP coprocessor is fully compliant with the IEEE 754 standard. However, using the fully compliant mode does increase the average run-time for floating point code, and increases the size of the operating system kernel or embedded system code.
When all of the VFP exceptions are disabled, Default NaN mode is enabled, and Flush-to-Zero is enabled, the VFP is not fully compliant with the IEEE 754 standard. However, floating point code runs significantly faster. For that reason, the state when bits 8–12 and bit 15 are set to zero while bits 24 and 25 are set to one is referred to as RunFast mode. There is some loss of accuracy for very small values, but the hardware no longer has to check for many of the conditions that may stall the floating point pipeline. This results in fewer stalls and much higher throughput in the hardware, as well as eliminating the necessity to handle exceptions in software. Many other floating point architectures have similar modes, so the GCC developers have found it worthwhile to provide programmers with the option of using them. User applications can be compiled to use this mode with GCC by using the - ffast -math and/or -Ofast options during compilation and linking. The startup code in the C standard library will then set the VFP to RunFast mode before calling the main function.
A VFP vector consists of up to eight single-precision registers, or up to four double-precision registers. All of the registers in a vector must be in the same bank. Also, vectors cannot be stored in Bank 0 or Bank 4. For example, registers s8 through s10 could be treated as a vector of three single-precision values. Registers s14 through s17 cannot be treated as a vector because some of those registers are in Bank 1 and others are in Bank 2. Registers d0 through d3 cannot be treated as a vector because they are in Bank 0.
The LEN field in the FPSCR controls the length of vectors that are used for vector operations. In vector operations, the first register in the vector is given as the operand, and the remaining registers are inferred from the settings of LEN and STRIDE. The STRIDE field allows data to be interleaved. For example, if the stride is set to two, and length is set to four, then the vector starting at s8 would consist of registers s8, s10, s12, and s14, while the vector starting at s9 would consist of registers s9, s11, s13, and s15. If a vector runs off the end of a bank, then the address wraps around to the first register in the bank. For example, if length is set to six and stride is set to one, then the vector starting at s13 would consist of s13, s14, s15, s8, s9, and s10, in that order.
The vector-capable data-processing instructions have one of the following two forms:

where Op is the VFP instruction, Fd is the destination register (or the first register in a vector), Fn is an operand register (or the first register in a vector), and Fm is an operand register (or the first register in a vector). Most data-processing instructions can operate in scalar mode, mixed mode, or vector mode. The mode depends on the LEN bits in the FPSCR, as well as on which register banks contain the destination and operand(s).
• The operation is scalar if the LEN field is set to zero (scalar mode) or the destination operand, Fd, is in Bank 0 or Bank 4. The operation acts on Fm (and Fn if the operation uses two operands) and places the result in Fd.
• The operation is mixed if the LEN field is not set to zero and Fm is in Bank 0 or Bank 4 but Fd is not. If the operation has only one operand, then the operation is applied to Fm and copies of the result are stored into each register in the destination vector. If the operation has two operands, then it is applied with the scalar Fm and each element in the vector starting at Fn, and the result is stored in the vector beginning at Fd.
• The operation is vector if the LEN field is not set to zero and neither Fd nor Fm is in Bank 0 or Bank 4. If the operation has only one operand, then the operation is applied to the vector starting at Fm and the results are placed in the vector starting at Fd. If the operation has two operands, then it is applied with corresponding elements from the vectors starting at Fm and Fn, and the result is stored in the vector beginning at Fd.
As with the integer registers, there are rules for using the VFP registers. These rules are a convention, and following the convention ensures interoperability between code written by different programmers and compilers. Registers s16 through s31 are non-volatile. This implies that d8 through d15 are also non-volatile, since they are really the same registers. The contents of these registers must be preserved across subroutine calls. The remaining registers (s0 through s15, also known as d0 through d7) are volatile. They are used for passing arguments, returning results, and for holding local variables. They do not need to be preserved by subroutines. If registers d16 through d31 are present, then they are also considered volatile.
In addition to the FPSCR, all VFP implementations contain at least two additional system registers. The Floating-point System ID register (FPSID) is a read-only register whose value indicates which VFP implementation is being provided. The contents of the FPSID can be transferred to an ARM integer register, then examined to determine which VFP version is available. There is also a Floating-point Exception register (FPEXC). Two bits of the FPEXC register provide system-level status and control. The remaining bits of this register are defined by the sub-architecture. These additional system registers should not be accessed by user applications.
The VFP provides several instructions for moving data between memory and the VFP registers. There are instructions for loading and storing single and double precision registers, and for moving multiple registers to or from memory.. All of the load and store instructions require a memory address to be in one of the ARM integer registers.
The following instructions are used to load or store a single VFP register:
vstr Store VFP Register.
• <op> may be either ld or st.
• Fd may be any single or double precision register.
• Rn may be any ARM integer register.
• <cond> is an optional condition code.
• <prec> may be either f32 or f64.

These instructions load or store multiple floating-point registers:
vldm Load Multiple VFP Registers, and
vstm Store Multiple VFP Registers.
As with the integer ldm and stm instructions, there are multiple versions for use in moving data and accessing stacks.
• <op> may be either ld or st.
ia Increment address after each transfer.
db Decrement address before each transfer.
• Rn may be any ARM integer register.
• <cond> is an optional condition code.
• <prec> may be either f32 or f64.
• <list> may be any set of contiguous single precision registers, or any set of contiguous double precision registers.
• If mode is db then the ! is required.
• vpop <list> is equivalent to vldmia sp!,< list >.
• vpush <list> is equivalent to vstmdb sp!,< list >.
| Name | Effect | Description |
| vldmia |
for i ∈ register_list do
if single then
else
end if end for if ! is present then
end if | Load multiple registers from memory starting at the address in Rd. Increment address after each load. |
| vstmia |
for i ∈ register_list do
if single then
else
end if end for if ! is present then
end if | Store multiple registers in memory starting at the address in Rd. Increment address after each store. |
| vldmdb |
for i ∈ register_list do if single then
else
end if
end for
| Load multiple registers from memory starting at the address in Rd. Decrement address before each load. |
| vstmdb |
for i ∈ register_list do if single then
else
end if
end for
| Store multiple registers in memory starting at the address in Rd. Decrement address before each store. |


These operations are vector-capable. For details on how to use vector mode, refer to Section 9.2.2. Instructions are provided to perform the four basic arithmetic functions, plus absolute value, negation, and square root. There are also special forms of the multiply instructions that perform multiply-accumulate.
The unary operations require on source operand and a destination register. The source and destination can be the same register. There are four unary operations:
vcpy Copy VFP Register (equivalent to move),
vabs Absolute Value,
vneg Negate, and
vsqrt Square Root.
• <op> is one of cpy, abs, neg, or sqrt.
• <cond> is an optional condition code.
• <prec> may be either f32 or f64.

The basic mathematical operations require two source operands and one destination. There are five basic mathematical operations:
vsub Subtract,
vmul Multiply,
vnmul Negate and Multiply, and
vdiv Divide.
• <op> is one of add, sub, mul, nmul, or div.
• <cond> is an optional condition code.
• <prec> may be either f32 or f64.

The compare instruction subtracts the value in Fm from the value in Fd and sets the flags in the FPSCR based on the result. The comparison operation will raise an exception if one of the operations is a signalling NaN. There is also a version of the instruction that will raise an exception if either operand is any type of NaN. The two comparison instructions are:
vcmpe Compare with Exception.
• If e is present, an exception is raised if either operand is any kind of NaN. Otherwise, an exception is raised only if either operand is a signaling NaN.
• <cond> is an optional condition code.
• <prec> may be either f32 or f64.

With the addition of all of the VFP registers, there many more possibilities for how data can be moved. There are many more registers, and VFP registers may be 32 or 64 bit. This results in several possible combinations for moving data among all of the registers. The VFP instruction set includes instructions for moving data between two VFP registers, between VFP and integer registers, and between the various system registers.
The most basic move instruction involving VFP registers simply moves data between two floating point registers. The instruction is:
vmov Move Between VFP Registers.
• Fd and Fm must be the same size.
• <cond> is an optional condition code.
• <prec> is either f32 or f64.

This version of the move instruction allows 32 bits of data to be moved between an ARM integer register and a floating point register. The instruction is:
vmov Move Between VFP and One ARM Integer Register.
• Rd is an ARM integer register.
• Sd is a VFP single precision register.
• <cond> is an optional condition code.

This version of the move instruction is used to transfer 64 bits of data between ARM integer registers and floating point registers:
vmov Move Between VFP and Two ARM Integer Registers.
• Source and destination must be VFP or integer registers. One of them must be a set of ARM integer registers, and the other must be VFP coprocessor registers. The following table shows the possible choices for sources and destinations.
• Sd and Sd’ must be adjacent, and Sd’ must be the higher-numbered register.
• <cond> is an optional condition code.

There are two instructions which allow the programmer to examine and change bits in the VFP system register(s):
vmrs Move From VFP System Register to ARM Register, and
vmsr Move From ARM Register to VFP System Register.
User programs should only access the FPSCR to check the flags and control vector mode.
• VFPsysreg can be any of the VFP system registers.
• Rd can be APSR_nzcv or any ARM integer register.,
• <cond> is an optional condition code.

The ARM VFP provides several instructions for converting between various floating point and integer formats. Some VFP versions also have instructions for converting between fixed point and floating point formats.
These instructions are used to convert integers to single or double precision floating point, or for converting single or double precision to integer:
vcvt Convert Between Floating Point and Integer
vcvtr Convert Floating Point to Integer with Rounding
These instructions always use a single precision register for the integer, but the floating point argument can be single precision or double precision. Some versions of the VFP do not support the double precision versions.
• The optional r makes the operation use the rounding mode specified in the FPSCR. The default is to round toward zero.
• <cond> is an optional condition code.
• The <type> can be either u32 or s32 to specify unsigned or signed integer.
• These instructions can also convert from fixed point to floating point if followed by an appropriate vmul.
| Opcode | Effect | Description |
| vcvt.f64.s32 | ![]() | Convert signed integer to double |
| vcvt.f32,s32 | ![]() | Convert signed integer to single |
| vcvt.f64.u32 | ![]() | Convert unsigned integer to double |
| vcvt.f32.u32 | ![]() | Convert unsigned integer to single |
| vcvt.s32.f32 | ![]() | Convert single to signed integer |
| vcvt.u32.f32 | ![]() | Convert single to unsigned integer |
| vcvt.s32.f64 | ![]() | Convert double to signed integer |
| vcvt.u32.f64 | ![]() | Convert double to unsigned integer |

VFPv3 and higher coprocessors have additional instructions used for converting between fixed point and single precision floating point:
vcvt Convert To or From Fixed Point.
• <cond> is an optional condition code.
• <td> specifies the type and size of the fixed point number, and must be one of the following:
u32 unsigned 32 bit value,
s16 signed 16 bit value, or
u16 unsigned 16 bit value.
• The #fbits operand specifies the number of fraction bits in the fixed point number, and must be less than or equal to the size of the fixed point number indicated by <td>.
| Name | Effect | Description |
| vcvt.s32.f32 | ![]() | Convert single precision to 32-bit signed fixed point. |
| vcvt.u32.f32 | ![]() | Convert single precision to 32-bit unsigned fixed point. |
| vcvt.s16.f32 | ![]() | Convert single precision to 16-bit signed fixed point. |
| vcvt.u16.f32 | ![]() | Convert single precision to 16-bit unsigned fixed point. |
| vcvt.f32.s32 | ![]() | Convert signed 32-bit fixed point to single precision |
| vcvt.f32.u32 | ![]() | Convert unsigned 32-bit fixed point to single precision |

A fixed point implementation of the sine function was discussed in Section 8.7, and shown to be superior to the floating point sine function provided by GCC. Now that we have covered the VFP instructions, we can write an assembly version using floating point which also performs better than the routines provided by GCC.
Listing 9.1 shows a single precision floating point implementation of the sine function, using the ARM VFPv3 instruction set. It works in a similar way to the previous fixed point code. There is a table of constants, each of which is the reciprocal of one of the factorial divisors in the Taylor series for sine. The subroutine calculates the powers of x one-by-one, and multiplies each power by the next constant in the table, summing the results as it goes. Note that the table of constants is shorter than the fixed point version of the code, because there are fewer bits of precision in a single precision floating point number than there are in the fixed point representation that was used previously.

Listing 9.2 shows a double precision floating point implementation of the sine function, using the ARM VFPv3 instruction set. Again, there is a table of constants, each of which is the reciprocal of one of the factorial divisors in the Taylor series for sine. The subroutine calculates the powers of x one-by-one, and multiplies each power by the next constant in the table, summing the results as it goes. Note that the table of constants is longer than the fixed point version of the code, because there are more bits of precision in a double precision floating point number than there are in the fixed point representation that was used previously.

The previous implementations are already faster than the implementations provided by GCC, However, it may be possible to gain a little more performance by using VFP vector mode. In the single precision code, there are five terms to be added. Since single precision vectors can have up to eight elements, the code should not require any loop at all.
Listing 9.3 shows a single precision floating point implementation of the sine function, using the ARM VFPv3 instruction set in vector mode. It performs the same operations as the previous implementation, but instead of using a loop, all of the data is pre-loaded into vector banks and then a vector multiply operation is performed. The processor is then returned to scalar mode, and the summation is performed. This implementation is slightly faster than the previous version.

Listing 9.4 shows a double precision floating point implementation of the sine function, using the ARM VFPv3 instruction set in vector mode. It performs the same operations as the previous implementation, but performs the nine multiplications in three groups of three, using vector operations. Also, computing the powers of x is done within the loop, using a vector multiply. In this case, the vector code is significantly faster than the scalar version.



Table 9.2 shows the performance of various implementations of the sine function, with and without compiler optimization. The Single Precision C and Double Precision C implementations are the standard implementations provided by GCC.
Table 9.2
Performance of sine function with various implementations
| Optimization | Implementation | CPU seconds |
| None | Single Precision Scalar Assembly | 2.96 |
| Single Precision Vector Assembly | 2.63 | |
| Single Precision C | 8.75 | |
| Double Precision Scalar Assembly | 4.59 | |
| Double Precision Vector Assembly | 3.75 | |
| Double Precision C | 9.21 | |
| Full | Single Precision Scalar Assembly | 2.16 |
| Single Precision Vector Assembly | 2.06 | |
| Single Precision C | 2.59 | |
| Double Precision Scalar Assembly | 3.88 | |
| Double Precision Vector Assembly | 3.16 | |
| Double Precision C | 8.49 |
When compiler optimization is not used, the single precision scalar VFP implementation achieves a speedup of about 2.96, and the vector implementation achieves a speedup of about 3.33 compared to the GCC implementation. The double precision scalar VFP implementation achieves a speedup of about 2.01, and the vector implementation achieves a speedup of about 2.46 compared to the GCC implementation.
When the best possible compiler optimization is used (-Ofast), the single precision scalar VFP implementation achieves a speedup of about 1.20, and the vector implementation achieves a speedup of about 1.26 compared to the GCC implementation. The double precision scalar VFP implementation achieves a speedup of about 2.19, and the vector implementation achieves a speedup of about 2.69 compared to the GCC implementation.
In most cases, the assembly versions were significantly faster than the functions provided by GCC. GCC with full optimization using single-precision numbers was competitive, but the assembly language vector implementation still beat it by over 25%. It is clear that writing some functions in assembly can result in large performance gains.
| Name | Page | Operation |
| vabs | 277 | Absolute Value |
| vadd | 278 | Add |
| vcmp | 279 | Compare |
| vcmpe | 279 | Compare with Exception |
| vcpy | 277 | Copy VFP Register |
| vcvt | 283 | Convert Between Floating Point and Integer |
| vcvt | 284 | Convert To or From Fixed Point |
| vcvtr | 283 | Convert Floating Point to Integer with Rounding |
| vdiv | 278 | Divide |
| vldm | 275 | Load Multiple VFP Registers |
| vldr | 274 | Load VFP Register |
| vmov | 280 | Move Between VFP and One ARM Integer Register |
| vmov | 281 | Move Between VFP and Two ARM Integer Registers |
| vmov | 279 | Move Between VFP Registers |
| vmrs | 282 | Move From VFP System Register to ARM Register |
| vmsr | 282 | Move From ARM Register to VFP System Register |
| vmul | 278 | Multiply |
| vneg | 277 | Negate |
| vnmul | 278 | Negate and Multiply |
| vsqrt | 277 | Square Root |
| vstm | 275 | Store Multiple VFP Registers |
| vstr | 274 | Store VFP Register |
| vsub | 278 | Subtract |
The ARM VFP coprocessor adds a great deal of power to the ARM architecture. The register set is expanded to hold up to four times the amount of data that can be held in the ARM integer registers. The additional instructions allow the programmer to deal directly with the most common IEEE 754 formats for floating point numbers. The ability to treat groups of registers as vectors adds a significant performance improvement. Access to the vector features is only possible through assembly language. The GCC compiler is not capable of using these advanced features, which gives the assembly programmer a big advantage when high-performance code is needed.
9.1 How many registers does the VFP coprocessor add to the ARM architecture?
9.2 What is the purpose of the FZ, DN, and IDE, IXE, UFE, OFE, DZE, and IOE bits in the FPSCR? What is it called when FZ and DN are set to one and all of the others are set to zero?
9.3 If a VFP coprocessor is present, how are floating point parameters passed to subroutines? How is a pointer to a floating point value (or array of values) passed to a subroutine?
9.4 Write the following C code in ARM assembly:

9.5 In the previous exercise, the C code contains a subtle bug.
b. Show two ways to fix the code in ARM assembly. Hint: One way is to change the amount of the increment, which will change the number of times that the loop executes.
9.6 The fixed point sine function from the previous chapter was not compared directly to the hand-coded VFP implementation. Based on the information in Tables 9.2 and 8.4, would you expect the fixed point sine function from the previous chapter to beat the hand-coded assembly VFP sine function in this chapter? Why or why not?
9.7 3-D objects are often stored as an array of points, where each point is a vector (array) consisting of four values, x, y, z, and the constant 1.0. Rotation, translation, scaling and other operations are accomplished by multiplying each point by a 4 × 4 transformation matrix. The following C code shows the data types and the transform operation:

Write the equivalent ARM assembly code.
9.8 Optimize the ARM assembly code you wrote in the previous exercise. Use vector mode if possible.
9.9 Since the fourth element of the point is always 1.0, there is no need to actually store it. This will reduce memory requirements by about 25%, and require one fewer multiply. The C code would look something like this:

Write optimal ARM VFP code to implement this function.
9.10 The function in the previous problem would typically be called multiple times to process an array of points, as in the following function:

This could be somewhat inefficient. Re-write this function in assembly so that the transformation of each point is done without resorting to a function call. Make your code as efficient as possible.
This chapter begins with an overview of the NEON extensions and explains the relationship between VFP and NEON. The NEON registers are explained, and the syntax for NEON instructions is explained. Next, each of the NEON instructions are explained, with short examples. In some cases, extended examples and figures are provided to help explain the operation of complex instructions. After all of the instructions are explained, another implementation of sine is presented and compared to previous implementations and with the GCC sine function. It is shown that NEON gives a significant performance advantage over VFP and hand coded assembly is much faster than the sin function provided by the compiler.
Single instruction multiple data (SIMD); Vector; Vector element; Instruction level parallelism; Lane
The ARM VFP coprocessor has been replaced or augmented by the NEON architecture on ARMv7 and higher systems. NEON extends the VFP instruction set with about 125 instructions and pseudo-instructions to support not only floating point, but also integer and fixed point. NEON also supports Single Instruction, Multiple Data (SIMD) operations. All NEON processors have the full set of 32 double precision VFP registers, but NEON adds the ability to view the register set as 16 128-bit (quadruple-word) registers, named q0 through q15.
A single NEON instruction can operate on up to 128 bits, which may represent multiple integer, fixed point, or floating point numbers. For example, if two of the 128-bit registers each contain eight 16-bit integers, then a single NEON instruction can add all eight integers from one register to the corresponding integers in the other register, resulting in eight simultaneous additions. For certain applications, this SIMD architecture can result in extremely fast and efficient implementations. NEON is particularly useful at handling streaming video and audio, but also can give very good performance on floating point intensive tasks. NEON instructions perform parallel operations on vectors. NEON deprecates the use of VFP vector mode covered in Section 9.2.2. On most NEON systems, using the VFP vector mode will result in an exception, which transfers control to the support code which emulates vector mode in software. This causes a severe performance penalty, so VFP vector mode should not be used on NEON systems.
Fig. 10.1 shows the ARM integer, VFP, and NEON register set. NEON views each register as containing a vector of 1, 2, 4, 8, or 16 elements, all of the same size and type. Individual elements of each vector can also be accessed as scalars. A scalar can be 8 bits, 16 bits, 32 bits, or 64 bits. The instruction syntax is extended to refer to scalars using an index, x, in a doubleword register. Dm[x] is element x in register Dm. The size of the elements is given as part of the instruction. Instructions that access scalars can access any element in the register bank.

The GCC compiler gives C (and C++) programs direct access to the NEON instructions through the NEON intrinsics. The intrinsics are a large set of functions that are built into the compiler. Most of the intrinsics functions map to one NEON instruction. There are additional functions provided for typecasting (reinterpreting) NEON vectors, so that the C compiler does not complain about mismatched types. It is usually shorter and more efficient to write the NEON code directly as assembly language functions and link them to the C code. However only those who know assembly language are capable of doing that.
Some instructions require specific register types. Other instructions allow the programmer to choose single word, double word, or quad word registers. If the instruction requires single precision registers, then the registers are specified as Sd for the destination register, Sn for the first operand register, and Sm for the second operand register. If the instruction requires only two registers, then Sn is not used. The lower-case letter is replaced with a valid register number. The register name is not case sensitive, so S10 and s10 are both valid names for single precision register 10.
The syntax of the NEON instructions can be described using a relatively simple notation. The notation consists of the following elements:
{item} Braces around an item indicate that the item is optional. For example, many operations have an optional condition, which is written as {<cond>}.
Ry An ARM integer register. y can be any number in the range 0{15.
Sy A 32-bit or single precision register. y can be any number in the range 0{31.
Dy A 64-bit or double precision register. y can be any number in the range 0{31.
Qy A quad word register. y can be any number in the range 0{15.
Fy A VFP register. F must be either s for a single word register, or d for a double word register. y can be any valid register number.
Ny A NEON or VFP register. N must be either s for a single word register, d for a double word register, or q for a quad word register. y can be any valid register number.
Vy A NEON vector register. V must be replaced with d for a double word register, or q for a quad word register. y can be any valid register number.
Vy[x] A NEON scalar (vector element). The size of the scalar is defined as part of the instruction. V must be replaced with d for a double word register, or q for a quad word register. y can be any valid register number. x specifies which scalar element of Vy is to be used. Valid values for x can be deduced by the size of Vy and the size of the scalars that the instruction uses.
<op> Operation specific part of a general instruction format
<n> An integer usually indicating a specific instruction version
<size> An integer indicating the number of bits used
<cond> ARM condition code from Table 3.2
<type> Many instructions operate on one or more of the following specific data types:
i16 Untyped 16 bits
i32 Untyped 32 bits
i64 Untyped 64 bits
s8 Signed 8-bit integer
s16 Signed 16-bit integer
s32 Signed 32-bit integer
s64 Signed 64-bit integer
u8 Unsigned 8-bit integer
u16 Unsigned 16-bit integer
u32 Unsigned 32-bit integer
u64 Unsigned 64-bit integer
f16 IEEE 754 half precision floating point
f32 IEEE 754 single precision floating point
f64 IEEE 754 double precision floating point
<list> A brace-delimited list of up to four NEON registers, vectors, or scalars. The general form is {Dn,D(n+a),D(n+2a),D(n+3a)} where a is either 1 or 2.
<align> Specifies the memory alignment of structured data for certain load and store operations.
<imm> An immediate value. The required format for immediate values depends on the instruction.
<fbits> Specifies the number of fraction bits in fixed point numbers.
The following function definitions are used in describing the effects of many of the instructions:
The floor function maps a real number, x, to the next smallest integer.
The saturate function limits the value of x to the highest or lowest value that can be stored in the destination register.
The round function maps a real number, x, to the nearest integer.
The narrow function reduces a 2n bit number to an n bit number, by taking the n least significant bits.
The extend function converts an n bit number to a 2n bit number, performing zero extension if the number is unsigned, or sign extension if the number is signed.
These instructions can be used to perform interleaving of data when structured data is loaded or stored. The data should be properly aligned for best performance. These instructions are very useful for common multimedia data types.
For example, image data is typically stored in arrays of pixels, where each pixel is a small data structure such as the pixel struct shown in Listing 5.37. Since each pixel is three bytes, and a d register is 8 bytes, loading a single pixel into one register would be inefficient. It would be much better to load multiple pixels at once, but an even number of pixels will not fit in a register. It will take three doubleword or quadword registers to hold an even number of pixels without wasting space, as shown in Fig. 10.2. This is the way data would be loaded using a VFP vldr or vldm instruction. Many image processing operations work best if each color “channel” is processed separately. The NEON load and store vector instructions can be used to split the image data into color channels, where each channel is stored in a different register, as shown in Fig. 10.3.


Other examples of interleaved data include stereo audio, which is two interleaved channels, and surround sound, which may have up to nine interleaved channels. In all of these cases, most processing operations are simplified when the data is separated into non-interleaved channels.
These instructions are used to load and store structured data across multiple registers:
vld<n> Load Structured Data, and
vst<n> Store Structured Data.
They can be used for interleaving or deinterleaving the data as it is loaded or stored, as shown in Fig. 10.3.
• <op> must be either ld or st.
• <n> must be one of 1, 2, 3, or 4.
• <size> must be one of 8, 16, or 32.
• <list> specifies the list of registers. There are four list formats:
2. {Dd[x], D(d+a)[x]}
3. {Dd[x], D(d+a)[x], D(d+2a)[x]}
4. {Dd[x], D(d+a)[x], D(d+2a)[x], D(d+3a)[x]}
where a can be either 1 or 2. Every register in the list must be in the range d0-d31.
• Rn is the ARM register containing the base address. Rn cannot be pc.
• <align> specifies an optional alignment. If <align> is not specified, then standard alignment rules apply.
• The optional ! indicates that Rn is updated after the data is transferred. This is similar to the ldm and stm instructions.
• Rm is an ARM register containing an offset from the base address. If Rm is present, Rn is updated to Rn + Rm after the address is used to access memory. Rm cannot be sp or pc.
Table 10.1 shows all valid combinations of parameters for these instructions. Note that the same vector element (scalar) x must be used in each register. Up to four registers can be specified. If the structure has more than four fields, then these instructions can be used repeatedly to load or store all of the fields.
Table 10.1
Parameter combinations for loading and storing a single structure
| <n> | <size> | <list> | <align> | Alignment |
| 1 | 8 | Dd[x] | Standard only | |
| 2-5 | 16 | Dd[x] | 16 | 2 byte |
| 2-5 | 32 | Dd[x] | 32 | 4 byte |
| 2 | 8 | Dd[x], D(d+1)[x] | 16 | 2 byte |
| 2-5 | 16 | Dd[x], D(d+1)[x] | 32 | 4 byte |
| Dd[x], D(d+2)[x] | 32 | 4 byte | ||
| 2-5 | 32 | Dd[x], D(d+1)[x] | 64 | 8 byte |
| Dd[x], D(d+2)[x] | 64 | 8 byte | ||
| 3 | 8 | Dd[x], D(d+1)[x], D(d+2)[x] | Standard only | |
| 2-5 | 16 or 32 | Dd[x], D(d+1)[x], D(d+2)[x] | Standard only | |
| Dd[x], D(d+2)[x], D(d+4)[x] | Standard only | |||
| 4 | 8 | Dd[x], D(d+1)[x], D(d+2)[x], D(d+3)[x] | 32 | 4 byte |
| 2-5 | 16 | Dd[x], D(d+1)[x], D(d+2)[x], D(d+3)[x] | 64 | 8 byte |
| Dd[x], D(d+2)[x], D(d+4)[x], D(d+6)[x] | 64 | 8 byte | ||
| 2-5 | 32 | Dd[x], D(d+1)[x], D(d+2)[x], D(d+3)[x] | 64 or 128 | (<align> ÷ 8) bytes |
| Dd[x], D(d+2)[x], D(d+4)[x], D(d+6)[x] | 64 or 128 | (<align> ÷ 8) bytes |

| Name | Effect | Description |
| vld<n> |
for D ∈ regs(<list>) do end for if ! is present then else if Rm is specified then end if end if | Load one or more data items into a single lane of one or more registers |
| vst<n> |
for D ∈ regs(<list>) do end for if ! is present then else if Rm is specified then end if end if | Store one or more data items from a single lane of one or more registers |


This instruction is used to load multiple copies of structured data across multiple registers:
vld<n> Load Copies of Structured Data.
The data is copied to all lanes. This instruction is useful for initializing vectors for use in later instructions.
• <n> must be one of 1, 2, 3, or 4.
• <size> must be one of 8, 16, or 32.
• <list> specifies the list of registers. There are four list formats:
2. {Dd[], D(d+a)[]}
3. {Dd[], D(d+a)[], D(d+2a)[]}
4. {Dd[], D(d+a)[], D(d+2a)[], D(d+3a)[]}
where a can be either 1 or 2. Every register in the list must be in the range d0-d31.
• Rn is the ARM register containing the base address. Rn cannot be pc.
• <align> specifies an optional alignment. If <align> is not specified, then standard alignment rules apply.
• The optional ! indicates that Rn is updated after the data is transferred. This is similar to the ldm and stm instructions.
• Rm is an ARM register containing an offset from the base address. If Rm is present, Rn is updated to Rn + Rm after the address is used to access memory. Rm cannot be sp or pc.
Table 10.2 shows all valid combinations of parameters for this instruction. Note that the vector element number is not specified, but the brackets [] must be present. Up to four registers can be specified. If the structure has more than four fields, then this instruction can be repeated to load or store all of the fields.
Table 10.2
Parameter combinations for loading multiple structures
| <n> | <size> | <list> | <align> | Alignment |
| 1 | 8 | Dd[] | Standard only | |
| Dd[], D(d+1)[] | Standard only | |||
| 2-5 | 16 | Dd[] | 16 | 2 byte |
| Dd[], D(d+1)[] | 16 | 2 byte | ||
| 2-5 | 32 | Dd[] | 32 | 4 byte |
| Dd[], D(d+1)[] | 32 | 4 byte | ||
| 2 | 8 | Dd[], D(d+1)[] | 8 | 1 byte |
| 8 | Dd[], D(d+2)[] | 8 | 1 byte | |
| 2-5 | 16 | Dd[], D(d+1)[] | 16 | 2 byte |
| Dd[], D(d+2)[] | 16 | 2 byte | ||
| 2-5 | 32 | Dd[], D(d+1)[] | 32 | 4 byte |
| Dd[], D(d+2)[] | 32 | 4 byte | ||
| 3 | 8, 16, or 32 | Dd[], D(d+1)[], D(d+2)[] | Standard only | |
| Dd[], D(d+2)[], D(d+4)[] | Standard only | |||
| 4 | 8 | Dd[], D(d+1)[], D(d+2)[], D(d+3)[] | 32 | 4 byte |
| Dd[], D(d+2)[], D(d+4)[], D(d+6)[] | 32 | 4 byte | ||
| 2-5 | 16 | Dd[], D(d+1)[], D(d+2)[], D(d+3)[] | 64 | 8 byte |
| Dd[], D(d+2)[], D(d+4)[], D(d+6)[] | 64 | 8 byte | ||
| 2-5 | 32 | Dd[], D(d+1)[], D(d+2)[], D(d+3)[] | 64 or 128 | (<align> ÷ 8) bytes |
| Dd[], D(d+2)[], D(d+4)[], D(d+6)[] | 64 or 128 | (<align> ÷ 8) bytes |


These instructions are used to load and store multiple data structures across multiple registers with interleaving or deinterleaving:
vld<n> Load Multiple Structured Data, and
vst<n> Store Multiple Structured Data.
• <op> must be either ld or st.
• <n> must be one of 1, 2, 3, or 4.
• <size> must be one of 8, 16, or 32.
• <list> specifies the list of registers. There are four list formats:
2. {Dd, D(d+a)}
3. {Dd, D(d+a), D(d+2a)}
4. {Dd, D(d+a), D(d+2a), D(d+3a)}
where a can be either 1 or 2. Every register in the list must be in the range d0-d31.
• Rn is the ARM register containing the base address. Rn cannot be pc.
• <align> specifies an optional alignment. If <align> is not specified, then standard alignment rules apply.
• The options ! indicates that Rn is updated after the data is transferred, similar to the ldm and stm instructions.
• Rm is an ARM register containing an offset from the base address. If Rm is present, Rn is updated to Rn + Rm after the address is used to access memory. Rm cannot be sp or pc.
Table 10.3 shows all valid combinations of parameters for this instruction. Note that the scalar is not specified and the instructions work on all multiple vector elements. Up to four registers can be specified. If the structure has more than four fields, then this instruction can be repeated to load or store all of the fields.
Table 10.3
Parameter combinations for loading copies of a structure
| <n> | <size> | <list> | <align> | Alignment |
| 1 | 8, 16, 32, or 64 | Dd | 64 | 8 bytes |
| Dd, D(d+1) | 64 or 128 | (<align> ÷ 8) bytes | ||
| Dd, D(d+1), D(d+2) | 64 | 8 bytes | ||
| Dd, D(d+1), D(d+2), D(d+3) | 64, 128, or 256 | (<align> ÷ 8) bytes | ||
| 2 | 8, 16, or 32 | Dd, D(d+1) | 64 or 128 | (<align> ÷ 8) bytes |
| Dd, D(d+2) | 64 or 128 | (<align> ÷ 8) bytes | ||
| Dd, D(d+1), D(d+2), D(d+3) | 64, 128, or 256 | (<align> ÷ 8) bytes | ||
| 3 | 8, 16, or 32 | Dd, D(d+1), D(d+2) | 64 | 8 bytes |
| Dd, D(d+2), D(d+3) | 64 | 8 bytes | ||
| 4 | 8, 16, or 32 | Dd, D(d+1), D(d+2), D(d+3) | 64, 128, or 256 | (<align> ÷ 8) bytes |
| Dd, D(d+2), D(d+4), D(d+6) | 64, 128, or 256 | (<align> ÷ 8) bytes |

| Name | Effect | Description |
| vld<n> |
for 0 ≤ x < nlanes do for D ∈<list> do end for end for if ! is present then else if Rm is specified then end if end if | Load one or more data items into all lanes of one or more registers. |
| vst<n> |
for 0 ≤ x < nlanes do for D ∈<list> do end for end for if ! is present then else if Rm is specified then end if end if | Load one or more data items into all lanes of one or more registers. |


Because they use the same set of registers, VFP and NEON share some instructions for loading, storing, and moving registers. The shared instructions are vldr, vstr, vldm, vstm, vpop, vpush, vmov, vmrs, and vmsr. These were explained in Chapter 9. NEON extends the vmov instructions to allow specification of NEON scalars and quadwords, and adds the ability to perform one’s complement during a move.
This version of the move instruction allows data to be moved between the NEON registers and the ARM integer registers as 8-bit, 16-bit, or 32-bit NEON scalars:
vmov Move Between NEON and ARM.
• <cond> is an optional condition code.
• <size> must be 8, 16, or 32, and specifies the number of bits that are to be moved.
• The <type> must be u8, u16, u32, s8, s16, s32, or f32, and specifies the number of bits that are to be moved and whether or not the result should be sign-extended in the ARM integer destination register.

NEON extends the VFP vmov instruction to include the ability to move an immediate value, or the one’s complement of an immediate value, to every element of a register. The instructions are:
vmvn Move Immediate NOT.
• <op> must be either <mov> or <mvn>.
• <type> must be i8, i16, i32, f32, or i64, and specifies the size of items in the vector.
• V can be s, d, or q.
• <imm> is an immediate value that matches <type>, and is copied to every element in the vector. The following table shows valid formats for imm:
| <type> | vmov | vmvn |
| i8 | 0xXY | 0xXY |
| i16 | 0x00XY | 0xFFXY |
| 0xXY00 | 0xXYFF | |
| i32 | 0x000000XY | 0xFFFFFFXY |
| 0x0000XY00 | 0xFFFFXYFF | |
| 0x00XY0000 | 0xFFXYFFFF | |
| 0xXY000000 | 0xXYFFFFFF | |
| i64 | 0xABCDEFGH | 0xABCDEFGH |
| 2-3 | Each letter represents a byte, and must be either FF or 00 | |
| f32 | Any number that can be written as ± n × (2 − r), where n and r are integers, such that 16 ≤ n ≤ 31 and 0 ≤ r ≤ 7 | |


It is sometimes useful to increase or decrease the number of bits per element in a vector. NEON provides these instructions to convert a doubleword vector with elements of size y to a quadword vector with size 2y, or to perform the inverse operation:
vmovn Move and Narrow,
vqmovn Saturating Move and Narrow, and
vqmovun Saturating Move and Narrow Unsigned.
• The valid choices for <type> are given in the following table:
| Opcode | Valid Types |
| vmovl | s8, s16, s32, u8, u16, or u32 |
| vmovn | i8, i16, or i32 |
| vqmovn | s8, s16, s32, u8, u16, or u32 |
| vqmovun | s8, s16, or s32 |
• q indicates that the results are saturated.
| Name | Effect | Description |
| vmovl |
end for | Sign or zero extends (depending on <type>) each element of a doubleword vector to twice their length |
| v{q}movn |
if q is present then else end if end for | Copy the least significant half of each element of a quadword vector to the corresponding elements of a doubleword vector. If q is present, then the value is saturated |
| vqmovun |
end for | Copy each element of the operand vector to the corresponding element of the destination vector. The destination element is unsigned, and the value is saturated |


The duplicate instruction copies a scalar into every element of the destination vector. The scalar can be in a NEON register or an ARM integer register. The instruction is:
• <size> must be one of 8, 16 or 32.
• V can be d or q.
• Rm cannot be r15.

This instruction extracts 8-bit elements from two vectors and concatenates them. Fig. 10.4 gives an example of what this instruction does. The instruction is:

• <size> must be one of 8, 16, 32, or 64.
• V can be d or q.
• <imm> is the number of elements to extract from the bottom of Vm. The remaining elements required to fill Vd are taken from the top of Vn.

This instruction reverses the order of data in a register:
One use of this instruction is for converting data from big-endian to little-endian order, or from little-endian to big-endian order. It could also be useful for swapping data and transforming matrices. Fig. 10.5 shows three examples.

• <size> is either 8, 16, or 32 and indicates the size of the elements to be reversed. <size> must be less than <n>.
• V can be q or d.

This instruction simply swaps two NEON registers:
• <type> can be any NEON data type. The assembler ignores the type, but it can be useful to the programmer as extra documentation.
• V can be q or d.

This instruction transposes 2 × 2 matrices:
Fig. 10.6 shows two examples of this instruction. Larger matrices can be transposed using a divide-and-conquer approach.

• <size> is either 8, 16, or 32 and indicates the size of the elements in the matrix (or matrices).
• V can be q or d.

Fig. 10.7 shows how the vtrn instruction can be used to transpose a 3 × 3 matrix. Transposing a 4 × 4 matrix requires the transposition of 13 2 × 2 matrices. However, this instruction can operate on multiple 2 × 2 sub-matrices in parallel, and can group elements into different sized sub-matrices. There is also a very useful swap instruction that can exchange the rows of a matrix. Using the swap and transpose instructions, transposing a 4 × 4 matrix of 16-bit elements can be done with only four instructions, as shown in Fig. 10.8.


The table lookup instructions use indices held in one vector to lookup values from a table held in one or more other vectors. The resulting values are stored in the destination vector. The table lookup instructions are:
vtbx Table Lookup with Extend.
• <list> specifies the list of registers. There are five list formats:
2. {Dn, D(n+1)},
3. {Dn, D(n+1), D(n+2)},
4. {Dn, D(n+1), D(n+2), D(n+3)}, or
5. {Qn, Q(n+1)}.
• Dm is the register holding the indices.
• The table can contain up to 32 bytes.
| Name | Effect | Description |
| vtbl |
for 0 ≤ i < 8 do if r > Maxr then else end if end for | Use indices Dm to look up values in a table and store them in Dd. If the index is out of range, zero is stored in the corresponding destination. |
| vtbx |
for 0 ≤ i < 8 do if r ≤ Maxr then end if end for | Use indices Dm to look up values in a table and store them in Dd. If the index is out of range, the corresponding destination is unchanged. |


These instructions are used to interleave or deinterleave the data from two vectors:
vuzp Unzip Vectors.
Fig. 10.9 gives an example of the vzip instruction. The vuzp instruction performs the inverse operation.

• <size> is either 8, 16, or 32 and indicates the size of the elements in the matrix (or matrices).
• V can be q or d.
| Name | Effect | Description |
| vzip |
for 0 ≤ i < (n ÷ 2) by 2 do end for for (n ÷ 2) ≤ i < n by 2 do end for
| Interleave data from two vectors. tmp is a vector of suitable size. |
| vuzp |
for 0 ≤ i < (n ÷ 2) by 2 do end for for (n ÷ 2) ≤ i < n by 2 do end for
| Interleave data from two vectors. tmp is a vector of suitable size. |


When high precision is not required, The IEEE half-precision format can be used to store floating point numbers in memory. This can reduce memory requirements by up to 50%. This can also result in a significant performance improvement, since only half as much data needs to be moved between the CPU and main memory. However, on most processors half-precision data must be converted to single precision before it is used in calculations. NEON provides enhanced versions of the vcvt instruction which support conversion to and from IEEE half precision. There are also versions of vcvt which operate on vectors, and perform integer or fixed-point to floating-point conversions.
This instruction can be used to perform a data conversion between single precision and fixed point on each element in a vector:
The elements in the vector must be a 32-bit single precision floating point or a 32-bit integer. Fixed point (or integer) arithmetic operations are up to twice as fast as floating point operations. In some cases it is much more efficient to make this conversion, perform the calculations, then convert the results back to floating point.
• <cond> is an optional condition code.
• <type> must be either s32 or u32.
• The optional <fbits> operand specifies the number of fraction bits for a fixed point number, and must be between 0 and 32. If it is omitted, then it is assumed to be zero.
| Name | Effect | Description |
| vcvt.s32.f32 | ![]() | Convert single precision to 32-bit signed fixed point or integer. |
| vcvt.u32.f32 | ![]() | Convert single precision to 32-bit unsigned fixed point or integer. |
| vcvt.f32.s32 | ![]() | Convert signed 32-bit fixed point or integer to single precision |
| vcvt.f32.u32 | ![]() | Convert unsigned 32-bit fixed point or integer to single precision |

NEON systems with the half-precision extension provide the following instruction to perform conversion between single precision and half precision floating point formats:
vcvt Convert Between Half and Single.
• The <op> must be either b or t and specifies whether the top or bottom half of the register should be used for the half-precision number.
• <cond> is an optional condition code.
| Name | Effect | Description |
| vcvtb.f16.f32 | ![]() | Convert single precision to half precision and store in bottom half of destination |
| vcvtt.f16.f32 | ![]() | Convert single precision to half precision and store in top half of destination |
| vcvtb.f32.f16 | ![]() | Convert half precision number from bottom half of source to single precision |
| vcvtt.f32.f16 | ![]() | Convert half precision number from top half of source to single precision |

NEON adds the ability to perform integer comparisons between vectors. Since there are multiple pairs of items to be compared, the comparison instructions set one element in a result vector for each pair of items. After the comparison operation, each element of the result vector will have every bit set to zero (for false) or one (for true). Note that if the elements of the result vector are interpreted as signed two’s-complement numbers, then the value 0 represents false and the value − 1 represents true.
The following instructions perform comparisons of all of the corresponding elements of two vectors in parallel:
vcge Compare Greater Than or Equal,
vcgt Compare Greater Than,
vcle Compare Less Than or Equal, and
vclt Compare Less Than.
The vector compare instructions compare each element of a vector with the corresponding element in a second vector, and sets an element in the destination vector for each comparison. If the comparison is true, then all bits in the result element are set to one. Otherwise, all bits in the result element are set to zero. Note that summing the elements of the result vector (as signed integers) will give the two’s complement of the number of comparisons which were true.
Note: vcle and vclt are actually pseudo-instructions. They are equivalent to vcgt and vcge with the operands reversed.
• <op> must be one of eq, ge, gt, le, or lt.
• If <op> is eq, then <type> must be i8, i16, i32, or f32.
• If <op> is not eq and Rop is #0, then < type > must be s8, s16, s32, or f32.
• If <op> is not eq and the third operand is a register, then <type> must be s8, s16, s32, u8, u16, u32, or f32.
• The result data type is determined from the following table:
• If the third operand is #0, then it is taken to be a vector of the correct size in which every element is zero.
• V can be d or q.

The following instructions perform comparisons between the absolute values of all of the corresponding elements of two vectors in parallel:
vacgt Absolute Compare Greater Than, and
vacge Absolute Compare Greater Than or Equal.
The vector absolute compare instruction compares the absolute value of each element of a vector with the absolute value of the corresponding element in a second vector, and sets an element in the destination vector for each comparison. If the comparison is true, then all bits in the result element are set to one. Otherwise, all bits in the result element are set to zero. Note that summing the elements of the result vector (as signed integers) will give the two’s complement of the number of comparisons which were true.
• <op> must be either ge or gt.
• V can be d or q.
• The operand element type must be f32.
• The result element type is i32.

NEON provides the following vector version of the ARM tst instruction:
The vector test bits instruction performs a logical AND operation between each element of a vector and the corresponding element in a second vector. If the result is not zero, then every bit in the corresponding element of the result vector is set to one. Otherwise, every bit in the corresponding element of the result vector is set to zero.
• <size> must be one of 8, 16 or 32
• The result element type is defined by the following table:

NEON adds the ability to perform integer and bitwise logical operations on the VFP register set. Recall that integer operations can also be used on fixed-point data. These operations add a great deal of power to the ARM processor.
NEON includes vector versions of the following five basic logical operations:
veor Bitwise Exclusive-OR,
vorr Bitwise OR,
vorn Bitwise Complement and OR, and
vbic Bit Clear.
All of them involve two source operands and a destination register.
• <op> must be one of and, eor, orr, orn, or bic.
• V must be either q or d.
• type must be i8, i16, i32, or i64. For these bitwise logical operations, type does not matter.

It is often useful to clear and/or set specific bits in a register. The NEON instruction set provides the following vector versions of the logical OR and bit clear instructions:
vorr Bitwise OR Immediate, and
vbic Bit Clear Immediate.
• <op> must be either orr, or bic.
• V must be either q or d to specify whether the operation involves quadwords or doublewords.
• <type> must be i16 or i32.
• <imm> is a 16-bit or 32-bit immediate value, which is interpreted as a pattern for filling the immediate operand. The following table shows acceptable patterns for <imm>, based on what was chosen for <type>:

NEON provides three instructions which can be used to combine the bits in two registers or to extract specific bits from a register, according to a pattern:
vbif Bitwise Insert if False, and
vbsl Bitwise Select.
• <op> can be bif, bit, or bsl.
• V can be d or q.
• The <type> must be i8, i16, i32, or i64, and specifies the size of items in the vectors. Note that for these bitwise logical operations, the type does not matter. so the assembler ignores it. However, it can be useful to the programmer as extra documentation.
| Name | Effect | Description |
| vbit | ![]() | Insert each bit from the first operand into the destination if the corresponding bit of the second operand is 1 |
| vbif | ![]() | Insert each bit from the first operand into the destination if the corresponding bit of the second operand is 0 |
| vbsl | ![]() | Select each bit for the destination from the first operand if the corresponding bit of the destination is 1, or from the second operand if the corresponding bit of the destination is 0 |

The NEON shift instructions operate on vectors. Shifts are often used for multiplication and division by powers of two. The results of a left shift may be larger than the destination register, resulting in overflow. A shift right is equivalent to division. In some cases, it may be useful to round the result of a division, rather than truncating. NEON provides versions of the shift instruction which perform saturation and/or rounding of the result.
These instructions shift each element in a vector left by an immediate value:
vqshl Saturating Shift Left Immediate,
vqshlu Saturating Shift Left Immediate Unsigned, and
vshll Shift Left Immediate Long.
Overflow conditions can be avoided by using the saturating version, or by using the long version, in which case the destination is twice the size of the source.
• If u is present, then the results are unsigned.
• The valid choices for <type> are given in the following table:
| Name | Effect | Description |
| vshl | Each element of Vm is shifted left by the immediate value and stored in the corresponding element of Vd. Bits shifted past the end of an element are lost. | |
| vshll | Each element of Vm is shifted left by the immediate value and stored in the corresponding element of Vd. The values are sign or zero extended, depending on <type> | |
| vqshl{u} | Each element of Vm is shifted left by the immediate value and stored in the corresponding element of Vd. If the result of the shift is outside the range of the destination element, then the value is saturated. If u was specified, then the destination is unsigned. Otherwise, it is signed |


These instructions shift each element in a vector, using the least significant byte of the corresponding element of a second vector as the shift amount:
vshl Shift Left or Right by Variable,
vrshl Shift Left or Right by Variable and Round,
vqshl Saturating Shift Left or Right by Variable, and
vqrshl Saturating Shift Left or Right by Variable and Round.
If the shift value is positive, the operation is a left shift. If the shift value is negative, then it is a right shift. A shift value of zero is equivalent to a move. If the operation is a right shift, and r is specified, then the result is rounded rather than truncated. Results are saturated if q is specified.
• If q is present, then the results are saturated.
• If r is present, then right shifted values are rounded rather than truncated.
• V can be d or q.
• <type> must be one of s8, s16, s32, s64, s8, s16, s32, or s64.

These instructions shift each element in a vector right by an immediate value:
vrshr Shift Right Immediate and Round,
vshrn Shift Right Immediate and Narrow,
vrshrn Shift Right Immediate Round and Narrow,
vsra Shift Right and Accumulate Immediate, and
vrsra Shift Right Round and Accumulate Immediate.
• If r is present, then right shifted values are rounded rather than truncated.
• <cond> is an optional condition code.
• The valid choices for <type> are given in the following table:
| Name | Effect | Description |
| v{r}shr |
else end if | Each element of Vm is shifted right with zero extension by the immediate value and stored in the corresponding element of Vd. Results can be rounded both. |
| v{r}shrn |
else end if | Each element of Vm is shifted right with zero extension by the immediate value, optionally rounded, then narrowed and stored in the corresponding element of Vd. |
| v{r}sra |
else end if | Each element of Vm is shifted right with sign or zero extension by the immediate value and accumulated in the corresponding element of Vd. Results can be rounded. |


These instructions shift each element in a quad word vector right by an immediate value:
vqshrn Saturating Shift Right Immediate,
vqrshrn Saturating Shift Right Immediate Round,
vqshrun Saturating Shift Right Immediate Unsigned, and
vqrshrun Saturating Shift Right Immediate Round Unsigned.
The result is optionally rounded, then saturated, narrowed, and stored in a double word vector.
• If r is present, then right shifted values are rounded rather than truncated.
• If u is present, then the results are unsigned, regardless of the type of elements in Qm.
• The valid choices for <type> are given in the following table:
• <imm> Is the amount that elements are to be shifted, and must be between zero and one less than the number of bits in <type>.
| Name | Effect | Description |
| vq{r}shrn |
else end if | Each element of Vm is shifted right with sign extension by the immediate value, optionally rounded, then saturated and narrowed, and stored in the corresponding element of Vd. |
| vq{r}shrun |
else end if | Each element of Vm is shifted right with zero extension by the immediate value, optionally rounded, then saturated and narrowed, and stored in the corresponding element of Vd. |


These instructions perform bitwise shifting of each element in a vector, then combine the results with the contents of the destination register:
vsri Shift Right and Insert.
Fig. 10.10 provides an example.

• <dir> must be l for a left shift, or r for a right shift.
• <size> must be 8, 16, 32, or 64.
• <imm> is the amount that elements are to be shifted, and must be between zero and <size>− 1 for vsli, or between one and <size> for vsri.

NEON provides several instructions for addition, subtraction, and multiplication, but does not provide a divide instruction. Whenever possible, division should be performed by multiplying the reciprocal. When dividing by constants, the reciprocal can be calculated in advance, as shown in Chapter 8. For dividing by variables, NEON provides instructions for quickly calculating the reciprocals for all elements in a vector. In most cases, this is faster than using a divide instruction. When division is absolutely unavoidable, the VFP divide instructions can be used.
The following eight instructions perform vector addition and subtraction:
vqadd Saturating Add
vaddl Add Long
vaddw Add Wide
vsub Subtract
vqsub Saturating Subtract
vsubl Subtract Long
vsubw Subtract Wide
The Vector Add (vadd) instruction adds corresponding elements in two vectors and stores the results in the corresponding elements of the destination register. The Vector Subtract (vsub) instruction subtracts elements in one vector from corresponding elements in another vector and stores the results in the corresponding elements of the destination register. Other versions allow mismatched operand and destination sizes, and the saturating versions prevent overflow by limiting the range of the results.
• The valid choices for <type> are given in the following table:
| Name | Effect | Description |
| v<op> | The operation is applied to corresponding elements of Vn and Vm. The results are stored in the corresponding elements of Vd. | |
| vq<op> | The operation is applied to corresponding elements of Vn and Vm. The results are saturated then stored in the corresponding elements of Vd. | |
| v<op>l | The operation is applied to corresponding elements of Dn and Dm. The results are zero or sign extended then stored in the corresponding elements of Qd. | |
| v<op>w | The elements of Vm are sign or zero extended, then the operation is applied with corresponding elements of Vn. The results are stored in the corresponding elements of Vd. |


These instructions add or subtract the corresponding elements of two vectors, and narrow by taking the most significant half of the result:
vraddhn Add, Round, and Narrow
vsubhn Subtract and Narrow
vrsubhn Subtract, Round, and Narrow
The results are stored in the corresponding elements of the destination register. Results can be optionally rounded instead of truncated.
• If <r> is specified, then the result is rounded instead of truncated.
• <type> must be either i16, i32, or i64.

These instructions add or subtract corresponding elements from two vectors then shift the result right by one bit:
vrhadd Halving Add and Round
vhsub Halving Subtract
The results are stored in corresponding elements of the destination vector. If the operation is addition, then the results can be optionally rounded.
• If <r> is specified, then the result is rounded instead of truncated.
• <type> must be either s8, s16, s32, u8, u16, ar u32.
| Name | Effect | Description |
| v{r}hadd |
else end if | The corresponding elements of Vn and Vm are added together, optionally rounded, then shifted right one bit. Results are stored in the corresponding elements of Vd. |
| vhsub | The elements of Vn are subtracted from the corresponding elements of Vm. Results are shifted right one bit and stored in the corresponding elements of Vd. |


These instructions add vector elements pairwise:
vpaddl Add Pairwise Long
vpadal Add Pairwises and Accumulate Long
The long versions can be used to prevent overflow.
• <op> must be either add or ada.
• The valid choices for <type> are given in the following table:
| Name | Effect | Description |
| vpadd |
for 0 ≤ i < (n ÷ 2) do end for for (n ÷ 2) ≤ i < n do end for | Add elements of two vectors pairwise and store the results in another vector. |
| vpaddl |
for 0 ≤ i < (n ÷ 2) by 2 do end for | Add elements of a vector pairwise and store the results in another vector. |
| vpadal |
for 0 ≤ i < (n ÷ 2) by 2 do end for | Add elements of a vector pairwise and accumulate the results in another vector. |


These instructions subtract the elements of one vector from another and store or accumulate the absolute value of the results:
vaba Absolute Difference and Accumulate
vabal Absolute Difference and Accumulate Long
vabd Absolute Difference
vabdl Absolute Difference Long
The long versions can be used to prevent overflow.
• The valid choices for <type> are given in the following table:
| Name | Effect | Description |
| vabd | Subtract corresponding elements and take the absolute value | |
| vaba | Subtract corresponding elements and take the absolute value. Accumulate the results | |
| vabdl | Extend and subtract corresponding elements, then take the absolute value | |
| v<op>w | Extend and subtract corresponding elements, then take the absolute value. Accumulate the results |


These operations compute the absolute value or negate each element in a vector:
vneg Negate
vqabs Saturating Absolute Value
vqneg Saturating Negate
The saturating versions can be used to prevent overflow.
• If q is present then results are saturated.
• <op> is either abs or neg.
• The valid choices for <type> are given in the following table:

The following four instructions select the maximum or minimum elements and store the results in the destination vector:
vmin Minimum
vpmax Pairwise Maximum
vpmin Pairwise Minimum
• <type> must be one of s8, s16, s32, u8, u16, u32, or f32.
| Name | Effect | Description |
| vmax |
for 0 ≤ i < n do if V n[i] > V m[i] then else end if end for | Compare corresponding elements and copy the greater of each pair into the corresponding element in the destination vector |
| vpmax |
for 0 ≤ i < (n ÷ 2) do if Dm[i] > Dm[i + 1] then else end if end for for (n ÷ 2) ≤ i < n do if Dn[i] > Dn[i + 1] then else end if end for | Compare elements pairwise and copy the greater of each pair into an element in the destination vector, another vector |
| vmin |
for 0 ≤ i < n do if V n[i] < V m[i] then else end if end for | Compare corresponding elements and copy the lesser of each pair into the corresponding element in the destination vector |
| vpmin |
for 0 ≤ i < (n ÷ 2) do if Dm[i] < Dm[i + 1] then else end if end for for (n ÷ 2) ≤ i < n do if Dn[i] < Dn[i + 1] then else end if end for | Compare elements pairwise and copy the lesser of each pair into an element in the destination vector, another vector |


These instructions can be used to count leading sign bits or zeros, or to count the number of bits that are set for each element in a vector:
vclz Count Leading Zero Bits
vcnt Count Set Bits
• <op> is either cls, clz or cnt.
• The valid choices for <type> are given in the following table:
| Name | Effect | Description |
| vcls |
for 0 ≤ i < n) do end for | Count the number of consecutive bits that are the same as the sign bit for each element in Fm, and store the counts in the corresponding elements of Fd |
| vcls |
for 0 ≤ i < n) do end for | Count the number of leading zero bits for each element in Fm, and store the counts in the corresponding elements of Fd. |
| vcnt |
for 0 ≤ i < n) do end for | Count the number of bits in Fm that are set to one, and store the counts in the corresponding elements of Fd |


There is no vector divide instruction in NEON. Division is accomplished with multiplication by the reciprocals of the divisors. The reciprocals are found by making an initial estimate, then using the Newton-Raphson method to improve the approximation. This can actually be faster than using a hardware divider. NEON supports single precision floating point and unsigned fixed point reciprocal calculation. Fixed point reciprocals provide higher precision. Division using the NEON reciprocal method may not provide the best precision possible. If the best possible precision is required, then the VFP divide instruction should be used.
These instructions are used to multiply the corresponding elements from two vectors:
vmla Multiply Accumulate
vmls Multiply Subtract
vmull Multiply Long
vmlal Multiply Accumulate Long
vmlsl Multiply Subtract Long
The long versions can be used to avoid overflow.
• <op> is either mul, mla. or mls.
• The valid choices for <type> are given in the following table:
| Name | Effect | Description |
| vmul | Multiply corresponding elements from two vectors and store the results in a third vector | |
| vmla | Multiply corresponding elements from two vectors and add the results in a third vector | |
| vmul | Multiply corresponding elements from two vectors and subtract the results from a third vector | |
| vmull | Multiply corresponding elements from two vectors and store the results in a third vector | |
| vmlal | Multiply corresponding elements from two vectors and add the results in a third vector | |
| vmul | Multiply corresponding elements from two vectors and subtract the results from a third vector |


These instructions are used to multiply each element in a vector by a scalar:
vmla Multiply Accumulate by Scalar
vmls Multiply Subtract by Scalar
vmull Multiply Long by Scalar
vmlal Multiply Accumulate Long by Scalar
vmlsl Multiply Subtract Long by Scalar
The long versions can be used to avoid overflow.
• <op> is either mul, mla. or mls.
• The valid choices for <type> are given in the following table:
| Opcode | Valid Types |
| vmul | i16, i32, or f32 |
| vmla | i16, i32, or f32 |
| vmls | i16, i32, or f32 |
| vmull | s16, s32, u16, or u32 |
| vmlal | s16, s32, u16, or u32 |
| vmlsl | s16, s32, u16, or u32 |
• x must be valid for the chosen <type>.
| Name | Effect | Description |
| vmul | Multiply corresponding elements from two vectors and store the results in a third vector | |
| vmla | Multiply corresponding elements from two vectors and add the results in a third vector | |
| vmul | Multiply corresponding elements from two vectors and subtract the results from a third vector | |
| vmull | Multiply corresponding elements from two vectors and store the results in a third vector | |
| vmlal | Multiply corresponding elements from two vectors and add the results in a third vector | |
| vmul | Multiply corresponding elements from two vectors and subtract the results from a third vector |


A fused multiply accumulate operation does not perform rounding between the multiply and add operations. The two operations are fused into one. NEON provides the following fused multiply accumulate instructions:
vfma Fused Multiply Accumulate
vfnma Fused Negate Multiply Accumulate
vfms Fused Multiply Subtract
vfnms Fused Negate Multiply Subtract
Using the fused multiply accumulate can result in improved speed and accuracy for many computations that involve the accumulation of products.
<op> is one of vfma, vfnma, vfms, or vfnms.
<cond> is an optional condition code.
<prec> may be either f32 or f64.

These instructions perform multiplication, double the results, and perform saturation:
vqdmull Saturating Multiply Double (Low)
vqdmlal Saturating Multiply Double Accumulate (Low)
vqdmlsl Saturating Multiply Double Subtract (Low)
• <op> is either mul, mla. or mls.
• <type> must be either s16 or s32.


These instructions perform multiplication, double the results, perform saturation, and store the high half of the results:
vqdmulh Saturating Multiply Double (High)
vqrdmulh Saturating Multiply Double (High) and Round
| Name | Effect | Description |
| vqdmulh |
if second operand is scalar then else end if | Multiply elements, double the results and store the high half in the destination vector with saturation |
| vqrdmulh |
if second operand is scalar then else end if | Multiply elements, double the results, round, and store the high half in the destination vector with saturation |


These instructions perform the initial estimates of the reciprocal values:
vrsqrte Reciprocal Square Root Estimate
These work on floating point and unsigned fixed point vectors. The estimates from this instruction are accurate to within about eight bits. If higher accuracy is desired, then the Newton-Raphson method can be used to improve the initial estimates. For more information, see the Reciprocal Step instruction.
• <op> is either recpe or rsqrte.
• <type> must be either u32, or f32.
• If <type> is u32, then the elements are assumed to be U(1,31) fixed point numbers, and the most significant fraction bit (bit 30) must be 1, and the integer part must be zero. The vclz and shift by variable instructions can be used to put the data in the correct format.
• The result elements are always f32.

These instructions are used to perform one Newton-Raphson step for improving the reciprocal estimates:
vrsqrts Reciprocal Square Root Step
For each element in the vector, the following equation can be used to improve the estimates of the reciprocals:
where xn is the estimated reciprocal from the previous step, and d is the number for which the reciprocal is desired. This equation converges to
if x0 is obtained using vrecpe on d. The vrecps instruction computes
so one additional multiplication is required to complete the update step. The initial estimate x0 must be obtained using the vrecpe instruction.
For each element in the vector, the following equation can be used to improve the estimates of the reciprocals of the square roots:
where xn is the estimated reciprocal from the previous step, and d is the number for which the reciprocal is desired. This equation converges to
if x0 is obtained using vrsqrte on d. The vrsqrts instruction computes
so two additional multiplications are required to complete the update step. The initial estimate x0 must be obtained using the vrsqrte instruction.
• <op> is either recps or rsqrts.
• <type> must be either u32, or f32.

The GNU assembler supports five pseudo-instructions for NEON. Two of them are vcle and vclt, which were covered in Section 10.6.1. The other three are explained in the following sections.
This pseudo-instruction loads a constant value into every element of a NEON vector, or into a VFP single-precision or double-precision register:
This pseudo-instruction will use vmov if possible. Otherwise, it will create an entry in the literal pool and use vldr.
• <cond> is an optional condition code.
• <type> must be one of i8, i16, i32, i64, s8, s16, s32, s64, u8, u16, u32, u64, f32, or f64.
• <imm> is a value appropriate for the specified <type>.

It is often useful to clear and/or set specific bits in a register. The following pseudo-instructions can provide bitwise logical operations:
vorn Bitwise Complement and OR Immediate
• <op> must be either and, or orn.
• V must be either q or d to specify whether the operation involves quadwords or doublewords.
• <type> must be i8, i16, i32, or i64.
• <imm> is a 16-bit or 32-bit immediate value, which is interpreted as a pattern for filling the immediate operand. The following table shows acceptable patterns for <imm>, based on what was chosen for <type>:

The following pseudo-instructions perform comparisons between the absolute values of all of the corresponding elements of two vectors in parallel:
vacle Absolute Compare Less Than or Equal
vaclt Absolute Compare Less Than
The vector absolute compare instruction compares the absolute value of each element of a vector with the absolute value of the corresponding element in a second vector, and sets an element in the destination vector for each comparison. If the comparison is true, then all bits in the result element are set to one. Otherwise, all bits in the result element are set to zero. Note that summing the elements of the result vector (as signed integers) will give the two’s complement of the number of comparisons which were true.
• <op> must be either lt or lt.
• V can be d or q.
• The operand element type must be f32.
• The result element type is i32.

In Chapter 9, four versions of the sine function were given. Those implementations used scalar and VFP vector modes for single-precision and double-precision. Those previous implementations are already faster than the implementations provided by GCC, However, it may be possible to gain a little more performance by taking advantage of the NEON architecture. All versions of NEON are guaranteed to have a very large register set, and that fact can be used to attain better performance.
Listing 10.1 shows a single precision floating point implementation of the sine function, using the ARM NEON instruction set. It performs the same operations as the previous implementations of the sine function, but performs many of the calculations in parallel. This implementation is slightly faster than the previous version.

Listing 10.2 shows a double precision floating point implementation of the sine function. This code is intended to run on ARMv7 and earlier NEON/VFP systems with the full set of 32 double-precision registers. NEON systems prior to ARMv8 do not have NEON SIMD instructions for double precision operations. This implementation is faster than Listing 9.4 because it uses a large number of registers, does not contain a loop, and is written carefully so that multiple instructions can be at different stages in the pipeline at the same time. This technique of gaining performance is known as loop unrolling.



Table 10.4 compares the implementations from Listings 10.1 and 10.2 with the VFP vector implementations from Chapter 9 and the sine function provided by GCC. Notice that in every case, using vector mode VFP instructions is slower than the scalar VFP version. As mentioned previously, vector mode is deprecated on NEON processors. On NEON systems, vector mode is emulated in software. Although vector mode is supported, using it will result in reduced performance, because each vector instruction causes the operating system to take over and substitute a series of scalar floating point operations on-the-fly. A great deal of time was spent by the operating system software in emulating the VFP hardware vector mode.
Table 10.4
Performance of sine function with various implementations
| Optimization | Implementation | CPU seconds |
| None | Single Precision VFP scalar Assembly | 1.74 |
| Single Precision VFP vector Assembly | 27.09 | |
| Single Precision NEON Assembly | 1.32 | |
| Single Precision C | 4.36 | |
| Double Precision VFP scalar Assembly | 2.83 | |
| Double Precision VFP vector Assembly | 106.46 | |
| Double Precision NEON Assembly | 2.24 | |
| Double Precision C | 4.59 | |
| Full | Single Precision VFP scalar Assembly | 1.11 |
| Single Precision VFP vector Assembly | 27.15 | |
| Single Precision NEON Assembly | 0.96 | |
| Single Precision C | 1.69 | |
| Double Precision VFP scalar Assembly | 2.56 | |
| Double Precision VFP vector Assembly | 107.5.53 | |
| Double Precision NEON Assembly | 2.05 | |
| Double Precision C | 4.27 |
When compiler optimization is not used, the single precision scalar VFP implementation achieves a speedup of about 2.51, and the NEON implementation achieves a speedup of about 3.30 compared to the GCC implementation. The double precision scalar VFP implementation achieves a speedup of about 1.62, and the loop-unrolled NEON implementation achieves a speedup of about 2.05 compared to the GCC implementation.
When the best possible compiler optimization is used (-Ofast), the single precision scalar VFP implementation achieves a speedup of about 1.52, and the NEON implementation achieves a speedup of about 1.76 compared to the GCC implementation. The double precision scalar VFP implementation achieves a speedup of about 1.67, and the loop-unrolled NEON implementation achieves a speedup of about 2.08 compared to the GCC implementation. The single precision NEON version was 1.16 times as fast as the VFP scalar version and the double precision NEON implementation was 1.25 times as fast as the VFP scalar implementation.
Although the VFP versions of the sine function ran without modification on the NEON processor, re-writing them for NEON resulted in significant performance improvement. Performance of the vectorized VFP code running on a NEON processor was abysmal. The take-away lesson is that a programmer can improve performance by writing some functions in assembly that are specifically targeted to run on an specific platform. However, assembly code which improves performance on one platform may actually result in very poor performance on a different platform. To achieve optimal or near-optimal performance, it is important for the programmer to be aware of exactly which hardware platform is being used.
| Name | Page | Operation |
| vaba | 339 | Absolute Difference and Accumulate |
| vabal | 339 | Absolute Difference and Accumulate Long |
| vabd | 339 | Absolute Difference |
| vabdl | 339 | Absolute Difference Long |
| vabs | 340 | Absolute Value |
| vacge | 324 | Absolute Compare Greater Than or Equal |
| vacgt | 324 | Absolute Compare Greater Than |
| vacle | 353 | Absolute Compare Less Than or Equal |
| vaclt | 353 | Absolute Compare Less Than |
| vadd | 335 | Add |
| vaddhn | 336 | Add and Narrow |
| vaddl | 335 | Add Long |
| vaddw | 335 | Add Wide |
| vand | 326 | Bitwise AND |
| vand | 352 | Bitwise AND Immediate |
| vbic | 326 | Bit Clear |
| vbic | 327 | Bit Clear Immediate |
| vbif | 328 | Bitwise Insert if False |
| vbit | 328 | Bitwise Insert |
| vbsl | 328 | Bitwise Select |
| vceq | 323 | Compare Equal |
| vcge | 323 | Compare Greater Than or Equal |
| vcgt | 323 | Compare Greater Than |
| vcle | 323 | Compare Less Than or Equal |
| vcls | 342 | Count Leading Sign Bits |
| vclt | 323 | Compare Less Than |
| vclz | 342 | Count Leading Zero Bits |
| vcnt | 342 | Count Set Bits |
| vcvt | 322 | Convert Between Half and Single |
| vcvt | 321 | Convert Data Format |
| vdup | 312 | Duplicate Scalar |
| veor | 326 | Bitwise Exclusive-OR |
| vext | 313 | Extract Elements |
| vfma | 346 | Fused Multiply Accumulate |
| vfms | 346 | Fused Multiply Subtract |
| vfnma | 346 | Fused Negate Multiply Accumulate |
| vfnms | 346 | Fused Negate Multiply Subtract |
| vhadd | 337 | Halving Add |
| vhsub | 337 | Halving Subtract |
| vld¡n¿ | 305 | Load Copies of Structured Data |
| vld¡n¿ | 307 | Load Multiple Structured Data |
| vld¡n¿ | 303 | Load Structured Data |
| vldr | 351 | Load Constant |
| vmax | 341 | Maximum |
| vmin | 341 | Minimum |
| vmla | 343 | Multiply Accumulate |
| vmla | 345 | Multiply Accumulate by Scalar |
| vmlal | 344 | Multiply Accumulate Long |
| vmlal | 345 | Multiply Accumulate Long by Scalar |
| vmls | 343 | Multiply Subtract |
| vmls | 345 | Multiply Subtract by Scalar |
| vmlsl | 344 | Multiply Subtract Long |
| vmlsl | 345 | Multiply Subtract Long by Scalar |
| vmov | 310 | Move Immediate |
| vmov | 309 | Move Between NEON and ARM |
| vmovl | 311 | Move and Lengthen |
| vmovn | 311 | Move and Narrow |
| vmul | 343 | Multiply |
| vmul | 345 | Multiply by Scalar |
| vmull | 343 | Multiply Long |
| vmull | 345 | Multiply Long by Scalar |
| vmvn | 310 | Move Immediate Negative |
| vneg | 340 | Negate |
| vorn | 326 | Bitwise Complement and OR |
| vorn | 352 | Bitwise Complement and OR Immediate |
| vorr | 326 | Bitwise OR |
| vorr | 327 | Bitwise OR Immediate |
| vpadal | 338 | Add Pairwises and Accumulate Long |
| vpadd | 338 | Add Pairwise |
| vpaddl | 338 | Add Pairwise Long |
| vpmax | 341 | Pairwise Maximum |
| vpmin | 341 | Pairwise Minimum |
| vqabs | 340 | Saturating Absolute Value |
| vqadd | 335 | Saturating Add |
| vqdmlal | 347 | Saturating Multiply Double Accumulate (Low) |
| vqdmlsl | 347 | Saturating Multiply Double Subtract (Low) |
| vqdmulh | 348 | Saturating Multiply Double (High) |
| vqdmull | 347 | Saturating Multiply Double (Low) |
| vqmovn | 311 | Saturating Move and Narrow |
| vqmovun | 311 | Saturating Move and Narrow Unsigned |
| vqneg | 340 | Saturating Negate |
| vqrdmulh | 348 | Saturating Multiply Double (High) and Round |
| vqrshl | 330 | Saturating Shift Left or Right by Variable and Round |
| vqrshrn | 332 | Saturating Shift Right Immediate Round |
| vqrshrun | 333 | Saturating Shift Right Immediate Round Unsigned |
| vqshl | 329 | Saturating Shift Left Immediate |
| vqshl | 330 | Saturating Shift Left or Right by Variable |
| vqshlu | 329 | Saturating Shift Left Immediate Unsigned |
| vqshrn | 332 | Saturating Shift Right Immediate |
| vqshrun | 333 | Saturating Shift Right Immediate Unsigned |
| vqsub | 335 | Saturating Subtract |
| vraddhn | 336 | Add, Round, and Narrow |
| vrecpe | 348 | Reciprocal Estimate |
| vrecps | 349 | Reciprocal Step |
| vrev | 314 | Reverse Elements |
| vrhadd | 337 | Halving Add and Round |
| vrshl | 330 | Shift Left or Right by Variable and Round |
| vrshr | 331 | Shift Right Immediate and Round |
| vrshrn | 331 | Shift Right Immediate Round and Narrow |
| vrsqrte | 348 | Reciprocal Square Root Estimate |
| vrsqrts | 349 | Reciprocal Square Root Step |
| vrsra | 331 | Shift Right Round and Accumulate Immediate |
| vrsubhn | 336 | Subtract, Round, and Narrow |
| vshl | 329 | Shift Left Immediate |
| vshl | 330 | Shift Left or Right by Variable |
| vshll | 329 | Shift Left Immediate Long |
| vshr | 331 | Shift Right Immediate |
| vshrn | 331 | Shift Right Immediate and Narrow |
| vsli | 334 | Shift Left and Insert |
| vsra | 331 | Shift Right and Accumulate Immediate |
| vsri | 334 | Shift Right and Insert |
| vst<n> | 307 | Store Multiple Structured Data |
| vst<n> | 303 | Store Structured Data |
| vsub | 335 | Subtract |
| vsubhn | 336 | Subtract and Narrow |
| vsubl | 335 | Subtract Long |
| vsubw | 335 | Subtract Wide |
| vswp | 315 | Swap Vectors |
| vtbl | 318 | Table Lookup |
| vtbx | 318 | Table Lookup with Extend |
| vtrn | 316 | Transpose Matrix |
| vtst | 325 | Test Bits |
| vuzp | 319 | Unzip Vectors |
| vzip | 319 | Zip Vectors |


NEON can dramatically improve performance of algorithms that can take advantage of data parallelism. However, compiler support for automatically vectorizing and using NEON instructions is still immature. NEON intrinsics allow C and C++ programmers to access NEON instructions, by making them look like C functions. It is usually just as easy and more concise to write NEON assembly code as it is to use the intrinsics functions. A careful assembly language programmer can usually beat the compiler, sometimes by a wide margin. The greatest gains usually come from converting an algorithm to avoid floating point, and taking advantage of data parallelism.
10.1 What is the advantage of using IEEE half-precision? What is the disadvantage?
10.2 NEON achieved relatively modest performance gains on the sine function, when compared to VFP.
(b) List some tasks for which NEON could significantly outperform VFP.
10.3 There are some limitations on the size of the structure that can be loaded or stored using the vld<n> and vst<n> instructions. What are the limitations?
10.4 The sine function in Listing 10.2 uses a technique known as “loop unrolling” to achieve higher performance. Name at least three reasons why this code is more efficient than using a loop?
10.5 Reimplement the fixed-point sine function from Listing 8.7 using NEON instructions. Hint: you should not need to use a loop. Compare the performance of your NEON implementation with the performance of the original implementation.
10.6 Reimplement Exercise 9.10 using NEON instructions.
10.7 Fixed point operations may be faster than floating point operations. Modify your code from the previous example so that it uses the following definitions for points and transformation matrices:

Use saturating instructions and/or any other techniques necessary to prevent overflow. Compare the performance of the two implementations.
Accessing Devices
This chapter starts with a high-level explanation of how devices may be accessed in a modern computer system, and then explains that most devices on modern architectures are memory-mapped. Next, it explains how memory mapped devices can be accessed by user processes under Linux, by making use of the mmap system call. Code examples are given, showing how several devices can be mapped into the memory of a user-level program on the Raspberry Pi and pcDuino. Next the General Purpose I/O devices on both systems are explained, providing the reader with the opportunity to do a comparison between two different devices which perform almost precisely the same functions.
Device; Memory map; General purpose I/O (GPIO); I/O Pin; Header; Pull-up and pull-down resistor; LED; Switch
As mentioned in Chapter 1, a computer system consists of three main parts: the CPU, memory, and devices. The typical computing system has many devices of various types for performing specific functions. Some devices, such as data caches, are closely coupled to the CPU, and are typically controlled by executing special CPU instructions that can only be accessed in assembly language. However, most of the devices on a typical system are accessed and controlled through the system data bus. These devices appear to the programmer to be ordinary memory locations. The hardware in the system bus decodes the addresses coming from the CPU, and some addresses correspond to devices rather than memory. Fig. 11.1 shows the memory layout for a typical system. The exact locations of the devices and memory are chosen by the system hardware designers. From the programmer’s standpoint, writing data to certain memory addresses results in the data being transferred to a device rather than stored in memory. The programmer must read documentation on the hardware design to determine exactly where the devices are in memory.

There are devices that allow data to be read or written from external sources, devices that can measure time, devices for moving data from one location in memory to another, devices for modifying the addresses of memory regions, and devices for even more esoteric purposes. Some devices are capable of sending signals to the CPU to indicate that they need attention, while others simply wait for the CPU to check on their status.
A modern computer system, such as the Raspberry Pi, has dozens or even hundreds of devices. Programmers write device driver software for each device. A device driver provides a few standard function calls for each device, so that it can be used easily. The specific set of functions depends on the type of device and the design of the operating system. Operating system designers strive to define a small set of device types, and to define a standard software interface for each type in order to make devices interchangeable.
Devices are typically controlled by writing specific values to the device’s internal device registers. For the ARM processor, access to most device registers is accomplished using the load and store instructions. Each device is assigned a base address in memory. This address corresponds with the first register inside the device. The device may also have other registers that are accessible at some pre-defined offset address from the base address. Some registers are read-only, some are write-only, and some are read-write. To use the device, the programmer must read from, and write appropriate data to, the correct device registers. For every device, there is a programmer’s model and documentation explaining what each register in the device does. Some devices are well designed, easy to use, and well documented. Some devices are not, and the programmer must work harder to write software to use them.
Linux is a powerful, multiuser, multitasking operating system. The Linux kernel manages all of the devices and protects them from direct access by user programs. User programs are intended to access devices by making system calls. The kernel accesses the devices on behalf of the user programs, ensuring that an errant user program cannot misuse the devices and other resources on the system. Attempting to directly access the registers in any device will result in an exception. The kernel will take over and kill the offending process.
However, our programs will need direct access to the device registers. Linux allows user programs to gain direct access through the mmap() system call. Listing 11.1 shows how four devices can be mapped into the memory space of a user program on a Raspberry Pi. In most cases, the user program will need administrator privileges in order to perform the mapping. The operating system does not usually give permission for ordinary users to access devices directly. However Linux does provide the ability to change permissions on /dev/mem, or for user programs to run with elevated privileges.






Listing 11.2 shows how four devices can be mapped into the memory space of a user program on a pcDuino. The devices are equivalent to the devices mapped in Listing 11.1. Some of the devices are described in the following sections of this chapter. The pcDuino devices and Raspberry Pi devices operate differently, but provide similar functionality. Note that most of the code is the same for both listings. The only real differences between Listings 11.1 and 11.2 are the names of the devices and their hardware addresses.





One type of device, commonly found on embedded systems, is the General Purpose I/O (GPIO) device. Although there are many variations on this device provided by different manufacturers, they all provide similar capabilities. The device provides a set of input and/or output bits, which allow signals to be transferred to or from the outside world. Each bit of input or output in a GPIO device is generally referred to as a pin, and a group of pins is referred to as a GPIO port. Ports commonly support 8 bits of input or output, but some devices have 16 or 32 bit ports. Some GPIO devices support multiple ports, and some systems have multiple GPIO devices in them.
A system with a GPIO device usually has some type of connector or wires that allow external inputs or outputs to be connected to the system. For example, the IBM PC has a type of GPIO device that was originally intended for communications with a parallel printer. On that platform, the GPIO device is commonly referred to as the parallel printer port.
Some GPIO devices, such as the one on the IBM PC, are arranged as sets of pins that can be switched as a group to either input or output. In many modern GPIO devices, each pin can be individually configured to accept or source different input and output voltages. On some devices, the amount of drive current available can be configured. Some include the ability to configure built-in pull-up and/or pull-down resistors. On most older GPIO devices, the input and output voltages are typically limited to the supply voltage of the GPIO device, and the device may be damaged by greater voltages. Newer GPIO devices generally can tolerate 5 V on inputs, regardless of the supply voltage of the device.
GPIO devices are very common in systems that are intended to be used for embedded applications. For most GPIO devices:
• individual pins or groups of pins can be configured,
• pins can be configured to be input or output,
• pins can be disabled so that they are neither input nor output,
• input values can be read by the CPU (typically high=1, low=0),
• output values can be read or written by the CPU, and
• input pins can be configured to generate interrupt requests.
Some GPIO devices may also have more advanced features, such as the ability to use Direct Memory Access (DMA) to send data without requiring the CPU to move each byte or word. Fig. 11.2 shows two common ways to use GPIO pins. Fig. 11.2A shows a GPIO pin that has been configured for input, and connected to a push-button switch. When the switch is open, the pull-up resistor pulls the voltage on the pin to a high state. When the switch is closed, the pin is pulled to a low state and some current flows through the pull-up resistor to ground. Typically, the pull-up resistor would be around 10 kΩ. The specific value is not critical, but it must be high enough to limit the current to a small amount when the switch is closed. Fig. 11.2B shows a GPIO pin that is configured as an output and is being used to drive an LED. When a 1 is output on the pin, it is at the same voltage as Vcc (the power supply voltage), and no current flows. The LED is off. When a 0 is output on the pin, current is drawn through the resistor and the LED, and through the pin to ground. This causes the LED to be illuminated. Selection of the resistor is not critical, but it must be small enough to light the LED without allowing enough current to destroy either the LED or the GPIO circuitry. This is typically around 1 kΩ. Note that, in general, GPIO pins can sink more current than they can source, so it is most common to connect LEDs and other devices in the way shown.

The Broadcom BCM2835 system-on-chip contains 54 GPIO pins that are split into two banks. The GPIO pins are named using the following format: GPIOx, where x is a number between 0 and 53. The GPIO pins are highly configurable. Each pin can be used for general purpose I/O, or can be configured to serve up to six pre-defined alternate functions. Configuring a GPIO pin for an alternate function usually allows some other device within the BCM2835 to use the pin. For example, GPIO4 can be used
• to send the signal generated by General Purpose Clock 0 to external devices,
• to send bit one of the Secondary Address Bus to external devices, or
• to receive JTAG data for programming the firmware of the device.
The last eight GPIO pins, GPIO46–GPIO53 have no alternate functions, and are used only for GPIO.
In addition to the alternate function, all GPIO pins can be configured individually as input or output. When configured as input, a pin can also be configured to detect when the signal changes, and to send an interrupt to the ARM CPU. Each input pin also has internal pull-up and pull-down resistors, which can be enabled or disabled by the programmer.
The GPIO pins on the BCM2835 SOC are very flexible and are quite complex, but are well designed and not difficult to program, once the programmer understands how the pins operate and what the various registers do. There are 41 registers that control the GPIO pins. The base address for the GPIO device is 20200000. The 41 registers and their offsets from the base address are shown in Table 11.1.
Table 11.1
Raspberry Pi GPIO register map
| Offset | Name | Description | Size | R/W |
| 0016 | GPFSEL0 | GPIO Function Select 0 | 32 | R/W |
| 0416 | GPFSEL1 | GPIO Function Select 1 | 32 | R/W |
| 0816 | GPFSEL2 | GPIO Function Select 2 | 32 | R/W |
| 0C16 | GPFSEL3 | GPIO Function Select 3 | 32 | R/W |
| 1016 | GPFSEL4 | GPIO Function Select 4 | 32 | R/W |
| 1416 | GPFSEL5 | GPIO Function Select 5 | 32 | R/W |
| 1C16 | GPSET0 | GPIO Pin Output Set 0 | 32 | W |
| 2016 | GPSET1 | GPIO Pin Output Set 1 | 32 | W |
| 2816 | GPCLR0 | GPIO Pin Output Clear 0 | 32 | W |
| 2C16 | GPCLR1 | GPIO Pin Output Clear 1 | 32 | W |
| 3416 | GPLEV0 | GPIO Pin Level 0 | 32 | R |
| 3816 | GPLEV1 | GPIO Pin Level 1 | 32 | R |
| 4016 | GPEDS0 | GPIO Pin Event Detect Status 0 | 32 | R/W |
| 4416 | GPEDS1 | GPIO Pin Event Detect Status 1 | 32 | R/W |
| 4C16 | GPREN0 | GPIO Pin Rising Edge Detect Enable 0 | 32 | R/W |
| 5016 | GPREN1 | GPIO Pin Rising Edge Detect Enable 1 | 32 | R/W |
| 5816 | GPFEN0 | GPIO Pin Falling Edge Detect Enable 0 | 32 | R/W |
| 5C16 | GPFEN1 | GPIO Pin Falling Edge Detect Enable 1 | 32 | R/W |
| 6416 | GPHEN0 | GPIO Pin High Detect Enable 0 | 32 | R/W |
| 6816 | GPHEN1 | GPIO Pin High Detect Enable 1 | 32 | R/W |
| 7016 | GPLEN0 | GPIO Pin Low Detect Enable 0 | 32 | R/W |
| 7416 | GPLEN1 | GPIO Pin Low Detect Enable 1 | 32 | R/W |
| 7C16 | GPAREN0 | GPIO Pin Async. Rising Edge Detect 0 | 32 | R/W |
| 8016 | GPAREN1 | GPIO Pin Async. Rising Edge Detect 1 | 32 | R/W |
| 8816 | GPAFEN0 | GPIO Pin Async. Falling Edge Detect 0 | 32 | R/W |
| 8C16 | GPAFEN1 | GPIO Pin Async. Falling Edge Detect 1 | 32 | R/W |
| 9416 | GPPUD | GPIO Pin Pull-up/down Enable | 32 | R/W |
| 9816 | GPPUDCLK0 | GPIO Pin Pull-up/down Enable Clock 0 | 32 | R/W |
| 9C16 | GPPUDCLK1 | GPIO Pin Pull-up/down Enable Clock 1 | 32 | R/W |

The first six 32-bit registers in the device are used to select the function for each of the 54 GPIO pins. The function of each pin is controlled by a group of three bits in one of these registers. The mapping is very regular. Bits 0–2 of GPIOFSEL0 control the function of GPIO pin 0. Bits 3–5 of GPIOFSEL0 control the function of GPIO pin 1, and so on, up to bits 27–29 of GPIOFSEL0, which control the function of GPIO pin 9. The next pin, pin 10, is controlled by bits 0–2 of GPIOFSEL1. The pins are assigned in sequence through the remaining bits, until bits 27–29, which control GPIO pin 19. The remaining four GPIOFSEL registers control the remaining GPIO pins. Note that bits 30 and 31 of all of the GPIOFSEL registers are not used, and most of the bits in GPIOFSEL5 are not assigned to any pin. The meaning of each combination of the three bits is shown in Table 11.2. Note that the encoding is not as simple as one might expect.
Table 11.2
GPIO pin function select bits
| MSB-LSB | Function |
| 000 | Pin is an input |
| 001 | Pin is an output |
| 100 | Pin performs alternate function 0 |
| 101 | Pin performs alternate function 1 |
| 110 | Pin performs alternate function 2 |
| 111 | Pin performs alternate function 3 |
| 011 | Pin performs alternate function 4 |
| 010 | Pin performs alternate function 5 |
The procedure for setting the function of a GPIO pin is as follows:
• Determine which GPIOFSEL register controls the desired pin.
• Determine which bits of the GPIOFSEL register are used.
• Determine what the bit pattern should be.
• Read the GPIOFSEL register.
• Clear the correct bits using the bic instruction.
• Set them to the correct pattern using the orr instruction.
For example, Listing 11.3 shows the sequence of code which would be used to set GPIO pin 26 to alternate function 1.

To use a GPIO pin for output, the function select bits for that pin must be set to 001. Once that is done, the output can be driven high or low by using the GPSET and GPCLR registers. GPIO pin 0 is set to a high output by writing a 1 to bit 0 of GPSET0, and it is set to low output by writing a 1 to bit 0 of GPCLR0. GPIO pin 1 is similarly controlled by bit 1 in GPSET0 and GPCLR0. Each of the GPIO pins numbered 0 through 31 is assigned one bit in GPSET0 and one bit in GPCLR0. GPIO pin 32 is assigned to bit 0 of GPSET1 and GPCLR1, GPIO pin 33 is assigned to bit 1 of GPSET1 and GPCLR1, and so on. Since there are only 54 GPIO pins, bits 22–31 of GPSET1 and GPCLR1 are not used. The programmer can set or clear several outputs simultaneously by writing the appropriate bits in the GPSET and GPCLR registers.
To use a GPIO pin for input, the function select bits for that pin must be set to 000. Once that is done, the input can be read at any time by reading the appropriate GPLEV register and examining the bit that corresponds with the input pin. GPIO pin 0 is read as bit 0 of GPLEV0, GPIO pin 1 is similarly read as bit 1 of GPLEV1. Each of the GPIO pins numbered 0 through 31 is assigned one bit in GPLEV0. GPIO pin 32 is assigned to bit 0 of GPLEV1, GPIO pin 33 is assigned to bit 1 of GPLEV1, and so on. Since there are only 54 GPIO pins, bits 22–31 of GPLEV1 are not used. The programmer can read the status of several inputs simultaneously by reading one of the GPLEV registers and examining the bits corresponding to the appropriate pins.
Input pins can be configured with internal pull-up or pull-down resistors. This can simplify the design of the system. For instance, Fig. 11.2A, shows a push-button switch connected to an input, with an external pull-up resistor. That resistor is unnecessary if the internal pull-up for that pin is enabled.
Enabling the pull-up or pull-down is a two step process. The first step is to configure the type of change to be made, and the second step is to perform that change on the selected pin(s). The first step is accomplished by writing to the GPPUD register. The valid binary control codes are shown in Table 11.3.
Table 11.3
GPPUD control codes
| Code | Function |
| 00 | Disable pull-up and pull-down |
| 01 | Enable pull-down |
| 10 | Enable pull-up |
Once the GPPUD register is configured, the selected operation can be performed on multiple pins by writing to one or both of the GPPUDCLK registers. GPIO pins are assigned to bits in these two registers in the same way as the pins are assigned in the GPLEV, GPSET, and GPCLR registers. Writing 1 to bit 0 of GPPUDCLK0 will configure the pull-up or pull-down for GPIO pin 0, according to the control code that is currently in the GPPUD register.
The GPEDS registers are used for detecting events that have occurred on the GPIO pins. For instance a pin may have transitioned from low to high, and back to low. If the CPU does not read the GPLEV register often enough, then such an event could be missed. The GPEDS registers can be configured to capture such events so that the CPU can detect that they occurred.
GPIO pins are assigned to bits in these two registers in the same way as the pins are assigned in the GPLEV, GPSET, and GPCLR registers. If bit 1 of GPEDS0 is set, then that indicates that an event has occurred on GPIO pin 0. Writing a 0 to that bit will clear the bit and allow the event detector to detect another event. Each pin can be configured to detect specific types of events by writing to the GPREN, GPHEN, GPLEN, GPAREN, and GPAFEN registers. For more information, refer to the BCM2835 ARM Peripherals manual.
The Raspberry Pi provides access to several of the 54 GPIO pins through the expansion header. The expansion header is a group of physical pins located in the corner of the Raspberry Pi board. Fig. 11.3 shows where the header is located on the Raspberry Pi. Wires can be connected to these pins and then the GPIO device can be programmed to send and/or receive digital information. Fig. 11.4 shows which signals are attached to the various pins. Some of the pins are used to provide power and ground to the external devices.


Table 11.4 shows some useful alternate functions available on each pin of the Raspberry Pi expansion header. Many of the alternate functions available on these pins are not really useful. Those functions have been left out of the table. The most useful alternate functions are probably GPIO 14 and 15, which can be used for serial communication, and GPIO 18, which can be used for pulse width modulation. Pulse width modulation is covered in Section 12.2, and serial communication is covered in Section 13.2. The Serial Peripheral Interface (SPI) functions could also be useful for connecting the Raspberry Pi to other devices which support SPI. Also, the SDA and SCL functions could be used to communicate with I2C devices.
The AllWinner A10/A20 system-on-chip contains 175 GPIO pins, which are arranged in seven ports. Each of the seven ports is identified by a letter between “A” and “I.” The ports are part of the PIO device, which is mapped at address 01C2080016. The GPIO pins are named using the following format: PNx, where N is a letter between “A” and “I” indicating the port, and x is a number indicating a pin on the given port. The assignment of pins to ports is somewhat irregular, as shown in Table 11.5. Some ports have as many as 28 physical pins, while others have as few as six. However, the layout of the registers in the device is very regular. Given any port and pin combination, finding the correct registers and sets of bits within the registers, is very straightforward.
Table 11.5
Number of pins available on each of the AllWinner A10/A20 PIO ports
| Port | Pins |
| A | 18 |
| B | 24 |
| C | 25 |
| D | 28 |
| E | 12 |
| F | 6 |
| G | 12 |
| H | 28 |
| I | 22 |
Each of the 9 ports is controlled by a set of 9 registers, for a total of 81 registers. There are seven additional registers that can be used to configure pins as interrupt sources. Interrupt processing is explained in Section 14.2. All of the port and interrupt registers together make a total of 88 registers for the GPIO device. The complete register map with the offset of each register from the device base address is shown in Table 11.6.
Table 11.6
Registers in the AllWinner GPIO device
| Offset | Name | Description |
| 00016 | PA_CFG0 | Function select for Port A, Pins 0–7 |
| 00416 | PA_CFG1 | Function select for Port A, Pins 8–15 |
| 00816 | PA_CFG2 | Function select for Port A, Pins 16–17 |
| 00C16 | PA_CFG3 | Not used |
| 01016 | PA_DAT | Port A Data Register |
| 01416 | PA_DRV0 | Port A Multi-driving, Pins 0–15 |
| 01816 | PA_DRV1 | Port A Multi-driving, Pins 16–17 |
| 01C16 | PA_PULL0 | Port A Pull-Up/-Down, Pins 0–15 |
| 02016 | PA_PULL1 | Port A Pull-Up/-Down, Pins 16–17 |
| 02416 | PB_CFG0 | Function select for Port B, Pins 0–7 |
| 02816 | PB_CFG1 | Function select for Port B, Pins 8–15 |
| 02C16 | PB_CFG2 | Function select for Port B, Pins 16–23 |
| 03016 | PB_CFG3 | Not used |
| 03416 | PB_DAT | Port B Data Register |
| 03816 | PB_DRV0 | Port B Multi-driving, Pins 0–15 |
| 03C16 | PB_DRV1 | Port B Multi-driving, Pins 16–23 |
| 04016 | PB_PULL0 | Port B Pull-Up/-Down, Pins 0–15 |
| 04416 | PB_PULL1 | Port B Pull-Up/-Down, Pins 16–23 |
| 04816 | PC_CFG0 | Function select for Port C, Pins 0–7 |
| 04C16 | PC_CFG1 | Function select for Port C, Pins 8–15 |
| 05016 | PC_CFG2 | Function select for Port C, Pins 16–23 |
| 05416 | PC_CFG3 | Function select for Port C, Pin 24 |
| 05816 | PC_DAT | Port C Data Register |
| 05C16 | PC_DRV0 | Port C Multi-driving, Pins 0–15 |
| 06016 | PC_DRV1 | Port C Multi-driving, Pins 16–23 |
| 06416 | PC_PULL0 | Port C Pull-Up/-Down, Pins 0–15 |
| 06816 | PC_PULL1 | Port C Pull-Up/-Down, Pins 16–23 |
| 06C16 | PD_CFG0 | Function select for Port D, Pins 0–7 |
| 07016 | PD_CFG1 | Function select for Port D, Pins 8–15 |
| 07416 | PD_CFG2 | Function select for Port D, Pins 16–23 |
| 07816 | PD_CFG3 | Function select for Port D, Pins 24–27 |
| 07C16 | PD_DAT | Port D Data Register |
| 08016 | PD_DRV0 | Port D Multi-driving, Pins 0–15 |
| 08416 | PD_DRV1 | Port D Multi-driving, Pins 16–27 |
| 08816 | PD_PULL0 | Port D Pull-Up/-Down, Pins 0–15 |
| 08C16 | PD_PULL1 | Port D Pull-Up/-Down, Pins 16–27 |
| 09016 | PE_CFG0 | Function select for Port E, Pins 0–7 |
| 09416 | PE_CFG1 | Function select for Port E, Pins 8–11 |
| 09816 | PE_CFG2 | Not used |
| 09C16 | PE_CFG3 | Not used |
| 0A016 | PE_DAT | Port E Data Register |
| 0A416 | PE_DRV0 | Port E Multi-driving, Pins 0–11 |
| 0A816 | PE_DRV1 | Not used |
| 0AC16 | PE_PULL0 | Port E Pull-Up/-Down, Pins 0–11 |
| 0B016 | PE_PULL1 | Not used |
| 0B416 | PF_CFG0 | Function select for Port F, Pins 0–5 |
| 0B816 | PF_CFG1 | Not used |
| 0BC16 | PF_CFG2 | Not used |
| 0C016 | PF_CFG3 | Not used |
| 0C416 | PF_DAT | Port F Data Register |
| 0C816 | PF_DRV0 | Port F Multi-driving, Pins 0–5 |
| 0CC16 | PF_DRV1 | Not used |
| 0D016 | PF_PULL0 | Port F Pull-Up/-Down, Pins 0–5 |
| 0D416 | PF_PULL1 | Not used |
| 0D816 | PG_CFG0 | Function select for Port G, Pins 0–7 |
| 0DC16 | PG_CFG1 | Function select for Port G, Pins 8–11 |
| 0E016 | PG_CFG2 | Not used |
| 0E416 | PG_CFG3 | Not used |
| 0E816 | PG_DAT | Port G Data Register |
| 0EC16 | PG_DRV0 | Port G Multi-driving, Pins 0–11 |
| 0F016 | PG_DRV1 | Not used |
| 0F416 | PG_PULL0 | Port G Pull-Up/-Down, Pins 0–11 |
| 0F816 | PG_PULL1 | Not used |
| 0FC16 | PH_CFG0 | Function select for Port H, Pins 0–7 |
| 10016 | PH_CFG1 | Function select for Port H, Pins 8–15 |
| 10416 | PH_CFG2 | Function select for Port H, Pins 16–23 |
| 10816 | PH_CFG3 | Function select for Port H, Pins 24–27 |
| 10C 16 | PH_DAT | Port H Data Register |
| 11016 | PH_DRV0 | Port H Multi-driving, Pins 0–15 |
| 11416 | PH_DRV1 | Port H Multi-driving, Pins 16–27 |
| 11816 | PH_PULL0 | Port H Pull-Up/-Down, Pins 0–15 |
| 11C16 | PH_PULL1 | Port H Pull-Up/-Down, Pins 16–27 |
| 12016 | PI_CFG0 | Function select for Port I, Pins 0–7 |
| 12416 | PI_CFG1 | Function select for Port I, Pins 8–15 |
| 12816 | PI_CFG2 | Function select for Port I, Pins 16–21 |
| 12C16 | PI_CFG3 | Not used |
| 13016 | PI_DAT | Port I Data Register |
| 13416 | PI_DRV0 | Port I Multi-driving, Pins 0–15 |
| 13816 | PI_DRV1 | Port I Multi-driving, Pins 16–21 |
| 13C16 | PI_PULL0 | Port I Pull-Up/-Down, Pins 0–15 |
| 14016 | PI_PULL1 | Port I Pull-Up/-Down, Pins 16–21 |
| 20016 | PIO_INT_CFG0 | PIO Interrupt Configure Register 0 |
| 20416 | PIO_INT_CFG1 | PIO Interrupt Configure Register 1 |
| 20816 | PIO_INT_CFG2 | PIO Interrupt Configure Register 2 |
| 20C16 | PIO_INT_CFG3 | PIO Interrupt Configure Register 3 |
| 21016 | PIO_INT_CTL | PIO Interrupt Control Register |
| 21416 | PIO_INT_STATUS | PIO Interrupt Status Register |
| 21816 | PIO_INT_DEB | PIO Interrupt Debounce Register |


The GPIO pins are highly configurable. Each pin can be used either for general purpose I/O, or can be configured to serve one of up to six pre-defined alternate functions. Configuring a GPIO pin for an alternate function usually allows some other device within the A10/A20 SOC to use the pin. For example PB2 (pin 2 of port B) can be used for general purpose I/O, or can be used to output the signal from a Pulse Width Modulator (PWM) device (explained in Section 12.2). Each input pin also has internal pull-up and pull-down resistors, which can be enabled or disabled by the programmer.
The first four registers for each port are used to configure the functions for each of the pins. The function of each pin is controlled by three bits in one of the four configuration registers. Pins 0–7 are controlled using configuration register 0. Pins 8–15 are controlled by configuration register 1, and so on. The assignment of pins to control bits is shown in Fig. 11.5. Note that eight pins are controlled by each register, and there is an unused bit between each group of three bits.

Each GPIO pin can be configured by writing a 3-bit code to the appropriate location in the correct port configuration register. The meanings of each possible code is shown in Table 11.7. For example, to configure port A, pin 10 (PA10) for output, the 3-bit code 001 must be written to bits 8–10 the PA_CFG1 register, without changing any other bit in the register. Listing 11.4 shows how this operation can be accomplished.
Table 11.7
Allwinner A10/A20 GPIO pin function select bits
| MSB-LSB | Function |
| 000 | Pin is an input |
| 001 | Pin is an output |
| 010 | Pin performs alternate function 0 |
| 011 | Pin performs alternate function 1 |
| 100 | Pin performs alternate function 2 |
| 101 | Pin performs alternate function 3 |
| 110 | Pin performs alternate function 4 |
| 111 | Pin performs alternate function 5 |

An output pin can be set to a high state by setting the corresponding bit in the correct port data register. Likewise the pin can be set to a low state by clearing its corresponding bit. Care must be taken to avoid changing any other bits in the port data register. Listing 11.5 shows how this operation can be accomplished for setting a port to output a high state. To set the port output to a low state, the orr instruction would be replaced with a bic instruction.

To determine the current state of an output pin or read an input pin, the programmer can read the contents of the correct port data register and use bitwise logical operations to isolate the appropriate bit. For example, to read the state of pin 14 of port I (PI14), the programmer would read the PI_DAT register and mask all bits except bit 14. Listing 11.6 shows how this operation can be accomplished. Another method would be to use the tst instruction, rather than the ands instruction, to set the CPSR flags.

Input pins can be configured with internal pull-up or pull-down resistors. This can simplify the design of the system. For instance, Fig. 11.2a, shows a push-button switch connected to an input with an external pull-up resistor. That resistor is unnecessary if the internal pull-up for that pin is enabled. Each pin is assigned two bits in one of the port pull-up/-down registers. The pull-up and pull-down resistors for pin 0 on port B are controlled using bits 0 and 1 of the PB_PULL0 register. Likewise the pull-up and pull-down resistors for pin 19 of port C are controlled using bits 6 and 7 of the PC_PULL1 register. Table 11.8 shows the bit patterns used to configure the pull-up and pull-down resisters for a pin.
When configured as an input, most of the pins on the pdDuino can be configured to generate an interrupt, which notifies the CPU than an event has occurred. Configuration of interrupts is beyond the scope of this chapter. It is accomplished using the PIO_INT registers.
The pcDuino provides access to several of the 175 GPIO pins through the expansion headers. Fig. 11.6 shows where the headers are located on the pcDuino. Wires can be plugged into the holes in these headers and then the GPIO device can be programmed to send and/or receive digital and/or analog signals. The physical layout of the pcDuino header makes it compatible with a wide range of expansion modules designed for the Arduino family of microcontroller boards.

Some of the header holes can provide power and ground to the external devices. Analog signals can be read into the pcDuino using the ADC header connections. Fig. 11.7 shows the pcDuino names for the signals that are available on the headers. Table 11.9 shows how the pcDuino header signal names are mapped to the actual port pins on the AllWinner A10/A20 chip. It also shows the most useful alternate functions available on each of the pins. Many alternate functions are left out of the table because they are not really useful. Note that the pcDunio and the Raspberry Pi both provide pins to perform PWM, UART communications, and SPI.

Table 11.9
pcDuino GPIO pins and function select code assignments.
| Function Select Code Assignment | ||||||
| pcDuino Pin Name | Port | Pin | 010 | 011 | 100 | 110 |
| UART-Rx(GPIO0) | I | 19 | UART2_RX | EINT31 | ||
| UART-Tx(GPIO1) | I | 18 | UART2_TX | EINT30 | ||
| GPIO3(GPIO2) | H | 7 | UART5_RX | EINT7 | ||
| PWM0(GPIO3) | H | 6 | UART5_TX | EINT6 | ||
| GPIO4 | H | 8 | EINT8 | |||
| PWM1(GPIO5) | B | 2 | PWM0 | |||
| PWM2(GPIO6) | I | 3 | PWM1 | |||
| GPIO7 | H | 9 | EINT9 | |||
| GPIO8 | H | 10 | EINT10 | |||
| PWM3(GPIO9) | H | 5 | EINT5 | |||
| SPI_CS(GPIO10) | I | 10 | SPI0_CS0 | UART5_TX | EINT22 | |
| SPI_MOSI(GPIO11) | I | 12 | SPI0_MOSI | UART6_TX | CLK_OUT_A | EINT24 |
| SPI_MISO(GPIO12) | I | 13 | SPI0_MISO | UART6_RX | CLK_OUT_B | EINT25 |
| SPI_CLK(GPIO13) | I | 11 | SPI0_CLK | UART5_RX | EINT23 | |

All input and output are accomplished by using devices. There are many types of devices, and each device has its own set of registers which are used to control the device. The programmer must understand the operation of the device and the use of each register in order to use the device at a low level. Computer system manufacturers usually can provide documentation providing the necessary information for low-level programming. The quality of the documentation can vary greatly, and a general understanding of various types of devices can help in deciphering poor or incomplete documentation.
There are two major tasks where programming devices at the register level is required: operating system drivers and very small embedded systems. Operating systems provide an abstract view of each device and this allows programmers to use them more easily. However, someone must write that driver, and that person must have intimate knowledge of the device. On very small systems, there may not be a driver available. In that case, the device must be accessed directly. Even when an operating system provides a driver, it is sometimes necessary or desirable for the programmer to access the device directly. For example, some devices may provide modes of operation or capabilities that are not supported by the operating system driver. Linux provides a mechanism which allows the programmer to map a physical device into the program’s memory space, thereby gaining access to the raw device registers.
11.1 Explain the relationships and differences between device registers, memory locations, and CPU registers.
11.2 Why is it necessary to map the device into user program memory before accessing it under Linux? Would this step be necessary under all operating systems or in the case where there is no operating system and our code is running on the “bare metal?”
11.3 What is the purpose of a GPIO device?
11.4 The Raspberry Pi and the PcDuino have very different GPIO devices.
(a) Are they functionally equivalent?
(b) Are they equally programmer-friendly?
(c) If you have answered no to either of the previous questions, then what are the differences?
11.5 Draw a circuit diagram showing how to connect:
(a) a pushbutton switch to GPIO 23 and an LED to GPIO 27 on the Raspberry Pi, and
(b) a pushbutton switch to GPIO12 and an LED to GPIO13 on the PcDuino.
11.6 Assuming the systems are wired according to the previous exercise, write two functions. One function must initialize the GPIO pins, and the other function must read the state of the switch and turn the LED on if the button is pressed, and off if the button is not pressed. Write the two functions for
(b) a PcDuino.
11.7 Write the code necessary to route the output from PWM0 to GPIO 18 on a Raspberry Pi.
11.8 Write the code necessary to route the output from PWM0 to GPIO 5 on a PcDuino.
This chapter begins by explaining pulse density and pulse width modulation in general terms. It then introduces and describes the PWM device on the Raspberry Pi. Following that, it covers the pcDuino PWM device. This gives the reader another opportunity to see two different devices which both perform essentially the same functions.
Pulse width modulation; Pulse density modulation; Digital to analog; Low pass filter
The GPIO device provides a method for sending digital signals to external devices. This can be useful to control devices that have basically two states: on and off. In some situations, it is useful to have the ability to turn a device on at varying levels. For instance, it could be useful to control a motor at any required speed, or control the brightness of a light source. One way that this can be accomplished is through pulse modulation.
The basic idea is that the computer sends a stream of pulses to the device. The device acts as a low-pass filter, which averages the digital pulses into an analog voltage. By varying the percentage of time that the pulses are high, versus low, the computer can control how much average energy is sent to the device. The percentage of time that the pulses are high versus low is known as the duty cycle. Varying the duty cycle is referred to as modulation. There are two major types of pulse modulation: pulse density modulation (PDM) and pulse width modulation (PWM). Most pulse modulation devices are configured in three steps as follows:
1. The base frequency of the clock that drives the PWM device is configured. This step is usually optional.
2. The mode of operation for the pulse modulation device is configured by writing to one or more configuration registers in the pulse modulation device.
3. The cycle time is set by writing a “range” value into a register in the pulse modulation device. This value is usually set as a multiple of the base clock cycle time.
Once the device is configured, the duty cycle can be changed easily by writing to one or more registers in the pulse modulation device.
With PDM, also known as pulse frequency modulation (PFM), the duration of the positive pulses does not change, but the time between them (the pulse density) is modulated. When using PDM devices, the programmer typically sets the device cycle time tc in a register, then uses another register to specify the number of pulses d that are to be sent during a device cycle. The number of pulses is typically referred to as the duty cycle and must be chosen such that 0 ≤ d ≤ tc. For instance, if tc = 1024, then the device cycle time is 1024 times the cycle time of the clock that drives the device. If d = 512, then the device will send 512 pulses, evenly spaced, during the device cycle. Each pulse will have the same duration as the base clock. The device will continue to output this pulse pattern until d is changed.
Fig. 12.1 shows a signal that is being sent using PDM, and the resulting set of pulses. Each pulse transfers a fixed amount of energy to the device. When the pulses arrive at the device, they are effectively filtered using a low pass filter. The resulting received signal is also shown. Notice that the received signal has a delay, or phase shift, caused by the low-pass filtering. This approach is suitable for controlling certain types of devices, such as lights and speakers.

However, when driving such devices directly with the digital pulses, care must be taken that the minimum frequency of pulses remains above the threshold that can be detected by human senses. For instance, when driving a speaker, the minimum pulse frequency must be high enough that the individual pulses cannot be distinguished by the human ear. This minimum frequency is around 40 kHz. Likewise, when driving an LED directly, the minimum frequency must be high enough that the eye cannot detect the individual pulses, because they will be seen as a flickering effect. That minimum frequency is around 70 Hz. To reduce or alleviate this problem, designers may add a low-pass filter between the PWM device and the device that is being driven.
In PWM, the frequency of the pulses remains fixed, but the duration of the positive pulse (the pulse width) is modulated. When using PWM devices, the programmer typically sets the device cycle time tc in a register, then uses another register to specify the number of base clock cycles, d, for which the output should be high. The percentage
is typically referred to as the duty cycle and d must be chosen such that 0 ≤ d ≤ tc. For instance, if tc = 1024, then the device cycle time is 1024 times the cycle time of the clock that drives the device. If d = 512, then the device will output a high signal for 512 clock cycles, then output a low signal for 512 clock cycles. It will continue to repeat this pattern of pulses until d is changed.
Fig. 12.2 shows a signal that is being sent using PWM. The pulses are also shown. Each pulse transfers some energy to the device. The width of each pulse determines how much energy is transferred. When the pulses arrive at the device, they are effectively filtered using a low-pass filter. The resulting received signal is shown by the dashed line. As with PDM, the received signal has a delay, or phase shift, caused by the low-pass filtering.

One advantage of PWM over PDM is that the digital circuit is not as complex. Another advantage of PWM over PDM is that the frequency of the pulses does not vary, so it is easier for the programmer to set the base frequency high enough that the individual pulses cannot be detected by human senses. Also, when driving motors it is usually necessary to match the pulse frequency to the size and type of motor. Mismatching the frequency can cause loss of efficiency as well as overheating of the motor and drive electronics. In severe cases, this can cause premature failure of the motor and/or drive electronics. With PWM, it is easier for the programmer to control the base frequency, and thereby avoid those problems.
The Broadcom BCM2835 system-on-chip includes a device that can create two PWM signals. One of the signals (PWM0) can be routed through GPIO pin 18 (alternate function 5), where it is available on the Raspberry Pi expansion header at pin 12. PWM0 can also be routed through GPIO pin 40. On the Raspberry Pi, pin 40 it is sent through a low-pass filter, and then to the Raspberry Pi audio output port as the right stereo channel. The other signal (PWM1) can be routed through GPIO pin 45. From there, it is sent through a low-pass filter, and then to the Raspberry Pi audio output port as the left stereo channel. So, both PWM channels are accessible, but PWM1 is only accessible through the audio output port after it has been low-pass filtered. The raw PWM0 signal is available through the Raspberry Pi expansion header at pin 12.
There are three modes of operation for the BCM2835 PWM device:
2. PWM mode, and
3. serial transmission mode.
The following paragraphs explain how the device can be used in basic PWM mode, which is the most simple and straightforward mode for this device. Information on how to use the PDM and serial transmission modes, the FIFO, and DMA is available in the BCM2835 ARM Peripherals manual.
The base address of the PWM device is 2020C00016 and it contains eight registers. Table 12.1 shows the offset, name, and a short description for each of the registers. The mode of operation is selected for each channel independently by writing appropriate bits in the PWMCTL register. The base clock frequency is controlled by the clock manager device, which is explained in Section 13.1. By default, the system startup code sets the base clock for the PWM device to 100 MHz.
Table 12.1
Raspberry Pi PWM register map
| Offset | Name | Description | Size | R/W |
| 0016 | PWMCTL | PWM Control | 32 | R/W |
| 0416 | PWMSTA | PWM FIFO Status | 32 | R/W |
| 0816 | PWMDMAC | PWM DMA Configuration | 32 | R/W |
| 1016 | PWMRNG1 | PWM Channel 1 Range | 32 | R/W |
| 1416 | PWMDAT1 | PWM Channel 1 Data | 32 | R/W |
| 1816 | PWMFIF1 | PWM FIFO Input | 32 | R/W |
| 2016 | PWMRNG2 | PWM Channel 2 Range | 32 | R/W |
| 2416 | PWMDAT2 | PWM Channel 2 Data | 32 | R/W |

Table 12.2 shows the names and short descriptions of the bits in the PWMCTL register. There are 8 bits used for controlling channel 1 and 8 bits for controlling channel 2. PWENn is the master enable bit for channel n. Setting that bit to 0 disables the PWM channel, while setting it to 1 enables the channel. MODEn is used to select whether the channel is in serial transmission mode or in the PDM/PWM mode. If MODEn is set to 0, then MSENn is used to choose whether channel n is in PDM mode or PWM mode. If MODEn is set to 1, then RPTLn, SBITn, USEFn, and CLRFn are used to manage the operation of the FIFO for channel n. POLAn is used to enable or disable inversion of the output signal for channel n.
Table 12.2
Raspberry Pi PWM control register bits

The PWMRNGn registers are used to define the base period for the corresponding channel. In PDM mode, evenly distributed pulses are sent within a period of length defined by this register, and the number of pulses sent during the base period is controlled by writing to the corresponding PWMDATn register. In PWM mode, the PWMRNGn register defines the base frequency for the pulses, and the duty cycle is controlled by writing to the corresponding PWMDATn register. Example 12.1 gives an overview of the steps needed to configure PWM0 for use in PWM mode.
The AllWinner A10/A20 SOCs have a hardware PWM device which is capable of generating two PWM signals. The PWM device is driven by the OSC24M signal, which is generated by the Clock Control Unit (CCU) in the AllWinner SOC. This base clock runs at 24 MHz by default, and changing the base frequency could affect many other devices in the system. The base clock can be divided by one of 11 predefined values using a prescaler built into the PWM device. Each of the two channels has its own prescaler. Table 12.3 shows the possible settings for the prescalers.
Table 12.3
Prescaler bits in the pcDuino PWM device
| Value | Effect |
| 0000 | Base clock is divided by 120 |
| 0001 | Base clock is divided by 180 |
| 0010 | Base clock is divided by 240 |
| 0011 | Base clock is divided by 360 |
| 0100 | Base clock is divided by 480 |
| 0101,0110,0111 | Not used |
| 1000 | Base clock is divided by 1200 |
| 1001 | Base clock is divided by 2400 |
| 1010 | Base clock is divided by 3600 |
| 1011 | Base clock is divided by 4800 |
| 1100 | Base clock is divided by 7200 |
| 1101,1110 | Not used |
| 1111 | Base clock is divided by 1 |
There are two modes of operation for the PWM device. In the first mode, the device operates like a standard PWM device as described in Section 12.2. In the second mode, it sends a single pulse and then waits until it is triggered again by the CPU. In this mode, it is a monostable multivibrator, also known as a one-shot multivibrator, or just one-shot. The duration of the pulse is controlled using the pre-scaler and the period register.
The PWM device is mapped at address 01C20C0016. Table 12.4 shows the registers and their offsets from the base address. All of the device configuration is done through a single control register, which can also be read in order to determine the status of the device. The bits in the control register are shown in Table 12.5.
Table 12.4
pcDuino PWM register map
| Offset | Name | Description |
| 20016 | PWMCTL | PWM Control |
| 20416 | PWM_CH0_PERIOD | PWM Channel 0 Period |
| 20816 | PWM_CH1_PERIOD | PWM Channel 1 Period |
Table 12.5
pcDuino PWM control register bits
| Bit | Name | Description | Values |
| 3-0 | CH0_PRESCAL | Channel 0 Prescale | These bits must be set before PWM Channel 0 clock is enabled. See Table 12.3. |
| 4 | CH0_EN | Channel 0 Enable | 0: Channel disabled |
| 1: Channel enabled | |||
| 5 | CH0_ACT_STA | Channel 0 Polarity | 0: Channel is active low |
| 1: Channel is active high | |||
| 6 | SCLK_CH0_GATING | Channel 0 Clock | 0: Clock disabled |
| 1: Clock enabled | |||
| 7 | CH0_PUL_START | Start pulse | If configured for pulse mode, writing a 1 causes the PWM device to emit a single pulse. |
| 8 | PWM0_BYPASS | Bypass PWM | 0: Output PWM device signal |
| 1: Output base clock | |||
| 9 | SCLK_CH0_MODE | Select Mode | 0: PWM mode |
| 1: Pulse mode | |||
| 10-14 | Not Used | ||
| 18-15 | CH1_PRESCAL | Channel 1 Prescale | These bits must be set before PWM Channel 1 clock is enabled. See Table 12.3. |
| 19 | CH1_EN | Channel 1 Enable | 0: Channel disabled |
| 1: Channel enabled | |||
| 20 | CH1_ACT_STA | Channel 1 Polarity | 0: Channel is active low |
| 1: Channel is active high | |||
| 21 | SCLK_CH1_GATING | Channel 1 Clock | 0: Clock disabled |
| 1: Clock enabled | |||
| 22 | CH1_PUL_START | Start pulse | If configured for pulse mode, writing a 1 causes the PWM device to emit a single pulse. |
| 23 | PWM1_BYPASS | Bypass PWM | 0: Output PWM device signal |
| 1: Output base clock | |||
| 24 | SCLK_CH1_MODE | Select Mode | 0: PWM mode |
| 1: Pulse mode | |||
| 27-25 | Not Used | ||
| 28 | PWM0_RDY | CH0 Period Ready | 0: PWM0 Period register is ready |
| 1: PWM0 Period register is busy | |||
| 29 | PWM1_RDY | CH1 Period Ready | 0: PWM1 Period register is ready |
| 1: PWM1 Period register is busy | |||
| 31–30 | Not Used |

Before enabling a PWM channel, the period register for that channel should be initialized. The two period registers are each organized as two 16-bit numbers. The upper 16 bits control the total number of clock cycles in one period. In other words, they control the base frequency of the PWM signal. The PWM frequency is calculated as
where OSC24M is the frequency of the base clock (the default is 24 MHz), PSC is the prescale value set in the channel prescale bits in the PWM control register, and N is the value stored in the upper 16 bits of the channel period register.
The lower 16 bits of the channel period register control the duty cycle. The duty cycle (expressed as % of full on) can be calculated as
where N is the value stored in the upper 16 bits of the channel period register, and D is the value stored in the lower 16 bits of the channel period register. Note that the condition D ≤ N must always remain true. If the programmer allows D to become greater than N, the results are unpredictable.
The procedure for configuring the AllWinner A10/A20 PWM device is as follows:
1. Disable the desired channel:
(a) Read the PWM control register into x.
(b) Clear all of the bits in x for the desired PWM channel.
(c) Write x back to the PWM control register
2. Initialize the period register for the desired channel.
(a) Calculate the desired value for N.
(b) Let D = 0.
(c) Let y = N × 216 + D.
(d) Write y to the desired channel period register.
3. Set the prescaler.
(a) Select the four-bit code for the desired divisor from Table 12.3.
(b) Set the prescaler code bits in x.
(c) Write x back to the PWM control register.
4. Enable the PWM device.
(a) Set the appropriate bits in x to enable the desired channel, select the polarity, and enable the clock.
(b) Write x to the PWM control register.
Once the control register is configured, the duty cycle can be controlled by calculating a new value for D and then writing y = N × 216 + D to the desired channel period register.
Pulse modulation is a group of methods for generating analog signals using digital equipment, and is commonly used in control systems to regulate the power sent to motors and other devices. Pulse modulation techniques can have very low power loss compared to other methods of controlling analog devices, and the circuitry required is relatively simple.
The cycle frequency must be programmed to match the application. Typically, 10 Hz is adequate for controlling an electric heating element, while 120 Hz would be more appropriate for controlling an incandescent light bulb. Large electric motors may be controlled with a cycle frequency as low as 100 Hz, while smaller motors may need frequencies around 10,000 Hz. It can take some experimentation to find the best frequency for any given application.
12.1 Write ARM assembly programs to configure PWM0 and the GPIO device to send a signal out on Raspberry Pi header pin 12 with:
(a) period of 1 ms and duty cycle of 25%, and
(b) frequency of 150 Hz and duty cycle of 63%.
12.2 Write ARM assembly programs to configure PWM0 and the GPIO device to send a signal out on the pcDuino PWM1/GPIO5 pin with:
(a) period of 1 ms and duty cycle of 25%, and
(b) frequency of 150 Hz and duty cycle of 63%.
This chapter briefly describes some of the devices which are present in most modern computer systems. It then describes in detail the clock management devices on the Raspberry Pi and the pcDuino. Next, it gives an explanation of asynchronous serial communications, and explains how there is some tolerance for mismatch between the clock rate of the transmitter and receiver. It then explains the Universal Asynchronous Receiver/Transmitter (UART) device. Next it covers in detail the UART devices present on the Raspberry Pi and the PcDuino. Once again, the reader is given the opportunity to do a comparison between two different devices which perform almost precisely the same functions.
Universal asynchronous receiver/transmitter (UART); Clock manager; Serial communications; RS232
There are some classes of devices that are found in almost every system, including the smallest embedded systems. Such common devices include hardware for managing the clock signals sent to other devices, and serial communications (typically RS232). Most mid-sized or large systems also include devices for managing virtual memory, managing the cache, driving a display, interfacing with keyboard and mouse, accessing disk and other storage devices, and networking. Small embedded systems may have devices for converting analog signals to digital and vice versa, pulse width modulation, and other purposes. Some systems, such as the Raspberry Pi and pcDuino, have all or most of the devices of large systems, as well as most of the devices found on embedded systems. In this chapter, we look at two devices found on almost every system.
Very simple computer systems can be driven by a single clock. Most devices, including the CPU, are designed as state machines. The clock device sends a square-wave signal at a fixed frequency to all devices that need it. The clock signal tells the devices when to transition to the next state. Without the clock signal, none of the devices would do anything.
More complex computers may contain devices which need to run at different rates. This requires the system to have separate clock signals for each device (or group of devices). System designers often solve this problem by adding a clock manager device to the system. This device allows the programmer to configure the clock signals that are sent to the other devices in the system. Fig. 13.1 shows a typical system. The clock manager, just like any other device, is configured by the CPU writing data to its registers using the system bus.

The BCM2835 system-on-chip contains an ARM CPU and several devices. Some of the devices need their own clock to drive their operation at the correct frequency. Some devices, such as serial communications receivers and transmitters, need configurable clocks so that the programmer has control over the speed of the device. To provide this flexibility and allow the programmer to have control over the clocks for each device, the BCM2835 includes a clock manager device, which can be used to configure the clock signals driving the other devices in the system.
The Raspberry Pi has a 19.2 MHz oscillator which can be used as a base frequency for any of the clocks. The BCM2835 also has three phase-locked-loop circuits that boost the oscillator to higher frequencies. Table 13.1 shows the frequencies that are available from various sources. Each device clock can be driven by one of the PLLs, the external 19.2 MHz oscillator, a signal from the HDMI port, or either of two test/debug inputs.
Table 13.1
Clock sources available for the clocks provided by the clock manager
| Number | Name | Frequency | Note |
| 0 | GND | 0 Hz | Clock is stopped |
| 1 | oscillator | 19.2 MHz | |
| 2 | testdebug0 | Unknown | Used for system testing |
| 3 | testdebug1 | Unknown | Used for system testing |
| 4 | PLLA | 650 MHz | May not be available |
| 5 | PLLC | 200 MHz | May not be available |
| 6 | PLLD | 500 MHz | |
| 7 | HDMI auxiliary | Unknown | |
| 8–15 | GND | 0 Hz | Clock is stopped |

Among the clocks controlled by the clock manager device are the core clock (CM_VPU), the system timer clock (PM_TIME) which controls the speed of the system timer, the GPIO clocks which are documented in the Raspberry Pi peripheral documentation, the pulse modulator device clocks, and the serial communications clocks. It is generally not a good idea to modify the settings of any of the clocks without good reason.
The base address of the clock manager device is 2010100016. Some of the clock manager registers are shown in Table 13.2. Each clock is managed by two registers: a control register and a divisor. The control register is used to enable or disable a clock, to select which source oscillator drives the clock, and to select an optional multistage noise shaping (MASH) filter level. MASH filtering is useful for reducing the perceived noise when a clock is being used to generate an audio signal. In most cases, MASH filtering should not be used.
Table 13.2
Some registers in the clock manager device
| Offset | Name | Description |
| 07016 | CM_GP0_CTL | GPIO Clock 0 (GPCLK0) Control |
| 07416 | CM_GP0_DIV | GPIO Clock 0 (GPCLK0) Divisor |
| 07816 | CM_GP1_CTL | GPIO Clock 1 (GPCLK1) Control |
| 07c16 | CM_GP1_DIV | GPIO Clock 1 (GPCLK1) Divisor |
| 08016 | CM_GP2_CTL | GPIO Clock 2 (GPCLK2) Control |
| 08416 | CM_GP2_DIV | GPIO Clock 2 (GPCLK2) Divisor |
| 09816 | CM_PCM_CTL | Pulse Code Modulator Clock (PCM_CLK) Control |
| 09c16 | CM_PCM_DIV | Pulse Code Modulator Clock (PCM_CLK) Divisor |
| 0a016 | CM_PWM_CTL | Pulse Modulator Device Clock (PWM_CLK) Control |
| 0a416 | CM_PWM_DIV | Pulse Modulator Device Clock (PWM_CLK) Divisor |
| 0f016 | CM_UART_CTL | Serial Communications Clock (UART_CLK) Control |
| 0f416 | CM_UART_DIV | Serial Communications Clock (UART_CLK) Divisor |
Table 13.3 shows the meaning of the bits in the control registers for each of the clocks, and Table 13.4 shows the fields in the clock manager divisor registers. The procedure for configuring one of the clocks is:
Table 13.3
Bit fields in the clock manager control registers
| Bit | Name | Description |
| 3–0 | SRC | Clock source chosen from Table 13.1 |
| 4 | ENAB | Writing a 0 causes the clock to shut down. The clock will not stop immediately. The BUSY bit will be 1 while the clock is shutting down. When the BUSY bit becomes 0, the clock has stopped and it is safe to reconfigure it. Writing a 1 to this bit causes the clock to start |
| 5 | KILL | Writing a 1 to this bit will stop and reset the clock. This does not shut down the clock cleanly, and could cause a glitch in the clock output |
| 6 | - | Unused |
| 7 | BUSY | A 1 in this bit indicates that the clock is running |
| 8 | FLIP | Writing a 1 to this bit will invert the clock output. Do not change this bit while the clock is running |
| 10–9 | MASH | Controls how the clock source is divided.
01: 1-stage MASH division 10: 2-stage MASH division 11: 3-stage MASH division Do not change this while the clock is running. |
| 23–11 | – | Unused |
| 31–24 | PASSWD | This field must be set to 5A16 every time the clock control register is written to |

Table 13.4
Bit fields in the clock manager divisor registers
| Bit | Name | Description |
| 11–0 | DIVF | Fractional part of divisor. Do not change this while the clock is running |
| 23–12 | DIVI | Integer part of divisor. Do not change this while the clock is running |
| 31–24 | PASSWD | This field must be set to 5A16 every time the clock divisor register is written to |
1. Read the desired clock control register.
2. Clear bit 4 in the word that was read, then OR it with 5A00000016 and store the result back to the desired clock control register.
3. Repeatedly read the desired clock control register, until bit 7 becomes 0.
4. Calculate the divisor required and store it into the desired clock divisor register.
5. Create a word to configure and start the clock. Begin with 5A00000016, and set bits 3–0 to select the desired clock source. Set bits 10–9 to select the type of division, and set bit 4 to 1 to enable the clock.
6. Store the control word into the desired clock control register.
Selection of the divisor depends on which clock source is used, what type of division is selected, and the desired output of the clock being configured. For example, to set the PWM clock to 100 kHz, the 19.20 MHz clock can be used. Dividing that clock by 192 will provide a 100-KHz clock. To accomplish this, it is necessary to stop the PWM clock as described, store the value 5A0C000016 in the PWM clock divisor register, and then start the clock by writing 5A00001116 into the PWM clock control register.
The AllWinner A10/A20 SOCs have a relatively simple clock manager, which is referred to as the Clock Control Unit. All of the clock signals in the system are driven by two crystal oscillators: the main oscillator runs at 24 MHz, and the real-time-clock oscillator, which runs at 32768 Hz. The real-time-clock oscillator is used only to provide a signal to the real-time-clock device.
The main clock oscillator drives many of the devices in the system, but there are seven phase-locked-loop circuits in the CCU which provide signals for devices which need clocks that are faster or slower than 24 MHz. Table 13.5 shows which devices are driven by the nine clock signals.
Table 13.5
Clock signals in the AllWinner A10/A20 SOC
| Clock Domain | Modules | Frequency | Description |
| OSC24M | Most modules | 24 MHz | Main clock |
| CPU32_clk | CPU | 2 kHz–1.2 GHz | Drives CPU |
| AHB_clk | AHB devices | 8 kHz–276 MHz | Drives some devices |
| APB_clk | Peripheral bus | 500 Hz–138 MHz | Drives some devices |
| SDRAM_clk | SDRAM | 0 Hz–400 MHz | Drives SDRAM memory |
| Usb:clk | USB | 480 MHz | Drives USB devices |

There are basically two methods for transferring data between two digital devices: parallel and serial. Parallel connections use multiple wires to carry several bits at one time, typically including extra wires to carry timing information. Parallel communications are used for transferring large amounts of data over very short distances. However, this approach becomes very expensive when data must be transferred more than a few meters. Serial, on the other hand, uses a single wire to transfer the data bits one at a time. When compared to parallel transfer, the speed of serial transfer typically suffers. However, because it uses significantly fewer wires, the distance may be greatly extended, reliability improved, and cost vastly reduced.
One of the oldest and most common devices for communications between computers and peripheral devices is the Universal Asynchronous Receiver/Transmitter, or UART. The word “universal” indicates that the device is highly configurable and flexible. UARTs allow a receiver and transmitter to communicate without a synchronizing signal.
The logic signal produced by the digital UART typically oscillates between zero volts for a low level and five volts for a high level, and the amount of current that the UART can supply is limited. For transmitting the data over long distances, the signals may go through a level-shifting or amplification stage. The circuit used to accomplish this is typically called a line driver. This circuit boosts the signal provided by the UART and also protects the delicate digital outputs from short circuits and signal spikes. Various standards, such as RS-232, RS-422, and RS-485 define the voltages that the line driver uses. For example, the RS-232 standard specifies that valid signals are in the range of + 3 to + 15 V, or − 3 to − 15 V. The standards also specify the maximum time that is allowable when shifting from a high signal to a low signal and vice versa, the amount of current that the device must be capable of sourcing and sinking, and other relevant design criteria.
The UART transmits data by sending each bit sequentially. The receiving UART re-assembles the bits into the original data. Fig. 13.2 shows how the transmitting UART converts a byte of data into a serial signal, and how the receiving UART samples the signal to recover the original data. Serializing the transmission and reassembly of the data are accomplished using shift registers. The receiver and transmitter each have their own clocks, and are configured so that the clocks run at the same speed (or close to the same speed). In this case, the receiver’s clock is running slightly slower than the transmitter’s clock, but the data are still received correctly.

To transfer a group of bits, called a data frame, the transmitter typically first sends a start bit. Most UARTs can be configured to transfer between four and eight data bits in each group. The transmitting and receiving UARTS must be configured to use the same number of data bits. After each group of data bits, the transmitter will return the signal to the low state and keep it there for some minimum period. This period is usually the time that it would take to send two bits of data, and is referred to as the two stop bits. The stop bits allow the receiver to have some time to process the received byte and prepare for the next start bit. Fig. 13.2A shows what a typical RS-232 signal would look like when transferring the value 5616 (the ASCII “V” character). The UART enters the idle state only if there is not another byte immediately ready to send. If the transmitter has another byte to send, then the start bit can begin at the end of the second stop bit.
Note that it is impossible to ensure that the receiver and transmitter have clocks which are running at exactly the same speed, unless they use the same clock signal. Fig. 13.2B shows how the receiver can reassemble the original data, even with a slightly different clock rate. When the start bit is detected by the receiver, it prepares to receive the data bits, which will be sent by the transmitter at an expected rate (within some tolerance). The receive circuitry of most UARTS is driven by a clock that runs 16 times as fast as the baud rate. The receive circuitry uses its faster clock to latch each bit in the middle of its expected time period. In Fig. 13.2B, the receiver clock is running slower than the transmitter clock. By the end of the data frame, the sample time is very far from the center of the bit, but the correct value is received. If the clocks differed by much more, or if more than eight data bits were sent, then it is very likely that incorrect data would be received. Thus, as long as their clocks are synchronized within some tolerance (which is dependent on the number of data bits and the baud rate), the data will be received correctly.
The RS-232 standard allows point-to-point communication between two devices for limited distances. With the RS-232 standard, simple one-way communications can be accomplished using only two wires: One to carry the serial bits, and another to provide a common ground. For bi-directional communication, three wires are required. In addition, the RS-232 standard specifies optional hand-shaking signals, which the UARTs can use to signal their readiness to transmit or receive data. The RS-422 and RS-485 standards allow multiple devices to be connected using only two wires.
The first UART device to enjoy widespread use was the 8250. The original version had 12 registers for configuration, sending, and receiving data. The most important registers are the ones that allow the programmer to set the transmit and receive bit rates, or baud. One baud is one bit per second. The baud is set by storing a 16 bit divisor in two of the registers in the UART. The chip is driven by an external clock, and the divisor is used to reduce the frequency of the external clock to a frequency that is appropriate for serial communication. For example, if the external clock runs at 1 MHz, and the required baud is 1200, then the divisor must be
. Note that the divisor can only be an integer, so the device cannot achieve exactly 1200 baud. However, as explained previously, the sending and receiving devices do not have to agree precisely on the baud. During the transmission and reception of a byte, 1200.48 baud is close enough that the bits will be received correctly even if the other end is running slightly below 1200 baud. In the 8250, there was only one 8-bit register for sending data and only one 8-bit register for receiving data. The UART could send an interrupt to the CPU after each byte was transmitted or received. When receiving, the CPU had to respond to the interrupt very quickly. If the current byte was not read quickly enough by the CPU, it would be overwritten by the subsequent incoming byte. When transmitting, the CPU needed to respond quickly to interrupts to provide the next byte to be sent, or the transmission rate would suffer.
The next generation of UART device was the 16550A. This device is the model for most UART devices today. It features 16-byte input and output buffers and the ability to trigger interrupts when a buffer is partially full or partially empty. This allows the CPU to move several bytes of data at a time and results in much lower CPU overhead and much higher data transmission and reception rates. The 16550A also supports much higher baud rates than the 8250.
The BCM2835 system-on-chip provides two UART devices: UART0 and UART1. UART 1 is part of the I2C device, and is not recommended for use as a UART. UART0 is a PL011 UART, which is based on the industry standard 16550A UART. The major differences are that the PL011 allows greater flexibility in configuring the interrupt trigger levels, the registers appear in different locations, and the locations of bits in some of the registers is different. So, although it operates very much like a 16550A, things have been moved to different locations. The transmit and receive lines can be routed through GPIO pin 14 and GPIO pin 15, respectively. UART0 has 18 registers, starting at its base address of 2E2010016. Table 13.6 shows the name, location, and a brief description for each of the registers.
Table 13.6
Raspberry Pi UART0 register map
| Offset | Name | Description |
| 0016 | UART_DR | Data Register |
| 0416 | UART_RSRECR | Receive Status Register/Error Clear Register |
| 1816 | UART_ FR | Flag register |
| 2016 | UART_ILPR | not in use |
| 2416 | UART_IBRD | Integer Baud rate divisor |
| 2816 | UART_FBRD | Fractional Baud rate divisor |
| 2c16 | UART_LCRH | Line Control register |
| 3016 | UART_CR | Control register |
| 3416 | UART_IFLS | Interrupt FIFO Level Select Register |
| 3816 | UART_IMSC | Interrupt Mask Set Clear Register |
| 3c16 | UART_RIS | Raw Interrupt Status Register |
| 4016 | UART_MIS | Masked Interrupt Status Register |
| 4416 | UART_ICR | Interrupt Clear Register |
| 4816 | UART_DMACR | DMA Control Register |
| 8016 | UART_ITCR | Test Control register |
| 8416 | UART_ITIP | Integration test input reg |
| 8816 | UART_ITOP | Integration test output reg |
| 8c16 | UART_TDR | Test Data reg |
UART_DR: The UART Data Register is used to send and receive data. Data are sent or received one byte at a time. Writing to this register will add a byte to the transmit FIFO. Although the register is 32 bits, only the 8 least significant bits are used in transmission, and 12 least significant bits are used for reception. If the FIFO is empty, then the UART will begin transmitting the byte immediately. If the FIFO is full, then the last byte in the FIFO will be overwritten with the new byte that is written to the Data Register. When this register is read, it returns the byte at the top of the receive FIFO, along with four additional status bits to indicate if any errors were encountered. Table 13.7 specifies the names and use of the bits in the UART Data Register.
Table 13.7
Raspberry Pi UART data register
| Bit | Name | Description | Values |
| 7–0 | DATA | Data |
Write: Data byte to transmit |
| 8 | FE | Framing error |
1: The received character did not have a valid stop bit |
| 9 | PE | Parity error |
1: The received character did not have the correct parity, as set in theEPS and SPS bits of the Line Control Register (UART_LCRH) |
| 10 | BE | Break error |
1: A break condition was detected. The data input line was held low forlonger than the time it would take to receive a complete byte,including the start and stop bits. |
| 11 | OE | Overrun error |
1: Data was not read quickly enough, and one or more bytes wereoverwritten in the input buffer |
| 31–12 | - | Not used | Write as zero, read as don’t care |

UART_RSRECR: The UART Receive Status Register/Error Clear Register is used to check the status of the byte most recently read from the UART Data Register, and to check for overrun conditions at any time. The status information for overrun is set immediately when an overrun condition occurs. The Receive Status Register/Error Clear Register provides the same four status bits as the Data Register (but in bits 3–0 rather than bits 11–8). The received data character must be read first from the Data Register, before reading the error status associated with that data character from the RSRECR register. Since the Data Register also contains these 4 bits, this register may not be required, depending on how the software is written. Table 13.8 describes the bits in this register.
Table 13.8
Raspberry Pi UART receive status register/error clear register
| Bit | Name | Description | Values |
| 0 | FE | Framing error |
1: The received character did not have a valid stop bit |
| 1 | PE | Parity error |
1: The received character did not have the correct parity, as set in theEPS and SPS bits of the Line Control Register (UART_LCRH) |
| 2 | BE | Break error |
1: A break condition was detected. The data input line was held low for longer than the time it would take to receive a complete byte,including the start and stop bits. |
| 3 | OE | Overrun error |
1: Data was not read quickly enough, and one or more bytes wereoverwritten in the input buffer |
| 31–4 | Not used | Write as zero, read as don’t care |

UART_FR: The UART Flag Register can be read to determine the status of the UART. The bits in this register are used mainly when sending and receiving data using the FIFOs. When several bytes need to be sent, the TXFF flag should be checked to ensure that the transmit FIFO is not full before each byte is written to the data register. When receiving data, the RXFE bit can be used to determine whether or not there is more data to be read from the FIFO. Table 13.9 describes the flags in this register.
Table 13.9
Raspberry Pi UART flags register bits

UART_ILPR: This is the IrDA register, which is supported by some PL011 UARTs. IrDA stands for the Infrared Data Association, which is a group of companies that cooperate to provide specifications for a complete set of protocols for wireless infrared communications. The name “IrDA” also refers to that set of protocols. IrDA is not implemented on the Raspberry Pi UART. Writing to this register has no effect and reading returns 0.
UART_IBRD and UART_FBRD: UART_FBRD is the fractional part of the baud rate divisor value, and UART_IBRD is the integer part. The baud rate divisor is calculated as follows:
where UARTCLK is the frequency of the UART_CLK that is configured in the Clock Manager device. The default value is 3 MHz. BAUDDIV is stored in two registers. UART_IBRD holds the integer part and UART_FBRD holds the fractional part. Thus BAUDDIV should be calculated as a U(16,6) fixed point number. The contents of the UART_IBRD and UART_FBRD registers may be written at any time, but the change will not have any effect until transmission or reception of the current character is complete. Table 13.10 shows the arrangement of the integer baud rate divisor register, and Table 13.11 shows the arrangement of the fractional baud rate divisor register.
Table 13.10
Raspberry Pi UART integer baud rate divisor
| Bit | Name | Description | Values |
| 15–0 | IBRD | Integer Baud Rate Divisor | See Eq. (13.1) |
| 31–16 | Not used | Write as zero, read as don’t care |

Table 13.11
Raspberry Pi UART fractional baud rate divisor
| Bit | Name | Description | Values |
| 5-0 | FBRD | Fractional Baud Rate Divisor | See Eq. (13.1) |
| 31-6 | Not used | Write as zero, read as don’t care |

UART_LCRH: UART_LCRH is the line control register. It is used to configure the communication parameters. This register must not be changed until the UART is disabled by writing zero to bit 0 of UART_CR, and the BUSY flag in UART_FR is clear. Table 13.12 shows the layout of the line control register.
Table 13.12
Raspberry Pi UART line control register bits

UART_CR: The UART Control Register is used for configuring, enabling, and disabling the UART. Table 13.13 shows the layout of the control register. To enable transmission, the TXE bit and UARTEN bit must be set to 1. To enable reception, the RXE bit and UARTEN bit must be set to 1. In general, the following steps should be used to configure or re-configure the UART:
Table 13.13
Raspberry Pi UART control register bits
| Bit | Name | Description | Values |
| 0 | UARTEN | UART Enable |
1: UART enabled. |
| 1 | SIREN | Not used | Write as zero, read as don’t care |
| 2 | SIRLP | Not used | Write as zero, read as don’t care |
| 3–6 | Not used | Write as zero, read as don’t care | |
| 7 | LBE | Loopback Enable |
1: Loopback enabled. Transmitted data is also fed back to the receiver. |
| 8 | TXE | Transmit enable |
1: Transmitter is enabled |
| 9 | RXE | Receive enable |
1: Receiver is enabled |
| 10 | DTR | Not used | Write as zero, read as don’t care |
| 11 | RTS | Complement of nUARTRTS | |
| 12 | OUT1 | Not used | Write as zero, read as don’t care |
| 13 | OUT2 | Not used | Write as zero, read as don’t care |
| 14 | RTSEN | RTS Enable |
1: Hardware RTS Enabled |
| 15 | CTSEN | CTS Enable |
1: Hardware CTS Enabled |
| 16–31 | Not used | Write as zero, read as don’t care |

(b) Wait for the end of transmission or reception of the current character.
(c) Flush the transmit FIFO by setting the FEN bit to 0 in the Line Control Register.
(d) Reprogram the Control Register.
(e) Enable the UART.
Interrupt Control: The UART can signal the CPU by asserting an interrupt when certain conditions occur. This will be covered in more detail in Chapter 14. For now, it is enough to know that there are five additional registers which are used to configure and use the interrupt mechanism.
UART_IFLS defines the FIFO level that triggers the assertion of the interrupt signal. One interrupt is generated when the FIFO reaches the specified level. The CPU must clear the interrupt before another can be generated.
UART_IMSC is the interrupt mask set/clear register. It is used to enable or disable specific interrupts. This register determines which of the possible interrupt conditions are allowed to generate an interrupt to the CPU.
UART_RIS is the raw interrupt status register. It can be read to raw status of interrupts conditions before any masking is performed.
UART_MIS is the masked interrupt status register. It contains the masked status of the interrupts. This is the register that the operating system should use to determine the cause of a UART interrupt.
UART_ICR is the interrupt clear register. writing to it clears the interrupt conditions. The operating system should use this register to clear interrupts before returning from the interrupt service routine.
UART_DMACR: The DMA control register is used to configure the UART to access memory directly, so that the CPU does not have to move each byte of data to or from the UART. DMA will be explained in more detail in Chapter 14.
Additional Registers: The remaining registers, UART_ITCR, UART_ITIP, and UART_ITOP, are either unimplemented or are used for testing the UART. These registers should not be used.
Listing 13.1 shows four basic functions for initializing the UART, changing the baud rate, sending a character, and receiving a character using UART0 on the Raspberry Pi. Note that a large part of the code simply defines the location and offset for all of the registers (and bits) that can be used to control the UART.





The AllWinner A10/A20 SOC includes eight UART devices. They are all fully compatible with the 16550A UART, and also provide some enhancements. All of them provide transmit (TX) and receive (RX) signals. UART0 has the full set of RS232 signals, including RTS, CTS, DTR, DSR, DCD, and RING. UART1 has the RTS and CTS signals. The remaining six UARTs only provide the TX and RX signals. They can all be configured for serial IrDA. Table 13.14 shows the base address for each of the eight UART devices.
Table 13.14
pcDuino UART addresses
| Name | Address |
| UART0 | 0x01C28000 |
| UART1 | 0x01C28400 |
| UART2 | 0x01C28800 |
| UART3 | 0x01C28C00 |
| UART4 | 0x01C29000 |
| UART5 | 0x01C29400 |
| UART6 | 0x01C29800 |
| UART7 | 0x01C29C00 |
When the 16550 UART was designed, 8-bit processors were common, and most of them provided only 16 address bits. Memory was typically limited to 64 kB, and every byte of address space was important. Because of these considerations, the designers of the 16550 decided to limit the number of addresses used to 8, and to only use eight bits of data per address. There are 10 registers in the 16550 UART, but some of them share the same address. For example, there are three registers mapped to an offset address of zero, two registers mapped at offset four, and two registers mapped at offset eight. Bit seven in the Line Control Register is used to determine which of the registers is active for a given address.
Because they are meant to be fully backwards-compatible with the 16550, the AllWinner A10/A20 SOC UART devices also use only 8 bits for each register, and the first 12 registers correspond exactly with the 16550 UART. The only differences are that the pcDuino uses word addresses rather than byte addresses, and they provide four additional registers that are used for IrDA mode. Table 13.15 shows the arrangement of the registers in each of the 8 UARTs on the pcDuino. The following sections will explain the registers.
Table 13.15
pcDuino UART register offsets
| Register Name | Offset | Description |
| UART_RBR | 0x00 | UART Receive Buffer Register |
| UART_THR | 0x00 | UART Transmit Holding Register |
| UART_DLL | 0x00 | UART Divisor Latch Low Register |
| UART_DLH | 0x04 | UART Divisor Latch High Register |
| UART_IER | 0x04 | UART Interrupt Enable Register |
| UART_IIR | 0x08 | UART Interrupt Identity Register |
| UART_FCR | 0x08 | UART FIFO Control Register |
| UART_LCR | 0x0C | UART Line Control Register |
| UART_MCR | 0x10 | UART Modem Control Register |
| UART_LSR | 0x14 | UART Line Status Register |
| UART_MSR | 0x18 | UART Modem Status Register |
| UART_SCH | 0x1C | UART Scratch Register |
| UART_USR | 0x7C | UART Status Register |
| UART_TFL | 0x80 | UART Transmit FIFO Level |
| UART_RFL | 0x84 | UART_RFL |
| UART_HALT | 0xA4 | UART Halt TX Register |
The baud rate is set using a 16-bit Baud Rate Divisor, according to the following equation:
where sclk is the frequency of the UART serial clock, which is configured by the Clock Manager device. The default frequency of the clock is 24 MHz. BAUDDIV is stored in two registers. UART_DLL holds the least significant 8 bits, and UART_DLH holds the most significant 8 bits. Thus BAUDDIV should be calculated as a 16-bit unsigned integer. Note that for high baud rates, it may not be possible to get exactly the rate desired. For example, a baud rate of 115200 would require a divisor of
. Since the baud rate divisor can only be given as an integer, the desired rate must be based on a divisor of 13, so the true baud rate will be
, or about 0.16% faster than desired. Although slightly fast, it is well within the tolerance for RS232 communication.
UART_RBR: The UART Receive Buffer Register is used to receive data, 1 byte at a time. If the receive FIFO is enabled, then as the UART receives data, it places the data into a receive FIFO. Reading from this address removes 1 byte from the receive FIFO. If the FIFO becomes full and another data byte arrives, then the new data are lost and an overrun error occurs. Table 13.16 shows the layout of the receive buffer register.
Table 13.16
pcDuno UART receive buffer register
| Bit | Name | Description | Values |
| 7–0 | RBR | Data | Read only: One byte of received data. Bit 7 of LCR must bezero. |
| 31–8 | Unused |

UART_THR: Writing to the Transmit Holding Register will cause that byte to be transmitted by the UART. If the transmit FIFO is enabled, then the byte will be added to the end of the transmit FIFO. If the FIFO is empty, then the UART will begin transmitting the byte immediately. If the FIFO is full, then the new data byte will be lost. Table 13.17 shows the layout of the transmit holding register.
Table 13.17
pcDuno UART transmit holding register
| Bit | Name | Description | Values |
| 7–0 | THR | Data | Write only: One byte of data to transmit. Bit 7 of LCR must bezero. |
| 31–8 | Unused |

UART_DLL: The UART Divisor Latch Low register is used to set the least significant byte of the baud rate divisor. When bit 7 of the Line Control Register is set to one, writing to this address will access the DLL register. If bit 7 of the Line Control Register is set to zero, then writing to this address will access the transmit holding register. Table 13.18 shows the layout of the UART_DLL register.
Table 13.18
pcDuno UART divisor latch low register
| Bit | Name | Description | Values |
| 7–0 | DLL | Data | Write only: Least significant eight bits of the Baud Rate Divisor. Bit 7 of LCR must be one. |
| 31–8 | Unused |

UART_DLH: The UART Divisor Latch High register is used to set the most significant byte of the baud rate divisor. When bit 7 of the Line Control Register is set to one, writing to this address will access the DLH register. If bit 7 of the Line Control Register is set to zero, then writing to this address will access the Interrupt Enable Register rather than the Divisor Latch High register. Table 13.19 shows the layout of the UART_DLL register.
If the two Divisor Latch Registers (DLL and DLH) are set to zero, the baud clock is disabled and no serial communications occur. DLH should be set before DLL, and at least eight clock cycles of the UART clock should be allowed to pass before data are transmitted or received.
Table 13.19
pcDuno UART divisor latch high register
| Bit | Name | Description | Values |
| 7–0 | DLH | Data | Write only: Most significant eight bits of the Baud Rate Divisor. Bit 7 of LCR must be one. |
| 31–8 | Unused |

UART_FCR: is the UART FIFO control register. It is used to enable or disable the receive and transmit FIFOs (buffers), flush their contents, set the level at which the transmit and receive FIFOs trigger an interrupt, and to control Direct Memory Access (DMA) Table 13.20 shows the layout of the UART_FCR register.
Table 13.20
pcDuno UART FIFO control register

UART_LCR: The Line Control Register is used to control the parity, number of data bits, and number of stop bits for the serial port. Bit 7 also controls which registers are mapped at offsets 0, 4, and 8 from the device base address. Table 13.21 shows the layout of the UART_LCR register.
Table 13.21
pcDuno UART line control register

UART_LSR: The Line Status Register is used to read status information from the UART. Table 13.22 shows the layout of the UART_LSR register.
Table 13.22
pcDuno UART line status register
| Bit | Name | Description |
| 0 | DR | When the Data Ready bit is set to 1, it indicates that at least one byte is ready to be read from the receive FIFO or RBR. |
| 1 | OE | When the Overrun Error bit is set to 1, it indicates that an overrun error occurred for the byte at the top of the receive FIFO. |
| 2 | PE | When the Parity Error bit is set to 1, it indicates that a parity error occurred for the byte at the top of the receive FIFO. |
| 3 | FE | When the Framing Error bit is set to 1, it indicates that a framing error occurred for the byte at the top of the receive FIFO. |
| 4 | BI | When the Break Interrupt bit is set to 1, it indicates thata break has been received. |
| 5 | THRE | When the Transmit Holding Register Empty bit is 1, it indicates that there are there are no bytes waiting to be transmitted, but there may be a byte currently being transmitted. |
| 6 | TEMT | When the Transmitter Empty bit is 1, it indicates that there are no bytes waiting to be transmitted and no byte currently being transmitted. |
| 7 | FIFOERR | When this bit is 1, an error has occurred (PE, BE, or BI) in the receive FIFO. This bit is cleared when the Line Status Register is read. |
| 31–8 | Unused |
UART_USR: The UART Status Register is used to read information about the status of the transmit and receive FIFOs, and the current state of the receiver and transmitter. Table 13.23 shows the layout of the UART_USR register. This register contains essentially the same information as the status register in the Raspberry Pi UART.
Table 13.23
pcDuno UART status register
| Bit | Name | Description |
| 0 | BUSY | When the Busy bit is 1, it indicates that the UART is currently busy. When it is 0, the UART is idle or inactive. |
| 1 | TFNF | When the Transmit FIFO Not Full bit is 1, it indicates that at least one more byte can be safely written to the Transmit FIFO. |
| 2 | TFE | When the Transmit FIFO Empty bit is 1, it indicates that there are no bytes remaining in the transmit FIFO. |
| 3 | RFNE | When the Receive FIFO Not Empty bit is 1, it indicates that at least one more byte is waiting to be read from the receive FIFO. |
| 4 | RFF | When the Receive FIFO Full bit is 1, it indicates that there is no more room in the receive FIFO. If data is not read before the next character is received, an overrun error will occur. |
| 31–5 | Unused |
UART_TFL: The UART Transmit FIFO Level register allows the programmer to determine exactly how many bytes are currently in the transmit FIFO. Table 13.24 shows the layout of the UART_TFL register.
Table 13.24
pcDuno UART transmit FIFO level register
| Bit | Name | Description |
| 6–0 | TFL | The Transmit FIFO level field contains an integer which indicates the number of bytes currently in the transmit FIFO. |
| 31–7 | Unused |
UART_RFL: The UART Receive FIFO Level register allows the programmer to determine exactly how many bytes are currently in the receive FIFO. Table 13.25 shows the layout of the UART_RFL register.
Table 13.25
pcDuno UART receive FIFO level register
| Bit | Name | Description |
| 6–0 | RFL | The Receive FIFO level field contains an integer which indicates the number of bytes currently in the receive FIFO. |
| 31–7 | Unused |
UART_HALT: The UART transmit halt register is used to halt the UART so that it can be reconfigured. After the configuration is performed, it is then used to signal the UART to restart with the new settings. It can also be used to invert the receive and transmit polarity. Table 13.26 shows the layout of the UART_HALT register.
Table 13.26
pcDuno UART transmit halt register
| Bit | Name | Description |
| 0 | Unused | |
| 1 | CHCFG_AT_BUSY | Setting this bit to 1 causes the UART to allow changing the Line Control Register (except the DLAB bit) and allows setting the baud rate even when the UART is busy. When this bit is set to 0, changes can only occur when the BUSY bit in the UART Status Register is 0. |
| 2 | CHANGE_UPDATE | After writing 1 to CHCFG_AT_BUSY and performing the configuration, 1 should be written to this bit to signal that the UART should re-start with the new configuration. This bit will stay at 1 while the new configuration is loaded, and go back to 0 when the re-start is complete. |
| 3 | Unused | |
| 4 | SIR_TX_INVERT | This bit allows the polarity of the transmitter to be inverted.
1: Polarity inverted |
| 5 | SIR_RX_INVERT | This bit allows the polarity of the receiver to be inverted.
1: Polarity inverted |
| 31–5 | Unused |

Interrupt Control: The UART can signal the CPU by asserting an interrupt when certain conditions occur. This will be covered in more detail in Chapter 14. For now, it is enough to know that there are five additional registers which are used to configure and use the interrupt mechanism.
UART_IFLS defines the FIFO level that triggers the assertion of the interrupt signal. One interrupt is generated when the FIFO reaches the specified level. The CPU must clear the interrupt before another can be generated.
UART_IER is the interrupt enable register. It is used to enable or disable the generation of interrupts for specific conditions.
UART_IIR is the Interrupt Identity Register. When an interrupt occurs, the CPU can read this register to determine what caused the interrupt.
Additional Registers There are several additional registers which are not needed for basic use of the UART.
UART_MCR is the Modem Control Register. It is used to configure the port for IrDA mode, enable Automatic Flow Control, and manage the RS-232 RTS and DTR hardware handshaking signals for the ports in which they are implemented. The default configuration disables these extra features.
UART_MSR is the Modem Status Register, which is used to read the state of the RS-232 modem control and status lines on ports that implement them. This register can be ignored unless a telephone modem is being used on the port.
UART_SCH is the Modem Scratch Register. It provides 8 bits of storage for temporary data values. In the days of 8 and 16-bit computers, when the 16550 UART was designed, this extra byte of storage was useful.
Most modern computer systems have some type of Universal Asynchronous Receiver/Transmitter. These are serial communications devices, and are meant to provide communications with other systems using RS-232 (most commonly) or some other standard serial protocol. Modern systems often have a large number of other devices as well. Each device may need its own clock source to drive it at the correct frequency for its operation. The clock sources for all of the devices are often controlled by yet another device: the clock manager.
Although two systems may have different UARTs, these devices perform the same basic functions. The specifics about how they are programmed will vary from one system to another. However, there is always enough similarity between devices of the same class that a programmer who is familiar with one specific device can easily learn to program another similar device. The more experience a programmer has, the less time it takes to learn how to control a new device.
13.1 Write a function for setting the PWM clock on the Raspberry Pi to 2 MHz.
13.2 The UART_GET_BYTE function in Listing 13.1 contains skeleton code for handling errors, but does not actually do anything when errors occur. Describe at least two ways that the errors could be handled.
13.3 Listing 13.1 provides four functions for managing the UART on the Raspberry Pi. Write equivalent functions for the pcDuino UART.
This chapter starts by describing the extra responsibilities that the programmer must assume when writing code to run without an operating system (bare metal). It then explains privileged and user modes and describes all of the privileged modes available on the ARM processor. Next, it gives an overview of exception processing, and provides example code for setting up the vector table stubs for exception handling functions on the ARM processor. Next, it describes the boot processes on the Raspberry Pi and the pcDuino. After that, it shows how to write a basic bare metal program, without any exception processing. The chapter finishes by showing a more efficient version of the bare metal program using an interrupt.
Bare metal; Exception; Vector table; Exception handler; Sleep mode; User mode; Privileged mode; Startup code; Linker script; Boot loader; Interrupt
The previous chapters assumed that the software would be running in user mode under an operating system. Sometimes, it is necessary to write assembly code to run on “bare metal,” which simply means: without an operating system. For example, when we write an operating system kernel, it must run on bare metal and a significant part of the code (especially during the boot process) must be written in assembly language. Coding on bare metal is useful to deeply understand how the hardware works and what happens in the lowest levels of an operating system. There are some significant differences between code that is meant to run under an operating system and code that is meant to run on bare metal.
The operating system takes care of many details for the programmer. For instance, it sets up the stack, text, and data sections, initializes static variables, provides an interface to input and output devices, and gives the programmer an abstracted view of the machine. When accessing data on a disk drive, the programmer uses the file abstraction. The underlying hardware only knows about blocks of data. The operating system provides the data structures and operations which allow the programmer to think of data in terms of files and streams of bytes. A user program may be scattered in physical memory, but the hardware memory management unit, managed by the operating system, allows the programmer to view memory as a simple memory map (such as shown in Fig. 1.7). The programmer uses system calls to access the abstractions provided by the operating system. On bare metal, there are no abstractions, unless the programmer creates them.
However, there are some software packages to help bare-metal programmers. For example, Newlib is a C standard library intended for use in bare-metal programs. Its major features are that:
• it implements the hardware-independent parts of the standard C library,
• for I/O, it relies on only a few low-level functions that must be implemented specifically for the target hardware, and
• many target machines are already supported in the Newlib source code.
To support a new machine, the programmer only has to write a few low-level functions in C and/or Assembly to initialize the system and perform low-level I/O on the target hardware.
Many early computers were not capable of protecting the operating system from user programs. That problem was solved mostly by building CPUs that support multiple “levels of privilege” for running programs. Almost all modern CPUs have the ability to operate in at least two modes:
User mode is the mode that normal user programs use when running under an operating system, and
Privileged mode is reserved for operating system code. There are operations that can be performed in privileged mode which cannot be performed in user mode.
The ARM processor provides six privileged modes and one user mode. Five of the privileged modes have their own stack pointer (r13) and link register (r14). When the processor mode is changed, the corresponding link register and stack pointer become active, “replacing” the user stack pointer and link register.
In any of the six privileged modes, the link registers and stack pointers of the other modes can be accessed. The privileged mode stack pointers and link registers are not accessible from user mode. One of the privileged modes, FIQ, has five additional registers which become active when the processor enters FIQ mode. These registers “replace” registers r8 through r12. Additionally, five of the privileged modes have a Saved Process Status Register (SPSR). When entering those privileged modes, the CPSR is copied into the corresponding SPSR. This allows the CPSR to be restored to its original contents when the privileged code returns to the previously active mode. The full register set for all modes is shown in Table 14.1. Registers r0 through r7 and the program counter are shared by all modes. Some processors have an additional monitor mode, as part of the ARMv6-M and ARMv7-M security extensions.
Table 14.1
The ARM user and system registers
| usr | svc | abt | und | irq | fiq |
| sys | |||||
| r0 | |||||
| r1 | |||||
| r2 | |||||
| r3 | |||||
| r4 | |||||
| r5 | |||||
| r6 | |||||
| r7 | |||||
| r8 | r8_fiq | ||||
| r9 | r9_fiq | ||||
| r10 | r10_fiq | ||||
| r11 (fp) | r11_fiq | ||||
| r12 (ip) | r12_fiq | ||||
| r13 (sp) | r13_svc | r13_abt | r13_und | r13_irq | r13_fiq |
| r14 (lr) | r14_svc | r14_abt | r14_und | r14_irq | r14_fiq |
| r15 (pc) | |||||
| CPSR | CPSR | CPSR | CPSR | CPSR | CPSR |
| SPSR_svc | SPSR_abt | SPSR_und | SPSR_irq | SPSR_fiq |

All of the bits of the Program Status Register (PSR) are shown in Fig. 14.1. The processor mode is selected by writing a bit pattern into the mode bits (M[4:0]) of the PSR. The bit pattern assignment for each processor mode is shown in Table 14.2. Not all combinations of the mode bits define a valid processor mode. An illegal value programmed into M[4:0] causes the processor to enter an unrecoverable state. If this occurs, a hardware reset must be used to re-start the processor. Programs running in user mode cannot modify these bits directly. User programs can only change the processor mode by executing the software interrupt (swi) instruction (also known as the svc instruction), which automatically gives control to privileged code in the operating system. The hardware is carefully designed so that the user program cannot run its own code in privileged mode.

Table 14.2
Mode bits in the PSR
| M[4:0] | Mode | Name | Register Set |
| 10000 | usr | User | R0-R14, CPSR, PC |
| 10001 | fiq | Fast Interrupt | R0-R7, R8_fiq-R14_fiq, CPSR, SPSR_fiq, PC |
| 10010 | irq | Interrupt Request | R0-R12, R13_irq, R14_irq, CPSR, SPSR_irq, PC |
| 10011 | svc | Supervisor | R0-R12, R13_svc R14_svc CPSR, SPSR_irq, PC |
| 10111 | abt | Abort | R0-R12, R13_abt R14_abt CPSR, SPSR_abt PC |
| 11011 | und | Undefined Instruction | R0-R12, R13_und R14_und, CPSR, SPSR_und PC |
| 11111 | sys | System | R0-R14, CPSR, PC |

The swi instruction does not really cause an interrupt, but the hardware and operating system handle it in a very similar way. The software interrupt is used by user programs to request that the operating system perform some task on their behalf. Another general class of interrupt is the “hardware interrupt.” This class of interrupt may occur at any time and is used by hardware devices to signal that they require service. Another type of interrupt may be generated within the CPU when certain conditions arise, such as attempting to execute an unknown instruction. These are generally known as “exceptions” to distinguish them from hardware interrupts. On the ARM processor, there are three bits in the CPSR which affect interrupt processing:
I: when set to one, normal hardware interrupts are disabled,
F: when set to one, fast hardware interrupts are disabled, and
A: (only on ARMv6 and later processors) when set to one, imprecise aborts are disabled (this is an abort on a memory write that has been held in a write buffer in the processor and not written to memory until later, perhaps after another abort).
Programs running in user mode cannot modify these bits. Therefore, the operating system gains control of the CPU whenever an interrupt occurs and the user program cannot disable interrupts and continue to run. Most operating systems use a hardware timer to generate periodic interrupts, thus they are able to regain control of the CPU every few milliseconds.
Most of the privileged modes are entered automatically by the hardware when certain exceptional circumstances occur. For example, when a hardware device needs attention, it can signal the processor by causing an interrupt. When this occurs, the processor immediately enters IRQ mode and begins executing the IRQ exception handler function. Some devices can cause a fast interrupt, which causes the processor to immediately enter FIQ mode and begin executing the FIQ exception handler function. There are six possible exceptions that can occur, each one corresponding to one of the six privileged modes. Each exception must be handled by a dedicated function, with one additional function required to handle CPU reset events. The first instruction of each of these seven exception handlers is stored in a vector table at a known location in memory (usually address 0). When an exception occurs, the CPU automatically loads the appropriate instruction from the vector table and executes it. Table 14.3 shows the address, exception type, and the mode that the processor will be in, for each entry in ARM vector table. The vector table usually contains branch instructions. Each branch instruction will jump to the correct function for handling a specific exception type. Listing 14.1 shows a short section of assembly code which provides definitions for the ARM CPU modes.
Table 14.3
ARM vector table
| Address | Exception | Mode |
| 0x00000000 | Reset | svc |
| 0x00000004 | Undefined Instruction | und |
| 0x00000008 | Software Interrupt | svc |
| 0x0000000C | Prefetch Abort | abt |
| 0x00000010 | Data Abort | abt |
| 0x00000014 | Reserved | |
| 0x00000018 | Interrupt Request | irq |
| 0x0000001C | Fast Interrupt Request | fiq |

Many bare-metal programs consist of a single thread of execution running in user mode to perform some task. This main program is occasionally interrupted by the occurrence of some exception. The exception is processed, and then control returns to the main thread. Fig. 14.2 shows the sequence of events when an exception occurs in such a system. The main program typically would be running with the CPU in user mode. When the exception occurs, the CPU executes the corresponding instruction in the vector table, which branches to the exception handler. The exception handler must save any registers that it is going to use, execute the code required to handle the exception, then restore the registers. When it returns to the user mode process, everything will be as it was before the exception occurred. The user mode program continues executing as if the exception never occurred.

More complex systems may have multiple tasks, threads of execution, or user processes running concurrently. In a single-processor system, only one task, thread, or user process can actually be executing at any given instant, but when an exception occurs, the exception handler may change the currently active task, thread, or user process. This is the basis for all modern multiprocessing systems. Fig. 14.3 shows how an exception may be processed on such a system. It is common on multi-processing systems for a timer device to be used to generate periodic interrupts, which allows the currently active task, thread, or user process to be changed at a fixed frequency.

When any exception occurs, it causes the ARM CPU hardware to perform a very well-defined sequence of actions:
1. The CPSR is copied into the SPSR for the mode corresponding to the type of exception that has occurred.
2. The CPSR mode bits are changed, switching the CPU into the appropriate privileged mode.
3. The banked registers for the new mode become active.
4. The I bit of the CPSR is cleared, which disables interrupts.
5. If the exception was an FIQ, or if a reset has occurred, then the FIQ bit is cleared, disabling fast interrupts.
6. The program counter is copied to the link register for the new mode.
7. The program counter is loaded with the address in the vector table corresponding with the exception that has occurred.
8. The processor then fetches the next instruction using the program counter as usual. However, the program counter has been set so that in loads an instruction from the vector table.
The instruction in the vector table should cause the CPU to branch to a function which handles the exception. At the end of that function, the program counter must be loaded with the address of the instruction where the exception occurred, and the SPSR must be copied back into the CPSR. That will cause the processor to branch back to where it was when the exception occurred, and return to the mode that it was in at that time.
Listing 14.2 shows in detail how the vector table is initialized. The vector table contains eight identical instructions. These instructions load the program counter, which causes a branch. In each case, the program counter is loaded with a value at the memory location that is 32 bytes greater than the corresponding load instruction. An offset of 24 is used because the program counter will have advanced 8 bytes by the time the load instruction is executed. The addresses of the exception handlers have been stored in a second table, that begins at an address 32 bytes after the first load instruction. Thus, each instruction in the vector table loads a unique address into the program counter. Note that one of the slots in the vector table is not used and is reserved by ARM for future use. That slot is treated like all of the others, but it will never be used on any current ARM processor.

Listing 14.3 shows the stub functions for each of the exception handlers.


Note that the return sequence depends on the type of exception. For some exceptions, the return address must be adjusted. This is because the program counter may have been advanced past the instruction where the exception occurred. These stub functions simply return the processor to the mode and location at which the exception occurred. To be useful, they will need to be extended significantly. Note that these functions all return using a data processing instruction with the optional s specified and with the program counter as the destination register. This special form of data processing instruction indicates that the SPSR should be copied into the CPSR at the same time that the program counter is loaded with the return address. Thus, the function returns to the point where the exception occurred, and the processer switches back into the mode that it was in when the exception occurred.
A special form of the ldm instruction can also be used to return from an exception processing function. In order to use that method, the exception handler should start by adjusting the link register (depending on the type of exception) and then pushing it onto the stack. The handler should also push any other registers that it will need to use. At the end of the function, an ldmfd is used to restore the registers, but instead of restoring the link register, it loads the program counter. Also a carat (ˆ) is added to the end of the instruction. Listing 14.4 shows the skeleton for an exception handler function using this method.

In order to create a bare-metal program, we must understand what the processor does when power is first applied or after a reset. The ARM CPU begins to execute code at a predetermined address. Depending on the configuration of the ARM processor, the program counter starts either at address 0 or 0xFFFF0000. In order for the system to work, the startup code must be at the correct address when the system starts up.
On the Raspberry Pi, when power is first applied, the ARM CPU is disabled and the graphics processing unit (GPU) is enabled. The GPU runs a program that is stored in ROM. That program, called the first stage boot loader, reads the second stage boot loader from a file named (bootcode.bin) on the SD card. That program enables the SDRAM, and then loads the third stage bootloader, start.elf. At this point, some basic hardware configuration is performed, and then the kernel is loaded to address 0x8000 from the kernel.img file on the SD card. Once the kernel image file is loaded, a “b #0x8000” instruction is placed at address 0, and the ARM CPU is enabled. The ARM CPU executes the branch instruction at address 0, then immediately jumps to the kernel code at address 0x8000.
To run a bare-metal program on the Raspberry Pi, it is only necessary to build an executable image and store it as kernel.img on the SD card. Then, the boot process will load the bare-metal program instead of the Linux kernel image. Care must be taken to ensure that the linker prepares the program to run at address 0x8000 and places the first executable instruction at the beginning of the image file. It is also important to make a copy of the original kernel image so that it can be restored (using another computer). If the original kernel image is lost, then there will be no way to boot Linux until it is replaced.
The pcDuino uses u-boot, which is a highly configurable open-source boot loader. The boot loader is configured to attempt booting from the SD card. If a bootable SD card is detected, then it is used. Otherwise, the pcDuino boots from its internal NAND flash. In either case, u-boot finds the Linux kernel image file, named uImage, loads it at address 0x40008000, and then jumps to that location. The easiest way to run bare-metal code on the pcDuino is to create a duplicate of the operating system on an SD card, then replace the uImage file with another executable image. Care must be taken to ensure that the linker prepares the program to run at address 0x40008000 and places the first executable instruction at the beginning of the image file. If the SD card is inserted, then the bare-metal code will be loaded. Otherwise, it will boot normally from the NAND flash memory.
A bare-metal program should be divided into several files. Some of the code may be written in assembly, and other parts in C or some other language. The initial startup code, and the entry and exit from exception handlers, must be written in assembly. However, it may be much more productive to write the main program and the remainder of the exception handlers as C functions and have the assembly code call them.
Other than the code being loaded at different addresses, there is very little difference between getting bare-metal code running on the Raspberry Pi and the pcDuino. For either platform, the bare-metal program must include some start-up code. The startup code will:
• initialize the stack pointers for all of the modes,
• set up interrupt and exception handling,
• initialize the .bss section,
• configure the CPU and critical systems (optional),
• set up memory management (optional),
• set up process and/or thread management (optional),
• initialize devices (optional), and call the main function.
The startup code requires some knowledge of the target platform, and must be at least partly written in assembly language. Listing 14.5 shows a function named _start which sets up the stacks, initializes the .bss section, calls a function to set up the vector table, then calls the main function:



The first task for the startup code is to ensure that the stack pointer for each processor mode is initialized. When an exception or interrupt occurs, the processor will automatically change into the appropriate mode and begin executing an exception handler, using the stack pointer for that mode. Hardware interrupts can be disabled, but some exceptions cannot be disabled. In order to guarantee correct operation, a stack must be set up for each processor mode, and an exception handler must be provided. The exception handler does not actually have to do anything.
On the Raspberry Pi, memory is mapped to begin at address 0, and all models have at least 256 MB of memory. Therefore, it is safe to assume that the last valid memory address is 0x0FFFFFFF. If each mode is given 4 kB of stack space, then all of the stacks together will consume 32 kB, and the initial stack addresses can be easily calculated. Since the C compiler uses a full descending stack, the initial stack pointers can be assigned addresses 0x10000000, 0x0FFFF000, 0x0FFFE000, etc.
For the pcDuino, there is a small amount of memory mapped at address 0, but most of the available memory is in the region between 0x40000000 and 0xBFFFFFFF. The pcDuino has at least 1 GB of memory. One possible way to assign the stack locations is: 0x50000000, 0x4FFFF000, 0x4FFFE000, etc. This assignment of addresses will make it easy to write one piece of code to set up the stacks for either the Raspberry Pi or the pcDuino.
After initializing the stacks, the startup code must set all bytes in the .bss section to zero. Recall that the .bss section is used to hold data that is initialized to zero, but the program file does not actually contain all of the zeros. Programs running under an operating system can rely on the C standard library to initialize the .bss section. If it is not linked to a C library, then a bare-metal program must set all of the bytes in the .bss section to zero for itself.
The final part of this bare-metal program is the main function. Listing 14.6 shows a very simple main program which reads from three GPIO pins which have pushbuttons connected to them, and controls three other pins that have LEDs connected to them. When a button is pressed the LED associated with it is illuminated. The only real difference between the pcDuino and Raspberry Pi versions of this program is in the functions which drive the GPIO device. Therefore, those functions have been removed from the main program file. This makes the main program portable; it can run on the pcDuino or the Raspberry Pi. It could also run on any other ARM system, with the addition of another file to implement the mappings and functions for using the GPIO device for that system.

When compiling the program, it is necessary to perform a few extra steps to ensure that the program is ready to be loaded and run by the boot code. The last step in compiling a program is to link all of the object files together, possibly also including some object files from system libraries. A linker script is a file that tells the linker which sections to include in the output file, as well as which order to put them in, what type of file is to be produced, and what is to be the address of the first instruction. The default linker script used by GCC creates an ELF executable file, which includes startup code from the C library and also includes information which tells the loader where the various sections reside in memory. The default linker script creates a file that can be loaded by the operating system kernel, but which cannot be executed on bare metal.
For a bare-metal program, the linker must be configured to link the program so that the first instruction of the startup function is given the correct address in memory. This address depends on how the boot loader will load and execute the program. On the Raspberry Pi this address is 0x8000, and on the pcDuino this address is 0x40008000. The linker will automatically adjust any other addresses as it links the code together. The most efficient way to accomplish this is by providing a custom linker script to be used instead of the default system script. Additionally, either the linker must be instructed to create a flat binary file, rather than an ELF executable file, or a separate program (objcopy) must be used to convert the ELF executable into a flat binary file.
Listing 14.7 is an example of a linker script that can be used to create a bare-metal program. The first line is just a comment. The second line specifies the name of the function where the program begins execution. In this case, it specifies that a function named _start is where the program will begin execution. Next, the file specifies the sections that the output file will contain. For each output section, it lists the input sections that are to be used.

The first output section is the .text section, and it is composed of any sections whose names end in .text.boot followed by any sections whose names end in .text. In Listing 14.5, the _start function was placed in the .text.boot section, and it is the only thing in that section. Therefore the linker will put the _start function at the very beginning of the program. The remaining text sections will be appended, and then the remaining sections, in the order that they appear. After the sections are concatenated together, the linker will make a pass through the resulting file, correcting the addresses of branch and load instructions as necessary so that the program will execute correctly.
Compiling a program that consists of multiple source files, a custom linker script, and special commands to create an executable image can become tedious. The make utility was created specifically to help in this situation. Listing 14.8 shows a make script that can be used to combine all of the elements of the program together and produce a uImage file for the pcDuino and a kernel.img file for the Raspberry Pi. Listing 14.9 shows how the program can be built by typing “make” at the command line.


The main program shown in Listing 14.6 is extremely wasteful because it runs the CPU in a loop, repeatedly checking the status of the GPIO pins. It uses far more CPU time (and electrical power) than is necessary. In reality, the pins are unlikely to change state very often, and it is sufficient to check them a few times per second. It only takes a few nanoseconds to check the input pins and set the output pins so the CPU only needs to be running for a few nanoseconds at a time, a few times per second.
A much more efficient implementation would set up a timer to send interrupts at a fixed frequency. Then the main loop can check the buttons, set the outputs, and put the CPU to sleep. Listing 14.10 shows the main program, modified to put the processor to sleep after each iteration of the main loop. The only difference between this main function and the one in Listing 14.6 is the addition of a wfi instruction at line 43. The new implementation will consume far less electrical power and allow the CPU to run cooler, thereby extending its life. However, some additional work must be performed in order to set up the timer and interrupt system before the main function is called.

Some changes must be made to the startup code in Listing 14.5 so that after setting up the vector table, it calls a function to initialize the interrupt controller then calls another function to set up the timer. Listing 14.5 shows the modified startup function.
Lines 50 through 57 have been added to initialize the interrupt controller, enable the timer, and change the CPU into user mode before calling main. Of course, the hardware timers and interrupt controllers on the pcDuino and Raspberry Pi are very different.
The pcDuino has an ARM Generic Interrupt Controller (GIC-400) device to manage interrupts. The GIC device can handle a large number of interrupts. Each one is a separate input signal to the GIC. The GIC hardware prioritizes each input, and assigns each one a unique integer identifier. When the CPU receives an interrupt, it simply reads the GIC to determine which hardware device signaled the interrupt, calls the function which handles that device, then writes to one of the GIC registers to indicate that the interrupt has been processed. Listing 14.12 provides a few basic functions for managing this device.




The Raspberry Pi has a much simpler interrupt controller. It can enable and disable interrupt sources, and requires that the programmer read up to three registers to determine the source of an interrupt. For our purposes, we only need to manage the ARM timer interrupt. Listing 14.13 provides a few basic functions for using this device to enable the timer interrupt. Extending these functions to provide functionality equal to the GIC would not be very difficult, but would take some time. It would be necessary to set up a mapping from the interrupt bits in the interrupt register controller to integer values, so that each interrupt source has a unique identifier. Then the functions could be written to use those identifiers. The result would be a software implementation to provide capabilities equivalent to the GIC.


Note that although the devices are very different internally, they perform basically the same function. With the addition of a software driver layer, implemented in Listings 14.12 and 14.13 the devices become interchangeable and other parts of the bare-metal program do not have to be changed when porting from one platform to the other.


The pcDuino provides several timers that could be used, Timer0 was chosen arbitrarily. Listing 14.14 provides a few basic functions for managing this Device.


The Raspberry Pi also provides several timers that could be used, but the ARM timer is the easiest to configure. Listing 14.15 provides a few basic functions for managing this device:


The final step in writing the bare-metal code to operate in an interrupt-driven fashion is to modify the IRQ handler from Listing 14.3. Listing 14.16 shows a new version of the IRQ exception handler which checks and clears the timer interrupt, then returns to the location and CPU mode that were current when the interrupt occurred. This code works for both platforms.

Finally, the make file must be modified to include the new source code that was added to the program. Listing 14.17 shows the modified make script. The only change is that two extra object files have been added. when make is run, those files will be compiled and linked with the program. Listing 14.9 shows how the program can be built by typing “make” at the command line.

Since its introduction in 1982 as the flagship processor for Acorn RISC Machine, the ARM processor has gone through many changes. Throughout the years, ARM processors have always maintained a good balance of simplicity, performance, and efficiency. Although originally intended as a desktop processor, the ARM architecture has been more successful than any other architecture for use in embedded applications. That is at least partially because of good choices made by its original designers. The architectural decisions resulted in a processor that provides relatively high computing power with a relatively small number of transistors. This design also results in relatively low power consumption.
Today, there are almost 20 major versions of the ARMv7 architecture, targeted for everything from smart sensors to desktops and servers, and sales of ARM-based processors outnumber all other processor architectures combined. Historically, ARM has given numbers to various versions of the architecture. With the ARMv7, they introduced a simpler scheme to describe different versions of the processor. They divided their processor families into three major profiles:
ARMv7-A: Applications processors are capable of running a full, multiuser, virtual memory, multiprocessing operating system.
ARMv7-R: Real-time processors are for embedded systems that may need powerful processors, cache, and/or large amounts of memory.
ARMv7-M: Microcontroller processors only execute Thumb instructions and are intended for use in very small cost-sensitive embedded systems. They provide low cost, low power, and small size, and may not have hardware floating point or other high-performance features.
In 2014, ARM introduced the ARMv8 architecture. This is the first radical change in the ARM architecture in over 30 years. The new architecture extends the register set to thirty 64-bit general purpose registers, and has a completely new instruction set. Compatibility with ARMv7 and earlier code is supported by switching the processor into 32-bit mode, so that it

executes the 32-bit ARM instruction set. This is somewhat similar to the way that the Thumb instructions are supported on 32-bit ARM cores, but the change to 32-bit code can only be made when the processor is in privileged mode, and drops back to unprivileged mode.
Writing bare-metal programs can be a daunting task. However, that task can be made easier by writing and testing code under an operating system before attempting to run it bare metal. There are some functions which cannot be tested in this way. In those cases, it is best to keep those functions as simple as possible. Once the program works on bare metal, extra capabilities can be added.
Interrupt-driven processing is the basis for all modern operating systems. The system timer allows the O/S to take control periodically and select a different process to run on the CPU. Interrupts allow hardware devices to do their jobs independently and signal the CPU when they need service. The ability to restrict user access to devices and certain processor features provides the basis for a secure and robust system.
14.1 What are the advantages of a CPU which supports user mode and privileged mode over a CPU which does not?
14.2 What are the six privileged modes supported by the ARM architecture?
14.3 The interrupt handling mechanism is somewhat complex and requires significant programming effort to use. Why is it preferred over simply having the processor poll I/O devices?
14.4 Where does program control transfer to when a hardware interrupt occurs?
14.5 What is the purpose of the Undefined Instruction exception? How can it be used to allow an older processor to run programs that have new instructions? What other uses does it have?
14.6 What is an swi instruction? What is its use in operating systems? What is the key difference between an swi instruction and an interrupt?
14.7 Which of the following operations should be allowed only in privileged mode? Briefly explain your decision for each one.
(a) Execute an swi instruction.
(b) Disable all interrupts.
(c) Read the time-of-day clock.
(d) Receive a packet of data from the network.
(e) Shutdown the computer.
14.8 The main program in Listing 14.10 has two different methods to put the processor to sleep waiting for an interrupt. One method is for the Raspberry Pi, while the other is for the pcDuino. In order to compile the code, the correct lines must be uncommented and the unneeded lines must be commented out or removed. Explain two ways to change the code so that exactly the same main program can be used on both systems.
14.9 The programs in this chapter assumed the existence of libraries of functions for controlling the GPIO pins on the Raspberry Pi and the pcDuino. Both libraries provide the same high-level functions, but one operates on the Raspberry Pi GPIO device and the other operates on the pcDuino GPIO device. The C prototypes for the functions are: int GPIO_get_pin(int pin), void GPIO_set_pin(int pin,int state), GPIO_dir_input (int pin), and GPIO_dir_output (int pin). Write these libraries in ARM assembly language for both platforms.
14.10 Write an interrupt-driven program to read characters from the serial port on either the Raspberry Pi or the pcDuino. The UART on either system can be configured to send an interrupt when a character is received.
When a character is received through the UART and an interrupt occurs, the character should be echoed by transmitting it back to the sender. The character should also be stored in a buffer. If the character received is newline (“n), or if the buffer becomes full, then the contents of the buffer should be transmitted through the UART. Then, the buffer cleared and prepared to receive more characters.
Note: Page numbers followed by b indicate boxes, f indicate figures and t indicate tables.
A
B
C
D
E
F
G
H
I
L
M
N
O
P
Q
R
S
T
U
V
W
Z
First Edition

Part I: Assembly as a Language
1.4 Memory Layout of an Executing Program
Chapter 2: GNU Assembly Syntax
2.1 Structure of an Assembly Program
Chapter 3: Load/Store and Branch Instructions
3.1 CPU Components and Data Paths
Chapter 4: Data Processing and Other Instructions
4.1 Data Processing Instructions
4.4 Alphabetized List of ARM Instructions
Chapter 5: Structured Programming
Chapter 6: Abstract Data Types
6.3 Ethics Case Study: Therac-25
Part II: Performance Mathematics
Chapter 7: Integer Mathematics
Chapter 8: Non-Integral Mathematics
8.1 Base Conversion of Fractional Numbers
8.8 Ethics Case Study: Patriot Missile Failure
Chapter 9: The ARM Vector Floating Point Coprocessor
9.1 Vector Floating Point Overview
9.2 Floating Point Status and Control Register
9.5 Data Processing Instructions
9.6 Data Movement Instructions
9.7 Data Conversion Instructions
9.8 Floating Point Sine Function
9.9 Alphabetized List of VFP Instructions
Chapter 10: The ARM NEON Extensions
10.3 Load and Store Instructions
10.4 Data Movement Instructions
10.7 Bitwise Logical Operations
10.10 Multiplication and Division
10.12 Performance Mathematics: A Final Look at Sine
10.13 Alphabetized List of NEON Instructions
11.1 Accessing Devices Directly Under Linux
11.2 General Purpose Digital Input/Output
Chapter 13: Common System Devices
Chapter 14: Running Without an Operating System
Newnes is an imprint of Elsevier
The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, UK
50 Hampshire Street, 5th Floor, Cambridge, MA 02139, USA
Copyright © 2016 Elsevier Inc. All rights reserved.
No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions.
This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein).
Library of Congress Cataloging-in-Publication Data
A catalog record for this book is available from the Library of Congress
British Library Cataloging-in-Publication Data
A catalogue record for this book is available from the British Library
ISBN: 978-0-12-803698-3
For information on all Newnes publications visit our website at https://www.elsevier.com/

Publisher: Joe Hayton
Acquisition Editor: Tim Pitts
Editorial Project Manager: Charlotte Kent
Production Project Manager: Julie-Ann Stansfield
Designer: Mark Rogers
Typeset by SPi Global, India
Table 1.1 Values represented by two bits 9
Table 1.2 The first 21 integers (starting with 0) in various bases 10
Table 1.3 The ASCII control characters 21
Table 1.4 The ASCII printable characters 22
Table 1.5 Binary equivalents for each character in “Hello World” 23
Table 1.6 Binary, hexadecimal, and decimal equivalents for each character in “Hello World” 24
Table 1.7 Interpreting a hexadecimal string as ASCII 24
Table 1.8 Variations of the ISO 8859 standard 25
Table 1.9 UTF-8 encoding of the ISO/IEC 10646 code points 27
Table 3.1 Flag bits in the CPSR register 58
Table 3.2 ARM condition modifiers 59
Table 3.3 Legal and illegal values for #<immediate|symbol> 60
Table 3.4 ARM addressing modes 61
Table 3.5 ARM shift and rotate operations 61
Table 4.1 Shift and rotate operations in Operand2 80
Table 4.2 Formats for Operand2 81
Table 8.1 Format for IEEE 754 half-precision 244
Table 8.2 Result formats for each term 252
Table 8.3 Shifts required for each term 252
Table 8.4 Performance of sine function with various implementations 259
Table 9.1 Condition code meanings for ARM and VFP 271
Table 9.2 Performance of sine function with various implementations 292
Table 10.1 Parameter combinations for loading and storing a single structure 304
Table 10.2 Parameter combinations for loading multiple structures 306
Table 10.3 Parameter combinations for loading copies of a structure 308
Table 10.4 Performance of sine function with various implementations 357
Table 11.1 Raspberry Pi GPIO register map 379
Table 11.2 GPIO pin function select bits 380
Table 11.3 GPPUD control codes 381
Table 11.4 Raspberry Pi expansion header useful alternate functions 385
Table 11.5 Number of pins available on each of the AllWinner A10/A20 PIO ports 385
Table 11.6 Registers in the AllWinner GPIO device 386
Table 11.7 Allwinner A10/A20 GPIO pin function select bits 388
Table 11.8 Pull-up and pull-down resistor control codes 389
Table 11.9 pcDuino GPIO pins and function select code assignments. 392
Table 12.1 Raspberry Pi PWM register map 398
Table 12.2 Raspberry Pi PWM control register bits 399
Table 12.3 Prescaler bits in the pcDuino PWM device 401
Table 12.4 pcDuino PWM register map 401
Table 12.5 pcDuino PWM control register bits 402
Table 13.1 Clock sources available for the clocks provided by the clock manager 407
Table 13.2 Some registers in the clock manager device 407
Table 13.3 Bit fields in the clock manager control registers 408
Table 13.4 Bit fields in the clock manager divisor registers 408
Table 13.5 Clock signals in the AllWinner A10/A20 SOC 409
Table 13.6 Raspberry Pi UART0 register map 413
Table 13.7 Raspberry Pi UART data register 414
Table 13.8 Raspberry Pi UART receive status register/error clear register 415
Table 13.9 Raspberry Pi UART flags register bits 415
Table 13.10 Raspberry Pi UART integer baud rate divisor 416
Table 13.11 Raspberry Pi UART fractional baud rate divisor 416
Table 13.12 Raspberry Pi UART line control register bits 416
Table 13.13 Raspberry Pi UART control register bits 417
Table 13.14 pcDuino UART addresses 422
Table 13.15 pcDuino UART register offsets 423
Table 13.16 pcDuno UART receive buffer register 424
Table 13.17 pcDuno UART transmit holding register 424
Table 13.18 pcDuno UART divisor latch low register 424
Table 13.19 pcDuno UART divisor latch high register 425
Table 13.20 pcDuno UART FIFO control register 425
Table 13.21 pcDuno UART line control register 426
Table 13.22 pcDuno UART line status register 427
Table 13.23 pcDuno UART status register 427
Table 13.24 pcDuno UART transmit FIFO level register 428
Table 13.25 pcDuno UART receive FIFO level register 428
Table 13.26 pcDuno UART transmit halt register 428
Table 14.1 The ARM user and system registers 433
Table 14.2 Mode bits in the PSR 434
Table 14.3 ARM vector table 435
Figure 1.1 Simplified representation of a computer system 4
Figure 1.2 Stages of a typical compilation sequence 6
Figure 1.3 Tables used for converting between binary, octal, and hex 14
Figure 1.4 Four different representations for binary integers 16
Figure 1.5 Complement tables for bases ten and two 17
Figure 1.6 A section of memory 29
Figure 1.7 Typical memory layout for a program with a 32-bit address space 30
Figure 2.1 Equivalent static variable declarations in assembly and C 42
Figure 3.1 The ARM processor architecture 54
Figure 3.2 The ARM user program registers 56
Figure 3.3 The ARM process status register 57
Figure 5.1 ARM user program registers 112
Figure 6.1 Binary tree of word frequencies 151
Figure 6.2 Binary tree of word frequencies with index added 157
Figure 6.3 Binary tree of word frequencies with sorted index 158
Figure 7.1 In signed 8-bit math, 110110012 is −3910 179
Figure 7.2 In unsigned 8-bit math, 110110012 is 21710 179
Figure 7.3 Multiplication of large numbers 180
Figure 7.4 Longhand division in decimal and binary 181
Figure 7.5 Flowchart for binary division 183
Figure 8.1 Examples of fixed-point signed arithmetic 232
Figure 9.1 ARM integer and vector floating point user program registers 267
Figure 9.2 Bits in the FPSCR 268
Figure 10.1 ARM integer and NEON user program registers 300
Figure 10.2 Pixel data interleaved in three doubleword registers 302
Figure 10.3 Pixel data de-interleaved in three doubleword registers 303
Figure 10.4 Example of vext.8 d12,d4,d9,#5 313
Figure 10.5 Examples of the vrev instruction. (A) vrev16.8 d3,d4; (B) vrev32.16 d8,d9; (C) vrev32.8 d5,d7 315
Figure 10.6 Examples of the vtrn instruction. (A) vtrn.8 d14,d15; (B) vtrn.32 d31,d15 316
Figure 10.7 Transpose of a 3 × 3 matrix 317
Figure 10.8 Transpose of a 4 × 4 matrix of 32-bit numbers 318
Figure 10.9 Example of vzip.8 d9,d4 320
Figure 10.10 Effects of vsli.32 d4,d9,#6 334
Figure 11.1 Typical hardware address mapping for memory and devices 366
Figure 11.2 GPIO pins being used for input and output. (A) GPIO pin being used as input to read the state of a push-button switch. (B) GPIO pin being used as output to drive an LED 378
Figure 11.3 The Raspberry Pi expansion header location 383
Figure 11.4 The Raspberry Pi expansion header pin assignments 384
Figure 11.5 Bit-to-pin assignments for PIO control registers 388
Figure 11.6 The pcDuino header locations 390
Figure 11.7 The pcDuino header pin assignments 391
Figure 12.1 Pulse density modulation 396
Figure 12.2 Pulse width modulation 397
Figure 13.1 Typical system with a clock management device 406
Figure 13.2 Transmitter and receiver timings for two UARTS. (A) Waveform of a UART transmitting a byte. (B) Timing of UART receiving a byte 411
Figure 14.1 The ARM process status register 433
Figure 14.2 Basic exception processing 436
Figure 14.3 Exception processing with multiple user processes 437
Listing 2.1 “Hello World” program in ARM assembly 36
Listing 2.2 “Hello World” program in C 37
Listing 2.3 “Hello World” assembly Listing 39
Listing 2.4 A Listing with mis-aligned data 43
Listing 2.5 A Listing with properly aligned data 45
Listing 2.6 Defining a symbol for the number of elements in an array 47
Listing 5.1 Selection in C 101
Listing 5.2 Selection in ARM assembly using conditional execution 102
Listing 5.3 Selection in ARM assembly using branch instructions 102
Listing 5.4 Complex selection in C 103
Listing 5.5 Complex selection in ARM assembly 104
Listing 5.6 Unconditional loop in ARM assembly 105
Listing 5.7 Pre-test loop in ARM assembly 105
Listing 5.8 Post-test loop in ARM assembly 106
Listing 5.9 for loop in C 106
Listing 5.10 for loop rewritten as a pre-test loop in C 107
Listing 5.11 Pre-test loop in ARM assembly 107
Listing 5.12 for loop rewritten as a post-test loop in C 108
Listing 5.13 Post-test loop in ARM assembly 108
Listing 5.14 Calling scanf and printf in C 111
Listing 5.15 Calling scanf and printf in ARM assembly 111
Listing 5.16 Simple function call in C 114
Listing 5.17 Simple function call in ARM assembly 114
Listing 5.18 A larger function call in C 114
Listing 5.19 A larger function call in ARM assembly 115
Listing 5.20 A function call using the stack in C 115
Listing 5.21 A function call using the stack in ARM assembly 116
Listing 5.22 A function call using stm to push arguments onto the stack 116
Listing 5.23 A small function in C 118
Listing 5.24 A small function in ARM assembly 118
Listing 5.25 A small C function with a register variable 119
Listing 5.26 Automatic variables in ARM assembly 119
Listing 5.27 A C program that uses recursion to reverse a string 120
Listing 5.28 ARM assembly implementation of the reverse function 121
Listing 5.29 Better implementation of the reverse function 122
Listing 5.30 Even better implementation of the reverse function 122
Listing 5.31 String reversing in C using pointers 123
Listing 5.32 String reversing in assembly using pointers 123
Listing 5.33 Initializing an array of integers in C 124
Listing 5.34 Initializing an array of integers in assembly 125
Listing 5.35 Initializing a structured data type in C 125
Listing 5.36 Initializing a structured data type in ARM assembly 126
Listing 5.37 Initializing an array of structured data in C 127
Listing 5.38 Initializing an array of structured data in assembly 128
Listing 5.39 Improved initialization in assembly 129
Listing 5.40 Very efficient initialization in assembly 130
Listing 6.1 Definition of an Abstract Data Type in a C header file 138
Listing 6.2 Definition of the image structure may be hidden in a separate header file 139
Listing 6.3 Definition of an ADT in Assembly 140
Listing 6.4 C program to compute word frequencies 140
Listing 6.5 C header for the wordlist ADT 142
Listing 6.6 C implementation of the wordlist ADT 143
Listing 6.7 Makefile for the wordfreq program 146
Listing 6.8 ARM assembly implementation of wl_print_numerical() 148
Listing 6.9 Revised makefile for the wordfreq program 149
Listing 6.10 C implementation of the wordlist ADT using a tree 151
Listing 6.11 ARM assembly implementation of wl_print_numerical() with a tree 158
Listing 7.1 ARM assembly code for adding two 64 bit numbers 176
Listing 7.2 ARM assembly code for multiplication with a 64 bit result 176
Listing 7.3 ARM assembly code for multiplication with a 32 bit result 177
Listing 7.4 ARM assembly implementation of signed and unsigned 32-bit and 64-bit division functions 187
Listing 7.5 ARM assembly code for division by constant 193 192
Listing 7.6 ARM assembly code for division of a variable by a constant without using a multiply instruction 193
Listing 7.7 Header file for a big integer abstract data type 195
Listing 7.8 C source code file for a big integer abstract data type 196
Listing 7.9 Program using the bigint ADT to calculate the factorial function 211
Listing 7.10 ARM assembly implementation if the bigint_adc function 213
Listing 8.1 Examples of fixed-point multiplication in ARM assembly 233
Listing 8.2 Dividing x by 23 239
Listing 8.3 Dividing x by 23 Using Only Shift and Add 240
Listing 8.4 Dividing x by − 50 242
Listing 8.5 Inefficient representation of a binimal 242
Listing 8.6 Efficient representation of a binimal 243
Listing 8.7 ARM assembly implementation of sin x and cos x using fixed-point calculations 252
Listing 8.8 Example showing how the sin x and cos x functions can be used to print a table 257
Listing 9.1 Simple scalar implementation of the sin x function using IEEE single precision 285
Listing 9.2 Simple scalar implementation of the sin x function using IEEE double precision 286
Listing 9.3 Vector implementation of the sin x function using IEEE single precision 288
Listing 9.4 Vector implementation of the sin x function using IEEE double precision 289
Listing 10.1 NEON implementation of the sin x function using single precision 354
Listing 10.2 NEON implementation of the sin x function using double precision 355
Listing 11.1 Function to map devices into the user program memory on a Raspberry Pi 367
Listing 11.2 Function to map devices into the user program memory space on a pcDuino 372
Listing 11.3 ARM assembly code to set GPIO pin 26 to alternate function 1 381
Listing 11.4 ARM assembly code to configure PA10 for output 388
Listing 11.5 ARM assembly code to set PA10 to output a high state 389
Listing 11.6 ARM assembly code to read the state of PI14 and set or clear the Z flag 389
Listing 13.1 Assembly functions for using the Raspberry Pi UART 418
Listing 14.1 Definitions for ARM CPU modes 435
Listing 14.2 Function to set up the ARM exception table 439
Listing 14.3 Stubs for the exception handlers 440
Listing 14.4 Skeleton for an exception handler 441
Listing 14.5 ARM startup code 443
Listing 14.6 A simple main program 446
Listing 14.7 A sample Gnu linker script 448
Listing 14.8 A sample make file 450
Listing 14.9 Running make to build the image 451
Listing 14.10 An improved main program 452
Listing 14.11 ARM startup code with timer interrupt 453
Listing 14.12 Functions to manage the pdDuino interrupt controller 454
Listing 14.13 Functions to manage the Raspberry Pi interrupt controller 457
Listing 14.14 Functions to manage the pdDuino timer0 device 459
Listing 14.15 Functions to manage the Raspberry Pi timer0 device 460
Listing 14.16 IRQ handler to clear the timer interrupt 462
Listing 14.17 A sample make file 463
Listing 14.18 Running make to build the image 464
This book is intended to be used in a first course in assembly language programming for Computer Science (CS) and Computer Engineering (CE) students. It is assumed that students using this book have already taken courses in programming and data structures, and are competent programmers in at least one high-level language. Many of the code examples in the book are written in C, with an assembly implementation following. The assembly examples can stand on their own, but students who are familiar with C, C++, or Java should find the C examples helpful.
Computer Science and Computer Engineering are very large fields. It is impossible to cover everything that a student may eventually need to know. There are a limited number of course hours available, so educators must strive to deliver degree programs that make a compromise between the number of concepts and skills that the students learn and the depth at which they learn those concepts and skills. Obviously, with these competing goals it is difficult to reach consensus on exactly what courses should be included in a CS or CE curriculum.
Traditionally, assembly language courses have consisted of a mechanistic learning of a set of instructions, registers, and syntax. Partially because of this approach, over the years, assembly language courses have been marginalized in, or removed altogether from, many CS and CE curricula. The author feels that this is unfortunate, because a solid understanding of assembly language leads to better understanding of higher-level languages, compilers, interpreters, architecture, operating systems, and other important CS an CE concepts.
One of the goals of this book is to make a course in assembly language more valuable by introducing methods (and a bit of theory) that are not covered in any other CS or CE courses, while using assembly language to implement the methods. In this way, the course in assembly language goes far beyond the traditional assembly language course, and can once again play an important role in the overall CS and CE curricula.
Because of their ubiquity, x86 based systems have been the platforms of choice for most assembly language courses over the last two decades. The author believes that this is unfortunate, because in every respect other than ubiquity, the x86 architecture is the worst possible choice for learning and teaching assembly language. The newer chips in the family have hundreds of instructions, and irregular rules govern how those instructions can be used. In an attempt to make it possible for students to succeed, typical courses use antiquated assemblers and interface with the antiquated IBM PC BIOS, using only a small subset of the modern x86 instruction set. The programming environment has little or no relevance to modern computing.
Partially because of this tendency to use x86 platforms, and the resulting unnecessary burden placed on students and instructors, as well as the reliance on antiquated and irrelevant development environments, assembly language is often viewed by students as very difficult and lacking in value. The author hopes that this textbook helps students to realize the value of knowing assembly language. The relatively simple ARM processor family was chosen in hopes that the students also learn that although assembly language programming may be more difficult than high-level languages, it can be mastered.
The recent development of very low-cost ARM based Linux computers has caused a surge of interest in the ARM architecture as an alternative to the x86 architecture, which has become increasingly complex over the years. This book should provide a solution for a growing need.
Many students have difficulty with the concept that a register can hold variable x at one point in the program, and hold variable y at some other point. They also often have difficulty with the concept that, before it can be involved in any computation, data has to be moved from memory into the CPU. Using a load-store architecture helps the students to more readily grasp these concepts.
Another common difficulty that students have is in relating the concepts of an address and a pointer variable. You can almost see the little light bulbs light up over their heads, when they have the “eureka!” moment and realize that pointers are just variables that hold an address. The author hopes that the approach taken in this book will make it easier for students to have that “eureka!” moment. The author believes that load-store architectures make that realization easier.
Many students also struggle with the concept of recursion, regardless of what language is used. In assembly, the mechanisms involved are exposed and directly manipulated by the programmer. Examples of recursion are scattered throughout this textbook. Again, the clean architecture of the ARM makes it much easier for the students to understand what is going on.
Some students have difficulty understanding the flow of a program, and tend to put many unnecessary branches into their code. Many assembly language courses spend so much time and space on learning the instruction set that they never have time to teach good programming practices. This textbook puts strong emphasis on using structured programming concepts. The relative simplicity of the ARM architecture makes this possible.
One of the major reasons to learn and use assembly language is that it allows the programmer to create very efficient mathematical routines. The concepts introduced in this book will enable students to perform efficient non-integral math on any processor. These techniques are rarely taught because of the time that it takes to cover the x86 instruction set. With the ARM processor, less time is spent on the instruction set, and more time can be spent teaching how to optimize the code.
The combination of the ARM processor and the Linux operating system provides the least costly hardware platform and development environment available. A cluster of 10 Raspberry Pis, or similar hosts, with power supplies and networking, can be assembled for 500 US dollars or less. This cluster can support up to 50 students logging in through ssh. If their client platform supports the X window system, then they can run GUI enabled applications. Alternatively, most low-cost ARM systems can directly drive a display and take input from a keyboard and mouse. With the addition of an NFS server (which itself could be a low-cost ARM system and a hard drive), an entire Linux ARM based laboratory of 20 workstations could be built for 250 US dollars per seat or less. Admittedly, it would not be a high-performance laboratory, but could be used to teach C, assembly, and other languages. The author would argue that inexperienced programmers should learn to program on low-performance machines, because it reinforces a life-long tendency towards efficiency.
The approach of this book is to present concepts in different ways throughout the book, slowly building from simple examples towards complex programming on bare-metal embedded systems. Students who don’t understand a concept when it is explained in a certain way may easily grasp the concept when it is presented later from a different viewpoint.
The main objective of this book is to provide an improved course in assembly language by replacing the x86 platform with one that is less costly, more ubiquitous, well-designed, powerful, and easier to learn. Since students are able to master the basics of assembly language quickly, it is possible to teach a wider range of topics, such as fixed and floating point mathematics, ethical considerations, performance tuning, and interrupt processing. The author hopes that courses using this book will better prepare students for the junior and senior level courses in operating systems, computer architecture, and compilers.
Please visit the companion web site to access additional resources. Instructors may download the author’s lecture slides and solution manual for the exercises. Students and instructors may also access the laboratory manual and additional code examples. The author welcomes suggestions for additional lecture slides, laboratory assignments, or other materials.
I would like to thank Randy Warner for reading the manuscript, catching errors, and making helpful suggestions. I would also like to thank the following students for suggesting exercises with answers and catching numerous errors in the drafts: Zach Buechler, Preston Cook, Joshua Daybrest, Matthew DeYoung, Josh Dodd, Matt Dyke, Hafiza Farzami, Jeremy Goens, Lawrence Hoffman, Colby Johnson, Benjamin Kaiser, Lauren Keene, Jayson Kjenstad, Murray LaHood-Burns, Derek Lane, Yanlin Li, Luke Meyer, Matthew Mielke, Forrest Miller, Christopher Navarro, Girik Ranchhod, Josh Schweigert, Christian Sieh, Weston Silbaugh, Jacob St. Amand, Njaal Tengesdal, Dylan Thoeny, Michael Vortherms, Dicheng Wu, and Kekoa (Peter) Yamaguchi. Finally, I am also very grateful for my assistants, Scott Logan, Ian Carlson, and Derek Stotz, who gave very valuable feedback during the writing of this book.
Assembly as a Language
This chapter first gives a very high-level description of the major components of function of a computer system. It then motivates the reader by giving reasons why learning assembly language is important for Computer Scientists and Computer Engineers. It then explains why the ARM processor is a good choice for a first assembly language. Next it explains binary data representations, including various integer formats, ASCII, and Unicode. Finally, it describes the memory sections for a typical program during execution. By the end of the chapter, the groundwork has been laid for learning to program in assembly language.
Instruction; Instruction stream; Central processing unit; Memory; Input/output device; High-level language; Assembly language; ARM processor; Binary; Hexadecimal; Decimal; Radix or base system; Base conversion; Sign magnitude; Unsigned; Complement; Excess-n; ASCII; Unicode; UTF-8; Stack; Heap; Data section; Text section
An executable computer program is, ultimately, just a series of numbers that have very little or no meaning to a human being. We have developed a variety of human-friendly languages in which to express computer programs, but in order for the program to execute, it must eventually be reduced to a stream of numbers. Assembly language is one step above writing the stream of numbers. The stream of numbers is called the instruction stream. Each number in the instruction stream instructs the computer to perform one (usually small) operation. Although each instruction does very little, the ability of the programmer to specify any sequence of instructions and the ability of the computer to perform billions of these small operations every second makes modern computers very powerful and flexible tools. In assembly language, one line of code usually gets translated into one machine instruction. In high-level languages, a single line of code may generate many machine instructions.
A simplified model of a computer system, as shown in Fig. 1.1, consists of memory, input/output devices, and a central processing unit (CPU), connected together by a system bus. The bus can be thought of as a roadway that allows data to travel between the components of the computer system. The CPU is the part of the system where most of the computation occurs, and the CPU controls the other devices in the system.

Memory can be thought of as a series of mailboxes. Each mailbox can hold a single postcard with a number written on it, and each mailbox has a unique numeric identifier. The identifier, x is called the memory address, and the number stored in the mailbox is called the contents of address x. Some of the mailboxes contain data, and others contain instructions which control what actions are performed by the CPU.
The CPU also contains a much smaller set of mailboxes, which we call registers. Data can be copied from cards stored in memory to cards stored in the CPU, or vice-versa. Once data has been copied into one of the CPU registers, it can be used in computation. For example, in order to add two numbers in memory, they must first be copied into registers on the CPU. The CPU can then add the numbers together and store the result in one of the CPU registers. The result of the addition can then be copied back into one of the mailboxes in the memory.
Modern computers execute instructions sequentially. In other words, the next instruction to be executed is at the memory address immediately following the current instruction. One of the registers in the CPU, the program counter (PC), keeps track of the location from which the next instruction is to be fetched. The CPU follows a very simple sequence of actions. It fetches an instruction from memory, increments the PC, executes the instruction, and then repeats the process with the next instruction. However, some instructions may change the PC, so that the next instruction is fetched from a non-sequential address.
There are many high-level programming languages, such as Java, Python, C, and C++ that have been designed to allow programmers to work at a high level of abstraction, so that they do not need to understand exactly what instructions are needed by a particular CPU. For compiled languages, such as C and C++, a compiler handles the task of translating the program, written in a high-level language, into assembly language for the particular CPU on the system. An assembler then converts the program from assembly language into the binary codes that the CPU reads as instructions.
High-level languages can greatly enhance programmer productivity. However, there are some situations where writing assembly code directly is desirable or necessary. For example, assembly language may be the best choice when writing
• the first steps in booting the computer,
• code to handle interrupts,
• low-level locking code for multi-threaded programs,
• code for machines where no compiler exists,
• code which needs to be optimized beyond the limits of the compiler,
• on computers with very limited memory, and
• code that requires low-level access to architectural and/or processor features.
Aside from sheer necessity, there are several other reasons why it is still important for computer scientists to learn assembly language.
One example where knowledge of assembly is indispensable is when designing and implementing compilers for high-level languages. As shown in Fig. 1.2, a typical compiler for a high-level language must generate assembly language as its output. Most compilers are designed to have multiple stages. In the input stage, the source language is read and converted into a graph representation. The graph may be optimized before being passed to the output, or code generation, stage where it is converted to assembly language. The assembly is then fed into the system’s assembler to generate an object file. The object file is linked with other object files (which are often combined into libraries) to create an executable program.

The code generation stage of a compiler must traverse the graph and emit assembly code. The quality of the assembly code that is generated can have a profound influence on the performance of the executable program. Therefore, the programmer responsible for the code generation portion of the compiler must be well versed in assembly programming for the target CPU.
Some people believe that a good optimizing compiler will generate better assembly code than a human programmer. This belief is not justified. Highly optimizing compilers have lots of clever algorithms, but like all programs, they are not perfect. Outside of the cases that they were designed for, they do not optimize well. Many newer CPUs have instructions which operate on multiple items of data at once. However, compilers rarely make use of these powerful single instruction multiple data ( SIMD) instructions. Instead, it is common for programmers to write functions in assembly language to take advantage of SIMD instructions. The assembly functions are assembled into object file(s), then linked with the object file(s) generated from the high-level language compiler.
Many modern processors also have some support for processing vectors (arrays). Compilers are usually not very good at making effective use of the vector instructions. In order to achieve excellent vector performance for audio or video codecs and other time-critical code, it is often necessary to resort to small pieces of assembly code in the performance-critical inner loops. A good example of this type of code is when performing vector and matrix multiplies. Such operations are commonly needed in processing images and in graphical applications. The ARM vector instructions are explained in Chapter 9.
Another reason for assembly is when writing certain parts of an operating system. Although modern operating systems are mostly written in high-level languages, there are some portions of the code that can only be done in assembly. Typical uses of assembly language are when writing device drivers, saving the state of a running program so that another program can use the CPU, restoring the saved state of a running program so that it can resume executing, and managing memory and memory protection hardware. There are many other tasks central to a modern operating system which can only be accomplished in assembly language. Careful design of the operating system can minimize the amount of assembly required, but cannot eliminate it completely.
Another good reason to learn assembly is for debugging. Simply understanding what is going on “behind the scenes” of compiled languages such as C and C++ can be very valuable when trying to debug programs. If there is a problem in a call to a third party library, sometimes the only way a developer can isolate and diagnose the problem is to run the program under a debugger and step through it one machine instruction at a time. This does not require a deep knowledge of assembly language coding but at least a passing familiarity with assembly is helpful in that particular case. Analysis of assembly code is an important skill for C and C++ programmers, who may occasionally have to diagnose a fault by looking at the contents of CPU registers and single-stepping through machine instructions.
Assembly language is an important part of the path to understanding how the machine works. Even though only a small percentage of computer scientists will be lucky enough to work on the code generator of a compiler, they all can benefit from the deeper level of understanding gained by learning assembly language. Many programmers do not really understand pointers until they have written assembly language.
Without first learning assembly language, it is impossible to learn advanced concepts such as microcode, pipelining, instruction scheduling, out-of-order execution, threading, branch prediction, and speculative execution. There are many other concepts, especially when dealing with operating systems and computer architecture, which require some understanding of assembly language. The best programmers understand why some language constructs perform better than others, how to reduce cache misses, and how to prevent buffer overruns that destroy security.
Every program is meant to run on a real machine. Even though there are many languages, compilers, virtual machines, and operating systems to enable the programmer to use the machine more conveniently, the strengths and weaknesses of that machine still determine what is easy and what is hard. Learning assembly is a fundamental part of understanding enough about the machine to make informed choices about how to write efficient programs, even when writing in a high-level language.
As an analogy, most people do not need to know a lot about how an internal combustion engine works in order to operate an automobile. A race car driver needs a much better understanding of exactly what happens when he or she steps on the accelerator pedal in order to be able to judge precisely when (and how hard) to do so. Also, who would trust their car to a mechanic who could not tell the difference between a spark plug and a brake caliper? Worse still, should we trust an engineer to build a car without that knowledge? Even in this day of computerized cars, someone needs to know the gritty details, and they are paid well for that knowledge. Knowledge of assembly language is one of the things that defines the computer scientist and engineer.
When learning assembly language, the specific instruction set is not critically important, because what is really being learned is the fine detail of how a typical stored-program machine uses different storage locations and logic operations to convert a string of bits into a meaningful calculation. However, when it comes to learning assembly languages, some processors make it more difficult than it needs to be. Because some processors have an instruction set that is extremely irregular, non-orthogonal, large, and poorly designed, they are not a good choice for learning assembly. The author feels that teaching students their first assembly language on one of those processors should be considered a crime, or at least a form of mental abuse. Luckily, there are processors that are readily available, low-cost, and relatively easy to learn assembly with. This book uses one of them as the model for assembly language.
In the late 1970s, the microcomputer industry was a fierce battleground, with several companies competing to sell computers to small business and home users. One of those companies, based in the United Kingdom, was Acorn Computers Ltd. Acorn’s flagship product, the BBC Micro, was based on the same processor that Apple Computer had chosen for their Apple IITM line of computers; the 8-bit 6502 made by MOS Technology. As the 1980s approached, microcomputer manufacturers were looking for more powerful 16-bit and 32-bit processors. The engineers at Acorn considered the processor chips that were available at the time, and concluded that there was nothing available that would meet their needs for the next generation of Acorn computers.
The only reasonably-priced processors that were available were the Motorola 68000 (a 32-bit processor used in the Apple Macintosh and most high-end Unix workstations) and the Intel 80286 (a 16-bit processor used in less powerful personal computers such as the IBM PC). During the previous decade, a great deal of research had been conducted on developing high-performance computer architectures. One of the outcomes of that research was the development of a new paradigm for processor design, known as Reduced Instruction Set Computing (RISC). One advantage of RISC processors was that they could deliver higher performance with a much smaller number of transistors than the older Complex Instruction Set Computing (CISC) processors such as the 68000 and 80286. The engineers at Acorn decided to design and produce their own processor. They used the BBC Micro to design and simulate their new processor, and in 1987, they introduced the Acorn ArchimedesTM. The ArchimedesTM was arguably the most powerful home computer in the world at that time, with graphics and audio capabilities that IBM PCTM and Apple MacintoshTM users could only dream about. Thus began the long and successful dynasty of the Acorn RISC Machine (ARM) processor.
Acorn never made a big impact on the global computer market. Although Acorn eventually went out of business, the processor that they created has lived on. It was re-named to the Advanced RISC Machine, and is now known simply as ARM. Stewardship of the ARM processor belongs to ARM Holdings, LLC which manages the design of new ARM architectures and licenses the manufacturing rights to other companies. ARM Holdings does not manufacture any processor chips, yet more ARM processors are produced annually than all other processor designs combined. Most ARM processors are used as components for embedded systems and portable devices. If you have a smart phone or similar device, then there is a very good chance that it has an ARM processor in it. Because of its enormous market presence, clean architecture, and small, orthogonal instruction set, the ARM is a very good choice for learning assembly language.
Although it dominates the portable device market, the ARM processor has almost no presence in the desktop or server market. However, that may change. In 2012, ARM Holdings announced the ARM64 architecture, which is the first major redesign of the ARM architecture in 30 years. The ARM64 is intended to compete for the desktop and server market with other high-end processors such as the Sun SPARC and Intel Xeon. Regardless of whether or not the ARM64 achieves much market penetration, the original ARM 32-bit processor architecture is so ubiquitous that it clearly will be around for a long time.
The basic unit of data in a digital computer is the binary digit, or bit. A bit can have a value of zero or one. In order to store numbers larger than 1, bits are combined into larger units. For instance, using two bits, it is possible to represent any number between zero and three. This is shown in Table 1.1. When stored in the computer, all data is simply a string of binary digits. There is more than one way that such a fixed-length string of binary digits can be interpreted.
Computers have been designed using many different bit group sizes, including 4, 8, 10, 12, and 14 bits. Today most computers recognize a basic grouping of 8 bits, which we call a byte. Some computers can work in units of 4 bits, which is commonly referred to as a nibble (sometimes spelled “nybble”). A nibble is a convenient size because it can exactly represent one hexadecimal digit. Additionally, most modern computers can also work with groupings of 16, 32 and 64 bits. The CPU is designed with a default word size. For most modern CPUs, the default word size is 32 bits. Many processors support 64-bit words, which is increasingly becoming the default size.
A numeral system is a writing system for expressing numbers. The most common system is the Hindu-Arabic number system, which is now used throughout the world. Almost from the first day of formal education, children begin learning how to add, subtract, and perform other operations using the Hindu-Arabic system. After years of practice, performing basic mathematical operations using strings of digits between 0 and 9 seems natural. However, there are other ways to count and perform arithmetic, such as Roman numerals, unary systems, and Chinese numerals. With a little practice, it is possible to become as proficient at performing mathematics with other number systems as with the Hindu-Arabic system.
The Hindu-Arabic system is a base ten or radix ten system, because it uses the ten digits 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9. For our purposes, the words radix and base are equivalent, and refer to the number of individual digits available in the numbering system. The Hindu-Arabic system is also a positional system, or a place-value notation, because the value of each digit in a number depends on its position in the number. The radix ten Hindu-Arabic system is only one of an infinite family of closely related positional systems. The members of this family differ only in the radix used (and therefore, the number of characters used). For bases greater than base ten, characters are borrowed from the alphabet and used to represent digits. For example, the first column in Table 1.2 shows the character “A” being used as a single digit representation for the number 10.
Table 1.2
The first 21 integers (starting with 0) in various bases
| Base | |||||||||
| 16 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 10 |
| 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 10 | 11 |
| 4 | 4 | 4 | 4 | 4 | 4 | 4 | 10 | 11 | 100 |
| 5 | 5 | 5 | 5 | 5 | 5 | 10 | 11 | 12 | 101 |
| 6 | 6 | 6 | 6 | 6 | 10 | 11 | 12 | 20 | 110 |
| 7 | 7 | 7 | 7 | 10 | 11 | 12 | 13 | 21 | 111 |
| 8 | 8 | 8 | 10 | 11 | 12 | 13 | 20 | 22 | 1000 |
| 9 | 9 | 10 | 11 | 12 | 13 | 14 | 21 | 100 | 1001 |
| A | 10 | 11 | 12 | 13 | 14 | 20 | 22 | 101 | 1010 |
| B | 11 | 12 | 13 | 14 | 15 | 21 | 23 | 102 | 1011 |
| C | 12 | 13 | 14 | 15 | 20 | 22 | 30 | 110 | 1100 |
| D | 13 | 14 | 15 | 16 | 21 | 23 | 31 | 111 | 1101 |
| E | 14 | 15 | 16 | 20 | 22 | 24 | 32 | 112 | 1110 |
| F | 15 | 16 | 17 | 21 | 23 | 30 | 33 | 120 | 1111 |
| 10 | 16 | 17 | 20 | 22 | 24 | 31 | 100 | 121 | 10000 |
| 11 | 17 | 18 | 21 | 23 | 25 | 32 | 101 | 122 | 10001 |
| 12 | 18 | 20 | 22 | 24 | 30 | 33 | 102 | 200 | 10010 |
| 13 | 19 | 21 | 23 | 25 | 31 | 34 | 103 | 201 | 10011 |
| 14 | 20 | 22 | 24 | 26 | 32 | 40 | 110 | 201 | 10100 |

In base ten, we think of numbers as strings of the 10 digits, “0”–“9”. Each digit counts 10 times the amount of the digit to its right. If we restrict ourselves to integers, then the digit furthest to the right is always the ones digit. It is also referred to as the least significant digit. The digit immediately to the left of the ones digit is the tens digit. To the left of that is the hundreds digit, and so on. The leftmost digit is referred to as the most significant digit. The following equation shows how a number can be decomposed into its constituent digits:
Note that the subscript of “10” on 5783910 indicates that the number is given in base ten.
Imagine that we only had 7 digits: 0, 1, 2, 3, 4, 5, and 6. We need 10 digits for base ten, so with only 7 digits we are limited to base seven. In base seven, each digit in the string represents a power of seven rather than a power of ten. We can represent any integer in base seven, but it may take more digits than in base ten. Other than using a different base for the power of each digit, the math works exactly the same as for base ten. For example, suppose we have the following number in base seven: 3304257. We can convert this number to base ten as follows:
Base two, or binary is the “native” number system for modern digital systems. The reason for this is mainly because it is relatively easy to build circuits with two stable states: on and off (or 1 and 0). Building circuits with more than two stable states is much more difficult and expensive, and any computation that can be performed in a higher base can also be performed in binary. The least significant (rightmost) digit in binary is referred to as the least significant bit, or LSB, while the leftmost binary digit is referred to as the most significant bit, or MSB.
The most common bases used by programmers are base two (binary), base eight (octal), base ten (decimal) and base sixteen (hexadecimal). Octal and hexadecimal are common because, as we shall see later, they can be translated quickly and easily to and from base two, and are often easier for humans to work with than base two. Note that for base sixteen, we need 16 characters. We use the digits 0 through 9 plus the letters A through F. Table 1.2 shows the equivalents for all numbers between 0 and 20 in base two through base ten, and base sixteen.
Before learning assembly language it is essential to know how to convert from any base to any other base. Since we are already comfortable working in base ten, we will use that as an intermediary when converting between two arbitrary bases. For instance, if we want to convert a number in base three to base five, we will do it by first converting the base three number to base ten, then from base ten to base five. By using this two-stage process, we will only need to learn to convert between base ten and any arbitrary base b.
Converting from an arbitrary base b to base ten simply involves multiplying each base b digit d by bn, where n is the significance of digit d, and summing all of the results. For example, converting the base five number 34215 to base ten is performed as follows:
This conversion procedure works for converting any integer from any arbitrary base b to its equivalent representation in base ten. Example 1.1 gives another specific example of how to convert from base b to base ten.
Converting from base ten to an arbitrary base b involves repeated division by the base, b. After each division, the remainder is used as the next more significant digit in the base b number, and the quotient is used as the dividend for the next iteration. The process is repeated until the quotient is zero. For example, converting 5610 to base four is accomplished as follows:
Reading the remainders from right to left yields: 3204. This result can be double-checked by converting it back to base ten as follows:
Since we arrived at the same number we started with, we have verified that 5610 = 3204. This conversion procedure works for converting any integer from base ten to any arbitrary base b. Example 1.2 gives another example of converting from base ten to another base b.
Although it is possible to perform the division and multiplication steps in any base, most people are much better at working in base ten. For that reason, the easiest way to convert from any base a to any other base b is to use a two step process. First step is to convert from base a to decimal. The second step is to convert from decimal to base b. Example 1.3 shows how to convert from any base to any other base.
In addition to the methods above, there is a simple method for quickly converting between base two, base eight, and base sixteen. These shortcuts rely on the fact that 2, 8, and 16 are all powers of two. Because of this, it takes exactly four binary digits (bits) to represent exactly one hexadecimal digit. Likewise, it takes exactly three bits to represent an octal digit. Conversely, each hexadecimal digit can be converted to exactly four binary digits, and each octal digit can be converted to exactly three binary digits. This relationship makes it possible to do very fast conversions using the tables shown in Fig. 1.3.

When converting from hexadecimal to binary, all that is necessary is to replace each hex digit with the corresponding binary digits from the table. For example, to convert 5AC416 to binary, we just replace “5” with “0101,” replace “A” with “1010,” replace “C” with “1100,” and replace “4” with “0100.” So, just by referring to the table, we can immediately see that 5AC416 = 01011010110001002. This method works exactly the same for converting from octal to binary, except that it uses the table on the right side of Fig. 1.3.
Converting from binary to hexadecimal is also very easy using the table. Given a binary number, n, take the four least significant digits of n and find them in the table on the left side of Fig. 1.3. The hexadecimal digit on the matching line of the table is the least significant hex digit. Repeat the process with the next set of four bits and continue until there are no bits remaining in the binary number. For example, to convert 00111001010101112 to hexadecimal, just divide the number into groups of four bits, starting on the right, to get: 0011|1001|0101|01112. Now replace each group of four bits by looking up the corresponding hex digit in the table on the left side of Fig. 1.3, to convert the binary number to 395716. In the case where the binary number does not have enough bits, simply pad with zeros in the high-order bits. For example, dividing the number 10011000100112 into groups of four yields 1|0011|0001|00112 and padding with zeros in the high-order bits results in 0001|0011|0001|00112. Looking up the four groups in the table reveals that 0001|0011|0001|00112 = 131316.
The computer stores groups of bits, but the bits by themselves have no meaning. The programmer gives them meaning by deciding what the bits represent, and how they are interpreted. Interpreting a group of bits as unsigned integer data is relatively simple. Each bit is weighted by a power-of-two, and the value of the group of bits is the sum of the non-zero bits multiplied by their respective weights. However, programmers often need to represent negative as well as non-negative numbers, and there are many possibilities for storing and interpreting integers whose value can be both positive and negative. Programmers and hardware designers have developed several standard schemes for encoding such numbers. The three main methods for storing and interpreting signed integer data are two’s complement, sign-magnitude, and excess-N, Fig. 1.4 shows how the same binary pattern of bits can be interpreted as a number in four different ways.

The sign-magnitude representation simply reserves the most significant bit to represent the sign of the number, and the remaining bits are used to store the magnitude of the number. This method has the advantage that it is easy for humans to interpret, with a little practice. However, addition and subtraction are slightly complicated. The addition/subtraction logic must compare the sign bits, complement one of the inputs if they are different, implement an end-around carry, and complement the result if there was no carry from the most significant bit. Complements are explained in Section 1.3.3. Because of the complexity, most integer CPUs do not directly support addition and subtraction of integers in sign-magnitude form. However, this method is commonly used for mantissa in floating-point numbers, as will be explained in Chapter 8. Another drawback to sign-magnitude is that it has two representations for zero, which can cause problems if the programmer is not careful.
Another method for representing both positive and negative numbers is by using an excess-N representation. With this representation, the number that is stored is N greater than the actual value. This representation is relatively easy for humans to interpret. Addition and subtraction are easily performed using the complement method, which is explained in Section 1.3.3. This representation is just the same as unsigned math, with the addition of a bias which is usually (2n−1 − 1). So, zero is represented as zero plus the bias. In n = 12 bits, the bias is 212−1 − 1 = 204710, or 0111111111112. This method is commonly used to store the exponent in floating-point numbers, as will be explained in Chapter 8.
A very efficient method for dealing with signed numbers involves representing negative numbers as the radix complements of their positive counterparts. The complement is the amount that must be added to something to make it “whole.” For instance, in geometry, two angles are complementary if they add to 90°. In radix mathematics, the complement of a digit x in base b is simply b − x. For example, in base ten, the complement of 4 is 10 − 4 = 6.
In complement representation, the most significant digit of a number is reserved to indicate whether or not the number is negative. If the first digit is less than
(where b is the radix), then the number is positive. If the first digit is greater than or equal to
, then the number is negative. The first digit is not part of the magnitude of the number, but only indicates the sign of the number. For example, numbers in ten’s complement notation are positive if the first digit is less than 5, and negative if the first digit is greater than 4. This works especially well in binary, since the number is considered positive if the first bit is zero and negative if the first bit is one. The magnitude of a negative number can be obtained by taking the radix complement. Because of the nice properties of the complement representation, it is the most common method for representing signed numbers in digital computers.
Finding the complement: The radix complement of an n digit number y in radix ( base) b is defined as
For example, the ten’s complement of the four digit number 873410 is 104 − 8734 = 1266. In this example, we directly applied the definition of the radix complement from Eq. (1.4). That is easy in base ten, but not so easy in an arbitrary base, because it involves performing a subtraction. However, there is a very simple method for calculating the complement which does not require subtraction. This method involves finding the diminished radix complement, which is (bn − 1) − y by substituting each digit with its complement from a complement table. The radix complement is found by adding one to the diminished radix complement. Fig. 1.5 shows the complement tables for bases ten and two. Examples 1.4 and 1.5 show how the complement is obtained in bases ten and two respectively. Examples 1.6 and 1.7 show additional conversions between binary and decimal.

Subtraction using complements One very useful feature of complement notation is that it can be used to perform subtraction by using addition. Given two numbers in base b, xb, and yb, the difference can be computed as:
where C(yb) is the radix complement of yb. Assume that xb and yb are both positive where yb ≤ xb and both numbers have the same number of digits n (yb may have leading zeros). In this case, the result of xb + C(yb) will always be greater than or equal to bn, but less than 2 × bn. This means that the result of xb + C(yb) will always begin with a ‘1’ in the n + 1 digit position. Dropping the initial ‘1’ is equivalent to subtracting bn, making the result x − y + bn − bn or just x − y, which is the desired result. This can be reduced to a simple procedure. When y and x are both positive and y ≤ x, the following four steps are to be performed:
1. pad the subtrahend (y) with leading zeros, as necessary, so that both numbers have the same number of digits (n),
2. find the b’s complement of the subtrahend,
3. add the complement to the minuend,
4. discard the leading ‘1’.
The complement notation provides a very easy way to represent both positive and negative integers using a fixed number of digits, and to perform subtraction by using addition. Since modern computers typically use a fixed number of bits, complement notation provides a very convenient and efficient way to store signed integers and perform mathematical operations on them. Hardware is simplified because there is no need to build a specialized subtractor circuit. Instead, a very simple complement circuit is built and the adder is reused to perform subtraction as well as addition.
In the previous section, we discussed how the computer stores information as groups of bits, and how we can interpret those bits as numbers in base two. Given that the computer can only store information using groups of bits, how can we store textual information? The answer is that we create a table, which assigns a numerical value to each character in our language.
Early in the development of computers, several computer manufacturers developed such tables, or character coding schemes. These schemes were incompatible and computers from different manufacturers could not easily exchange textual data without the use of translation software to convert the character codes from one coding scheme to another.
Eventually, a standard coding scheme, known as the American Standard Code for Information Interchange (ASCII) was developed. Work on the ASCII standard began on October 6, 1960, with the first meeting of the American Standards Association’s (ASA) X3.2 subcommittee. The first edition of the standard was published in 1963. The standard was updated in 1967 and again in 1986, and has been stable since then. Within a few years of its development, ASCII was accepted by all major computer manufacturers, although some continue to support their own coding schemes as well.
ASCII was designed for American English, and does not support some of the characters that are used by non-English languages. For this reason, ASCII has been extended to create more comprehensive coding schemes. Most modern multilingual coding schemes are based on ASCII, though they support a wider range of characters.
At the time that it was developed, transmission of digital data over long distances was very slow, and usually involved converting each bit into an audio signal which was transmitted over a telephone line using an acoustic modem. In order to maximize performance, the standards committee chose to define ASCII as a 7-bit code. Because of this decision, all textual data could be sent using seven bits rather than eight, resulting in approximately 10% better overall performance when transmitting data over a telephone modem. A possibly unforeseen benefit was that this also provided a way for the code to be extended in the future. Since there are 128 possible values for a 7-bit number, the ASCII standard provides 128 characters. However, 33 of the ASCII characters are non-printing control characters. These characters, shown in Table 1.3, are mainly used to send information about how the text is to be displayed and/or printed. The remaining 95 printable characters are shown in Table 1.4.
Table 1.3
The ASCII control characters
| Binary | Oct | Dec | Hex | Abbr | Glyph | Name |
| 000 0000 | 000 | 0 | 00 | NUL | ˆ@ | Null character |
| 000 0001 | 001 | 1 | 01 | SOH | ˆA | Start of header |
| 000 0010 | 002 | 2 | 02 | STX | ˆB | Start of text |
| 000 0011 | 003 | 3 | 03 | ETX | ˆC | End of text |
| 000 0100 | 004 | 4 | 04 | EOT | ˆD | End of transmission |
| 000 0101 | 005 | 5 | 05 | ENQ | ˆE | Enquiry |
| 000 0110 | 006 | 6 | 06 | ACK | ˆF | Acknowledgment |
| 000 0111 | 007 | 7 | 07 | BEL | ˆG | Bell |
| 000 1000 | 010 | 8 | 08 | BS | ˆH | Backspace |
| 000 1001 | 011 | 9 | 09 | HT | ˆI | Horizontal tab |
| 000 1010 | 012 | 10 | 0A | LF | ˆJ | Line feed |
| 000 1011 | 013 | 11 | 0B | VT | ˆK | Vertical tab |
| 000 1100 | 014 | 12 | 0C | FF | ˆL | Form feed |
| 000 1101 | 015 | 13 | 0D | CR | ˆM | Carriage return[g] |
| 000 1110 | 016 | 14 | 0E | SO | ˆN | Shift out |
| 000 1111 | 017 | 15 | 0F | SI | ˆO | Shift in |
| 001 0000 | 020 | 16 | 10 | DLE | ˆP | Data link escape |
| 001 0001 | 021 | 17 | 11 | DC1 | ˆQ | Device control 1 (oft. XON) |
| 001 0010 | 022 | 18 | 12 | DC2 | ˆR | Device control 2 |
| 001 0011 | 023 | 19 | 13 | DC3 | ˆS | Device control 3 (oft. XOFF) |
| 001 0100 | 024 | 20 | 14 | DC4 | ˆT | Device control 4 |
| 001 0101 | 025 | 21 | 15 | NAK | ˆU | Negative acknowledgement |
| 001 0110 | 026 | 22 | 16 | SYN | ˆV | Synchronous idle |
| 001 0111 | 027 | 23 | 17 | ETB | ˆW | End of transmission Block |
| 001 1000 | 030 | 24 | 18 | CAN | ˆX | Cancel |
| 001 1001 | 031 | 25 | 19 | EM | ˆY | End of medium |
| 001 1010 | 032 | 26 | 1A | SUB | ˆZ | Substitute |
| 001 1011 | 033 | 27 | 1B | ESC | ˆ[ | Escape |
| 001 1100 | 034 | 28 | 1C | FS | ˆ\ | File separator |
| 001 1101 | 035 | 29 | 1D | GS | ˆ] | Group separator |
| 001 1110 | 036 | 30 | 1E | RS | ˆˆ | Record separator |
| 001 1111 | 037 | 31 | 1F | US | ˆ_ | Unit separator |
| 111 1111 | 177 | 127 | 7F | DEL | ˆ? | Delete |

Table 1.4
The ASCII printable characters
| Binary | Oct | Dec | Hex | Glyph |
| 010 0000 | 040 | 32 | 20 | _ |
| 010 0001 | 041 | 33 | 21 | ! |
| 010 0010 | 042 | 34 | 22 | ” |
| 010 0011 | 043 | 35 | 23 | # |
| 010 0100 | 044 | 36 | 24 | $ |
| 010 0101 | 045 | 37 | 25 | % |
| 010 0110 | 046 | 38 | 26 | & |
| 010 0111 | 047 | 39 | 27 | ’ |
| 010 1000 | 050 | 40 | 28 | ( |
| 010 1001 | 051 | 41 | 29 | ) |
| 010 1010 | 052 | 42 | 2A | * |
| 010 1011 | 053 | 43 | 2B | + |
| 010 1100 | 054 | 44 | 2C | , |
| 010 1101 | 055 | 45 | 2D | − |
| 010 1110 | 056 | 46 | 2E | . |
| 010 1111 | 057 | 47 | 2F | / |
| 011 0000 | 060 | 48 | 30 | 0 |
| 011 0001 | 061 | 49 | 31 | 1 |
| 011 0010 | 062 | 50 | 32 | 2 |
| 011 0011 | 063 | 51 | 33 | 3 |
| 011 0100 | 064 | 52 | 34 | 4 |
| 011 0101 | 065 | 53 | 35 | 5 |
| 011 0110 | 066 | 54 | 36 | 6 |
| 011 0111 | 067 | 55 | 37 | 7 |
| 011 1000 | 070 | 56 | 38 | 8 |
| 011 1001 | 071 | 57 | 39 | 9 |
| 011 1010 | 072 | 58 | 3A | : |
| 011 1011 | 073 | 59 | 3B | ; |
| 011 1100 | 074 | 60 | 3C | < |
| 011 1101 | 075 | 61 | 3D | = |
| 011 1110 | 076 | 62 | 3E | > |
| 011 1111 | 077 | 63 | 3F | ? |
| 100 0000 | 100 | 64 | 40 | @ |
| 100 0001 | 101 | 65 | 41 | A |
| 100 0010 | 102 | 66 | 42 | B |
| 100 0011 | 103 | 67 | 43 | C |
| 100 0100 | 104 | 68 | 44 | D |
| 100 0101 | 105 | 69 | 45 | E |
| 100 0110 | 106 | 70 | 46 | F |
| 100 0111 | 107 | 71 | 47 | G |
| 100 1000 | 110 | 72 | 48 | H |
| 100 1001 | 111 | 73 | 49 | I |
| 100 1010 | 112 | 74 | 4A | J |
| 100 1011 | 113 | 75 | 4B | K |
| 100 1100 | 114 | 76 | 4C | L |
| 100 1101 | 115 | 77 | 4D | M |
| 100 1110 | 116 | 78 | 4E | N |
| 100 1111 | 117 | 79 | 4F | O |
| 101 0000 | 120 | 80 | 50 | P |
| 101 0001 | 121 | 81 | 51 | Q |
| 101 0010 | 122 | 82 | 52 | R |
| 101 0011 | 123 | 83 | 53 | S |
| 101 0100 | 124 | 84 | 54 | T |
| 101 0101 | 125 | 85 | 55 | U |
| 101 0110 | 126 | 86 | 56 | V |
| 101 0111 | 127 | 87 | 57 | W |
| 101 1000 | 130 | 88 | 58 | X |
| 101 1001 | 131 | 89 | 59 | Y |
| 101 1010 | 132 | 90 | 5A | Z |
| 101 1011 | 133 | 91 | 5B | [ |
| 101 1100 | 134 | 92 | 5C | \ |
| 101 1101 | 135 | 93 | 5D | ] |
| 101 1110 | 136 | 94 | 5E | ˆ |
| 101 1111 | 137 | 95 | 5F | _ |
| 110 0000 | 140 | 96 | 60 | ‘ |
| 110 0001 | 141 | 97 | 61 | a |
| 110 0010 | 142 | 98 | 62 | b |
| 110 0011 | 143 | 99 | 63 | c |
| 110 0100 | 144 | 100 | 64 | d |
| 110 0101 | 145 | 101 | 65 | e |
| 110 0110 | 146 | 102 | 66 | f |
| 110 0111 | 147 | 103 | 67 | g |
| 110 1000 | 150 | 104 | 68 | h |
| 110 1001 | 151 | 105 | 69 | i |
| 110 1010 | 152 | 106 | 6A | j |
| 110 1011 | 153 | 107 | 6B | k |
| 110 1100 | 154 | 108 | 6C | l |
| 110 1101 | 155 | 109 | 6D | m |
| 110 1110 | 156 | 110 | 6E | n |
| 110 1111 | 157 | 111 | 6F | o |
| 111 0000 | 160 | 112 | 70 | p |
| 111 0001 | 161 | 113 | 71 | q |
| 111 0010 | 162 | 114 | 72 | r |
| 111 0011 | 163 | 115 | 73 | s |
| 111 0100 | 164 | 116 | 74 | t |
| 111 0101 | 165 | 117 | 75 | u |
| 111 0110 | 166 | 118 | 76 | v |
| 111 0111 | 167 | 119 | 77 | w |
| 111 1000 | 170 | 120 | 78 | x |
| 111 1001 | 171 | 121 | 79 | y |
| 111 1010 | 172 | 122 | 7A | z |
| 111 1011 | 173 | 123 | 7B | { |
| 111 1100 | 174 | 124 | 7C | | |
| 111 1101 | 175 | 125 | 7D | } |
| 111 1110 | 176 | 126 | 7E | ˜ |


The non-printing characters are used to provide hints or commands to the device that is receiving, displaying, or printing the data. The FF character, when sent to a printer, will cause the printer to eject the current page and begin a new one. The LF character causes the printer or terminal to end the current line and begin a new one. The CR character causes the terminal or printer to move to the beginning of the current line. Many text editing programs allow the user to enter these non-printing characters by using the control key on the keyboard. For instance, to enter the BEL character, the user would hold the control key down and press the G key. This character, when sent to a character display terminal, will cause it to emit a beep. Many of the other control characters can be used to control specific features of the printer, display, or other device that the data is being sent to.
Suppose we wish to covert a string of characters, such as “Hello World” to an ASCII representation. We can use an 8-bit byte to store each character. Also, it is common practice to include an additional byte at the end of the string. This additional byte holds the ASCII NUL character, which indicates the end of the string. Such an arrangement is referred to as a null-terminated string.
To convert the string “Hello World” into a null-terminated string, we can build a table with each character on the left and its equivalent binary, octal, hexadecimal, or decimal value (as defined in the ASCII table) on the right. Table 1.5 shows the characters in “Hello World” and their equivalent binary representations, found by looking in Table 1.4. Since most modern computers use 8-bit bytes (or multiples thereof) as the basic storage unit, an extra zero bit is shown in the most significant bit position.
Table 1.5
Binary equivalents for each character in “Hello World”
| Character | Binary |
| H | 01001000 |
| e | 01100101 |
| l | 01101100 |
| l | 01101100 |
| o | 01101111 |
| 00100000 | |
| W | 01010111 |
| o | 01101111 |
| r | 01110010 |
| l | 01101100 |
| d | 01100100 |
| NUL | 00000000 |
Reading the Binary column from top to bottom results in the following sequence of bytes: 01001000 01100101 01101100 01101100 01101111 00100000 01010111 01101111 01110010 01101100 01100100 0000000. To convert the same string to a hexadecimal representation, we can use the shortcut method that was introduced previously to convert each 4-bit nibble into its hexadecimal equivalent, or read the hexadecimal value from the ASCII table. Table 1.6 shows the result of extending Table 1.5 to include hexadecimal and decimal equivalents for each character. The string can now be converted to hexadecimal or decimal simply by reading the correct column in the table. So “Hello World” expressed as a null-terminated string in hexadecimal is “48 65 6C 6C 6F 20 57 6F 62 6C 64 00” and in decimal it is ”72 101 108 108 111 32 87 111 98 108 100 0”.
Table 1.6
Binary, hexadecimal, and decimal equivalents for each character in “Hello World”
| Character | Binary | Hexadecimal | Decimal |
| H | 01001000 | 48 | 72 |
| e | 01100101 | 65 | 101 |
| l | 01101100 | 6C | 108 |
| l | 01101100 | 6C | 108 |
| o | 01101111 | 6F | 111 |
| 00100000 | 20 | 32 | |
| W | 01010111 | 57 | 87 |
| o | 01101111 | 6F | 111 |
| r | 01110010 | 62 | 98 |
| l | 01101100 | 6C | 108 |
| d | 01100100 | 64 | 100 |
| NUL | 00000000 | 00 | 0 |

It is sometimes necessary to convert a string of bytes in hexadecimal into ASCII characters. This is accomplished simply by building a table with the hexadecimal value of each byte in the left column, then looking in the ASCII table for each value and entering the equivalent character representation in the right column. Table 1.7 shows how the ASCII table is used to interpret the hexadecimal string “466162756C6F75732100” as an ASCII string.
ASCII was developed to encode all of the most commonly used characters in North American English text. The encoding uses only 128 of the 256 codes that are available in a 8-bit byte. ASCII does not include symbols frequently used in other countries, such as the British pound symbol (£) or accented characters (ü). However, the International Standards Organization (ISO) has created several extensions to ASCII to enable the representation of characters from a wider variety of languages.
The ISO has defined a set of related standards known collectively as ISO 8859. ISO 8859 is an 8-bit extension to ASCII which includes the 128 ASCII characters along with an additional 128 characters, such as the British Pound symbol and the American cent symbol. Several variations of the ISO 8859 standard exist for different language families. Table 1.8 provides a brief description of the various ISO standards.
Table 1.8
Variations of the ISO 8859 standard
| Name | Alias | Languages |
| ISO8859-1 | Latin-1 | Western European languages |
| ISO8859-2 | Latin-2 | Non-Cyrillic Central and Eastern European languages |
| ISO8859-3 | Latin-3 | Southern European languages and Esperanto |
| ISO8859-4 | Latin-4 | Northern European and Baltic languages |
| ISO8859-5 | Latin/Cyrillic | Slavic languages that use a Cyrillic alphabet |
| ISO8859-6 | Latin/Arabic | Common Arabic language characters |
| ISO8859-7 | Latin/Greek | Modern Greek language |
| ISO8859-8 | Latin/Hebrew | Modern Hebrew languages |
| ISO8859-9 | Latin-5 | Turkish |
| ISO8859-10 | Latin-6 | Nordic languages |
| ISO8859-11 | Latin/Thai | Thai language |
| ISO8859-12 | Latin/Devanagari | Never completed. Abandoned in 1997 |
| ISO8859-13 | Latin-7 | Some Baltic languages not covered by Latin-4 or Latin-6 |
| ISO8859-14 | Latin-8 | Celtic languages |
| ISO8859-15 | Latin-9 | Update to Latin-1 that replaces some characters. Most |
| notably, it includes the euro symbol (€), which did not | ||
| exist when Latin-1 was created | ||
| ISO8859-16 | Latin-10 | Covers several languages not covered by Latin-9 and |
| includes the euro symbol (€) |
Although the ISO extensions helped to standardize text encodings for several languages that were not covered by ASCII, there were still some issues. The first issue is that the display and input devices must be configured for the correct encoding, and displaying or printing documents with multiple encodings requires some mechanism for changing the encoding on-the-fly. Another issue has to do with the lexicographical ordering of characters. Although two languages may share a character, that character may appear in a different place in the alphabets of the two languages. This leads to issues when programmers need to sort strings into lexicographical order. The ISO extensions help to unify character encodings across multiple languages, but do not solve all of the issues involved in defining a universal character set.
In the late 1980s, there was growing interest in developing a universal character encoding for all languages. People from several computer companies worked together and, by 1990, had developed a draft standard for Unicode. In 1991, the Unicode Consortium was formed and charged with guiding and controlling the development of Unicode. The Unicode Consortium has worked closely with the ISO to define, extend, and maintain the international standard for a Universal Character Set (UCS). This standard is known as the ISO/IEC 10646 standard. The ISO/IEC 10646 standard defines the mapping of code points (numbers) to glyphs (characters). but does not specify character collation or other language-dependent properties. UCS code points are commonly written in the form U+XXXX, where XXXX in the numerical code point in hexadecimal. For example, the code point for the ASCII DEL character would be written as U+007F. Unicode extends the ISO/IEC standard and specifies language-specific features.
Originally, Unicode was designed as a 16-bit encoding. It was not fully backward-compatible with ASCII, and could encode only 65,536 code points. Eventually, the Unicode character set grew to encompass 1,112,064 code points, which requires 21 bits per character for a straightforward binary encoding. By early 1992, it was clear that some clever and efficient method for encoding character data was needed.
UTF-8 (UCS Transformation Format-8-bit) was proposed and accepted as a standard in 1993. UTF-8 is a variable-width encoding that can represent every character in the Unicode character set using between one and four bytes. It was designed to be backward compatible with ASCII and to avoid the major issues of previous encodings. Code points in the Unicode character set with lower numerical values tend to occur more frequently than code points with higher numerical values. UTF-8 encodes frequently occurring code points with fewer bytes than those which occur less frequently. For example, the first 128 characters of the UTF-8 encoding are exactly the same as the ASCII characters, requiring only 7 bits to encode each ASCII character. Thus any valid ASCII text is also valid UTF-8 text. UTF-8 is now the most common character encoding for the World Wide Web, and is the recommended encoding for email messages.
In November 2003, UTF-8 was restricted by RFC 3629 to end at code point 10FFFF16. This allows UTF-8 to encode 1,114,111 code points, which is slightly more than the 1,112,064 code points defined in the ISO/IEC 10646 standard. Table 1.9 shows how ISO/IEC 10646 code points are mapped to a variable-length encoding in UTF-8. Note that the encoding allows each byte in a stream of bytes to be placed in one of the following three distinct categories:
Table 1.9
UTF-8 encoding of the ISO/IEC 10646 code points
| First | Last | ||||||
| UCS | Code | Code | |||||
| Bits | Point | Point | Bytes | Byte 1 | Byte 2 | Byte 3 | Byte 4 |
| 7 | U+0000 | U+007F | 1 | 0xxxxxxx | |||
| 11 | U+0080 | U+07FF | 2 | 110xxxxx | 10xxxxxx | ||
| 16 | U+0800 | U+FFFF | 3 | 1110xxxx | 10xxxxxx | 10xxxxxx | |
| 21 | U+10000 | U+10FFFF | 4 | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |

1. If the most significant bit of a byte is zero, then it is a single-byte character, and is completely ASCII-compatible.
2. If the two most significant bits in a byte are set to one, then the byte is the beginning of a multi-byte character.
3. If the most significant bit is set to one, and the second most significant bit is set to zero, then the byte is part of a multi-byte character, but is not the first byte in that sequence.
The UTF-8 encoding of the UCS characters has several important features:
Backwards compatible with ASCII: This allows the vast number of existing ASCII documents to be interpreted as UTF-8 documents without any conversion.
Self-synchronization: Because of the way code points are assigned, it is possible to find the beginning of each character by looking only at the top two bits of each byte. This can have important performance implications when performing searches in text.
Encoding of code sequence length: The number of bytes in the sequence is indicated by the pattern of bits in the first byte of the sequence. Thus, the beginning of the next character can be found quickly. This feature can also have important performance implications when performing searches in text.
Efficient code structure: UTF-8 efficiently encodes the UCS code points. The high-order bits of the code point go in the lead byte. Lower-order bits are placed in continuation bytes. The number of bytes in the encoding is the minimum required to hold all the significant bits of the code point.
Easily extended to include new languages: This feature will be greatly appreciated when we contact intelligent species from other star systems.
With UTF-8 encoding, the first 128 characters of the UCS are each encoded in a single byte. The next 1,920 characters require two bytes to encode. The two-byte encoding covers almost all Latin alphabets, and also Arabic, Armenian, Cyrillic, Coptic, Greek, Hebrew, Syriac and Tāna alphabets. It also includes combining diacritical marks, which are used in combination with another character, such as á, ñ, and ö. Most of the Chinese, Japanese, and Korean (CJK) characters are encoded using three bytes. Four bytes are needed for the less common CJK characters, various historic scripts, mathematical symbols, and emoji (pictographic symbols).
Consider the UTF-8 encoding for the British Pound symbol (£), which is UCS code point U+00A3. Since the code point is greater than 7F16, but less than 80016, it will require two bytes to encode. The encoding will be 110xxxxx 10xxxxxx, where the x characters are replaced with the 11 least-significant bits of the code point, which are 00010100011. Thus, the character £ is encoded in UTF-8 as 11000010 10100011 in binary, or C2 A3 in hexadecimal.
The UCS code point for the Euro symbol (€) is U+20AC. Since the code point is between 80016 and FFFF16, it will require three bytes to encode in UTF-8. The three-byte encoding is 1110xxxx 10xxxxxx 10xxxxxx where the x characters are replaced with the 16 least-significant bits of the code point. In this case the code point, in binary is 0010000010101100. Therefore, the UTF-8 encoding for € is 11100010 10000010 10101100 in binary, or E2 82 AC in hexadecimal.
In summary, there are three components to modern language support. The ISO/IEC 10646 defines a mapping from code points (numbers) to glyphs (characters). UTF-8 defines an efficient variable-length encoding for code points (text data) in the ISO/IEC 10646 standard. Unicode adds language specific properties to the ISO/IEC 10646 character set. Together, these three elements currently provide support for textual data in almost every human written language, and they continue to be extended and refined.
Computer memory consists of number of storage locations, or cells, each of which has a unique numeric address. Addresses are usually written in hexadecimal. Each storage location can contain a fixed number of binary digits. The most common size is one byte. Most computers group bytes together into words. A computer CPU that is capable of accessing a single byte of memory is said to have byte addressable memory. Some CPUs are only capable of accessing memory in word-sized groups. They are said to have word addressable memory.
Fig. 1.6 A shows a section of memory containing some data. Each byte has a unique address that is used when data is transferred to or from that memory cell. Most processors can also move data in word-sized chunks. On a 32-bit system, four bytes are grouped together to form a word. There are two ways that this grouping can be done. Systems that store the most significant byte of a word in the smallest address, and the least significant byte in the largest address, are said to be big-endian. The big-endian interpretation of a region of memory is shown in Fig. 1.6B. As shown in Fig. 1.6C, little-endian systems store the least significant byte in the lowest address and the most significant byte in the highest address. Some processors, such as the ARM, can be configured as either little-endian or big-endian. The Linux operating system, by default, configures the ARM processor to run in little-endian mode .

The memory layout for a typical program is shown in Fig. 1.7. The program is divided into four major memory regions, or sections. The programmer specifies the contents of the Text and Data sections. The Stack and Heap segments are defined when the program is loaded for execution. The Stack and Heap may grow and shrink as the program executes, while the Text and Data segments are set to fixed sizes by the compiler, linker, and loader. The Text section contains the executable instructions, while the Data section contains constants and statically allocated variables. The sizes of the Text and Data segments depend on how large the program is, and how much static data storage has been declared by the programmer. The heap contains variables that are allocated dynamically, and the stack is used to store parameters for function calls, return addresses, and local (automatic) variables.

In a high-level language, storage space for a variable can be allocated in one of three ways: statically, dynamically, and automatically. Statically allocated variables are allocated from the .data section. The storage space is reserved, and usually initialized, when the program is loaded and begins execution. The address of a statically allocated variable is fixed at the time the program begins running, and cannot be changed. Automatically allocated variables, often referred to as local variables, are stored on the stack. The stack pointer is adjusted down to make space for the newly allocated variable. The address of an automatic variable is always computed as an offset from the stack pointer. Dynamic variables are allocated from the heap, using malloc, new, or a language-dependent equivalent. The address of a dynamic variable is always stored in another variable, known as a pointer, which may be an automatic or static variable, or even another dynamic variable. The four major sections of program memory correspond to executable code, statically allocated variables, dynamically allocated variables, and automatically allocated variables.
There are several reasons for Computer Scientists and Computer Engineers to learn at least one assembly language. There are programming tasks that can only be performed using assembly language, and some tasks can be written to run much more efficiently and/or quickly if written in assembly language. Programmers with assembly language experience tend to write better code even when using a high-level language, and are usually better at finding and fixing bugs.
Although it is possible to construct a computer capable of performing arithmetic in any base, it is much cheaper to build one that works in base two. It is relatively easy to build an electrical circuit with two states, using two discrete voltage levels, but much more difficult to build a stable circuit with 10 discrete voltage levels. Therefore, modern computers work in base two.
Computer data can be viewed as simple bit strings. The programmer is responsible for supplying interpretations to give meaning to those bit strings. A set of bits can be interpreted as a number, a character, or anything that the programmer chooses. There are standard methods for encoding and interpreting characters and numbers. Fig. 1.4 shows some common methods for encoding integers. The most common encodings for characters are UTF-8 and ASCII.
Computer memory can be viewed as a sequence of bytes. Each byte has a unique address. A running program has four regions of memory. One region holds the executable code. The other three regions hold different types of variables.
1.1 What is the two’s complement of 11011101?
1.2 Perform the base conversions to fill in the blank spaces in the following table:
1.3 What is the 8-bit ASCII binary representation for the following characters?
(b) “a”
(c) “!”
1.4 What is \ minus ! given that \ and ! are ASCII characters? Give your answer in binary.
(a) Convert the string “Super!” to its ASCII representation. Show your result as a sequence of hexadecimal values.
(b) Convert the hexadecimal sequence into a sequence of values in base four.
1.6 Suppose that the string “This is a nice day” is stored beginning at address 4B3269AC16. What are the contents of the byte at address 4B3269B116 in hexadecimal?
(a) Convert 1011012 to base ten.
(b) Convert 102310 to base nine.
(c) Convert 102310 to base two.
(d) Convert 30110 to base 16.
(e) Convert 30110 to base 2.
(f) Represent 30110 as a null-terminated ASCII string (write your answer in hexadecimal).
(g) Convert 34205 to base ten.
(h) Convert 23145 to base nine.
(i) Convert 1167 to base three.
(j) Convert 129411 to base 5.
1.8 Given the following binary string:
01001001 01110011 01101110 00100111 01110100 00100000 01000001 01110011 01110011 01100101 01101101 01100010 01101100 01111001 00100000 01000110 01110101 01101110 00111111 00000000
(a) Convert it to a hexadecimal string.
(b) Convert the first four bytes to a string of base ten numbers.
(c) Convert the first (little-endian) halfword to base ten.
(d) Convert the first (big-endian) halfword to base ten.
(e) If this string of bytes were sent to an ASCII printer or terminal, what would be printed?
1.9 The number 1,234,567 is stored as a 32-bit word starting at address F043900016. Show the address and contents of each byte of the 32-bit word on a
(b) big-endian system.
1.10 The ISO/IEC 10646 standard defines 1,112,064 code points (glyphs). Each code point could be encoded using 24 bits, or three bytes. The UTF-8 encoding uses up to four bytes to encode a code point. Give three reasons why UTF-8 is preferred over a simple 3-byte per code point encoding.
1.11 UTF-8 is often referred to as Unicode. Why is this not correct?
1.12 Skilled assembly programmers can convert small numbers between binary, hexadecimal, and decimal in their heads. Without referring to any tables or using a calculator or pencil, fill in the blanks in the following table:
1.13 What are the differences between a CPU register and a memory location?
This chapter begins with a high-level description of assembly language and the assembler. It then explains the five elements of assembly language syntax, and gives some examples. It then goes in to more depth about how the assembler converts assembly language files into object files, which are then linked with other object files to create an executable file. Then it explains the most commonly used directives for the GNU assembler, and gives some examples to help relate the assembly code to equivalent C code.
Compiler; Assembler; Linker; Labels; Comments; Directives; Instructions; Sections; Symbols
All modern computers consist of three main components: the central processing unit (CPU), memory, and devices. It can be argued that the major factor that distinguishes one computer from another is the CPU architecture. The architecture determines the set of instructions that can be performed by the CPU. The human-readable language which is closest to the CPU architecture is assembly language.
When a new processor architecture is developed, its creators also define an assembly language for the new architecture. In most cases, a precise assembly language syntax is defined and an assembler is created by the processor developers. Because of this, there is no single syntax for assembly language, although most assembly languages are similar in many ways and have certain elements in common.
The GNU assembler (GAS) is a highly portable re-configurable assembler. GAS uses a simple, general syntax that works for a wide variety of architectures. Although the syntax used by GAS for the ARM processor is slightly different from the syntax defined by the developers of the ARM processor, it provides the same capabilities.
An assembly program consists of four basic elements: assembler directives, labels, assembly instructions, and comments. Assembler directives allow the programmer to reserve memory for the storage of variables, control which program section is being used, define macros, include other files, and perform other operations that control the conversion of assembly instructions into machine code. The assembly instructions are given as mnemonics, or short character strings that are easier for human brains to remember than sequences of binary, octal, or hexadecimal digits. Each assembly instruction may have an optional label, and most assembly instructions require the programmer to specify one or more operands.
Most assembly language programs are written in lines of 80 characters organized into four columns. The first column is for optional labels. The second column is for assembly instructions or assembler directives. The third column is for specifying operands, and the fourth column is for comments. Traditionally, the first two columns are 8 characters wide, the third column is 16 characters wide, and the last column is 48 characters wide. However, most modern assemblers (including GAS) do not require a fixed column widths. Listing 2.1 shows a basic “Hello World” program written in GNU ARM Assembly to run under Linux. For comparison, Listing 2.2 shows an equivalent program written in C. The assembly language version of the program is significantly longer than the C version, and will only work on an ARM processor. The C version is at a higher level of abstraction, and can be compiled to run on any system that has a C compiler. Thus, C is referred to as a high-level language, and assembly is a low-level language.


Most modern assemblers are called two-pass assemblers because they read the input file twice. On the first pass, the assembler keeps track of the location of each piece of data and each instruction, and assigns an address or numerical value to each label and symbol in the input file. The main goal of the first pass is to build a symbol table, which maps each label or symbol to a numerical value.
On the second pass, the assembler converts the assembly instructions and data declarations into binary, using the symbol table to supply numerical values whenever they are needed. In Listing 2.1, there are two labels: main and str. During assembly, those labels are assigned the value of the address counter at the point where they appear. Labels can be used anywhere in the program to refer to the address of data, functions, or blocks of code. In GNU assembly syntax, labels always end with a colon (:) character.
There are two basic comment styles: multi-line and single-line. Multi-line comments start with /* and everything is ignored until a matching sequence of */ is found. These comments are exactly the same as multi-line comments in C and C++. In ARM assembly, single line comments begin with an @ character, and all remaining characters on the line are ignored. Listing 2.1 shows both types of comment. In addition, if the name of the file ends in .S, then single line comments can begin with //. If the file name does not end with a capital .S, then the // syntax is not allowed.
Directives are used mainly to define symbols, allocate storage, and control the behavior of the assembler, allowing the programmer to control how the assembler does its job. The GNU assembler has many directives, but assembly programmers typically need to know only a few of them. All assembler directives begin with a period “.” which is followed by a sequence of letters, usually in lower case. Listing 2.1 uses the .data, .asciz, .text, and .globl directives. The most commonly used directives are discussed later in this chapter. There are many other directives available in the GNU Assembler which are not covered here. Complete documentation is available online as part of the GNU Binutils package.
Assembly instructions are the program statements that will be executed on the CPU. Most instructions cause the CPU to perform one low-level operation, In most assembly languages, operations can be divided into a few major types. Some instructions move data from one location to another. Others perform addition, subtraction, and other computational operations. Another class of instructions is used to perform comparisons and control which part of the program is to be executed next. Chapters 3 and 4 explain most of the assembly instructions that are available on the ARM processor.
Listing 2.3 shows how the GNU assembler will assemble the “Hello World” program from Listing 2.1. The assembler converts the string on input line 2 into the binary representation of the string. The results are shown in hexadecimal in the Code column of the listing. The first byte of the string is stored at address zero in the .data section of the program, as shown by the 0000 in the Addr column on line 2.

On line 4, the assembler switches to the .text section of the program and begins converting instructions into binary. The first instruction, on line 9, is converted into its 4-byte machine code, 00402DE916, and stored at location 0000 in the .text section of the program, as shown in the Code and Addr columns on line 6.
Next, the assembler converts the ldr instruction on line 10 into the four-byte machine instruction 0C009FE516 and stores it at address 0004. It repeats this process with each remaining instruction until the end of the program. The assembler writes the resulting data into a specially formatted file, called an object file. Note that the assembler was unable to locate the printf function. The linker will take care of that. The object file created by the assembler, hello.o, contains the data in the Code column of Listing 2.3, along with information to help the linker to link (or “patch”) the instruction on line 11 so that printf is called correctly.
After creating the object file, the next step in creating an executable program would be to invoke the linker and request that it link hello.o with the C standard library. The linker will generate the final executable file, containing the code assembled from hello.S, along with the printf function and other start-up code from the C standard library. The GNU C compiler is capable of automatically invoking the assembler for files that end in .s or .S, and can also be used to invoke the linker. For example, if Listing 2.1 is stored in a file named hello.S in the current directory, then the command
will run the GNU C compiler, telling it to create an executable program file named hello, and to use hello.S as the source file for the program. The C compiler will notice the .S extension, and invoke the assembler to create an object file which is stored in a temporary file, possibly named hello.o. Then the C compiler will invoke the linker to link hello.o with the C standard library, which provides the printf function and some start-up code which calls the main function. The linker will create an executable file named hello. When the linker has finished, the C compiler will remove the temporary object file.
Each processor architecture has its own assembly language, created by the designers of the architecture. Although there are many similarities between assembly languages, the designers may choose different names for various directives. The GNU assembler supports a relatively large set of directives, some of which have more than one name. This is because it is designed to handle assembling code for many different processors without drastically changing the assembly language designed by the processor manufacturers. We will now cover some of the most commonly used directives for the GNU assembler.
The instructions and data that make up a program are stored in different sections of the program file. There are several standard sections that the programmer can choose to put code and data in. Sections can also be further divided into numbered subsections. Each section has its own address counter, which is used to keep track of the location of bytes within that section. When a label is encountered, it is assigned the value of the current address counter for the currently active section.
Selecting a section and subsection is done by using the appropriate assembly directive. Once a section has been selected, all of the instructions and/or data will go into that section until another section is selected. The most important directives for selecting a section are:
Instructs the assembler to append the following instructions or data to the data subsection numbered subsection. If the subsection number is omitted, it defaults to zero. This section is normally used for global variables and constants which have labels.
Tells the assembler to append the following statements to the end of the text subsection numbered subsection. If the subsection number is omitted, subsection number zero is used. This section is normally used for executable instructions, but may also contain constant data.
The bss (short for Block Started by Symbol) section is used for defining data storage areas that should be initialized to zero at the beginning of program execution. The .bss directive tells the assembler to append the following statements to the end of the bss subsection numbered subsection. If the subsection number is omitted, subsection number zero is used. This section is normally used for global variables which need to be initialized to zero. Regardless of what is placed into the section at compile-time, all bytes will be set to zero when the program begins executing. This section does not actually consume any space in the object or executable file. It is really just a request for the loader to reserve some space when the program is loaded into memory.
In addition to the three common sections, the programmer can create other sections using this directive. However in order for custom sections to be linked into a program, the linker must be made aware of them. Controlling the linker is covered in Section 14.4.3.
There are several directives that allow the programmer to allocate and initialize static storage space for variables and constants. The assembler supports bytes, integer types, floating point types, and strings. These directives are used to allocate a fixed amount of space in memory and optionally initialize the memory. Some of these directives allow the memory to be initialized using an expression. An expression can be a simple integer, or a C-style expression. The directives for allocating storage are as follows:
.byte expects zero or more expressions, separated by commas. Each expression is assembled into the next byte. If no expressions are given, then the address counter is not advanced and no bytes are reserved.
.hword expressions .short expressions
For the ARM processor, these two directives do exactly the same thing. They expect zero or more expressions, separated by commas, and emit a 16-bit number for each expression. If no expressions are given, then the address counter is not advanced and no bytes are reserved.
.word expressions .long expressions
For the ARM processor, these two directives do exactly the same thing. They expect zero or more expressions, separated by commas. They will emit four bytes for each expression given. If no expressions are given, then the address counter is not advanced and no bytes are reserved.
The .ascii directive expects zero or more string literals, each enclosed in quotation marks and separated by commas. It assembles each string (with no trailing ASCII NULL character) into consecutive addresses.
.asciz ” string ” .string ” string ”
The .asciz directive is similar to the .ascii directive, but each string is followed by an ASCII NULL character (zero). The “z” in .asciz stands for zero. .string is just another name for .asciz.
.float flonums .single flonums
This directive assembles zero or more floating point numbers, separated by commas. On the ARM, they are 4-byte IEEE standard single precision numbers. .float and .single are synonymous.
The .double directive expects zero or more floating point numbers, separated by commas. On the ARM, they are stored as 8-byte IEEE standard double precision numbers.
Fig. 2.1A shows how these directives are used to declare variables and constants. Fig. 2.1B shows the equivalent statements for creating global variables in C or C++. Note that in both cases, the variables created will be visible anywhere within the file that they are declared, but not visible in other files which are linked into the program.

In C, the declaration of an array can be performed by leaving out the number of elements and specifying an initializer, as shown in the last three lines of Fig. 2.1B. In assembly, the equivalent is accomplished by providing a label, a type, and a list of values, as shown in the last three lines of Fig. 2.1A. The syntax is different, but the result is precisely the same.
Listing 2.4 shows how the assembler assigns addresses to these labels. The second column of the listing shows the address (in hexadecimal) that is assigned to each label. The variable i is assigned the first address. Since it is a word variable, the address counter is incremented by four bytes and the next address is assigned to the variable j. The address counter is incremented again, and fmt is assigned the address 0008. The fmt variable consumes seven bytes, so the ch variable gets address 000f. Finally, the array of words named ary begins at address 0012. Note that 1216 = 1810 is not evenly divisible by four, which means that the word variables in ary are not aligned on word boundaries.

On the ARM CPU, data can be moved to and from memory one byte at a time, two bytes at a time (half-word), or four bytes at a time (word). Moving a word between the CPU and memory takes significantly more time if the address of the word is not aligned on a four-byte boundary (one where the least significant two bits are zero). Similarly, moving a half-word between the CPU and memory takes significantly more time if the address of the half-word is not aligned on a two-byte boundary (one where the least significant bit is zero). Therefore, when declaring storage, it is important that words and half-words are stored on appropriate boundaries. The following directives allow the programmer to insert as much space as necessary to align the next item on any boundary desired.
.align abs-expr, abs-expr, abs-expr
Pad the location counter (in the current subsection) to a particular storage boundary. For the ARM processor, the first expression specifies the number of low-order zero bits the location counter must have after advancement. The second expression gives the fill value to be stored in the padding bytes. It (and the comma) may be omitted. If it is omitted, then the fill value is assumed to be zero. The third expression is also optional. If it is present, it is the maximum number of bytes that should be skipped by this alignment directive.
.balign [lw] abs-expr, abs-expr, abs-expr
These directives adjust the location counter to a particular storage boundary. The first expression is the byte-multiple for the alignment request. For example, .balign 16 will insert fill bytes until the location counter is an even multiple of 16. If the location counter is already a multiple of 16, then no fill bytes will be created. The second expression gives the fill value to be stored in the fill bytes. It (and the comma) may be omitted. If it is omitted, then the fill value is assumed to be zero. The third expression is also optional. If it is present, it is the maximum number of bytes that should be skipped by this alignment directive.
The .balignw and .balignl directives are variants of the .balign directive. The .balignw directive treats the fill pattern as a 2-byte word value, and .balignl treats the fill pattern as a 4-byte long word value. For example, “.balignw 4,0x368d” will align to a multiple of four bytes. If it skips two bytes, they will be filled in with the value 0x368d (the exact placement of the bytes depends upon the endianness of the processor).
.skip size, fill .space size, fill
Sometimes it is desirable to allocate a large area of memory and initialize it all to the same value. This can be accomplished by using these directives. These directives emit size bytes, each of value fill. Both size and fill are absolute expressions. If the comma and fill are omitted, fill is assumed to be zero. For the ARM processor, the .space and .skip directives are equivalent. This directive is very useful for declaring large arrays in the .bss section.
Listing 2.5 shows how the code in Listing 2.4 can be improved by adding an alignment directive at line 6. The directive causes the assembler to emit two zero bytes between the end of the ch variable and the beginning of the ary variable. These extra “padding” bytes cause the following word data to be word aligned, thereby improving performance when the word data is accessed. It is a good practice to always put an alignment directive after declaring character or half-word data.

The assembler provides support for setting and manipulating symbols that can then be used in other places within the program. The labels that can be assigned to assembly statements and directives are one type of symbol. The programmer can also declare other symbols and use them throughout the program. Such symbols may not have an actual storage location in memory, but they are included in the assembler’s symbol table, and can be used anywhere that their value is required. The most common use for defined symbols is to allow numerical constants to be declared in one place and easily changed. The .equ directive allows the programmer to use a label instead of a number throughout the program. This contributes to readability, and has the benefit that the constant value can then be easily changed every place that it is used, just by changing the definition of the symbol. The most important directives related to symbols are:
.equ symbol, expression .set symbol, expression
This directive sets the value of symbol to expression. It is similar to the C language #define directive.
The .equiv directive is like .equ and .set, except that the assembler will signal an error if the symbol is already defined.
.global symbol .globl symbol
This directive makes the symbol visible to the linker. If symbol is defined within a file, and this directive is used to make it global, then it will be available to any file that is linked with the one containing the symbol. Without this directive, symbols are visible only within the file where they are defined.
This directive declares symbol to be a common symbol, meaning that if it is defined in more than one file, then all instances should be merged into a single symbol. If the symbol is not defined anywhere, then the linker will allocate length bytes of uninitialized memory. If there are multiple definitions for symbol, and they have different sizes, the linker will merge them into a single instance using the largest size defined.
Listing 2.6 shows how the .equ directive can be used to create a symbol holding the number of elements in an array. The symbol arysize is defined as the value of the current address counter (denoted by the .) minus the value of the ary symbol, divided by four (each word in the array is four bytes). The listing shows all of the symbols defined in this program segment. Note that the four variables are shown to be in the data segment, and the arysize symbol is marked as an “absolute” symbol, which simply means that it is a number and not an address. The programmer can now use the symbol arysize to control looping when accessing the array data. If the size of the array is changed by adding or removing constant values, the value of arysize will change automatically, and the programmer will not have to search through the code to change the original value, 5, to some other value in every place it is used.

Sometimes it is desirable to skip assembly of portions of a file. The assembler provides some directives to allow conditional assembly. One use for these directives is to optionally assemble code to aid in debugging.
.if marks the beginning of a section of code which is only considered part of the source program being assembled if the argument (which must be an absolute expression) is non-zero. The end of the conditional section of code must be marked by the .endif directive. Optionally, code may be included for the alternative condition by using the .else directive.
Assembles the following section of code if the specified symbol has been defined.
Assembles the following section of code if the specified symbol has not been defined.
Assembles the following section of code only if the condition for the preceding .if or.ifdef was false.
Marks the end of a block of code that is only assembled conditionally.
This directive provides a way to include supporting files at specified points in the source program. The code from the included file is assembled as if it followed the point of the .include directive. When the end of the included file is reached, assembly of the original file continues. The search paths used can be controlled with the ‘-I’ command line parameter. Quotation marks are required around file. This assembler directive is similar to including header files in C and C++ using the #include compiler directive.
The directives .macro and .endm allow the programmer to define macros that the assembler expands to generate assembly code. The GNU assembler supports simple macros. Some other assemblers have much more powerful macro capabilities.
.macro macname .macro macname macargs …
Begin the definition of a macro called macname. If the macro definition requires arguments, their names are specified after the macro name, separated by commas or spaces. The programmer can supply a default value for any macro argument by following the name with ‘=deflt’.
The following begins the definition of a macro called reserve_str, with two arguments. The first argument has a default value, but the second does not:

When a macro is called, the argument values can be specified either by position, or by keyword. For example, reserve_str 9,17 is equivalent to reserve_str p2=17,p1=9. After the definition is complete, the macro can be called either as
(with \p1 evaluating to x and \p2 evaluating to y), or as
(with \p1 evaluating as the default, in this case 0, and \p2 evaluating to y). Other examples of valid .macro statements are:


End the current macro definition.
Exit early from the current macro definition. This is usually used only within a .if or .ifdef directive.
This is a pseudo-variable used by the assembler to maintain a count of how many macros it has executed. That number can be accessed with ‘\@’, but only within a macro definition.
The following definition specifies a macro SHIFT that will emit the instruction to shift a given register left by a specified number of bits. If the number of bits specified is negative, then it will emit the instruction to perform a right shift instead of a left shift.

After that definition, the following code:

will generate these instructions:

The meaning of these instructions will be covered in Chapters 3 and 4.
The following definition specifies a macro enum that puts a sequence of numbers into memory by using a recursive macro call to itself:

With that definition, ‘enum 0,5’ is equivalent to this assembly input:

There are four elements to assembly syntax: labels, directives, instructions, and comments. Directives are used mainly to define symbols, allocate storage, and control the behavior of the assembler. The most common assembler directives were introduced in this chapter, but there are many other directives available in the GNU assembler. Complete documentation is available online as part of the GNU Binutils package.
Directives are used to declare statically allocated storage, which is equivalent to declaring global static variables in C. In assembly, labels and other symbols are visible only within the file that they are declared, unless they are explicitly made visible to other files with the .global directive. In C, variables that are declared outside of any function are visible to all files in the program, unless the static keyword is used to make them visible only within the file where they are declared. Thus, both C and assembly support file and global scope for static variables, but with the opposite defaults and different syntax.
Directives can also be used to declare macros. Macros are expanded by the assembler and may generate multiple statements. Careful use of macros can automate some simple tasks, allowing several lines of assembly code to be replaced with a single macro invocation.
2.1 What is the difference between
(a) the .data section and .bss section?
(b) the .ascii and .asciz directives?
(c) the .word and the .long directives?
2.2 What is the purpose of the .align assembler directive? What does “.align 2” do in GNU ARM assembly?
2.3 Assembly language has four main elements. What are they?
2.4 Using the directives presented in this chapter, show three different ways to create a null-terminated string containing the phrase “segmentation fault”.
2.5 What is the total memory, in bytes, allocated for the following variables?

2.6 Identify the directive(s), label(s), comment(s), and instruction(s) in the following code:

2.7 Write assembly code to declare variables equivalent to the following C code:

2.8 Show how to store the following text as a single string in assembly language, while making it readable and keeping each line shorter than 80 characters:
The three goals of the mission are:
1) Keep each line of code under 80 characters,
2) Write readable comments,
3) Learn a valuable skill for readability.
2.9 Insert the minimum number of .align directives necessary in the following code so that all word variables are aligned on word boundaries and all halfword variables are aligned on halfword boundaries, while minimizing the amount of wasted space.

2.10 Re-order the directives in the previous problem so that no .align directives are necessary to ensure proper alignment. How many bytes of storage were wasted by the original ordering of directives, compared to the new one?
2.11 What are the most important directives for selecting a section?
2.12 Why are .ascii and .asciz directives usually followed by an .align directive, but .word directives are not?
2.13 Using the “Hello World” program shown in Listing 2.1 as a template, write a program that will print your name.
2.14 Listing 2.3 shows that the assembler will assign the location 0000000016 to the main symbol and also to the str symbol. Why does this not cause problems?
This chapter explains how a particular assembly language is related to the architectural design of a particular CPU family. It then gives an overview of the ARM architecture. Next, it describes the ARM register set and data paths, including the Process Status Register, and the flags which are used to control conditional execution. Then it introduces the concept of instructions and operands, and explains immediate data used as an operand. Next it describes the load and store instructions and all of the addressing modes available on the ARM processor. Then it explains the branch and conditional branch instructions. The chapter ends with some examples showing how the branch and link instruction can be used to call functions from the C standard library.
Architecture; Instruction set architecture; Data path; Register; Memory; Load; Store; Branch; Address; Addressing mode; Conditional execution; Function or subroutine call
The part of the computer architecture related to programming is referred to as the instruction set architecture (ISA). The ISA includes the set of registers that the user program can access, and the set of instructions that the processor supports, as well as data paths and processing elements within the processor. The first step in learning a new assembly language is to become familiar with the ISA. For most modern computer systems, data must be loaded in a register before it can be used for any data processing instruction, but there are a limited number of registers. Memory provides a place to store data that is not currently needed. Program instructions are also stored in memory and fetched into the CPU as they are needed. This chapter introduces the ISA for the ARM processor.
The CPU is composed of data storage and computational components connected together by a set of buses. The most important components of the CPU are the registers, where data is stored, and the arithmetic and logic unit (ALU), where arithmetic and logical operations are performed on the data. Some CPUs also have dedicated hardware units for multiplication and/or division. Fig. 3.1 shows the major components of the ARM CPU and the buses that connect the components together. These buses provide pathways for the data to move between the computational and storage components. The organization of the components and buses in a CPU govern what types of operations can be performed.

The set of instructions and addressing modes available on the ARM processor is closely related to the architecture shown in Fig. 3.1. The architecture provides for certain operations to be performed efficiently, and this has a direct relationship to the types of instructions that are supported.
Note that on the ARM, two source registers can be selected for an instruction, using the A and B buses. The data on the B bus is routed through a shifter, and then to the ALU. This allows the second operand of most instructions to be shifted an arbitrary amount before it reaches the ALU. The data on the A bus goes directly to the ALU. Additionally, the A and B buses can provide operands for the multiplier, and the multiplier can provide data for the A and B buses.
Data coming in from memory or an input/output device is fed directly onto the ALU bus. From there, it can be stored in one of the general-purpose registers. Data being written to memory or an input/output device is taken directly from the B bus, which means that store operations can move data from a register, but cannot modify the data on the way to memory or input/output devices.
The address register is a temporary register that is used by the CPU whenever it needs to read or write to memory or I/O devices. It is used every time an instruction is fetched from memory, and is used for all load and store operations. The address register can be loaded from the program counter, for fetching the next instruction. Also the address register can be loaded from the ALU, which allows the processor to support addressing modes where a register is used as a base pointer and an offset is calculated on-the-fly. After its contents are used to access memory or I/O devices, the base address can be incremented and the incremented value can be stored back into a register. This allows the processor to increment the program counter after each instruction, and to implement certain addressing modes where a pointer is automatically incremented after each memory access.
As shown in Fig. 3.2, the ARM processor provides 13 general-purpose registers, named r0 through r12. These registers can each store 32 bits of data. In addition to the 13 general-purpose registers, the ARM has three other special-purpose registers.

The program counter, r15, always contains the address of the next instruction that will be executed. The processor increments this register by four, automatically, after each instruction is fetched from memory. By moving an address into this register, the programmer can cause the processor to fetch the next instruction from the new address. This gives the programmer the ability to jump to any address and begin executing code there.
The link register, r14, is used to hold the return address for subroutines. Certain instructions cause the program counter to be copied to the link register, then the program counter is loaded with a new address. These branch-and-link instructions are briefly covered in Section 3.5 and in more detail in Section 5.4.
The program stack was introduced in Section 1.4. The stack pointer, r13, is used to hold the address where the stack ends. This is commonly referred to as the top of the stack, although on most systems the stack grows downwards and the stack pointer really refers to the bottom of the stack. The address where the stack ends may change when registers are pushed onto the stack, or when temporary local variables (automatic variables) are allocated or deleted. The use of the stack for storing automatic variables is described in Chapter 5. The use of r13 as the stack pointer is a programming convention. Some instructions (eg, branches) implicitly modify the program counter and link registers, but there are no special instructions involving the stack pointer. As far as the hardware is concerned, r13 is exactly the same as registers r0–r12, but all ARM programmers use it for the stack pointer.
Although register r13 is normally used as the stack pointer, it can be used as a general-purpose register if the stack is not used. However the high-level language compilers always use it as the stack pointer, so using it as a general-purpose register will result in code that cannot inter-operate with code generated using high-level languages. The link register, r14, can also be used as a general-purpose register, but its contents are modified by hardware when a subroutine is called. Using r13 and r14 as general-purpose registers is dangerous and strongly discouraged.
There are also two other registers which may have special purposes. As with the stack pointer, these are programming conventions. There are no special instructions involving these registers. The frame pointer (r11) is used by high-level language compilers to track the current stack frame. This is sometimes useful when running your program under a debugger, and can sometimes help the compiler to generate more efficient code for returning from a subroutine. The GNU C compiler can be instructed to use r11 as a general-purpose register by using the --omit-frame-pointer command line option. The inter-procedure scratch register r12 is used by the C library when calling functions in dynamically linked libraries. The contents may change, seemingly at random, when certain functions (such as printf) are called.
The final register in the ARM user programming model is the Current Program Status Register (CPSR). This register contains bits that indicate the status of the current program, including information about the results of previous operations. Fig. 3.3 shows the bits in the CPSR. The first four bits, N, Z, C, and V are the condition flags. Most instructions can modify these flags, and later instructions can use the flags to modify their operation. Their meaning is as follows:

Negative: This bit is set to one if the signed result of an operation is negative, and set to zero if the result is positive or zero.
Zero: This bit is set to one if the result of an operation is zero, and set to zero if the result is non-zero.
Carry: This bit is set to one if an add operation results in a carry out of the most significant bit, or if a subtract operation results in a borrow. For shift operations, this flag is set to the last bit shifted out by the shifter.
oVerflow: For addition and subtraction, this flag is set if a signed overflow occurred.
The remaining bits are used by the operating system or for bare-metal programs, and are described in Section 14.1.
The ARM processor supports a relatively small set of instructions grouped into four basic instruction types. Most instructions have optional modifiers which can be used to change their behavior. For example, many instructions can have modifiers which set or check condition codes in the CPSR. The combination of basic instructions with optional modifiers results in an extremely rich assembly language. There are four general instruction types, or categories. The following sections give a brief overview of the features which are common to instructions in each category. The individual instructions are explained later in this chapter, and in the following chapter.
As mentioned previously, the CPSR contains four flag bits (bits 28–31), which can be used to control whether or not certain instructions are executed. Most of the data processing instructions have an optional modifier to control whether or not the flag bits are affected when the instruction is executed. For example, the basic instruction for addition is add. When the add instruction is executed, the result is stored in a register, but the flag bits in the CPSR are not affected.
However, the programmer can add the s modifier to the add instruction to create the adds instruction. When it is executed, this instruction will affect the CPSR flag bits. The flag bits can be used by subsequent instructions to control execution and branching. The meaning of the flags depends on the type of instruction that last set the flags. Table 3.1 shows the names and meanings of the four bits depending on the type of instruction that set or cleared them. Most instructions support the s modifier to control setting the flags.
Table 3.1
Flag bits in the CPSR register
| Name | Logical Instruction | Arithmetic Instruction |
| N (Negative) | No meaning | Bit 31 of the result is set. Indicates a negative number in signed operations |
| Z (Zero) | Result is all zeroes | Result of operation was zero |
| C (Carry) | After Shift operation, ‘1’ was left in carry flag | Result was greater than 32 bits |
| V (oVerflow) | No meaning | The signed two’s complement result requires more than 32 bits. Indicates a possible corruption of the result |

Most ARM instructions can have a condition modifier attached. If present, the modifier controls, at run-time, whether or not the instruction is actually executed. These condition modifiers are added to basic instructions to create conditional instructions. Table 3.2 shows the condition modifiers that can be attached to base instructions. For example, to create an instruction that adds only if the CPSR Z flag is set, the programmer would add the eq condition modifier to the basic add instruction to create the addeq instruction.
Table 3.2
ARM condition modifiers
| <cond> | English Meaning |
| al | always (this is the default <cond> |
| eq | Z set (=) |
| ne | Z clear (≠) |
| ge | N set and V set, or N clear and V clear (≥) |
| lt | N set and V clear, or N clear and V set (<) |
| gt | Z clear, and either N set and V set, or N clear and V set (>) |
| le | Z set, or N set and V clear, or N clear and V set (≤) |
| hi | C set and Z clear (unsigned >) |
| ls | C clear or Z (unsigned ≤) |
| hs | C set (unsigned ≥) |
| cs | Alternate name for HS |
| lo | C clear (unsigned <) |
| cc | Alternate name for LO |
| mi | N set (result < 0) |
| pl | N clear (result ≥ 0) |
| vs | V set (overflow) |
| vc | V clear (no overflow) |
Setting and using condition flags are orthogonal operations. This means that they can be used in combination. Using the previous example, the programmer could add the s modifier to create the addeqs instruction, which executes only if the Z bit is set, and updates the CPSR flags only if it executes.
An immediate value in assembly language is a constant value that is specified by the programmer. Some assembly languages encode the immediate value as part of the instruction. Other assembly languages create a table of immediate values in a literal pool and insert appropriate instructions to access them. ARM assembly language provides both methods.
Immediate values can be specified in decimal, octal, hexadecimal, or binary. Octal values must begin with a zero, and hexadecimal values must begin with “0x”. Likewise immediate values that start with “0b” are interpreted as binary numbers. Any value that does not begin with zero, 0x, or 0 b will be interpreted as a decimal value.
There are two ways that immediate values can be specified in GNU ARM assembly. The =<immediate|symbol> syntax can be used to specify any immediate 32-bit number, or to specify the 32-bit value of any symbol in the program. Symbols include program labels (such as main) and symbols that are defined using .equ and similar assembler directives. However, this syntax can only be used with load instructions, and not with data processing instructions. This restriction is necessary because of the way the ARM machine instructions are encoded. For data processing instructions, there are a limited number of bits that can be devoted to storing immediate data as part of the instruction.
The #<immediate|symbol> syntax is used to specify immediate data values for data processing instructions. The #<immediate|symbol> syntax has some restrictions. Basically, the assembler must be able to construct the specified value using only eight bits of data, a shift or rotate, and/or a complement. For immediate values that can cannot be constructed by shifting or rotating and complementing an 8-bit value, the programmer must use an ldr instruction with the =<immediate|symbol> to specify the value. That method is covered in Section 3.4. Some examples of immediate values are shown in Table 3.3.
Table 3.3
Legal and illegal values for #<immediate—symbol>
| #32 | Ok because it can be stored as an 8-bit value |
| #1021 | Illegal because the number cannot be created from an 8-bit value using shift or rotate and complement |
| #1024 | Ok because it is 1 shifted left 10 bits |
| #0b1011 | Ok because it fits in 8 bits |
| #-1 | Ok because it is the one’s complement of 0 |
| #0xFFFFFFFE | Ok because it is the one’s complement of 1 |
| #0xEFFFFFFF | Ok because it is the one’s complement of 1 shifted left 31 bits |
| #strsize | Ok if the value of strsize can be created from an 8-bit value using shift or rotate and complement |

The ARM processor has a strict separation between instructions that perform computation and those that move data between the CPU and memory. Because of this separation between load/store operations and computational operations, it is a classic example of a load-store architecture. The programmer can transfer bytes (8 bits), half-words (16 bits), and words (32 bits), from memory into a register, or from a register into memory. The programmer can also perform computational operations (such as adding) using two source operands and one register as the destination for the result. All computational instructions assume that the registers already contain the data. Load instructions are used to move data into the registers, while store instructions are used to move data from the registers to memory.
Most of the load/store instructions use an <address> which is one of the six options shown in Table 3.4. The < shift_op > can be any of shift operations from Table 3.5, and shift should be a number between 0 and 31. Although there are really only six addressing modes, there are eleven variations of the assembly language syntax. Four of the variations are simply shorthand notations. One of the variations allows an immediate data value or the address of a label to be loaded into a register, and may result in the assembler generating more than one instruction. The following section describes each addressing mode in detail.
Table 3.4
ARM addressing modes
| Syntax | Name |
| [Rn, #±<offset_12>] | Immediate offset |
| [Rn, ±Rm, <shift_op> #<shift>] | Scaled register offset |
| [Rn, #±<offset_12>]! | Immediate pre-indexed |
| [Rn, ±Rm, <shift_op> #<shift>]! | Scaled register pre-indexed |
| [Rn], #±<offset_12> | Immediate post-indexed |
| [Rn], ±Rm, <shift_op> #<shift> | Scaled register post-indexed |
Table 3.5
ARM shift and rotate operations
| <shift> | Meaning |
| lsl | Logical Shift Left by specified amount |
| lsr | Logical Shift Right by specified amount |
| asr | Arithmetic Shift Right by specified amount |
Immediate offset: [Rn, #±< offset_12 >]
The immediate offset (which may be positive or negative) is added to the contents of Rn. The result is used as the address of the item to be loaded or stored. For example, the following line of code:
calculates a memory address by adding 12 to the contents of register r1. It then loads four bytes of data, starting at the calculated memory address, into register r0. Similarly, the line:
subtracts 8 from the contents of r6 and uses that as the address where it stores the contents of r9 in memory.
Register immediate: [Rn]
When using immediate offset mode with an offset of zero, the comma and offset can be omitted. That is, [Rn] is just shorthand notation for [Rn, #0]. This shorthand is referred to as register immediate mode. For example, the following line of code:
uses the contents of register r2 as a memory address and loads four bytes of data, starting at that address, into register r3. Likewise,
copies the contents of r8 to the four bytes of memory starting at the address that is in r0.
Scaled register offset: [Rn, ±Rm, < shift_op > #<shift>]
Rm is shifted as specified, then added to or subtracted from Rn. The result is used as the address of the item to be loaded or stored. For example,
shifts the contents of r1 left two bits, adds the result to the contents of r2 and uses the sum as an address in memory from which it loads four bytes into r3. Recall that shifting a binary number left by two bits is equivalent to multiplying that number by four. This addressing mode is typically used to access an array, where r2 contains the address of the beginning of the array, and r1 is an integer index. The integer shift amount depends on the size of the objects in the array. To store an item from register r0 into an array of half-words, the following instruction could be used:
where r4 holds the address of the first byte of the array, and r5 holds the integer index for the desired array item.
Register offset: [Rn, ±Rm]
When using scaled register offset mode with a shift amount of zero, the comma and shift specification can be omitted. That is, [Rn, ±Rm] is just shorthand notation for [Rn, ±Rm, lsl #0]. This shorthand is referred to as register offset mode.
Immediate pre-indexed: [Rn, #±Rm< offset_12 >]!
The address is computed in the same way as immediate offset mode, but after the load or store, the address that was used is stored in Rn. This mode can be used to step through elements in an array, updating a pointer to the next array element before each element is accessed.
Scaled register pre-indexed: [Rn, ±Rm, < shift_op > #<shift>]!
The address is computed in the same way as scaled register offset mode, but after the load or store, the address that was used is stored in Rn. This mode can be used to step through elements in an array, updating a pointer to the current array element before each access.
Register pre-indexed: [Rn, ±Rm]!
When using scaled register pre-indexed mode with a shift amount of zero, the comma and shift specification can be omitted. That is, [Rn, ±Rm]! is shorthand notation for [Rn, ±Rm, lsl #0]!. This shorthand is referred to as register pre-indexed mode.
Immediate post-indexed: [Rn], #±< offset_12 >
Register Rn is used as the address of the value to be loaded or stored. After the value is loaded or stored, the value in Rn is updated by adding the immediate offset, which may be negative or positive. This mode can be used to step through elements in an array, updating a pointer to point at the next array element after each one is accessed.
Scaled register post-indexed: [Rn], ±Rm, < shift_op > #<shift>
Register Rn is used as the address of the value to be loaded or stored. After the value is loaded or stored, the value in Rn is updated by adding or subtracting the contents of Rm shifted as specified. This mode can be used to step through elements in an array, updating a pointer to point at the next array element after each one is accessed.
Register post-indexed: [Rn], ±Rm
When using scaled register post-indexed mode with a shift amount of zero, the comma and shift specification can be omitted. That is, [Rn], ±Rm is shorthand notation for [Rn], ±Rm, lsl #0. This shorthand is referred to as register post-indexed mode.
Load Immediate: [Rn], =<immediate|symbol>
This is really a pseudo-instruction. The assembler will generate a mov instruction if possible. Otherwise it will store the value of immediate or the address of symbol in a “literal table” and generate a load instruction, using one of the previous addressing modes, to load the value into a register. This addressing mode can only be used with the ldr instruction.
The load and store instructions allow the programmer to move data from memory to registers or from registers to memory. The load/store instructions can be grouped into the following types:
• multiple register, and
• atomic.
The following sections describe the seven load and store instructions that are available, and all of their variations.
These instructions transfer a single word, half-word, or byte from a register to memory or from memory to a register:
str Store Register.
• The optional <cond> can be any of the codes from Table 3.2 on page 59 specifying conditional execution.
• The optional <size> is one of:
h unsigned half-word
sb signed byte
sh signed half-word
• The <address> is any valid address specifier described in Section 3.4.1.
ARM has two instructions for loading and storing multiple registers:
ldm Load Multiple Registers, and
stm Store Multiple Registers.
These instructions are used to store registers on the program stack, and for copying blocks of data. The ldm and stm instructions each have four variants, and each variant has two equivalent names. So, although there are only two basic instructions, there are sixteen mnemonics. These are the most complex instructions in the ARM assembly language.
• <variant> is chosen from the following tables:
| Block Copy Method | Stack Type | |||
| Variant | Description | Variant | Description | |
| ia | Increment After | ea | Empty Ascending | |
| ib | Increment Before | fa | Full Ascending | |
| da | Decrement After | ed | Empty Descending | |
| db | Decrement Before | fd | Full Descending | |

• The optional ! specifies that the address register Rd should be modified after the registers are stored.
• An optional trailing ˆ can only be used by operating system code. It causes the transfer to affect user registers instead of operating system registers.
There are two equivalent mnemonics for each load/store multiple instruction. For example, ldmia is exactly the same instruction as ldmfd, and stmdb is exactly the same instruction as stmfd. There are two different names so that the programmer can indicate what the instruction is being used for.
The mnemonics in the Block Copy Method table are used when the programmer is using the instructions to move blocks of data. For instance, the programmer may want to copy eight words from one address in memory to another address. One very efficient way to do that is to:
1. load the address of the first byte of the source into a register,
2. load the address of the first byte of the destination into another register,
3. use ldmia (load multiple increment after) to load eight registers from the source address, then
4. use stmia (store multiple increment after) to store the registers to the destination address.
Assuming source and dest are labeled blocks of data declared elsewhere, the following listing shows the exact instructions needed to move eight words from source to dest:

The mnemonics in the Stack Type table are used when the programmer is performing stack operations. The most common variants are stmfd and ldmfd, which are used for pushing registers onto the program stack and later popping them back off, respectively. In Linux, the C compiler always uses the stmfd and ldmfd versions for accessing the stack. The following code shows how the programmer could save the contents of registers r0-r9 on the stack, use them to perform a block copy, then restore their contents:

Note that in the previous example, after the stmfd sp!, { r0-r9 } instruction, sp will contain the address of the last word on the stack, because the optional ! was used to indicate that the register should be updated.
| Name | Effect | Description |
| ldmia and ldmfd |
for all i ∈ register_list do
end for if ! is present then
end if | Load multiple registers from memory, starting at the address in Rd and increment the address by four bytes after each load. |
| stmia and stmea |
for all i ∈ register_list do
end for if ! is present then
end if | Store multiple registers in memory, starting at the address in Rd and increment the address by four bytes after each store. |
| ldmib and ldmed |
for all i ∈ register_list do
end for if ! is present then
end if | Load multiple registers from memory, starting at the address in Rd and increment the address by four bytes before each load. |
| stmib and stmfa |
for all i ∈ register_list do
end for if ! is present then
end if | Store multiple registers in memory, starting at the address in Rd and increment the address by four bytes before each store. |
| ldmda and ldmfa |
for all i ∈ register_list do
end for if ! is present then
end if | Load multiple registers from memory, starting at the address in Rd and decrement the address by four bytes after each load. |
| stmda and stmed |
for all i ∈ register_list do
end for if ! is present then
end if | Store multiple registers in memory, starting at the address in Rd and decrement the address by four bytes after each store. |
| ldmdb and ldmea |
for all i ∈ register_list do
end for if ! is present then
end if | Load multiple registers from memory, starting at the address in Rd and decrement the address by four bytes before each load. |
| stmdb and stmfd |
for all i ∈ register_list do
end for if ! is present then
end if | Store multiple registers in memory, starting at the address in Rd and decrement the address by four bytes before each store. |


Multiprogramming and threading require the ability to set and test values atomically. This instruction is used by the operating system or threading libraries to guarantee mutual exclusion:
Note: swp and swpb are deprecated in favor of ldrex and strex, which work on multiprocessor systems as well as uni-processor systems.
• The optional s specifies whether or not the instruction should affect the bits in the CPSR.
• The optional <cond> can be any of the codes from Table 3.2 specifying conditional execution.

These instructions are used by the operating system or threading libraries to guarantee mutual exclusion, even on multiprocessor systems:
ldrex Load Multiple Registers, and
strex Store Multiple Registers.
Exclusive load (ldrex) reads data from memory, tagging the memory address at the same time. Exclusive store (strex) stores data to memory, but only if the tag is still valid. A strex to the same address as the previous ldrex will invalidate the tag. A str to the same address may invalidate the tag (implementation defined). The strex instruction sets a bit in the specified register which indicates whether or not the store succeeded. This allows the programmer to implement semaphores on uni-processor and multiprocessor systems.
• The optional <cond> can be any of the codes from Table 3.2 specifying conditional execution.

Branch instructions allow the programmer to change the address of the next instruction to be executed. They are used to implement loops, if-then structures, subroutines, and other flow control structures. There are two basic branch instructions:
• Branch and Link (subroutine call).
This instruction is used to perform conditional and unconditional branches in program execution:
It is used for creating loops and if-then-else constructs.
• The optional <cond> can be any of the codes from Table 3.2 specifying conditional execution.
• The target_label can be any label in the current file, or any label that is defined as .global or .globl in any file that is linked in.

The following instruction is used to call subroutines:
The branch and link instruction is identical to the branch instruction, except that it copies the current program counter to the link register before performing the branch. This allows the programmer to copy the link register back into the program counter at some later point. This is how subroutines are called, and how subroutines return and resume executing at the next instruction after the one that called them.
• The optional <cond> can be any of the codes from Table 3.2 specifying conditional execution.
• The target_address can be any label in the current file, or any label that is defined as .global or .globl in any file that is linked in.

Example 3.1 shows how the bl instruction can be used to call a function from the C standard library to read a single character from standard input. By convention, when a function is called, it will leave its return value in r0. Example 3.2 shows how the bl instruction can be used to call another function from the C standard library to print a message to standard output. By convention, when a function is called, it will expect to find its first argument in r0. There are other rules, which all ARM programmers must follow, regarding which registers are used when passing arguments to functions and procedures. Those rules will be explained fully in Section 5.4.
The assembler provides a small number of pseudo-instructions. From the perspective of the programmer, these instructions are indistinguishable from standard instructions. However, when the assembler encounters a pseudo-instruction, it may substitute a different instruction or generate a short sequence of machine instructions.
This pseudo-instruction loads a register with any 32-bit value:
When this pseudo-instruction is encountered, the assembler first determines whether or not it can substitute a mov Rd,#<immediate> or mvn Rd,#<immediate> instruction. If that is not possible, then it reserves four bytes in a “literal pool” and stores the immediate value there. Then, the pseudo-instruction is translated into an ldr instruction using Immediate Offset addressing mode with the pc as the base register.
• The optional <cond> can be any of the codes from Table 3.2 specifying conditional execution.
• The <immediate> parameter is any valid 32-bit quantity.
Example 3.3 shows how the assembler generates code from the load immediate pseudo-instruction. Line 2 of the example listing just declares two 32-bit words. They cause the next variable to be given a non-zero address for demonstration purposes, and are not used anywhere in the program, but line 3 declares a string of characters in the data section. The string is located at offset 0x00000008 from the beginning of the data section. The linker is responsible for calculating the actual address, when it assigns a location for the data section. Line 6 shows how a register can be loaded with an immediate value using the mov instruction. The next line shows the equivalent using the ldr pseudo-instruction. Note that the assembler generates the same machine instruction (FD5FE0E3) for both lines.
Line 8 shows the ldr pseudo-instruction being used to load a value that cannot be loaded using the mov instruction. The assembler generated a load half-word instruction using the program counter as the base register, and an offset to the location where the value is stored. The value is actually stored in a literal pool at the end of the text segment. The listing has three lines labeled 11. The first line 11 is an instruction. The remaining lines are the literal pool.
On line 9, the programmer used the ldr pseudo-instruction to request that the address of str be loaded into r4. The assembler created a storage location to hold the address of str, and generated a load word instruction using the program counter as the base register and an offset to the location where the address is stored. The address of str is actually stored in the text segment, on the third line 11.
These pseudo instructions are used to load the address associated with a label:
adrl Load Address Long
They are more efficient than the ldr rx,=label instruction, because they are translated into one or two add or subtract operations, and do not require a load from memory. However, the address must be in the same section as the adr or adrl pseudo-instruction, so they cannot be used to load addresses of labels in the .data section.
• The adr pseudo-instruction will be translated into one or two pc-relative add or sub instructions.
• The adrl pseudo-instruction will always be translated into two instructions. The second instruction may be a nop instruction.
• The label must be defined in the same file and section where these pseudo-instructions are used.

The ARM Instruction Set Architecture includes 17 registers and a four basic instruction types. This chapter explained the instructions used for
• moving data between memory and registers, and
• branching and calling subroutines.
The load and store operations are used to move data between memory and registers. The basic load and store operations, ldr and str, have a very powerful set of addressing modes. To facilitate moving multiple registers to or from memory, the ARM ISA provides the ldm and stm instructions, which each have several variants. The assembler provides two pseudo-instructions for loading addresses and immediate values.
The ARM processor provides only two types of branch instruction. The bl instruction is used to call subroutines (functions). The b instruction can be used to create loops and to create if-then-else constructs. The ability to append a condition to almost any instruction results in a very rich instruction set.
3.1 Which registers hold the stack pointer, return address, and program counter?
3.2 Which is more efficient for loading a constant value, the ldr pseudo-instruction, or the mov instruction? Explain.
3.3 Which two variants of the Store Multiple instruction are used most often, and why?
3.4 The stm and ldm instructions include an optional ‘!’ after the address register. What does it do?
3.5 The following C statement declares an array of four integers, and initializes their values to 7, 3, 21, and 10, in that order.
(a) Write the equivalent in GNU ARM assembly.
(b) Write the ARM assembly instructions to load all four numbers into registers r3, r5, r6, and r9, respectively, using:
i. a single ldm instruction, and
ii. four ldr instructions.
3.6 What is the difference between a memory location and a CPU register?
3.7 How many registers are provided by the ARM Instruction Set Architecture?
3.8 Use ldm and stm to write a short sequence of ARM assembly language to copy 16 words of data from a source address to a destination address. Assume that the source address is already loaded in r0 and the destination address is already loaded in r1. You may use registers r2 through r5 to hold values as needed. Your code is allowed to modify r0 and/or r1.
3.9 Assume that x is an array of integers. Convert the following C statements into ARM assembly language.
(b) x[10] = x[0];
(c) x[9] = x[3];
3.10 Assume that x is an array of integers, and i and j are integers. Convert the following C statements into ARM assembly language.
(b) x[j] = x[i];
(c) x[i] = x[j*2];
3.11 What is the difference between the b instruction and the bl instruction? What is each used for?
3.12 What are the meanings of the following instructions?
(b) ldrlt
(c) bgt
(d) bne
(e) bge
This chapter begins by explaining Operand2, which is used by most ARM data processing instructions to specify one of the source operands for the data processing operation. It explains all of the shift operations and how they can be combined with other data processing operations in a single instruction. It then explains each of the data processing instructions, giving a short example showing how they can be used. Short examples, relating the assembly instructions to C statements, are incorporated throughout the chapter. One of the examples shows how to construct a loop. After the data processing instructions are explained, the chapter covers the special instructions and pseudo-instructions.
Operand2; Data processing; Shift; Loop; Comparison; Data movement; Three address instruction; Two address instruction
The ARM processor has approximately 25 data processing instructions. The exact number depends on the processor version. For example, older versions of the architecture did not have the six multiply instructions, and the Cortex M3 and newer processors have two division instructions. There are also a few special instructions that are used infrequently to perform operations that are not classified as load/store, branch, or data processing.
The data processing instructions operate only on CPU registers, so data must first be moved from memory into a register before processing can be performed. Most of these instructions use two source operands and one destination register. Each instruction performs one basic arithmetical or logical operation. The operations are grouped in the following categories:
• Logical Operations,
• Comparison Operations,
• Data Movement Operations,
• Status Register Operations,
• Multiplication Operations, and
• Division Operations.
Most of the data processing instructions require the programmer to specify two source operands and one destination register for the result. Because three items must be specified for these instructions, they are known as three address instructions. The use of the word address in this case has nothing to do with memory addresses. The term three address instruction comes from earlier processor architectures that allow arithmetic operations to be performed with data that is stored in memory rather than registers. The first source operand specifies a register whose contents will be on the A bus in Fig. 3.1. The second source operand will be on the B bus and is referred to as Operand2. Operand2 can be any one of the following three things:
• a register (r0-r15) and a shift operation to modify it, or
• a 32-bit immediate value that can be constructed by shifting, rotating, and/or complementing an 8-bit value.
The options for Operand2 allow a great deal of flexibility. Many operations that would require two instructions on most processors can be performed using a single ARM instruction. Table 4.1 shows the mnemonics used for specifying shift operations, which we refer to as < shift_op >.
The lsl operation shifts each bit left by a specified amount n. Zero is shifted into the n least significant bits, and the most significant n bits are lost. The lsr operation shifts each bit right by a specified amount n. Zero is shifted into the n most significant bits, and the least significant n bits are lost. The asr operation shifts each bit right by a specified amount n. The n most significant bits become copies of the sign bit (bit 31), and the least significant n bits are lost. The ror operation rotates each bit right by a specified amount n. The n most significant bits become the least significant n bits. The RRX operation rotates one place to the right but the CPSR carry flag, C, is included. The carry flag and the register together create a 33 bit quantity to be rotated. The carry flag is rotated into the most significant bit of the register, and the least significant bit of the register is rotated into the carry flag. Table 4.2 shows all of the possible forms for Operand2.
Table 4.2
Formats for Operand2
| #<immediate|symbol> | A 32-bit immediate value that can be constructed from an 8 bit value |
| Rm | Any of the 16 registers r0-r15 |
| Rm, <shift_op> #<shift_imm> | The contents of a register shifted or rotated by an immediate amount between 0 and 31 |
| Rm, <shift_op> Rs | The contents of a register shifted or rotated by an amount specified by the contents of another register |
| Rm, rrx | The contents of a register rotated right by one bit through the carry flag |
These four comparison operations update the CPSR flags, but have no other effect:
cmn Compare Negative,
tst Test Bits, and
teq Test Equivalence.
They each perform an arithmetic operation, but the result of the operation is discarded. Only the CPSR carry flags are affected.
• <op> is either cmp, cmn, tst, or teq.
• The optional <cond> can be any of the codes from Table 3.2 specifying conditional execution.

Example 4.1 shows how conditional execution and the test instruction can be used together to create an if-then-else structure. Note that in this case, the assembly code is more concise than the C code. That is not generally true.
There are six basic arithmetic operations:
adc Add with Carry,
sub Subtract,
sbc Subtract with Carry,
rsb Reverse Subtract, and
rsc Reverse Subtract with Carry.
All of them involve two 32-bit source operands and a destination register.
• <op> is one of add, adc, sub, sbc, or rsb, or rsc.
• The optional s specifies whether or not the instruction should affect the bits in the CPSR.
• The optional <cond> can be any of the codes from Table 3.2 on page 59 specifying conditional execution.

Example 4.2 shows a complete program for adding the contents of two statically allocated variables and printing the result. The printf () function expects to find the address of a string in r0. As it prints the string, it finds the \%d formatting command, which indicates that the value of an integer variable should be printed. It expects the variable to be stored in r1. Note that the variable sum does not need to be stored in memory. It is stored in r1, where printf () expects to find it.
Example 4.3 shows how the compare, branch, and add instructions can be used to create a loop. There are basically three steps for creating a loop: allocating and initializing the loop variable, testing the loop variable, and modifying the loop variable. In general, any of the registers r0-r12 can be used to hold the loop variable. Section 5.4 introduces some considerations for choosing an appropriate register. For now, it is assumed that r0 is available for use as the loop variable for this example.
There are five basic logical operations:
orr Bitwise OR,
eor Bitwise Exclusive OR,
orn Bitwise OR NOT, and
bic Bit Clear.
All of them involve two source operands and a destination register.
• <op> is either and, eor, orr, orn, or bic.
• The optional s specifies whether or not the instruction should affect the bits in the CPSR.
• The optional <cond> can be any of the codes from Table 3.2 specifying conditional execution.

The data movement operations copy data from one register to another:
mvn Move Not, and
movt Move Top.
The movt instruction copies 16 bits of data into the upper 16 bits of the destination register, without affecting the lower 16 bits. It is available on ARMv6T2 and newer processors.
• The optional s specifies whether or not the instruction should affect the bits in the CPSR.
• The optional <cond> can be any of the codes from Table 3.2 specifying conditional execution.

These two instructions perform multiplication using two 32-bit registers to form a 32-bit result:
mla Multiply and Accumulate.
The mla instruction adds a third register to the result of the multiplication.
• The optional s specifies whether or not the instruction should affect the bits in the CPSR.
• The optional <cond> can be any of the codes from Table 3.2 specifying conditional execution.

These instructions perform multiplication using two 32-bit registers to form a 64-bit result:
umull Unsigned Multiply Long,
smlal Signed Multiply and Accumulate Long, and
umlal Unsigned Multiply and Accumulate Long.
The smlal and umlal instructions add a 64-bit quantity to the result of the multiplication.
• <type> must be either s for signed or u for unsigned.
• <op> must be either mul, or mla.
• The optional s specifies whether or not the instruction should affect the bits in the CPSR.
• The optional <cond> can be any of the codes from Table 3.2 specifying conditional execution.

Some ARM processors have the following instructions to perform division:
udiv Unsigned Divide.
The divide operations are available on Cortex M3 and newer ARM processors. The processor used on the Raspberry Pi does not have these instructions. The Raspberry Pi 2 does have them.
• <type> must be either s for signed or u for unsigned.
• The optional <cond> can be any of the codes from Table 3.2 specifying conditional execution.
• The optional s specifies whether or not the instruction should affect the bits in the CPSR.

There are a few instructions that do not fit into any of the previous categories. They are used to request operating system services and access advanced CPU features.
This instruction counts the number of leading zeros in the operand register and stores the result in the destination register:
• The optional <cond> can be any of the codes from Table 3.2 specifying conditional execution.

These two instructions allow the programmer to access the status bits of the CPSR and SPSR:
mrs Move Status to Register, and
msr Move Register to Status.
The SPSR is covered in Section 14.1.
• The optional < fields > is any combination of:
x extension field
s status field
f flags field
• The optional <cond> can be any of the codes from Table 3.2 specifying conditional execution.

The following instruction allows a user program to perform a system call to request operating system services:
In Unix and Linux, the system calls are documented in the second section of the online manual. Each system call has a unique id number which is defined in the /usr/include/syscall.h file.
• The <syscall_number> is encoded in the instruction. The operating system may examine it to determine which operating system service is being requested.
• In Linux, <syscall_number> is ignored. The system call number is passed in r7, and up to seven parameters are passed in r0-r6. No Linux system call requires more than seven parameters.

The ARM processor has an alternate mode where it executes a 16-bit instruction set known as Thumb. This instruction allows the programmer to change the processor mode and branch to Thumb code:
The thumb instruction set is sometimes more efficient than the full ARM instruction set, and may offer advantages on small systems.

The assembler provides a small number of pseudo-instructions. From the perspective of the programmer, these instructions are indistinguishable from standard instructions. However, when the assembler encounters a pseudo-instruction, it may substitute a different instruction or generate a short sequence of machine instructions.
This pseudo instruction does nothing, but takes one clock cycle to execute.
This is equivalent to a mov r0,r0 instruction.

These pseudo instructions are assembled into mov instructions with an appropriate shift of Operand2:
lsr Logical Shift Right,
asr Arithmetic Shift Right,
ror Rotate Right, and
rrx Rotate Right with eXtend.
• <op> must be either lsl, lsr, asr, or ror.
• Rs is a register holding the shift amount. Only the least significant byte is used.
• shift must be between 1 and 32.
• If the optional s is specified, then the N and Z flags are updated according to the result, and the C flag is updated to the last bit shifted out.
• The optional <cond> can be any of the codes from Table 3.2 on page 59 specifying conditional execution.
| Name | Effect | Description |
| lsl | ![]() | Shift Left |
| lsr | ![]() | Shift Right |
| asr | ![]() | Shift Right with sign extend |
| rrx | ![]() | Rotate Right with eXtend |
The rrx operation rotates one place to the right but the CPSR carry flag, C, is included. The carry flag and the register together create a 33-bit quantity to be rotated. The carry flag is rotated into the most significant bit of the register, and the least significant bit of the register is rotated into the carry flag.

This chapter and the previous one introduced the core set of ARM instructions. Most of these instructions were introduced with the very first ARM processors. There are approximately 50 additional instructions and pseudo instructions that were introduced with the ARMv6 and later versions of the architecture, or that only appear in specific versions of the ARM. There are also additional instructions available on systems that have the Vector Floating Point (VFP) coprocessor and/or the NEON extensions. The instructions introduced so far are:
| Name | Page | Operation |
| adc | 83 | Add with Carry |
| add | 83 | Add |
| adr | 75 | Load Address |
| adrl | 75 | Load Address Long |
| and | 85 | Bitwise AND |
| asr | 94 | Arithmetic Shift Right |
| b | 70 | Branch |
| bic | 86 | Bit Clear |
| bl | 71 | Branch and Link |
| bx | 92 | Branch and Exchange |
| clz | 90 | Count Leading Zeros |
| cmn | 81 | Compare Negative |
| cmp | 81 | Compare |
| eor | 85 | Bitwise Exclusive OR |
| ldm | 65 | Load Multiple Registers |
| ldr | 73 | Load Immediate |
| ldr | 64 | Load Register |
| ldrex | 69 | Load Multiple Registers |
| lsl | 94 | Logical Shift Left |
| lsr | 94 | Logical Shift Right |
| mla | 87 | Multiply and Accumulate |
| mov | 86 | Move |
| movt | 86 | Move Top |
| mrs | 91 | Move Status to Register |
| msr | 91 | Move Register to Status |
| mul | 87 | Multiply |
| mvn | 86 | Move Not |
| nop | 93 | No Operation |
| orn | 86 | Bitwise OR NOT |
| orr | 85 | Bitwise OR |
| ror | 94 | Rotate Right |
| rrx | 94 | Rotate Right with eXtend |
| rsb | 83 | Reverse Subtract |
| rsc | 83 | Reverse Subtract with Carry |
| sbc | 83 | Subtract with Carry |
| sdiv | 89 | Signed Divide |
| smlal | 88 | Signed Multiply and Accumulate Long |
| smull | 88 | Signed Multiply Long |
| stm | 65 | Store Multiple Registers |
| str | 64 | Store Register |
| strex | 69 | Store Multiple Registers |
| sub | 83 | Subtract |
| swi | 91 | Software Interrupt |
| swp | 68 | Load Multiple Registers |
| teq | 81 | Test Equivalence |
| tst | 81 | Test Bits |
| udiv | 89 | Unsigned Divide |
| umlal | 88 | Unsigned Multiply and Accumulate Long |
| umull | 88 | Unsigned Multiply Long |

The ARM Instruction Set Architecture includes 17 registers and four basic instruction types. This chapter introduced the instructions used for
• moving data from one register to another,
• performing computational operations with two source operands and one destination register,
• multiplication and division,
• performing comparisons, and
• performing special operations.
Most of the data processing instructions are three address instructions, because they involve two source operands and produce one result. For most instructions, the second source operand can be a register, a rotated or shifted register, or an immediate value. This flexibility results in a relatively powerful assembly language. In addition, almost all instructions can be executed conditionally, which, if used properly, results in very efficient and compact code.
4.1 If r0 initially contains 1, what will it contain after the third instruction in the sequence below?

4.2 What will r0 and r1 contain after each of the following instructions? Give your answers in base 10.

4.3 What is the difference between lsr and asr?
4.4 Write the ARM assembly code to load the numbers stored in num1 and num2, add them together, and store the result in numsum. Use only r0 and r1.
4.5 Given the following variable definitions:

where you do not know the values of x and y, write a short sequence of ARM assembly instructions to load the two numbers, compare them, and move the largest number into register r0.
4.6 Assuming that a is stored in register r0 and b is stored in register r1, show the ARM assembly code that is equivalent to the following C code.

4.7 Without using the mul instruction, give the instructions to multiply r3 by the following constants, leaving the result in r0. You may also use r1 and r2 to hold temporary results, and you do not need to preserve the original contents of r3.
(b) 100
(c) 575
(d) 123
4.8 Assume that r0 holds the least significant 32 bits of a 64-bit integer a, and r1 holds the most significant 32 bits of a. Likewise, r2 holds the least significant 32 bits of a 64-bit integer b, and r3 holds the most significant 32 bits of b. Show the shortest instruction sequences necessary to:
(a) compare a to b, setting the CPSR flags,
(b) shift a left by one bit, storing the result in b,
(c) add b to a, and
(d) subtract b from a.
4.9 Write a loop to count the number of bits in r0 that are set to 1. Use any other registers that are necessary.
4.10 The C standard library provides the open() function, which is documented in the second section of the Linux manual pages. This function is a very small “wrapper” to allow C programmers to access the open() system call. Assembly programmers can access the system call directly. In ARM Linux, the system call number for open() is 5. The values for flag constants used with open() are defined in
Write the ARM assembly instructions and directives necessary to make a Linux system call to open a file named input.txt for reading, without using the C standard library. In other words, write the assembly equivalent to: open(”input.txt”,O˙RDONLY); using the swi instruction.
This chapter first introduces the structured programming concepts and describes the principles of good software design. It then shows how the language elements covered in the previous three chapters are used to create the elements required by structured programming, giving comparative examples of these elements in C and assembly language. It covers programming elements for sequencing, selection, and iteration. Then it covers in greater detail how to access the standard C library functions from assembly language, and how to access assembly language functions from C. It then explains how automatic variables are allocated, and covers writing recursive functions in assembly language. Finally, it explains the implementation of C structs and shows how they can be accessed from assembly language, then covers arrays in the same way.
Structured programming; Sequencing; Selection; Iteration; Loop; Subroutine; Function; Recursion; Struct; Aggregate data; Array
Before IBM released FORTRAN in 1957, almost all programming was done in assembly language. Part of the reason for this is that nobody knew how to design a good high-level language, nor did they know how to write a compiler to generate efficient code. Early attempts at high-level languages resulted in languages that were not well structured, difficult to read, and difficult to debug. The first release of FORTRAN was not a particularly elegant language by today’s standards, but it did generate efficient code.
In the 1960s, a new paradigm for designing high-level languages emerged. This new paradigm emphasized grouping program statements into blocks of code that execute from beginning to end. These basic blocks have only one entry point and one exit point. Control of which basic blocks are executed, and in what order, is accomplished with highly structured flow control statements. The structured program theorem provides the theoretical basis of structured programming. It states that there are three ways of combining basic blocks: sequencing, selection, and iteration. These three mechanisms are sufficient to express any computable function. It has been proven that all programs can be written using only basic blocks, the pre-test loop, and if-then-else structure. Although most high-level languages provide additional statements for the convenience of the programmer, they are just “syntactical sugar.” Other structured programming concepts include well-formed functions and procedures, pass-by-reference and pass-by-value, separate compilation, and information hiding.
These structured programming languages enabled programmers to become much more productive. Well-written programs that adhere to structured programming principles are much easier to write, understand, debug, and maintain. Most successful high-level languages are designed to enforce, or at least facilitate, good programming techniques. This is not generally true for assembly language. The burden of writing a well-structured code lies with the programmer, and not with the language.
The best assembly programmers rely heavily on structured programming concepts. Failure to do so results in code that contains unnecessary branch instructions and, in the worst cases, results in something called spaghetti code. Consider a code listing where a line has been drawn from each branch instruction to its destination. If the result looks like someone spilled a plate of spaghetti on the page, then the listing is spaghetti code. If a program is spaghetti code, then the flow of control is difficult to follow. Spaghetti code is much more likely to have bugs and is extremely difficult to debug. If the flow of control is too complex for the programmer to follow, then it cannot be adequately debugged. It is the responsibility of the assembly language programmer to write code that uses a block-structured approach.
Adherence to structured programming principles results in code that has a much higher probability of working correctly. Well-written code also has fewer branch statements, making the percentage of data processing statements versus branch statements is higher. High data processing density results in higher throughput of data. In other words, writing code in a structured manner leads to higher efficiency.
Sequencing simply means executing statements (or instructions) in a linear sequence. When statement n is completed, statement n + 1 will be executed next. Uninterrupted sequences of statements form basic blocks. Basic blocks have exactly one entry point and one exit point. Flow control is used to select which basic block should be executed next.
The first control structure that we will examine is the basic selection construct. It is called selection because it selects one of the two (or possibly more) blocks of code to execute, based on some condition. In its most general form, the condition could be computed in a variety of ways, but most commonly it is the result of some comparison operation or the result of evaluating a Boolean expression.
Most languages support selection in the form of an if-then-else statement. Selection can be implemented very easily in ARM assembly language with a two-stage process:
1. perform an operation that updates the CPSR flags, and
2. use conditional execution to select a block of instructions to execute.
Because the ARM architecture supports conditional execution on almost every instruction, there are two basic ways to implement this control structure: by using conditional execution on all instructions in a block, or by using branch instructions. The conditional execution can be applied directly to instructions following the flag update, or to branch instructions that transfer execution to another location. Listing 5.1 shows a typical if-then-else statement in C.

Listing 5.2 shows the ARM code equivalent to Listing 5.1, using conditional execution. The then and else are written with one instruction each on lines 7 and 8. The then section is written as a conditional instruction with the lt condition attached. The else section is a single instruction with the opposite (ge) condition. Therefore only one of the two instructions will actually execute, depending on the results of the cmp instruction. If there are three or fewer instructions in each block that can be selected, then this is the preferred and most efficient method of writing the bodies of the then and else selections.

Listing 5.3 shows the ARM code equivalent to Listing 5.1, using branch instructions. Note that this method requires a conditional branch, an unconditional branch, and two labels. If there are more than three instructions in either basic block, then this is the preferred and most efficient method of writing the bodies of the then and else selections.

More complex selection structures should be written with care. Listing 5.4 shows a fragment of C code which compares the variables a, b, and c, and sets the variable x to the least of the three values. In C, Boolean expressions use short-circuit evaluation. For example, consider the Boolean AND operator in the expression ((a<b)&&(a<c)). If the first sub-expression evaluates to false, then the truth value of the complete expression can be immediately determined to be false, so the second sub-expression is not evaluated. This usually results in the compiler generating very efficient assembly code. Good programmers can take advantage of short-circuiting by checking array bounds early in a Boolean expression and accessing array elements later in the expression. For example, the expression ((i<15)&&(array[i]<0)) makes sure that the index i is less than 15 before attempting to access the array. If the index is greater than 14, the array access will not take place. This prevents the program from attempting to access the 16th element on an array that has only 15 elements.

Listing 5.5 shows an ARM assembly code fragment which is equivalent to Listing 5.4. In this code fragment, r0 is used to store a temporary value for the variable x, and the value is only stored to memory once at the end of the fragment of code. The outer if-then-else statement is implemented using branch instructions. The first comparison is performed on line 8. If the comparison evaluates to false, then it immediately branches to the else block of the outer if-then-else statement. But if the first comparison evaluates to true, then it performs the second comparison. Again, if that comparison evaluates to false, then it branches to the else block of the outer if-then-else statement. If both comparisons evaluate to true, then it executes the then block of the outer if-then-else statement, and then branches to the statement following the else block.

The if-then-else statement on line 5 of Listing 5.4 is implemented using conditional execution. The comparison is performed on line 13 of Listing 5.5. Lines 14 and 15 contain instructions that are conditionally executed. Since they have complementary conditions, it is guaranteed that one of them will move a value into r0. The comparison on line 13 determines which statement executes.
Note that the number of comparisons performed will always be minimized, and the number of branches has also been minimized. The only way that line 13 can be reached is if one of the first two comparisons evaluates to false. If line 2 is executed, then no matter which sequence of events occurs, the program fragment will always reach line 16 and a value will be stored in x. Thus, the ARM assembly code fragment in Listing 5.5 can be considered to be a block of code with exactly one entry point and one exit point.
When writing nested selection structures, it is important to maintain a block structure, even if the bodies of the blocks consist of only a single instruction. It is often very helpful to write the algorithm in pseudo-code or a high-level language, such as C or Java, before converting it to assembly. Prolific commenting of the code is also strongly encouraged.
Iteration involves the transfer of control from a statement in a sequence to a previous statement in the sequence. The simplest type of iteration is the unconditional loop, also known as the infinite loop. This type of loop may be used in programs or tasks that should continue running indefinitely. Listing 5.6 shows an ARM assembly fragment containing an unconditional loop. Few high-level languages provide a true unconditional loop, but the high-level programmer can achieve a similar effect by using a conditional loop and specifying a condition that always evaluates to true.

A pre-test loop is a loop in which a test is performed before the block of instructions forming the loop body is executed. If the test evaluates to true, then the loop body is executed. The last instruction in the loop body is a branch back to the beginning of the test. If the test evaluates to false, then execution branches to the first instruction following the loop body. All structured programming languages have a pre-test loop construct. For example, in C, the pre-test loop is called a while loop. In assembly, a pre-test loop is constructed very similarly to an if-then statement. The only difference is that it includes an additional branch instruction at the end of the sequence of instructions that form the body. Listing 5.7 shows a pre-test loop in ARM assembly.

In a post-test loop, the test is performed after the loop body is executed. If the test evaluates to true, then execution branches to the first instruction in the loop body. Otherwise, execution continues sequentially. Most structured programming languages have a post-test loop construct. For example, in C, the post-test loop is called a do-while loop. Listing 5.8 shows a post-test loop in ARM assembly. The body of a post-test loop will always be executed at least once.

Many structured programming languages have a for loop construct, which is a type of counting loop. The for loop is not essential, and is only included as a matter of syntactical convenience. In some cases, a for loop is easier to write and understand than an equivalent pre-test or post-test loop. However, with the addition of an if-then construct, any loop can be implemented as a pre-test loop. The following sections show how loops can be converted from one form to another.
Listing 5.9 shows a simple C program with a for loop. The program prints “Hello World” 10 times, appending an integer to the end of each line.

In order to write an equivalent program in assembly, the programmer must first rewrite the for loop as a pre-test loop. Listing 5.10 shows the program rewritten so that it is easier to translate into assembly. Note that the initialization of the loop variable has been moved to its own line before the while statement. Also, the loop variable is modified on the last line of the loop body. This is a straightforward conversion from one type of loop to another type. Listing 5.11 shows a translation of the pre-test loop structure into ARM assembly.


If the programmer can guarantee that the body of a for loop will always execute at least once, then the for loop can be converted to an equivalent post-test loop. This form of loop is more efficient, because the loop control variable is tested one time less than for a pre-test loop. Also, a post-test loop requires only one label and one conditional branch instruction, whereas a pre-test loop requires two labels, a conditional branch, and an unconditional branch.
Since the loop in Listing 5.9 always executes the body exactly 10 times, we know that the body will always execute at least once. Therefore, the loop can be converted to a post-test loop. Listing 5.12 shows the program rewritten as a post-test loop so that it is easier to translate into assembly. Note that, as in the previous example, the initialization of the loop variable has been moved to its own line before the do-while loop, and the loop variable is modified on the last line of the loop body. This post-test version will produce the same output as the pre-test version. This is a straightforward conversion from one type of loop to an equivalent type. Listing 5.13 shows a straightforward translation of the post-test loop structure into ARM assembly.


A subroutine is a sequence of instructions to perform a specific task, packaged as a single unit. Depending on the particular programming language, a subroutine may be called a procedure, a function, a routine, a method, a subprogram, or some other name. Some languages, such as Pascal, make a distinction between functions and procedures. A function must return a value and must not alter its input arguments or have any other side effects (such as producing output or changing static or global variables). A procedure returns no value, but may alter the value of its arguments or have other side effects.
Other languages, such as C, make no distinction between procedures and functions. In these languages, functions may be described as pure or impure. A function is pure if:
1. the function always evaluates the same result value when given the same argument value(s), and
2. evaluation of the result does not cause any semantically observable side effect or output.
The first condition implies that the result of the function cannot depend on any hidden information or state that may change as program execution proceeds, or between different executions of the program, nor can it depend on any external input from I/O devices. The result value of a pure function does not depend on anything other than the argument values. If the function returns multiple result values, then these two conditions must apply to all returned values. Otherwise the function is impure. Another way to state this is that impure functions have side effects while pure functions have no side effects.
Assembly language does not impose any distinction between procedures and functions, pure or impure. Although every assembly language will provide a way to call subroutines and return from them, it is up to the programmer to decide how to pass arguments to the subroutines and how to pass return values back to the section of code that called the subroutine. Once again, the expert assembly programmer will use structured programming concepts to write efficient, readable, debugable, and maintainable code.
Subroutines help programmers to design reliable programs by decomposing a large problem into a set of smaller problems. It is much easier to write and debug a set of small code pieces than it is to work on one large piece of code. Careful use of subroutines will often substantially reduce the cost of developing and maintaining a large program, while increasing its quality and reliability. The advantages of breaking a program into subroutines include:
• enabling reuse of code across multiple programs,
• reducing duplicate code within a program,
• enabling the programming task to be divided between several programmers or teams,
• decomposing a complex programming task into simpler steps that are easier to write, understand, and maintain,
• enabling the programming task to be divided into stages of development, to match various stages of a project, and
• hiding implementation details from users of the subroutine (a programming principle known as information hiding).
There are two minor disadvantages in using subroutines. First, invoking a subroutine (versus using in-line code) imposes overhead. The arguments to the subroutine must be put into some known location where the subroutine can find them. if the subroutine is a function, then the return value must be put into a known location where the caller can find it. Also, a subroutine typically requires some standard entry and exit code to manage the stack and save and restore the return address.
In most languages, the cost of using subroutines is hidden from the programmer. In assembly, however, the programmer is often painfully aware of the cost, since they have to explicitly write the entry and exit code for each subroutine, and must explicitly write the instructions to pass the data into the subroutine. However, the advantages usually outweigh the costs. Assembly programs can get very large and failure to modularize the code by using subroutines will result in code that cannot be understood or debugged, much less maintained and extended.
Subroutines may be defined within a program, or a set of subroutines may be packaged together in a library. Libraries of subroutines may be used by multiple programs, and most languages provide some built-in library functions. The C language has a very large set of functions in the C standard library. All of the functions in the C standard library are available to any program that has been linked with the C standard library. Even assembly programs can make use of this library. Linking is done automatically when gcc is used to assemble the program source. All that the programmer needs to know is the name of the function and how to pass arguments to it.
Listing 5.14 shows a very simple C program which reads an integer from standard input using scanf and prints the integer to standard output using printf. An equivalent program written in ARM assembly is shown in Listing 5.15. These examples show how arguments can be passed to subroutines in C and equivalently in assembly language.


All processor families have their own standard methods, or function calling conventions, which specify how arguments are passed to subroutines and how function values are returned. The function call standard allows programmers to write subroutines and libraries of subroutines that can be called by other programmers. In most cases, the function calling standards are not enforced by hardware, but assembly programmers and compiler writers conform to the standards in order to make their code accessible to other programmers. The basic subroutine calling rules for the ARM processor are simple:
• The first four arguments go in registers r0-r3.
• Any remaining arguments are pushed to the stack.
If the subroutine returns a value, then it is stored in r0 before the function returns to its caller. Calling a subroutine in ARM assembly usually requires several lines of code. The number of lines required depends on how many arguments the subroutine requires and where the data for those arguments are stored. Some variables may already be in the correct register. Others may need to be moved from one register to another. Still others may need to be pushed onto the stack. Careful programming is required to minimize the amount of work that must be done just to move the subroutine arguments into their required locations.
The ARM register set was introduced in Chapter 3. Some registers have special purposes that are dictated by the hardware design. Others have special purposes that are dictated by programming conventions. Programmers follow these conventions so that their subroutines are compatible with each other. These conventions are simply a set of rules for how registers should be used. In ARM assembly, all registers have alternate names which can be used to help remember the rules for using them. Fig. 5.1 shows an expanded view of the ARM registers, including their alternate names and conventional use.

Registers r0-r3 are also known as a1-a4, because they are used for passing arguments to subroutines. Registers r4-r11 are also known as v1-v8, because they are used for holding local variables in a subroutine. As mentioned in Section 3.2, register r11 can also be referred to as fp because it is used by the C compiler to track the stack frame, unless the code is compiled using the --omit-frame- pointer command line option.
The intra-procedure scratch register, r12, is used by the C library when calling dynamically linked functions. If a subroutine does not call any C library functions, then it can use r12 as another register to store local variables. If a C library function is called, it may change the contents of r12. Therefore, if r12 is being used to store a local variable, it should be saved to another register or to the stack before a C library function is called.
The stack pointer (sp), link register (lr), and program counter (pc), along with the argument registers, are all involved in performing subroutine calls. The calling subroutine must place arguments in the argument registers, and possibly on the stack as well. Placing the arguments in their proper locations is known as marshaling the arguments. After marshaling the arguments, the calling subroutine executes the bl instruction, which will modify the program counter and link register. The bl instruction copies the contents of the program counter to the link register, then loads the program counter with the address of the first instruction in the subroutine that is being called. The CPU will then fetch and execute its next instruction from the address in the program counter, which is the first instruction of the subroutine that is being called.
Our first examples of calling a function will involve the printf function from the C standard library. The printf function can be a bit confusing at first, but it is an extremely useful and flexible function for printing formatted output. The printf function examines its first argument to determine how many other arguments have been passed to it. The first argument is a format string, which is a null-terminated ASCII string. The format string may include conversion specifiers, which start with the \% character. For each conversion specifier, printf assumes that an argument has been passed in the correct register or location on the stack. The argument is retrieved, converted according to the specified format, and printed. Other specifiers include \%X to print the matching argument as an integer in hexadecimal, \%c to print the matching argument as an ASCII character, \%s to print a zero-terminated string. The integer specifiers can include an optional width and zero-padding specification. For example \%8X will print an integer in hexadecimal, using 8 characters. Any leading zeros will be printed as spaces. The format string \%08X will print an integer in hexadecimal, using 8 characters. In this case, any leading zeros will be printed as zeros. Similarly, \%15d can be used to print an integer in base 10 using spaces to pad the number up to 15 characters, while \%015d will print an integer in base 10 using zeros to pad up to 15 characters.
Listing 5.16 shows a call to printf in C. The printf function requires one argument, and can accept more than one. In this case, there is only one argument, the format string. Listing 5.17 shows an equivalent call made in ARM assembly language. The single argument is loaded into r0 in conformance with the ARM subroutine calling convention.


Listing 5.18 shows a call to printf in C having four arguments. The format string is the first argument. The format string contains three conversion specifiers, and is followed by three more arguments. Arguments are matched to conversion specifiers according to their positions. The type of each argument matches the type indicated in the conversion specifier. The first conversion specifier is applied to the second argument, the second conversion specifier is applied to the third argument, and the third conversion specifier is applied to the fourth argument. The \%d conversion specifiers indicate that the arguments are to be interpreted as integers and printed in base 10.

Listing 5.19 shows an equivalent call made in ARM assembly language. The arguments are loaded into r0-r3 in conformance with the ARM subroutine calling convention. Note that we assume that formatstr has previously been defined using a .asciz or .string assembler directive or equivalent method. As long as there are four or fewer arguments that must be passed, they can all fit in registers r0-r3 (a.k.a a1-a4), but when there are more arguments, things become a little more complicated. Any remaining arguments must be passed on the program stack, using the stack pointer r13. Care must be taken to ensure that the arguments are pushed to the stack in the proper order. Also, after the function call, the arguments must be removed from the stack, so that the stack pointer is restored to its original value.

Listing 5.20 shows a call to printf in C having more than four arguments. The format string is the first argument. The format string contains five conversion specifiers, which implies that the format string must be followed by five additional arguments. Arguments are matched to conversion specifiers according to their positions. The type of each argument matches the type indicated in the conversion specifier. The first conversion specifier is applied to the second argument, the second conversion specifier is applied to the third argument, the third conversion specifier is applied to the fourth argument, etc. The \%d conversion specifiers indicate that the arguments are to be interpreted as integers and printed in base 10.

Listing 5.21 shows an equivalent call made in ARM assembly language. Since there are six arguments, the last two must be pushed to the program stack. The arguments are loaded into r0 one at a time and then the register pre-indexed addressing mode is used to subtract four bytes from the stack pointer and then store the argument at the top of the stack. Note that the sixth argument is pushed to the stack first, followed by the fifth argument. The remaining arguments are loaded in r0-r3. Note that we assume that formatstr has previously been defined to be ”The results are: or \ lstinline { .string assembler directive.

Listing 5.22 shows how the fifth and sixth arguments can be pushed to the stack using a single stmfd instruction. The sixth argument is loaded into r3 and the fifth argument is loaded into r0, then the stmfd instruction is used to store them on the stack and adjust the stack pointer. A little care must be taken to ensure that the arguments are stored in the correct order on the stack. Remember that the stmfd instruction will always push the lowest-numbered register to the lowest address, and the stack grows downward. Therefore, r3, the sixth argument, will be pushed onto the stack first, making it grow downward by four bytes. Next, r0 is pushed, making the stack grow downward by four more bytes. As in the previous example, the remaining four arguments are loaded into a1-a4.

After the printf function is called, the fifth and sixth arguments must be popped from the stack. If those values are no longer needed, then there is no need to load them into registers. The quickest way to pop them from the stack is to simply adjust the stack pointer back to its original value. In this case, we pushed two arguments onto the stack, using a total of eight bytes. Therefore, all we need to do is add eight to the stack pointer, thereby restoring its original value.
We have looked at the conventions that are followed for calling functions. Now we will examine these same conventions from the point of view of the function being called. Because of the calling conventions, the programmer writing a function can assume that
• the first four arguments are in r0-r3,
• any additional arguments can be accessed with ldr rd,[sp,# offset ],
• the calling function will remove arguments from the stack, if necessary,
• if the function return type is not void, then they must enusure that the return value is in r0 (and possibly r1, r2, r3), and
• the return address will be in lr.
Also because of the conventions, there are certain registers that can be used freely while others must be preserved or restored so that the calling function can continue operating correctly. Registers which can be used freely are referred to as volatile, and registers which must be preserved or restored before returning are referred to as non-volatile. When writing a subroutine (function),
• registers r0-r3 and r12 are volatile,
• registers r4-r11 and r13 are non-volatile (they can be used, but their contents must be restored to their original value before the function returns),
• register r14 can be used by the function, but its contents must be saved so that the return address can be loaded into r15 when the function returns to its caller,
• if the function calls another function, then it must save register r14 either on the stack or in a non-volatile register before making the call.
Listing 5.23 shows a small C function that simply returns the sum of its six arguments. The ARM assembly version of that function is shown in Listing 5.24. Note that on line 5, the fifth argument is loaded from the stack, and on line 7, the sixth argument is loaded in a similar way, using an offset from the stack pointer. If the calling function has followed the conventions, then the fifth and sixth arguments will be where they are expected to be in relation to the stack pointer.


In block-structured high-level languages, an automatic variable is a variable that is local to a block of code and not declared with static duration. It has a lifetime that lasts only as long as its block is executing. Automatic variables can be stored in one of two ways:
1. the stack is temporarily adjusted to hold the variable, or
2. the variable is held in a register during its entire life.
When writing a subroutine in assembly, it is the responsibility of the programmer to decide what automatic variables are required and where they will be stored. In high-level languages this decision is usually made by the compiler. In some languages, including C, it is possible to request that an automatic variable be held in a register. The compiler will attempt to comply with the request, but it is not guaranteed. Listing 5.25 shows a small function which requests that one of its variables be kept in a register instead of on the stack.

Listing 5.26 shows how the function could be implemented in assembly. Note that the array of integers consumes 80 bytes of storage on the stack, and could not possibly fit into the registers available on the ARM processor. However, the loop control variable can easily be stored in one of the registers for the duration of the function. Also notice that on line 1 the storage for the array is allocated simply by adjusting the stack pointer, and on line 9 the storage is released by restoring the stack pointer to its original contents. It is critical that the stack pointer be restored, no matter how the function returns. Otherwise, the calling function will probably mysteriously fail. For this reason, each function should have exactly one block of instructions for returning. If the function needs to return from some location other than the end, then it should branch to the return block rather than returning directly.

A function that calls itself is said to be recursive. Certain problems are easy to implement recursively, but are more difficult to solve iteratively. A problem exhibits recursive behavior when it can be defined by two properties:
1. a simple base case (or cases), and
2. a set of rules that reduce all other cases toward the base case.
For example, we can define person’s ancestors recursively as follows:
1. one’s parents are one’s ancestors (base case),
2. the ancestors of one’s ancestors are also one’s ancestors (recursion step).
Recursion is a very powerful concept in programming. Many functions are naturally recursive, and can be expressed very concisely in a recursive way. Numerous mathematical axioms are based upon recursive rules. For example, the formal definition of the natural numbers by the Peano axioms can be formulated as:
2. each natural number has a successor, which is also a natural number.
Using one base case and one recursive rule, it is possible to generate the set of all natural numbers. Other recursively defined mathematical objects include functions and sets.
Listing 5.27 shows the C code for a small program which uses recursion to reverse the order of characters in a string. The base case where recursion ends is when there are fewer than two characters remaining to be swapped. The recursive rule is that the reverse of a string can be created by swapping the first and last characters and then reversing the string between them. In short, a string is reversed if:

1. the string has a length of zero or one character, or
2. the first and last characters have been swapped and the remaining characters have been reversed.
In Listing 5.27, line 3 checks for the base case. If the string has not been reversed according to the first rule, then the second rule is applied. Lines 5–7 swap the first and last characters, and line 8 recursively reverses the characters between them.
Listing 5.28 shows how the reverse function can be implemented using recursion in ARM assembly. Line 1 saves the link register to the stack and decrements the stack pointer. Next, storage is allocated for an automatic variable. Lines 3 and 4 test for the base case. If the current case is the base case, then the function simply returns (restoring the stack as it goes). Otherwise, the first and last characters are swapped in lines 5 through 10 and a recursive call is made in lines 11 through 13.

The code in Listing 5.28 can be made a bit more efficient. First, the test for the base case can be performed before anything else is done, as shown in Listing 5.29. Also, the local variable tmp can be stored in a volatile register rather than stored on the stack, because it is only needed for lines 4 through 8. It is not needed after the recursive call, so there is really no need to preserve it on the stack. This means that our function can use half as much stack space and will run much faster. This further refined version is shown in Listing 5.30. This version uses ip (r12) as the tmp variable instead of using the stack.


The previous examples used the concept of an array of characters to access the string that is being reversed. Listing 5.31 shows how this problem can be solved in C using pointers to the first and last characters rather than array indices. This version only has two parameters in the reverse function, and uses pointer dereferencing rather than array indexing to access each character. Other than that difference, it works the same as the original version. Listing 5.32 shows how the reverse function can be implemented efficiently in ARM assembly. This implementation has the same number of instructions as the previous version, but lines 4 through 7 use a different addressing mode. On the ARM processor, the pointer method and the array index method are equally efficient. However, many processors do not have the rich set of addressing modes available on the ARM. On those processors, the pointer method may be significantly more efficient.


An aggregate data item can be referenced as a single entity, and yet consists of more than one piece of data. Aggregate data types are used to keep related data together, so that the programmer’s job becomes easier. Some examples of aggregate data are arrays, structures or records, and objects, In most programming languages, aggregate data types can be defined to create higher-level structures. Most high-level languages allow aggregates to be composed of basic types as well as other aggregates. Proper use of structured data helps to make programs less complicated and easier to understand and maintain.
In high-level languages, there are several benefits to using aggregates. Aggregates make the relationships between data clear, and allow the programmer to perform operations on blocks of data. Aggregates also make passing parameters to functions simpler and easier to read.
The most common aggregate data type is an array. An array contains zero or more values of the same data type, such as characters, integers, floating point numbers, or fixed point numbers. An array may also contain values of another aggregate data type. Every element in an array must have the same type. Each data item in an array can be accessed by its array index.
Listing 5.33 shows how an array can be allocated and initialized in C. Listing 5.34 shows the equivalent code in ARM assembly. Note that in this case, the scaled register offset addressing mode was used to access each element in the array. This mode is often convenient when the size of each element in the array is an integer power of 2. If that is not the case, then it may be necessary to use a different addressing mode. An example of this will be given in Section 5.5.3.


The second common aggregate data type is implemented as the struct in C or the record in Pascal. It is commonly referred to as a structured data type or a record. This data type can contain multiple fields. The individual fields in the structured data may also be referred to as structured data elements, or simply elements. In most high-level languages, each element of a structured data type may be one of the base types, an array type, or another structured data type. Listing 5.35 shows how a struct can be declared, allocated, and initialized in C. Listing 5.36 shows the equivalent code in ARM assembly.


Care must be taken using assembly to access data structures that were declared in higher level languages such as C and C++. The compiler will typically pad a data structure to ensure that the data fields are aligned for efficiency. On most systems, it is more efficient for the processor to access word-sized data if the data is aligned to a word boundary. Some processors simply cannot load or store a word from an address that is not on a word boundary, and attempting to do so will result in an exception. The assembly programmer must somehow determine the relative address of each field within the higher-level language structure. One way that this can be accomplished in C is by writing a small function which prints out the offsets to each field in the C structure. The offsets can then be used to access the fields of the structure from assembly language. Another method for finding the offsets is to run the program under a debugger and examine the data structure.
It is often useful to create arrays of structured data. For example, a color image may be represented as a two-dimensional array of pixels, where each pixel consists of three integers which specify the amount of red, green, and blue that are present in the pixel. Typically, each of the three values is represented using an unsigned eight bit integer. Image processing software often adds a fourth value, α, specifying the transparency of each pixel.
Listing 5.37 shows how an array of pixels can be allocated and initialized in C. The listing uses the malloc() function from the C standard library to allocate storage for the pixels from the heap (see Section 1.4). Note that the code uses the sizeof () function to determine how many bytes of memory are consumed by a single pixel, then multiplies that by the width and height of the image. Listing 5.38 shows the equivalent code in ARM assembly.


Note that the code in Listing 5.38 is far from optimal. It can be greatly improved by combining the two loops into one loop. This will remove the need for the multiply on line 28 and the addition on line 29, and will simplify the code structure. An additional improvement would be to increment the single loop counter by three on each loop iteration, making it very easy to calculate the pointer for each pixel. Listing 5.39 shows the ARM assembly implementation with these optimizations.

Although the implementation shown in Listing 5.39 is more efficient than the previous version, there are several more improvements that can be made. If we consider that the goal of the code is to allocate some number of bytes and initialize them all to zero, then the code can be written more efficiently. Rather than using three separate store instructions to set 3 bytes to zero on each iteration of the loop, why not use a single store instruction to set four bytes to zero on each iteration? The only problem with this approach is that we must consider the possibility that the array may end in the middle of a word. However, this can be dealt with by using two consecutive loops. The first loop sets one word of the array to zero on each iteration, and the second loop finishes off any remaining bytes. Listing 5.40 shows the results of these additional improvements. This third implementation will run much faster than the previous implementations.

Spaghetti code is the bane of assembly programming, but it can easily be avoided. Although assembly language does not enforce structured programming, it does provide the low-level mechanisms required to write structured programs. The assembly programmer must be aware of, and assiduously practice, proper structured programming techniques. The burden of writing properly structured code blocks, with selection structures and iteration structures, lies with the programmer, and failure to apply structured programming techniques will result in code that is difficult to understand, debug, and maintain.
Subroutines provide a way to split programs into smaller parts, each of which can be written and debugged individually. This allows large projects to be divided among team members. In assembly language, defining and using subroutines is not as easy as in higher level languages. However, the benefits usually outweigh the costs. The C library provides a large number of functions. These can be accessed by an assembly program as long as it is linked with the C standard library.
Assembly provides the mechanisms to access aggregate data types. Arrays can be accessed using various addressing modes on the ARM processor. The pre-indexing and post-indexing modes allow array elements to be accessed using pointers, with the pointers being incremented after each element access. Fields in structured data records can be accessed using immediate offset addressing mode. The rich set of addressing modes available on the ARM processor allows the programmer to use aggregate data types more efficiently than on most processors.
5.1 What does it mean for a register to be volatile? Which ARM registers are considered volatile according to the ARM function calling convention?
5.2 Fully explain the differences between static variables and automatic variables.
5.3 In ARM assembly language, write a function that is equivalent to the following C function.

5.4 What are the two places where an automatic variable can be stored?
5.5 You are writing a function and you decided to use registers r4 and r5 within the function. Your function will not call any other functions; it is self-contained. Modify the following skeleton structure to ensure that r4 and r5 can be used within the function and are restored to comply with the ARM standards, but without unnecessary memory accesses.

5.6 Convert the following C program to ARM assembly, using a post-test loop:

5.7 Write a complete ARM function to shift a 64-bit value left by any given amount between 0 and 63 bits. The function should expect its arguments to be in registers r0, r1, and r2. The lower 32 bits of the value are passed in r0, the upper 32 bits of the value are passed in r1, and the shift amount is passed in r2.
5.8 Write a complete subroutine in ARM assembly that is equivalent to the following C subroutine.

5.9 Write a complete function in ARM assembly that is equivalent to the following C function.


5.10 Write an ARM assembly function to calculate the average of an array of integers, given a pointer to the array and the number of items in the array. Your assembly function must implement the following C function prototype:
Assume that the processor does not support the div instruction, but there is a function available to divide two integers. You do not have to write this function, but you may need to call it. Its C prototype is:
5.11 Write a complete function in ARM assembly that is equivalent to the following C function. Note that a and b must be allocated on the stack, and their addresses must be passed to scanf so that it can place their values into memory.

5.12 The factorial function can be defined as:
The following C program repeatedly reads x from the user and calculates x! It quits when it reads end-of-file or when the user enters a negative number or something that is not an integer.
Write this program in ARM assembly.

5.13 For large x, the factorial function is slow. However, a lookup table can be added to the function to improve average performance. This technique is commonly known as memoization or tabling, but is sometimes called dynamic programming. The following C implementation of the factorial function uses memoization. Modify your ARM assembly program from the previous problem to include memoization.


This chapter extends the coverage of structured programming to include abstract data types (ADT). It begins by giving the definition of an abstract data type and giving a small example of an ADT that could be used to read, process, and write Netpbm images. The next section introduces an ADT written in C to perform word frequency counts, and shows how performance can be greatly improved by using better algorithms and/or by writing some functions in assembly language. It also shows how a binary tree structure created by C code can be traversed in assembly language. The chapter ends with a ethics module about the Therac-25 cancer treatment device.
Abstract data type; Word frequency count; Binary tree; Index; Sort; Ethics
An abstract data type (ADT) is composed of data and the operations that work on that data. The ADT is one of the cornerstones of structured programming. Proper use of ADTs has many benefits. Most importantly, abstract data types help to support information hiding. A software module hides information by encapsulating the information into a module or other construct which presents an interface. The interface typically consists of the names of data types provided by the ADT and a set of subroutine definitions, or prototypes, for operating on the data types. The implementation of the ADT is hidden from the client code that uses the ADT.
A common use of information hiding is to hide the physical storage layout for data so that if it is changed, the change is restricted to a small subset of the total program. For example, if a three-dimensional point (x,y,z) is represented in a program with three floating point scalar variables, and the representation is later changed to a single array variable of size three, a module designed with information hiding in mind would protect the remainder of the program from such a change.
Information hiding reduces software development risk by shifting the code’s dependency on an uncertain implementation onto a well-defined interface. Clients of the interface perform operations purely through the interface, which does not change. If the implementation changes, the client code does not have to change.
Encapsulating software and data structures behind an interface allows the construction of objects that mimic the behavior and interactions of objects in the real world. For example, a simple digital alarm clock is a real-world object that most people can use and understand. They can understand what the alarm clock does, and how to use it through the provided interface (buttons and display) without needing to understand every part inside of the clock. If the internal circuitry of the clock were to be replaced with a different implementation, people could continue to use it in the same way, provided that the interface did not change.
As with all other structured programming concepts, ADTs can be implemented in assembly language. In fact, most high-level compilers convert structured programming code into assembly during compilation. All that is required is that the programmer define the data structure(s), and the set of operations that can be used on the data. Listing 6.1 gives an example of an ADT interface in C. The type Image is not fully defined in the interface. This prevents client software from accessing the internal structure of the image data type. Therefore, programmers using the ADT can modify images only by using the provided functions. Other structured programming and object-oriented programming languages such as C++, Java, Pascal, and Modula 2 provide similar protection for data structures so that client code can access the data structure only through the provided interface. Note that only the pval definition is exposed, indicating to client programs that the red, green, and blue components of a pixel must be a number between 0 and 255. In C, as with other structured programming languages, the implementation of the subroutines can also be hidden by placing them in separate compilation modules. Those modules will have access to the internal structure of the Image data type.

Assembly language does not have the ability to define a data structure as such, but it does provide the mechanisms needed to specify the location of each field with respect to the beginning of a data structure, as well as the overall size of the data structure. With a little thought and effort, it is possible to implement ADTs in Assembly language. Listing 6.2 shows the private implementation of the Image data type, which is included by the C files which implement the Image data type. Listing 6.3 shows how the data structures from the previous listings can be defined in assembly language. With those definitions, any of the functions declared in Listing 6.1 can be written in assembly language.


Counting the frequency of words in written text has several uses. In digital forensics, it can be used to provide evidence as to the author of written communications. Different people have different vocabularies, and use words with differing frequency. Word counts can also be used to classify documents by type. Scientific articles from different fields contain words specific to that field, and historical novels will differ from western novels in word frequency.
Listing 6.4 shows the main function for a simple C program which reads a text file and creates a list of all the words contained in a file, along with their frequency of occurrence. The program has been divided into two parts: the main program, and an ADT which is used to keep track the words and their frequencies, and to print a table of word frequencies.


The interface for the ADT is shown in Listing 6.5. There are several ways that the ADT could be implemented. Note that the interface given in the header file does not show the internal fields of the word list data type. Thus, any file which includes this header is allowed to declare pointers to wordlist data types, but cannot access or modify any internal fields. The list of words could be stored in an array, a linked list, a binary tree, or some other data structure. The subroutines could be implemented in C or in some other language, including assembly. Listing 6.6 shows an implementation in C using a linked list. Note that the function for printing the word frequency list in numerical order has not been implemented. It will be written in assembly language. Since the program is split into multiple files, it is a good idea to use the make utility to build the executable program. A basic makefile is shown in Listing 6.7.





Suppose we wish to implement one of the functions from Listing 6.6 in ARM assembly language. We would delete the function from the C file, create a new file with the assembly version of the function, and modify the makefile so that the new file is included in the program. The header file and the main program file would not require any changes. The header file provides function prototypes that the C compiler uses to determine how parameters should be passed to the functions. As long as our new assembly function conforms to its C header definition, the program will work correctly.
The linked list is created in alphabetical order, but the wl_print_numerical() function is required to print it sorted in reverse order of number of occurrences. There are several ways in which this could be accomplished, with varying levels of efficiency. The possible approaches include, but are not limited to:
• Re-ordering the linked list using an insertion sort: This approach creates a complete new list by removing each item, one at a time, from the original list, and inserting it into a new list sorted by the number of occurrences rather than the words themselves. The time complexity for this approach would be O(N2), but would require no additional storage. However, if the list were later needed in alphabetical order, or any more words were to be added, then it would need to be re-sorted in the original order.
• Sorting the linked list using a merge sort algorithm: Merge sort is one of the most efficient sorting algorithms known and can be efficiently applied to data in files and linked lists. The merge sort works as follows:
1. The sub-list size, i, is set to 1.
2. The list is divided into sub-lists, each containing i elements. Each sub-list is assumed to be sorted. (A sub-list of length one is sorted by definition.)
3. The sub-lists are merged together to create a list of sub-lists of size 2i, where each sub-list is sorted.
4. The sub-list size, i, is set to 2i.
5. The process is repeated from step 2 until i ≥ N, where N is the number of items to be sorted.
The time complexity for the merge sort algorithm is
, which is far more efficient than the insertion sort. This approach would also require no additional storage. However, if the list were later needed in alphabetical order, or any more words were to be added, then it would need to be re-sorted in the original alphabetical order.
• Create an index, and sort the index rather than rebuilding the list. Since the number of elements in the list is known, we can allocate an array of pointers. Each pointer in the array is then initialized to point to one element in the linked list. The array forms an index, and the pointers in the array can be re-sorted in any desired order, using any common sorting method such as bubble sort (O(N2)), in-place insertion sort (O(N2)), quick sort (
), or others. This approach requires additional storage, but has the advantage that it does not need to modify the original linked list.
There are many other possibilities for re-ordering the list. Regardless of which method is chosen, the main program and the interface (header file) need not be changed. Different implementations of the sorting function can be substituted without affecting any other code.
The wl_print_numerical() function can be implemented in assembly as shown in Listing 6.8. The function operates by re-ordering the linked list using an insertion sort as described above. Listing 6.9 shows the change that must be made to the make file. Now, when make is run, it compiles the two C files and the assembly file into object files, then links them all together. The C implementation of wl_print_numerical() in list.c must be deleted or commented out or the linker will emit an error indicating that it found two versions of wl_print_numerical().




The word frequency counter, as previously implemented, takes several minutes to count the frequency of words in the author’s manuscript for this textbook on a Raspberry Pi. Most of the time is spent building the list of words and re-sorting the list in order of word frequency. Most of the time for both of these operations is spent in searching for the word in the list before incrementing its count or inserting it in the list. There are more efficient ways to build ordered lists of data.
Since the code is well modularized using an ADT, the internal mechanism of the list can be modified without affecting the main program. A major improvement can be made by changing the data structure from a linked list to a binary tree. Fig. 6.1 shows an example binary tree storing word frequency counts. The time required to insert into a linked list is O(N), but the time required to insert into a binary tree is
. To give some perspective, the author’s manuscript for this textbook contains about 125,000 words. Since log2(125,000) < 17, we would expect the linked list implementation to require about
times as long as a binary tree implementation to process the author’s manuscript for this textbook. In reality, there is some overhead to the binary tree implementation. Even with the extra overhead, we should see a significant speedup. Listing 6.10 shows the C implementation using a balanced binary tree instead of a linked list.








With the tree implementation, wl_print_numerical() could build a new tree, sorted on the word frequency counts. However, it may be more efficient to build a separate index, and sort the index by word frequency counts. The assembly code will allocate an array of pointers, and set each pointer to one of the nodes in the tree, as shown in Fig. 6.2. Then, it will use a quick sort to sort the pointers into descending order by word frequency count, as shown in Fig. 6.3. This implementation is shown in Listing 6.11.






The tree-based implementation gets most of its speed improvement through using two
algorithms to replace O(N2) algorithms. These examples show how a small part of a program can be implemented in assembly language, and how to access C data structures from assembly language. The functions could just as easily have been written in C rather than assembly, without greatly affecting performance. Later chapters will show examples where the assembly implementation does have significantly better performance than the C implementation.
The Therac-25 was a device designed for radiation treatment of cancer. It was produced by Atomic Energy of Canada Limited (AECL), which had previously produced the Therac-6 and Therac-20 units in partnership with CGR of France. It was capable of treating tumors close to the skin surface using electron beam therapy, but could also be configured for Megavolt X-ray therapy to treat deeper tumors. The X-ray therapy required the use of a tungsten radiation shield to limit the area of the body that was exposed to the potentially lethal radiation produced by the device.
The Therac-25 used a double pass accelerator, which provided more power, in a smaller space, at less cost, compared to its predecessors. The second major innovation was that computer control was a central part of the design, rather than an add-on component as in its predecessors. Most of the hardware safety interlocks that were integral to the designs of the Therac-6 and Therac-20 were seen as unnecessary, because the software would perform those functions. Computer control was intended to allow operators to set up the machine more quickly, allowing them to spend more time communicating with patients and to treat more patients per day. It was also seen as a way to reduce production costs by relying on software, rather than hardware, safety interlocks.
There were design issues with both the software and the hardware. Although this machine was built with the goal of saving lives, between 1985 and 1986, three deaths and other injuries were attributed to the hardware and software design of this machine. Death due to radiation exposure is usually slow and painful, and the problem was not identified until the damage had been done.
AECL was required to obtain US Food and Drug Administration (FDA) approval before releasing the Therac-25 to the US market. They obtained approval quickly by declaring “pre-market equivalence,” effectively claiming that the new machine was not significantly different from its predecessors. This practice was common in 1984, but was overly optimistic, considering that most of the safety features had been changed from hardware to software implementations. With FDA approval, AECL made the Therac-25 commercially available and performed a Fault Tree Analysis to evaluate the safety of the device.
Fault Tree Analysis, as its name implies, requires building a tree to describe every possible fault and assigning probabilities to those faults. After building the tree, the probabilities of hazards, such as overdose, can be calculated. Unfortunately, the engineers assumed that the software (much of which was re-used from the previous Therac models) would operate correctly. This turned out not to be the case, because the hardware interlocks present in the previous models had hidden some of the software faults. The analysts did consider some possible computer faults, such as an error being caused by cosmic rays, but assigned extremely low probabilities to those faults. As a result, the assessment was very inaccurate.
When the first report of an overdose was reported to AECL in 1985, they sent an engineer to the site to investigate. They also filed a report with the FDA and the Canadian Radiation Protection Board (CRPB). AECL also notified all users of the fact that there had been a report and recommended that operators should visually confirm hardware settings before each treatment. The AECL engineers were unable to reproduce the fault, but suspected that it was due to the design and placement of a microswitch. They redesigned the microswitch and modified all of the machines that had been deployed. They also retracted their recommendation that operators should visually confirm hardware settings before each treatment.
Later that year, a second incident occurred. In this case, there is no evidence that AECL took any action. In January of 1986, AECL received another incident report. An employee at AECL responded by denying that the Therac-25 was at fault, and stated that no other similar incidents had been reported. Another incident occurred in March of that year. AECL sent an engineer to investigate. The engineer was unable to determine the cause, and suggested that it was due to an electrical problem, which may have caused an electrical shock. An independent engineering firm was called to examine the machine and reported that it was very unlikely that the machine could have delivered an electrical shock to the patient. In April of 1986, another incident was reported. In this case, the AECL engineers, working with the medical physicist at the hospital, were able to reproduce the sequence of events that lead to the overdose.
As required by law, AECL filed a report with the FDA. The FDA responded by declaring the Therac-25 defective. AECL was ordered to notify all of the sites where the Therac-25 was in use, investigate the problem, and file a corrective action plan. AECL notified all sites, and recommended removing certain keys from the keyboard on the machines. The FDA responded by requiring them to send another notification with more information about the defect and the consequent hazards. Later in 1986, AECL filed a revised corrective action plan.
Another overdose occurred in January 1987, and was attributed to a different software fault. In February, the FDA and CRPB both ordered that all Therac-25 units be shut down, pending effective and permanent modifications. AECL spent six months developing a new corrective action plan, which included a major overhaul of the software, the addition of mechanical safety interlocks, and other safety-related modifications.
The Therac-25 was controlled by a DEC PDP-11 computer, which was the most popular minicomputer ever produced. Around 600,000 were produced between 1970 and 1990 and used for a variety of purposes, including embedded systems, education, and general data processing. It was a 16-bit computer and was far less powerful than a Raspberry Pi. The Therac-25 computer was programmed in assembly language by one programmer and the source code was not documented. Documentation for the hardware components was written in French. After the faults were discovered, a commission concluded that the primary problems with the Therac-25 were attributable to poor software design practices, and not due to any one of several specific coding errors. This is probably the best known case where a poor overall software design, and insufficient testing, led to loss of life.
The worst problems in the design and engineering of the machine were:
• The code was not subjected to independent review.
• The software design was not considered during the assessment of how the machine could fail or malfunction.
• The operator could ignore malfunctions and cause the machine to proceed with treatment.
• The hardware and software were designed separately and not tested as a complete system until the unit was assembled at the hospitals where it was to be used.
• The design of the earlier Therac-6 and Therac-20 machines included hardware interlocks which would ensure that the X-ray mode could not be activated unless the tungsten radiation shield was in place. The hardware interlock was replaced with a software interlock in the Therac-25.
• Errors were displayed as numeric codes, and there was no indication of the severity of the error condition.
The operator interface consisted of a keyboard and text-mode monitor, which was common in the early 1980s. The interface had a data entry area in the middle of the screen and a command line at the bottom. The operator was required to enter parameters in the data entry area, then move the cursor to the command line to initiate treatment. When the operator moved the cursor to the command line, internal variables were updated and a flag variable was set to indicate that data entry was complete. That flag was cleared when a command was entered on the command line. If the operator moved the cursor back to the data entry area without entering a command, then the flag was not cleared, and any subsequent changes to the data entry area did not affect the internal variables.
A global variable was used to indicate that the magnets were currently being adjusted. This variable was modified by two functions, and did not always contain the correct value. Adjusting the magnets required about eight seconds, and the flag was correct for only a small period at the beginning of this time period.
Due to the errors in the design and implementation of the software, the following sequence of events could result in the machine causing injury to, or even the death of, the patient:
1. The operator mistakenly specified high-power mode during data entry.
2. The operator moved the cursor to the command line area.
3. The operator noticed the mistake, and moved the cursor back to the data entry area without entering a command.
4. The operator corrected the mistake and moved the cursor back to the command line.
5. The operator entered the command line area, left it, made the correction, and returned within the eight-second window required for adjusting the magnets.
If the above sequence occurred, then the operator screen could indicate that the machine was in low power mode, although it was actually set in high-power mode. During a final check before initiating the beam, the software would find that the magnets were set for high-power mode but the operator setting was for low power mode. It displayed a numeric error code and prevented the machine from starting. The operator could clear the error code by resetting the computer (which only required one key to be pressed on the keyboard). This caused the tungsten shield to withdraw but left the machine in X-ray mode. When the operator entered the command to start the beam, the machine could be in high-power mode without having the tungsten shield in place. X-rays were applied to the unprotected patient.
It took some time for this critical flaw to appear. The failure only occurred when the operator initially made a one-keystroke mistake in entering the prescription data, moved to the command area, and then corrected the mistake within eight seconds. Initially, operators were slow to enter data, and spent a lot of time making sure that the prescription was correct before initiating treatment. As they became more familiar with the machine, they were able to enter data and correct mistakes more quickly. Eventually, operators became familiar enough with the machine that they could enter data, make a correction, and return to the command area within the critical eight-second window. Also, the operators became familiar with the machine reporting numeric error codes without any indication of the severity of the code. The operators were given a table of codes and their meanings. The code reported was “no dose” and indicated “treatment pause.” There is no reason why the operator should consider that to be a serious problem; they had become accustomed to frequent malfunctions that did not have any consequences to the patient.
Although the code was written in assembly language, that fact was not cited as an important factor. The fundamental problems were poor software design and overconfidence. The reuse of code in an application for which it was not initially designed also may have contributed to the system flaws. A proper design using established software design principles, including structured programming and abstract data types, would almost certainly have avoided these fatalities.
The abstract data type is a structured programming concept which contributes to software reliability, eases maintenance, and allows for major revisions to be performed in a safe way. Many high-level languages enforce, or at least facilitate, the use of ADTs. Assembly language does not. However, the ethical assembly language programmer will make the extra effort to write code that conforms to the standards of structured programming and use abstract data types to help ensure safety, reliability, and maintainability.
ADTs also facilitate the implementation of software modules in more than one language. The interface specifies the components of the ADT, but not the implementation. The implementation can be in any language. As long as assembly programmers and compiler authors generate code that conforms to a well-known standard, their code can be linked with code written in other languages.
Poor coding practices and poor design can lead to dire consequences, including loss of life. It is the responsibility of the programmer, regardless of the language used, to make ethical decisions in the design and implementation of software. Above all, the programmer must be aware of the possible consequences of the decisions they make.
6.1 What are the advantages of designing software using abstract data types?
6.2 Why is the internal structure of the Pixel data type hidden from client code in Listing 6.2?
6.3 High-level languages provide mechanisms for information hiding, but assembly does not. Why should the assembly programmer not simply bypass all information hiding and access the internal data structures of any ADT directly?
6.4 The assembly code in wl_print_numerical() accesses the internal structure of the wordlistnode data type. Why is it allowed to do so? Should it be allowed to do so?
6.5 Given the following definitions for a stack ADT:


Write the InitStack() function in ARM assembly language.
6.6 Referring to the previous question, write the Push() function in ARM assembly language.
6.7 Referring to the previous two questions, write the Pop() function in ARM assembly language.
6.8 Referring to the previous three questions, write the Top() function in ARM assembly language.
6.9 Referring to the previous three questions, write the PrintStack() function in ARM assembly language.
6.10 Re-implement all of the previous stack functions using a linked list rather than a static array.
6.11 The “Software Engineering Code of Ethics And Professional Practice” states that a responsible software engineer should “Approve software only if they have well-founded belief that it is safe, meets specifications, passes appropriate tests…” (sub-principle 1.03) and “Ensure adequate testing, debugging, and review of software…on which they work.” (sub-principle 3.10). Unfortunately, defects did make their way into the system.
The software engineering code of ethics also states that a responsible software engineer should “Treat all forms of software maintenance with the same professionalism as new development.”
(a) Explain how the Software Engineering Code of Ethics And Professional Practice were violated by the Therac 25 developers.
(b) How should the engineers and managers at AECL have responded when problems were reported?
(c) What other ethical and non-ethical considerations may have contributed to the deaths and injuries?
Performance Mathematics
This chapter introduces the concept of high performance mathematics. The chapter starts by explaining basic math in bases other than 10. It explains subtraction using complement mathematics. Next it gives efficient algorithms for performing signed and unsigned multiplication in binary. It explains how multiplication by a constant can often be converted into a much more efficient sequence of shift and add or subtract operations, and gives a method for multiplying two arbitrarily large numbers. Next, an efficient algorithm is given for binary division, followed by a technique for converting division by a constant into multiplication by a related constant. The next section introduces an ADT, written in C, which can be used to perform basic mathematical operations on integers of any size. The chapter concludes by showing that the ADT can be made much more efficient by replacing some of the functions with assembly language implementations.
Addition; Subtraction; Complement; Multiplication; Division; Big integer; High performance; Abstract data type
There are some differences between the way calculations are performed in a computer versus the way most of us were taught as children. The first difference is that calculations are performed in binary instead of base ten. Another difference is that the computer is limited to a fixed number of binary digits, which raises the possibility of having a result that is too large to fit in the number of bits available. This occurrence is referred to as overflow. The third difference is that subtraction is performed using complement addition.
Addition in base b is very similar to base ten addition, except that the result of each column is limited to b − 1. For example, binary addition works exactly the same as decimal addition, except that the result of each column is limited to 0 or 1. The following figure shows an addition in base ten and the equivalent addition in base two.

The carry from one column to the next is shown as a small number above the column that it is being carried into. Note that carries from one column to the next are done the same way in both bases. The only difference is that there are more columns in the base two addition because it takes more digits to represent a number in binary than it does in decimal.
Finding the complement was explained in Section 1.3.3. Subtraction can be computed by adding the radix complement of the subtrahend to the menuend. Example 7.1 shows a complement subtraction with positive results. When x < y, the result will be negative. In the complement method, this means that there will be a ‘1’ in the most significant bit, and in order to convert the result to base ten, we must take the radix complement. Example 7.2 shows complement subtraction with negative results. Example 7.3 shows several more signed addition and subtraction operations in base ten and binary.
Many processors have hardware multiply instructions. However hardware multipliers require a large number of transistors, and consume significant power. Processors designed for extremely low power consumption or very small size usually do not implement a multiply instruction, or only provide multiply instructions that are limited to a small number of bits. On these systems, the programmer must implement multiplication using basic data processing instructions.
If the multiplier is a power of two, then multiplication can be accomplished with a shift to the left. Consider the 4-bit binary number x = x3 × 23 + x2 × 22 + x1 × 21 + x0 × 20, where xn denotes bit n of x. If x is shifted left by one bit, introducing a zero into the least significant bit, then it becomes
Therefore, a shift of one bit to the left is equivalent to multiplication by two. This argument can be extended to prove that a shift left by n bits is equivalent to multiplication by 2n.
Most techniques for binary multiplication involve computing a set of partial products and then summing the partial products together. This process is similar to the method taught to primary schoolchildren for conducting long multiplication on base ten integers, but has been modified here for application to binary. The method typically taught in school for multiplying decimal numbers is based on calculating partial products, shifting them to the left and then adding them together. The most difficult part is to obtain the partial products, as that involves multiplying a long number by one base ten digit. The following example shows how the partial products are formed when multiplying 123 by 456.

The first partial product can be written as 123 × 6 × 100 = 738. The second is 123 × 5 × 101 = 6150, and the third is 123 × 4 × 102 = 49200. In practice, we usually leave out the trailing zeros. The procedure is the same in binary, but is simpler because the partial product involves multiplying a long number by a single base 2 digit. Since the multiplier is always either zero or one, the partial product is very easy to compute. The product of multiplying any binary number x by a single binary digit is always either 0 or x. Therefore, the multiplication of two binary numbers comes down to shifting the multiplicand left appropriately for each non-zero bit in the multiplier, and then adding the shifted numbers together.
Suppose we wish to multiply two four-bit numbers, 1011 and 1010:

Notice in the previous example that each partial sum is either zero or x shifted by some amount. A slightly quicker way to perform the multiplication is to leave out any partial sum which is zero. Example 7.4 shows the results of multiplying 10110 by 8910 in decimal and binary using this shorter method. For implementation in hardware and software, it is easier to accumulate the partial products, by adding each to a running sum, rather than building a circuit to add multiple binary numbers at once.
Binary multiplication can be implemented as a sequence of shift and add instructions. Given two registers, x and y, and an accumulator register a, the product of x and y can be computed using Algorithm 1. When applying the algorithm, it is important to remember that, in the general case, the result of multiplying an n bit number by an m bit number is (at most) an n + m bit number. For instance 112 × 112 = 10012. Therefore, when applying Algorithm 1, it is necessary to know the number of bits in x and y. Since x is shifted left on each iteration of the loop, the registers used to store x and a must both be at least as large as the number of bits in x plus the number of bits in y.

Assume we wish to multiply two numbers, x = 01101001 and y = 01011010. Applying Algorithm 1 results in the following sequence:
| a | x | y | Next operation |
| 0000000000000000 | 0000000001101001 | 01011010 | shift only |
| 0000000000000000 | 0000000011010010 | 00101101 | add, then shift |
| 0000000011010010 | 0000000110100100 | 00010110 | shift only |
| 0000000011010010 | 0000001101001000 | 00001011 | add, then shift |
| 0000010000011010 | 0000011010010000 | 00000101 | add, then shift |
| 0000101010101010 | 0000110100100000 | 00000010 | shift only |
| 0000101010101010 | 0001101001000000 | 00000001 | add, then shift |
| 0010010011101010 | 0011010010000000 | 00000000 | shift only |
| 105 × 90 = 9450 | |||

To multiply two n bit numbers, you must be able to add two 2n-bit numbers. On the ARM processor, n is usually assumed to be 32-bits, because that is the natural word size for the ARM processor. Adding 64-bit numbers requires two add instructions and the carry from the least-significant 32 bits must be added to the sum of the most-significant 32 bits. The ARM processor provides a convenient way to perform the add with carry. Assume we have two 64 bit numbers, x and y. We have x in r0, r1 and y in r2, r3, where the high order words of each number are in the higher-numbered registers, and we want to calculate x = x + y. Listing 7.1 shows a two instruction sequence for the ARM processor. The first instruction adds the two least-significant words together and sets (or clears) the carry bit and other flags in the CPSR. The second instruction adds the two most significant words along with the carry bit.

On the ARM processor, the algorithm to multiply two 32-bit unsigned integers is very efficient. Listing 7.2 shows one possible algorithm for multiplying two 32-bit numbers to obtain a 64-bit result. The code is a straightforward implementation of the algorithm, and some modifications can be made to improve efficiency. For example, if we only want a 32-bit result, we do not need to perform 64-bit addition. This significantly simplifies the code, as shown in Listing 7.3.


If x or y is a constant, then a loop is not necessary. The multiplication can be directly translated into a sequence of shift and add operations. This will result in much more efficient code than the general algorithm. If we inspect the constant multiplier, we can usually find a pattern to exploit that will save a few instructions. For example, suppose we want to multiply a variable x by 1010. The multiplier 1010 = 10102, so we only need to add x shifted left 1 bit to x shifted left 3 bits as shown below:

Now suppose we want to multiply a number x by 1110. The multiplier 1110 = 10112, so we will add x to x shifted left one bit plus x shifted left 3 bits as in the following:

If we wish to multiply a number x by 100010, we note that 100010 = 11111010002 It looks like we need one shift plus five add/shift operations, or six add/shift operations. With a little thought, the number of operations can be reduced from six to five as shown below:

Applying the basic multiplication algorithm to multiply a number x by 25510 would result in seven add/shift operations, but we can do it with only three operations and use only one register, as shown below:

Most modern systems have assembly language instructions for multiplication, but hardware multiply units require a relatively large number of transistors. For that reason, processors intended for small embedded applications often do not have a multiply instruction. Even when a hardware multiplier is available, on some processors it is often more efficient to use shift, add, and subtract operations when multiplying by a constant. The hardware multiplier units that are available on most ARM processors are very powerful. They can typically perform multiplication with a 32-bit result in as little as one clock cycle. The long multiply instructions take between three and five clock cycles, depending on the size of the operands. Using the multiply instruction on an ARM processor to multiply by a constant usually requires loading the constant into a register before performing the multiply. Therefore, if the multiplication can be performed using three or fewer shift, add, and subtract instructions, then it will be equal to or better than using the multiply instruction.
Consider the two multiplication problems shown in Figs. 7.1 and 7.2. Note that the result of a multiply depends on whether the numbers are interpreted as unsigned numbers or signed numbers. For this reason, most computer CPUs have two different multiply operations for signed and unsigned numbers.


If the CPU provides only an unsigned multiply, then a signed multiply can be accomplished by using the unsigned multiply operation along with a conditional complement. The following procedure can be used to implement signed multiplication.
1. if the multiplier is negative, take the two’s complement,
2. if the multiplicand is negative, take the two’s complement,
3. perform unsigned multiply, and
4. if the multiplier or multiplicand was negative (but not both), then take two’s complement of result.
Example 7.5 demonstrates this method using one negative number.
Consider the method used for multiplying two digit numbers is base ten, using only the one-digit multiplication tables. Fig. 7.3 shows how a two digit number a = a1 × 101 + a0 × 100 is multiplied by another two digit number b = b1 × 101 + b0 × 100 to produce a four digit result using basic multiplication operations which only take one digit from a and one digit from b at each step.

This technique can be used for numbers in any base and for any number of digits. Recall that one hexadecimal digit is equivalent to exactly four binary digits. If a and b are both 8-bit numbers, then they are also 2-digit hexadecimal numbers. In other words 8-bit numbers can be divided into groups of four bits, each representing one digit in base sixteen. Given a multiply operation that is capable of producing an 8-bit result from two 4-bit inputs, the technique shown above can then be used to multiply two 8-bit numbers using only 4-bit multiplication operations.
Carrying this one step further, suppose we are given two 16-bit numbers, but our computer only supports multiplying eight bits at a time and producing a 16-bit result. We can consider each 16-bit number to be a two digit number in base 256, and use the above technique to perform four eight bit multiplies with 16-bit results, then shift and add the 16-bit results to obtain the final 32-bit result. This approach can be extended to implement efficient multiplication of arbitrarily large numbers, using a fixed-sized multiplication operation.
Binary division can be implemented as a sequence of shift and subtract operations. When performing binary division by hand, it is convenient to perform the operation in a manner very similar to the way that decimal division is performed. As shown in Fig. 7.4, the operation is identical, but takes more steps in binary.

If the divisor is a power of two, then division can be accomplished with a shift to the right. Using the same approach as was used in Section 7.2.1, it can be shown that a shift right by n bits is equivalent to division by 2n. However, care must be taken to ensure that an arithmetic shift is used if the numerator is a signed two’s complement number, and a logical shift is used if the numerator is unsigned.
The algorithm for dividing binary numbers is somewhat more complicated than the algorithm for multiplication. The algorithm consists of two main phases:
1. shift the divisor left until it is greater than dividend and count the number of shifts, then
2. repeatedly shift the divisor back to the right and subtract whenever possible.
Fig. 7.5 shows the algorithm in more detail. Because of the complexity of the algorithm, division in hardware requires a significant number of transistors. The ARM architecture did not introduce a divide instruction until ARMv7, and even then it was not implemented on all processors. Many ARM systems (including the Raspberry Pi) do not have hardware division. However, the ARM processor instruction set makes it possible to write very efficient code for division.

Before we introduce the ARM code, we will take some time to step through the algorithm using an example. Let us begin by dividing 94 by 7. The result is shown below:

To implement the algorithm, we need three registers, one for the dividend, one for the divisor, and one for a counter. The dividend and divisor are loaded into their registers and the counter is initialized to zero as shown below:
Next, the divisor is shifted left and the counter incremented repeatedly until the divisor is greater than the dividend. This is shown in the following sequence:
Next, we allocate a register for the quotient and initialize it to zero. Then, according to the algorithm, we repeatedly subtract if possible, shift to the right, and decrement the counter. This sequence continues until the counter becomes negative. For our example this results in the following sequence:






When the algorithm terminates, the quotient register contains the result of the division, and the modulus (remainder) is in the dividend register. Thus, one algorithm is used to compute both the quotient and the modulus at the same time. There are variations on this algorithm. For example, one variation is to shift a single bit left in a register, rather than incrementing a count. This variation has the same two phases as the previous algorithm, but counts in powers of two rather than by ones. The following sequence shows what occurs after each iteration of the first loop in the algorithm.
The divisor is greater than the dividend, so the algorithm proceeds to the second phase. In this phase, if the divisor is less than or equal to the dividend, then the power register is added to the quotient and the divisor is subtracted from the dividend. Then, the power and Divisor registers are shifted to the right. The process is repeated until the power register is zero. The following sequence shows what the registers will contain at the end of each iteration of the second loop.






As with the previous version, when the algorithm terminates, the quotient register contains the result of the division, and the modulus (remainder) is in the dividend register. Listing 7.4 shows the ARM assembly code to implement this version of the division algorithm for 32-bit numbers, and the counting method for 64-bit numbers.




In general, division is slow. Newer ARM processors provide a hardware divide instruction which requires between two and twelve clock cycles to produce a result, depending on the size of the operands. Older processors must perform division using software, as previously described. In either case, division is by far the slowest of the basic mathematical operations. However, division by a constant c can be converted to a multiply by the reciprocal of c. It is obviously much more efficient to use a multiply instead of a divide wherever possible. Efficient division of a variable by a constant is achieved by applying the following equality:
The only difficulty is that we have to do it in binary, using only integers. If we modify the right-hand side by multiplying and dividing by some power of two (2n), we can rewrite Eq. (7.1) as follows:
Recall that, in binary, multiplying by 2n is the same as shifting left by n bits, while multiplying by 2−n is done by shifting right by n bits. Therefore, Eq. (7.2) is just Eq. (7.1) with two shift operations added. The two shift operations cancel each other out. Now, let
We can rewrite Eq. (7.2) as:
We now have a method for dividing by a constant c which involves multiplying by a different constant, m, and shifting the result. In order to achieve the best precision, we want to choose n such that m is as large as possible with the number of bits we have available.
Suppose we want efficient code to calculate x ÷ 23 using 8-bit signed integer multiplication. Our first task is to find
such that 011111112 ≥ m ≥ 010000002. In other words, we want to find the value of n where the most significant bit of m is zero, and the next most significant bit of m is one. If we choose n = 11, then
Rounding to the nearest integer gives m = 89. In 8 bits, m is 010110012 or 5916. We now have values for m and n, and therefore we can apply Eq. (7.4) to divide any number x by 23. The procedure is simple: calculate y = x × m, then shift y right by 11 bits.
However, there are two more considerations. First, when the divisor is positive, the result for some values of x may be incorrect due to rounding error. It is usually sufficient to increment the reciprocal value by one in order to avoid these errors. In the previous example, the number would be changed from 5916 to 5A16. When implementing this technique for finding the reciprocal, the programmer should always verify that the results are correct for all input values. The second consideration is when the dividend is negative. In that case it is necessary to subtract one from the final result.
For example, to calculate 10110 ÷ 2310 in binary, with eight bits of precision, we first perform the multiplication as follows:

Then shift the result right by 11 bits. 100011000111012 shifted right 1110 bits is: 1002 = 410. If the modulus is required, it can be calculated as 101 mod 23 = 101 − (4 × 23) = 9, which once again requires multiplication by a constant.
In the previous example the shift amount of 11 bits provided the best precision possible. But how was that number chosen? The shift amount, n, can be directly computed as
where p is the desired number of bits of precision. The value of m can then be computed as
For example, to divide by the constant 33, with 16 bits of precision, we compute n as
and then we compute m as
Therefore, multiplying a 16 bit number by 7C2016 and then shifting right 20 bits is equivalent to dividing by 33.
Example 7.6 shows how to calculate m and n for division by 193. On the ARM processor, division by a constant can be performed very efficiently. Listing 7.5 shows how division by 193 can be implemented using only a few lines of code. In the listing, the numbers are 32 bits in length, so the constant m is much larger than in the example that was multiplied by hand, but otherwise the method is the same.

On processors without the multiply instruction, we can use the technique of shifting and adding shown previously. If we wish to divide by 23 using 32 bits of precision, we compute the multiplier as
That is 010110010000101100100001011001012. Note that there are only 12 non-zero bits, and the pattern 1011001 appears three times in the 32-bit multiplier. The multiply can be implemented as 224(26x + 24x + 23x + 20x) + 213(26x + 24x + 23x + 20x) +22(26x + 24x + 23x + 20x) + 20x. So the following code sequence can be used on processors that do not have the multiply instruction:


Section 7.2.5 showed how large numbers can be multiplied by breaking them into smaller numbers and using a series of multiplication operations. There is no similar method for synthesizing a large division operation with an arbitrary number of digits in the dividend and divisor. However, there is a method for dividing a large dividend by a divisor given that the division operation can operate on numbers with at least the same number of digits as in the divisor.
Suppose we wish to perform division of an arbitrarily large dividend by a one digit divisor using a basic division operation that can divide a two digit dividend by a one digit divisor. The operation can be performed in multiple steps as follows:
1. Divide the most significant digit of the dividend by the divisor. The result is the most significant digit of the quotient.
2. Prepend the remainder from the previous division step to the next digit of the dividend, forming a two-digit number, and divide that by the divisor. This produces the next digit of the result.
3. Repeat from step 2 until all digits of the dividend have been processed.
4. Take the final remainder as the modulus.
The following example shows how to divide 6189 by 7 using only 2-digits at a time:
This method can be applied in any base and with any number of digits. The only restriction is that the basic division operation must be capable of dividing a 2n digit number by an n digit number and producing a 2n digit quotient and an n digit remainder. for example, the div instruction available on Cortex M3 and newer processors is capable of dividing a 32-bit dividend by a 32-bit divisor, producing a 32-bit quotient. The remainder can be calculated by multiplying the quotient by the divisor and subtracting the product from the dividend. Using this division operation it is possible to divide an arbitrarily large number by a 16-bit divisor.
We have seen that, given a divide operation capable of dividing an n digit number by an n digit number, it is possible to divide a dividend with any number of digits by a divisor with
digits. Unfortunately, there is no similar method to deal with an arbitrarily large divisor, or to divide an arbitrarily large dividend by a divisor with more than
digits. In those cases the division must be performed using a general division algorithm as shown previously.
For some programming tasks, it may be helpful to deal with arbitrarily large integers. For example, the factorial function and Ackerman’s function grow very quickly and will overflow a 32-bit integer for small input values. In this section, we will outline an abstract data type which provides basic operations for arbitrarily large integer values. Listing 7.7 shows the C header for this ADT, and Listing 7.8 shows the C implementation. Listing 7.9 shows a small program that uses the bigint ADT to create a table of x! for all x between 0 and 100.




















The implementation could be made more efficient by writing some of the functions in assembly language. One opportunity for improvement is in the add function, which must calculate the carry from one chunk of bits to the next. In assembly, the programmer has direct access to the carry bit, so carry propagation should be much faster.
When attempting to speed up a C program by converting selected parts of it to assembly language, it is important to first determine where the most significant gains can be made. A profiler, such as gprof, can be used to help identify the sections of code that will matter most. It is also important to make sure that the result is not just highly optimized C code. If the code cannot benefit from some features offered by assembly, then it may not be worth the effort of re-writing in assembly. The code should be re-written from a pure assembly language viewpoint.
It is also important to avoid premature assembly programming. Make sure that the C algorithms and data structures are efficient before moving to assembly. if a better algorithm can give better performance, then assembly may not be required at all. Once the assembly is written, it is more difficult to make major changes to the data structures and algorithms. Assembly language optimization is the final step in optimization, not the first one.
Well-written C code is modularized, with many small functions. This helps readability, promotes code reuse, and may allow the compiler to achieve better optimization. However, each function call has some associated overhead. If optimal performance is the goal, then calling many small functions should be avoided. For instance, if the piece of code to be optimized is in a loop body, then it may be best to write the entire loop in assembly, rather than writing a function and calling it each time through the loop. Writing in assembly is not a guarantee of performance. Spaghetti code is slow. Load/store instructions are slow. Multiplication and division are slow. The secret to good performance is avoiding things that are slow. Good optimization requires rethinking the code to take advantage of assembly language.
The bigint_adc function was re-written in assembly, as shown in Listing 7.10. This function is used internally by several other functions in the bigint ADT to perform addition and subtraction. The profiler indicated that it is used more than any other function. If assembly language can make this function run faster, then it should have a profound effect on the program.




The bigfact main function was executed 50 times on a Raspberry Pi, using the C version of bigint_adc and then with the assembly version. The total time required using the C version was 27.65 seconds, and the program spent 54.0% of its time (14.931 seconds) in the bigint_adc function. The assembly version ran in 15.07 seconds, and the program spent 15.3% of its time (2.306 seconds) in the bigint_adc function. Therefore the assembly version of the function achieved a speedup of 6.47 over the C implementation. Overall, the program achieved a speedup of 1.83 by writing one function in assembly.
Running gprof on the improved program reveals that most of the time is now spent in the bigint_mul function (63.2%) and two functions that it calls: bigint_mul_uint (39.1%) and bigint_shift_left_chunk (21.6%). It seems clear that optimizing those two functions would further improve performance.
Complement mathematics provides a method for performing all basic operations using only the complement, add, and shift operations. Addition and subtraction are fast, but multiplication and division are relatively slow. In particular, division should be avoided whenever possible. The exception to this rule is division by a power of the radix, which can be implemented as a shift. Good assembly programmers replace division by a constant c with multiplication by the reciprocal of c. They also replace the multiply instruction with a series of shifts and add or subtract operations when it makes sense to do so. These optimizations can make a big difference in performance.
Writing sections of a program in assembly can result in better performance, but it is not guaranteed. The chance of achieving significant performance improvement is increased if the following rules are used:
1. Only optimize the parts that really matter.
2. Design data structures with assembly in mind.
3. Use efficient algorithms and data structures.
4. Write the assembly code last.
5. Ignore the C version and write good, clean, assembly.
6. Reduce function calls wherever it makes sense.
7. Avoid unnecessary memory accesses.
8. Write good code. The compiler will beat poor assembly every time, but good assembly will beat the compiler every time.
Understanding the basic mathematical operations can enable the assembly programmer to work with integers of any arbitrary size with efficiency that cannot be matched by a C compiler. However, it is best to focus the assembly programming on areas where the greatest gains can be made.
7.1 Multiply − 90 by 105 using signed 8-bit binary multiplication to form a signed 16-bit result. Show all of your work.
7.2 Multiply 166 by 105 using unsigned 8-bit binary multiplication to form an unsigned 16-bit result. Show all of your work.
7.3 Write a section of ARM assembly code to multiply the value in r1 by 1310 using only shift and add operations.
7.4 The following code will multiply the value in r0 by a constant C. What is C?

7.5 Show the optimally efficient instruction(s) necessary to multiply a number in register r0 by the constant 6710.
7.6 Show how to divide 7810 by 610 using binary long division.
7.7 Demonstrate the division algorithm using a sequence of tables as shown in Section 7.3.2 to divide 15510 by 1110.
7.8 When dividing by a constant value, why is it desirable to have m as large as possible?
7.9 Modify your program from Exercise 5.13 in Chapter 5 to produce a 64-bit result, rather than a 32-bit result.
7.10 Modify your program from Exercise 5.13 in Chapter 5 to produce a 128-bit result, rather than a 32-bit result. How would you do this in C?
7.11 Write the bigint_shift_left_chunk function from Listing 7.8 in ARM assembly, and measure the performance improvement.
7.12 Write the bigint_mul_uint function in ARM assembly, and measure the performance improvement.
7.13 Write the bigint_mul function in ARM assembly, and measure the performance improvement.
This chapter starts by demonstrating how to convert fractional numbers to radix notation in any base. It then presents a theorem that can be used to determine in which bases a given fraction will terminate rather than repeating. That theorem is then used to explain why some base ten fractional numbers cannot be represented in binary with a finite number of bits. Next fixed-point numbers are introduced. The rules for addition, subtraction, multiplication, and division are given. Division by a constant is explained in terms of fixed-point mathematics. Next, the IEEE floating point formats are explained. The chapter ends with an example showing how fixed-point mathematics can be used to write functions for sine and cosine which give better precision and higher performance than the functions provided by GCC.
Fixed point; Radix point; Non-terminating repeating fraction; S/U notation; Q notation; Floating point; Performance
Chapter 7 introduced methods for performing computation using integers. Although many problems can be solved using only integers, it is often necessary (or at least more convenient) to perform computation using real numbers or even complex numbers. For our purposes, a non-integral number is any number that is not an integer. Many systems are only capable of performing computation using binary integers, and have no hardware support for non-integral calculations. In this chapter, we will examine methods for performing non-integral calculations using only integer operations.
Section 1.3.2 explained how to convert integers in a given base into any other base. We will now extend the methods to convert fractional values. A fractional number can be viewed as consisting of an integer part, a radix point, and a fractional part. In base 10, the radix point is also known as the decimal point. In base 2, it is called the binimal point. For base 16, it is the heximal point, and in base 8 it is an octimal point. The term radix point is used as a general term for a location that divides a number into integer and fractional parts, without specifying the base.
The procedure for converting fractions from a given base b into base ten is very similar to the procedure used for integers. The only difference is that the digit to the left of the radix point is weighted by b0 and the exponents become increasingly negative for each digit right of the radix point. The basic procedure is the same for any base b. For example, the value 101.01012 can be converted to base ten by expanding it as follows:
Likewise, the hexadecimal fraction 4F2.9A0 can be converted to base ten by expanding it as follows:
When converting from base ten into another base, the integer and fractional parts are treated separately. The base conversion for the integer part is performed in exactly the same way as in Section 1.3.2, using repeated division by the base b. The fractional part is converted using repeated multiplication. For example, to convert the decimal value 5.687510 to a binary representation:
1. Convert the integer portion, 510 into its binary equivalent, 1012.
2. Multiply the decimal fraction by two. The integer part of the result is the first binary digit to the right of the radix point.
Because x = 0.6875 × 2 = 1.375, the first binary digit to the right of the point is a 1. So far, we have 5.62510 = 101.12
3. Multiply the fractional part of x by 2 once again.
Because x = 0.375 × 2 = 0.75, the second binary digit to the right of the point is a 0. So far, we have 5.62510 = 101.102
4. Multiply the fractional part of x by 2 once again.
Because x = 0.75 × 2 = 1.50, the third binary digit to the right of the point is a 1. So now we have 5.625 = 101.101
5. Multiply the fractional part of x by 2 once again.
Because x = 0.5 × 2 = 1.00, the fourth binary digit to the right of the point is a 1. So now we have 5.625 = 101.1011
6. Since the fractional part is now zero, we know that all remaining digits will be zero.
The procedure for obtaining the fractional part can be accomplished easily using a tabular method, as shown below:
| Operation | Result | |
| Integer | Fraction | |
| 0.6875 × 2 = 1.375 | 1 | 0.375 |
| 0.375 × 2 = 0.75 | 0 | 0.75 |
| 0.75 × 2 = 1.5 | 1 | 0.5 |
| 0.5 × 2 = 1.0 | 1 | 0.0 |

Putting it all together, 5.687510 = 101.10112. After converting a fraction from base 10 into another base, the result should be verified by converting back into base 10. The results from the previous example can be expanded as follows:
Converting decimal fractions to base sixteen is accomplished in a very similar manner. To convert 842.23437510 into base 16, we first convert the integer portion by repeatedly dividing by 16 to yield 34A. We then repeatedly multiply the fractional part, extracting the integer portion of the result each time as shown in the table below:
In the second line, the integer part is 12, which must be replaced with a hexadecimal digit. The hexadecimal digit for 1210 is C, so the fractional part is 3C. Therefore, 842.23437510 =34A.3C16 The result is verified by converting it back into base 10 as follows:
Converting fractional values between binary, hexadecimal, and octal can be accomplished in the same manner as with integer values. However, care must be taken to align the radix point properly. As with integers, converting from hexadecimal or octal to binary is accomplished by replacing each hex or octal digit with the corresponding binary digits from the appropriate table shown in Fig. 1.3.
For example, to convert 5AC.43B16 to binary, we just replace “5” with “0101,” replace “A” with “1010,” replace “C” with “1100,” replace “4” with “0100,” replace “3” with “0011,” replace “B” with “1011,” So, using the table, we can immediately see that 5AC.43B16 = 010110101100.0100001110112. This method works exactly the same way for converting from octal to binary, except that it uses the table on the right side of Fig. 1.3.
Converting fractional numbers from binary to hexadecimal or octal is also very easy when using the tables. The procedure is to split the binary string into groups of bits, working outwards from the radix point, then replace each group with its hexadecimal or octal equivalent. For example, to convert 01110010.10101112 to hexadecimal, just divide the number into groups of four bits, starting at the radix point and working outwards in both directions. It may be necessary to pad with zeroes to make a complete group on the left or right, or both. Our example is grouped as follows: |0000|0111|0010.1010|1110|2. Now each group of four bits is converted to hexadecimal by looking up the corresponding hex digit in the table on the left side of Fig. 1.3. This yields 072.AE16. For octal, the binary number would be grouped as follows: |001|110|010.101|011|100|2. Now each group of three bits is converted to octal by looking up the corresponding digit in the table on the right side of Fig. 1.3. This yields 162.5348.
One interesting phenomenon that is often encountered is that fractions which terminate in one base may become non-terminating, repeating fractions in another base. For example, the binary representation of the decimal fraction
is a repeating fraction, as shown in Example 8.1. The resulting fractional part from the last step performed is exactly the same as in the second step. Therefore, the sequence will repeat. If we continue, we will repeat the sequence of steps 2–5 forever. Hence, the final binary representation will be:
Because of this phenomenon, it is impossible to exactly represent 1.1010 (and many other fractional quantities) as a binary fraction in a finite number of bits.
The fact that some base 10 fractions cannot be exactly represented in binary has lead to many subtle software bugs and round-off errors, when programmers attempt to work with currency (and other quantities) as real-valued numbers. In this section, we explore the idea that the representation problem can be avoided by working in some base other than base 2. If that is the case, then we can simply build hardware (or software) to work in that base, and will be able to represent any fractional value precisely using a finite number of digits. For brevity, we will refer to a binary fractional quantity as a binimal and a decimal fractional quantity as a decimal. We would like to know whether there are more non-terminating decimals than binimals, more non-terminating binimals than decimals, or neither. Since there are an infinite number of non-terminating decimals and an infinite number of non-terminating binimals, we could be tempted to conclude that they are equal. However, that is an oversimplification. If we ask the question differently, we can discover some important information. A better way to ask the question is as follows:
Question: Is the set of terminating decimals a subset of the set of terminating binimals, or vice versa, or neither?
We start by introducing a lemma which can be used to predict whether or not a terminating fraction in one base will terminate in another base. We introduce the notation x|y (read as “x divides y”) to indicate that y can be evenly divided by x.
Answer: The set of terminating binimals is a subset of the set of terminating decimals, but the set of terminating decimals is not a subset of the set of terminating binimals.
Theorem 8.2.1 implies that any binary fraction can be expressed exactly as a decimal fraction, but Theorem 8.2.2 implies that there are decimal fractions which cannot be expressed exactly in binary. Every fraction (when expressed in lowest terms) which has a non-zero power of five in its denominator cannot be represented in binary with a finite number of bits. Another implication is that some fractions cannot be expressed exactly in either binary or decimal. For example, let B = 30 = 2 * 3 * 5. Then any number with denominator
terminates in base 30. However if k2≠0, then the fraction will terminate in neither base two nor base ten, because three is not a prime factor of ten or two.
Another implication of the theorem is that the more prime factors we have in our base, the more fractions we can express exactly. For instance, the smallest base that has two, three, and five as prime factors is base 30. Using that base, we can exactly express fractions in radix notation that cannot be expressed in base ten or in base two with a finite number of digits. For example, in base 30, the fraction
will terminate after one division since 15 = 3151. To see what the number will look like, let us extend the hexadecimal system of using letters to represent digits beyond 9. So we get this chart for base 30:
Since
, the fraction can be expressed precisely as 0.M30. Likewise, the fraction
is
but terminates in base 30. Since 45 = 3351, this number will have three or fewer digits following the radix point. To compute the value, we will have to raise it to higher terms. Using 302 as the denominator gives us:
Now we can convert it to base 30 by repeated division.
with remainder 20. Since 20 < 30, we cannot divide again. Therefore,
in base 30 is 0.8K.
Although base 30 can represent all fractions that can be expressed in bases two and ten, there are still fractions that cannot be represented in base 30. For example,
has the prime factor seven in its denominator, and therefore will only terminate in bases were seven is a prime factor of the base. The fraction
will terminate in base 7, base 14, base 21, base 42 and many others, but not in base 30. Since there are an infinite number of primes, no number system is immune from this problem. No matter what base the computer works in, there are fractions that cannot be expressed exactly with a finite number of digits. Therefore, it is incumbent upon programmers and hardware designers to be aware of round-off errors and take appropriate steps to minimize their effects.
For example, there is no reason why the hardware clocks in a computer should work in base ten. They can be manufactured to measure time in base two. Instead of counting seconds in tenths, hundredths or thousandths, they could be calibrated to measure in fourths, eighths, sixteenths, 1024ths, etc. This would eliminate the round-off error problem in keeping track of time.
As shown in the previous section, given a finite number of bits, a computer can only approximately represent non-integral numbers. It is often necessary to accept that limitation and perform computations involving approximate values. With due care and diligence, the results will be accurate within some acceptable error tolerance. One way to deal with real-valued numbers is to simply treat the data as fixed- point numbers. Fixed-point numbers are treated as integers, but the programmer must keep track of the radix point during each operation. We will present a systematic approach to designing fixed-point calculations.
When using fixed-point arithmetic, the programmer needs a convenient way to describe the numbers that are being used. Most languages have standard data types for integers and floating point numbers, but very few have support for fixed-point numbers. Notable exceptions include PL/1 and Ada, which provide support for fixed-point binary and fixed-point decimal numbers. We will focus on fixed-point binary, but the techniques presented can also be applied to fixed-point numbers in any base.
Each fixed-point binary number has three important parameters that describe it:
1. whether the number is signed or unsigned,
2. the position of the radix point in relation to the right side of the sign bit (for signed numbers) or the position of the radix point in relation to the most significant bit (for unsigned numbers), and
3. the number of fractional bits stored.
Unsigned fixed-point numbers will be specified as U(i,f), where i is the position of the radix point in relation to the left side of the most significant bit, and f is the number of bits stored in the fractional part.
For example, U(10,6) indicates that there are six bits of precision in the fractional part of the number, and the radix point is ten bits to the right of the most significant bit stored. The layout for this number is shown graphically as:

where i is an integer bit and f is a fractional bit. Very small numbers with no integer part may have a negative i. For example, U(−8,16) specifies an unsigned number with no integer part, eight leading zero bits which are not actually stored, and 16 bits of fractional precision. The layout for this number is shown graphically as:

Likewise, signed fixed-point numbers will be specified using the following notation: S(i,f), where i is the position of the radix point in relation to the right side of the sign bit, and f is the number of fractional bits stored. As with integer two’s-complement notation, the sign bit is always the leftmost bit stored. For example, S(9,6) indicates that there are six bits in the fractional part of the number, and the radix point is nine bits to the right of the sign bit. The layout for this number is shown graphically as:

where i is an integer bit and f is a fractional bit. Very small numbers with no integer part may have a negative i. For example, S(−7,16) specifies a signed number with no integer part, six leading sign bits which are not actually stored, a sign bit that is stored and 15 bits of fraction. The layout for this number is shown graphically as:

Note that the “hidden” bits in a signed number are assumed to be copies of the sign bit, while the “hidden” bits in an unsigned number are assumed to be zero.
The following figure shows an unsigned fixed-point number with seven bits in the integer part and nine bits in the fractional part. It is a U(7,9) number. Note that the total number of bits is 7 + 9 = 16

The value of this number in base 10 can be computed by summing the values of each non-zero bit as follows:
Likewise, the following figure shows a signed fixed-point number with nine bits in the integer part and six bits in the fractional part. It is as S(9,6) number. Note that the total number of bits is 9 + 6 + 1 = 16.

The value of this number in base 10 can be computed by summing the values of each non-zero bit as follows:
Note that in the above two examples, the pattern of bits are identical. The value of a number depends upon how it is interpreted. The notation that we have introduced allows us to easily specify exactly how a number is to be interpreted. For signed values, if the first bit is non-zero, then the two’s complement should be taken before the number is evaluated. For example, the following figure shows an S(8,7) number that has a negative value.

The value of this number in base 10 can be computed by taking the two’s complement, summing the values of the non-zero bits, and adding a negative sign to the result. The two’s complement of 1011010101111010 is 0100101010000101 + 1 = 0100101010000110. The value of this number is:
For a final example we will interpret this bit pattern as an S(−5,16). In that format, the layout is:

The value of this number in base ten can be computed by taking the two’s complement, summing the values of the non-zero bits, and adding a negative sign to the result. The two’s complement is:

The value of this number interpreted as an S(−5,16) is:
Fixed-point number formats can also be represented using Q notation, which was developed by Texas Instruments. Q notation is equivalent to the S/U format used in this book, except that the integer portion is not always fully specified. In general, Q formats are specified as Qm, n where m is the number of integer bits, and n is the number of fractional bits. If a fixed word size w is being used then m may be omitted, and is assumed to be w − n. For example, a Q10 number has 10 fractional bits, and the number of integer bits is not specified, but is assumed to be the number of bits required to complete a word of data. A Q2,4 number has two integer bits and four fractional bits in a 6-bit word. There are two conflicting conventions for dealing with the sign bit. In one convention, the sign bit is included as part of m, and in the other convention, it is not. When using Q notation, it is important to state which convention is being used. Additionally, a U may be prefixed to indicate an unsigned value. For example UQ8.8 is equivalent to U(8,8), and Q7,9 is equivalent to S(7,9).
Once the decision has been made to use fixed-point calculations, the programmer must make some decisions about the specific representation of each fixed-point variable. The combination of size and radix will affect several properties of the numbers, including:
Precision: the maximum number of non-zero bits representable,
Resolution: the smallest non-zero magnitude representable,
Accuracy: the magnitude of the maximum difference between a true real value and its approximate representation,
Range: the difference between the largest and smallest number that can be represented, and
Dynamic range: the ratio of the maximum absolute value to the minimum positive absolute value representable.
Given a number specified using the notation introduced previously, we can determine its properties. For example, an S(9,6) number has the following properties:
Resolution: R = 2−6 = 0.015625
Accuracy: 
Range: Minimum value is 1000000000.000000 = −512 Maximum value is 0111111111.111111 = 1023.9921875 Range is G = 1023.9921875 + 512 = 1535.9921875
Dynamic range: For a signed fixed-point rational representation, S(i,f), the dynamic range is
Therefore, the dynamic range of an S(9,6) is 216 = 65536.
Being aware of these properties, the programmer can select fixed-point representations that fit the task that they are trying to solve. This allows the programmer to strive for very efficient code by using the smallest fixed-point representation possible, while still guaranteeing that the results of computations will be within some limits for error tolerance.
Fixed-point numbers are actually stored as integers, and all of the integer mathematical operations can be used. However, some care must be taken to track the radix point at each stage of the computation. The advantages of fixed-point calculations are that the operations are very fast and can be performed on any computer, even if it does not have special hardware support for non-integral numbers.
Fixed-point addition and subtraction work exactly like their integer counterparts. Fig. 8.1 gives some examples of fixed-point addition with signed numbers. Note that in each case, the numbers are aligned so that they have the same number of bits in their fractional part. This requirement is the only difference between integer and fixed-point addition. In fact, integer arithmetic is just fixed-point arithmetic with no bits in the fractional part. The arithmetic that was covered in Chapter 7 was fixed-point arithmetic using only S(i,0) and U(i,0) numbers. Now we are simply extending our knowledge to deal with numbers where f≠0. There are some rules which must be followed to ensure that the results are correct. The rules for subtraction are the same as the rules for addition. Since we are using two’s complement math, subtraction is performed using addition.

Suppose we want to add an S(7,8) number to an S(7,4) number. The radix points are at different locations, so we cannot simply add them. Instead, we must shift one of the numbers, changing its format, until the radix points are aligned. The choice of which one to shift depends on what format we desire for the result. If we desire eight bits of fraction in our result, then we would shift the S(7,4) left by four bits, converting it into an S(7,8). With the radix points aligned, we simply use an integer addition operation to add the two numbers. The result will have it’s radix point in the same location as the two numbers being added.
Recall that the result of multiplying an n bit number by an m bit number is an n + m bit number. In the case of fixed-point numbers, the size of the fractional part of the result is the sum of the number of fractional bits of each number, and the total size of the result is the sum of the total number of bits in each number. Consider the following example where two U(5,3) numbers are multiplied together:

The result is a U(10,6) number. The number of bits in the result is the sum of all of the bits of the multiplicand and the multiplier. The number of fractional bits in the result is the sum of the number of fractional bits in the multiplicand and the multiplier. There are three simple rules to predict the resulting format when multiplying any two fixed-point numbers.
Unsigned Multiplication The result of multiplying two unsigned numbers U(i1,f1) and U(i2,f2) is a U(i1 + i2,f1 + f2) number.
Mixed Multiplication The result of multiplying a signed number S(i1,f1) and an unsigned number U(i2,f2) is an S(i1 + i2,f1 + f2) number.
Signed Multiplication The result of multiplying two signed numbers S(i1,f1) and S(i2,f2) is an S(i1 + i2 + 1,f1 + f2) number.
Note that this rule works for integers as well as fixed-point numbers, since integers are really fixed-point numbers with f = 0. If the programmer desires a particular format for the result, then the multiply is followed by an appropriate shift.
Listing 8.1 gives some examples of fixed-point multiplication using the ARM multiply instructions. In each case, the result is shifted to produce the desired format. It is the responsibility of the programmer to know what type of fixed-point number is produced after each multiplication and to adjust the result by shifting if necessary.

Derivation of the rule for determining the format of the result of division is more complicated than the one for multiplication. We will first consider only unsigned division of a dividend with format U(i1,f1) by a divisor with format U(i2,f2).
Consider the results of dividing two fixed-point numbers, using integer operations with limited precision. The value of the least significant bit of the dividend N is
and the value of the least significant bit of the divisor D is
. In order to perform the division using integer operations, it is necessary to multiply N by
and multiply D by
so that both numbers are integers. Therefore, the division operation can be written as:
Note that no multiplication is actually performed. Instead, the programmer mentally shifts the radix point of the divisor and dividend, then computes the radix point of the result. For example, given two U(5,3) numbers, the division operation is accomplished by converting them both to integers, performing the division, then computing the location radix point:
Note that the result is an integer. If the programmer wants to have some fractional bits in the result, then the dividend must be shifted to the left before the division is performed.
If the programmer wants to have fq fractional bits in the quotient, then the amount that the dividend must be shifted can easily be computed as
For example, suppose the programmer wants to divide 01001.011 stored as a U(28,3) by 00011.110 which is also stored as a U(28,3), and wishes to have six fractional bits in the result. The programmer would first shift 01001.011 to the left by six bits, then perform the division and compute the position of the radix in the result as shown:

Since the divisor may be between zero and one, the quotient may actually require more integer bits than there are in the dividend. Consider that the largest possible value of the dividend is
, and the smallest positive value for the divisor is
. Therefore, the maximum quotient is given by:
Taking the limit of the previous equation,
provides the following bound on how many bits are required in the integer part of the quotient:
Therefore, in the worst case, the quotient will require i1 + f2 integer bits. For example, if we divide a U(3,5), a = 111.11111 = 7.9687510, by a U(5,3), b = 00000.001 = 0.12510, we end up with a U(6,2)q = 111111.11 = 63.7510.
The same thought process can be used to determine the results for signed division as well as mixed division between signed and unsigned numbers. The results can be reduced to the following three rules:
Unsigned Division The result of dividing an unsigned fixed-point number U(i1,f1) by an unsigned number U(i2,f2) is a U(i1 + f2,f1 − f2) number.
Mixed Division The result of dividing two fixed-point numbers where one of them is signed and the other is unsigned is an S(i1 + f2,f1 − f2) number.
Signed Division The result of dividing two signed fixed-point numbers is an S(i1 + f2 + 1,f1 − f2) number.
Consider the results when a U(2,3), a = 00000.001 = 0.12510 is divided by a U(4,1), b = 1000.0 = 8.010. The quotient is q = 0.000001, which requires six bits in the fractional part. However, if we simply perform the division, then according to the rules shown above, the result will be a U(8,−2). There is no such thing as a U(8,−2), so the result is meaningless.
When f2 > f1, blindly applying the rules will result in a negative fractional part. To avoid this, the dividend can be shifted left so that it has at least as many fractional bits as the divisor. This leads to the following rule: If f2 > f1 then convert the divisor to an S(i1,x), where x ≥ f2, then apply the appropriate rule. For example, dividing an S(5,2) by a U(3,12) would result in an S(17,−10). But shifting the S(5,2) 16 bits to the left will result in an S(5,18), and dividing that by a U(3,12) will result in an S(17,6).
Recall that integer division produces a result and a remainder. In order to maintain precision, it is necessary to perform the integer division operation in such a way that all of the significant bits are in the result and only insignificant bits are left in the remainder. The easiest way to accomplish this is by shifting the dividend to the left before the division is performed.
To find a rule for determining the shift necessary to maintain full precision in the quotient, consider the worst case. The minimum positive value of the dividend is
and the largest positive value for the divisor is
. Therefore, the minimum positive quotient is given by:
Therefore, in the worst case, the quotient will require i2 + f1 fractional bits to maintain precision. However, fewer bits can be reserved if full precision is not required.
Recall that the least significant bit of the quotient will be
. Shifting the dividend left by i2 + f2 bits will convert it into a U(i1,i2 + f1 + f2). Using the rule above, when it is divided by a U(i2,f2), the result is a U(i1 + f2,i2 + f1). This is the minimum size which is guaranteed to preserve all bits of precision. The general method for performing fixed-point division while maintaining maximum precision is as follows:
1. shift the dividend left by i2 + f2, then
2. perform integer division.
The result will be a U(i1 + f2,i2 + f1) for unsigned division, or an S(i1 + f2 + 1,i2 + f1) for signed division. The result for mixed division is left as an exercise for the student.
Section 7.3.3 introduced the idea of converting division by a constant into multiplication by the reciprocal of that constant. In that section it was shown that by pre-multiplying the reciprocal by a power of two (a shift operation), then dividing the final result by the same power of two (a shift operation), division by a constant could be performed using only integer operations with a more efficient multiply replacing the (usually) very slow divide.
This section presents an alternate way to achieve the same results, by treating division by an integer constant as an application of fixed-point multiplication. Again, the integer constant divisor is converted into its reciprocal, but this time the process is considered from the viewpoint of fixed-point mathematics. Both methods will achieve exactly the same results, but some people tend to grasp the fixed-point approach better than the purely integer approach.
When writing code to divide by a constant, the programmer must strive to achieve the largest number of significant bits possible, while using the shortest (and most efficient) representation possible. On modern computers, this usually means using 32-bit integers and integer multiply operations which produce 64-bit results. That would be extremely tedious to show in a textbook, so the principals will be demonstrated here using 8-bit integers and an integer multiply which produces a 16-bit result.
Suppose we want efficient code to calculate x ÷ 23 using only 8-bit signed integer multiplication. The reciprocal of 23, in binary, is
If we store R as an S(1,11), it would look like this:

Note that in this format, the reciprocal of 23 has five leading zeros. We can store R in eight bits by shifting it left to remove some of the leading zeros. Each shift to the left changes the format of R. After removing the first leading zero bit, we have:

After removing the second leading zero bit, we have:

After removing the third leading zero bit, we have:

Note that the number in the previous format has a “hidden” bit between the radix point and the sign bit. That bit is not actually stored, but is assumed to be identical to the sign bit. Removing the fourth leading zero produces:

The number in the previous format has two “hidden” bits between the radix point and the sign bit. Those bits are not actually stored, but are assumed to be identical to the sign bit. Removing the fifth leading zero produces:

We can only remove five leading zero bits, because removing one more would change the sign bit from 0 to 1, resulting in a completely different number. Note that the final format has three “hidden” bits between the radix point and the sign bit. These bits are all copies of the sign bit. It is an S(−4,8) number because the sign is four bits to the right of the radix point (resulting in the three “hidden” bits). According to the rules of fixed-point multiplication given earlier, an S(7,0) number x multiplied by an S(−4,8) number R will yield an S(4,8) number y. The value y will be
because we have three “hidden” bits to the right of the radix point. Therefore,
indicating that after the multiplication, we must shift the result right by three bits to restore the radix. Since
is positive, the number R must be increased by one to avoid round-off error. Therefore, we will use R + 1 = 01011010 = 9010 in our multiply operation. To calculate y = 10110 ÷ 2310, we can multiply and perform a shift as follows:

Because our task is to implement integer division, everything to the right of the radix point can be immediately discarded, keeping only the upper eight bits as the integer portion of the result. The integer portion, 1000112, shifted right three bits, is 1002 = 410. If the modulus is required, it can be calculated as: 101 − (4 × 23) = 9. Some processors, such as the Motorola HC11, have a special multiply instruction which keeps only the upper half of the result. This method would be especially efficient on that processor. Listing 8.2 shows how the 8-bit division code would be implemented in ARM assembly. Listing 8.3 shows an alternate implementation which uses shift and add operations rather than a multiply.


The procedure is exactly the same for dividing by a negative constant. Suppose we want efficient code to calculate
using 16-bit signed integers. We first convert
into binary:
The two’s complement of
is
We can represent
as the following S(1,21) fixed-point number:

Note that the upper seven bits are all one. We can remove six of those bits and adjust the format as follows. After removing the first leading one, the reciprocal is:

Removing another leading one changes the format to:

On the next step, the format is:

Note that we now have a “hidden” bit between the radix point and the sign bit. The hidden bit is not actually part of the number that we store and use in the computation, but it is assumed to be the same as the sign bit.
After three more leading ones are removed, the format is:

Note that there are four “hidden” bits between the radix point and the sign. Since the reciprocal
is negative, we do not need to round by adding one to the number R. Therefore, we will use R = 10101110000101012 = AE1516 in our multiply operation.
Since we are using 16-bit integer operations, the dividend, x, will be an S(15,0). The product of an S(15,0) and an S(−5,16) will be an S(11,16). We will remove the 16 fractional bits by shifting right. The four “hidden” bits indicate that the result must be shifted an additional four bits to the right, resulting in a total shift of 20 bits. Listing 8.4 shows how the 16-bit division code would be implemented in ARM assembly.

Sometimes we need more range than we can easily get from fixed precision. One approach to solving this problem is to create an aggregate data type that can represent a fractional number by having fields for an exponent, a sign bit, and an integer mantissa. For example, in C, we could represent a fractional number using the data structure shown in Listing 8.5. That data structure, along with some subroutines for addition, subtraction, multiplication and division, would provide the capability to perform arithmetic without explicitly tracking the radix point. The subroutines for the basic arithmetical operations could do that, thereby freeing the programmer to work at a higher level.

The structure shown in Listing 8.5 is a rather inefficient way to represent a fractional number, and may create different data structures on different machines. The sign only requires one bit, and the size of the exponent and mantissa are dependent upon the machine on which the code is compiled. The sign will use one bit, the exponent eight bits, and the mantissa 23 bits.
The C language includes the notion of bit fields. This allows the programmer to specify exactly how many bits are to be used for each field within a struct, Listing 8.6 shows a C data structure that consumes 32 bits on all machines and architectures. It provides the same fields as the structure in Listing 8.5, but specifies exactly how many bits each field consumes.

The compiler will compress this data structure into 32 bits, regardless of the natural word size of the machine.
The method of representing fractional numbers as a sign, exponent, and mantissa is very powerful, and IEEE has set standards for various floating point formats. These formats can be described using bit fields in C, as described above. Many processors have hardware that is specifically designed to perform arithmetic using the standard IEEE formatted data. The following sections highlight most of the IEEE defined numerical definitions.
The IEEE standard specifies the bitwise representation for numbers, and specifies parameters for how arithmetic is to be performed. The IEEE standard for numbers includes the possibility of having numbers that cannot be easily represented. For example, any quantity that is greater than the most positive representable value is positive infinity, and any quantity that is less than the most negative representable value is negative infinity. There are special bit patterns to encode these quantities. The programmer or hardware designer is responsible for ensuring that their implementation conforms to the IEEE standards. The following sections describe some of the IEEE standard data formats.
The half-precision format gives a 16-bit encoding for fractional numbers with a small range and low precision. There are situations where this format is adequate. If the computation is being performed on a very small machine, then using this format may result in significantly better performance than could be attained using one of the larger IEEE formats. However, in most situations, the programmer can achieve better performance and/or precision by using a fixed-point representation. The format is as follows:

• The Significand (a.k.a. “Mantissa”) is stored using a sign-magnitude coding, with bit 15 being the sign bit.
• The exponent is an excess-15 number. That is, the number stored is 15 greater than the actual exponent.
• There are 10 bits of significand, but there are 11 bits of significand precision. There is a “hidden” bit, m10, between m9 and e0. When a number is stored in this format, it is shifted until its leftmost non-zero bit is in the hidden bit position, and the hidden bit is not actually stored. The exception to this rule is when the number is zero or very close to zero. The radix point is assumed to be between the hidden bit and the first bit stored. The radix point is then shifted by the exponent.
Table 8.1 shows how to interpret IEEE 754 Half-Precision numbers. The exponents 00000 and 11111 have special meaning. The value 00000 is used to represent zero and numbers very close to zero, and the exponent value 11111 is used to represent infinity and NaN. NaN, which is the abbreviation for not a number, is a value representing an undefined or unrepresentable value. One way to get NaN as a result is to divide infinity by infinity. Another is to divide zero by zero. The NaN value can indicate that there is a bug in the program, or that a calculation must be performed using a different method.
Table 8.1
Format for IEEE 754 half-precision
| Exponent | Significand = 0 | Significand≠0 | Equation |
| 00000 | ± 0 | subnormal | − 1sign × 2−14 × 0.significand |
| 00001 …11110 | normalized value | − 1sign × 2exp−15 × 1.significand | |
| 11111 | ![]() | NaN | |

Subnormal means that the value is too close to zero to be completely normalized. The minimum strictly positive (subnormal) value is 2−24 ≈ 5.96 × 10−8. The minimum positive normal value is 2−14 ≈ 6.10 × 10−5. The maximum exactly representable value is (2 − 2−10) × 215 = 65504.

represents

represents
The single precision format provides a 23-bit mantissa and an 8-bit exponent, which is enough to represent a reasonably large range with reasonable precision. This type can be stored in 32 bits, so it is relatively compact. At the time that the IEEE standards were defined, most machines used a 32-bit word, and were optimized for moving and processing data in 32-bit quantities. For many applications this format represents a good trade-off between performance and precision.

The double-precision format was designed to provide enough range and precision for most scientific computing requirements. It provides a 10-bit exponent and a 53-bit mantissa. When the IEEE 754 standard was introduced, this format was not supported by most hardware. That has changed. Most modern floating point hardware is optimized for the IEEE 754 double-precision standard, and most modern processors are designed to move 64-bit or larger quantities. On modern floating-point hardware, this is the most efficient representation.
However, processing large arrays of double-precision data requires twice as much memory, and twice as much memory bandwidth, as single-precision.

The IEEE 754 Quad-Precision format was designed to provide enough range and precision for very demanding applications. It provides a 14-bit exponent and a 116-bit mantissa. This format is still not supported by most hardware. The first hardware floating point unit to support this format was the SPARC V8 architecture. As of this writing, the popular Intel x86 family, including the 64-bit versions of the processor, do not have hardware support for the IEEE 754 quad-precision format. On modern high-end processors such as the SPARC, this may be an efficient representation. However, for mid-range processors such as the Intel x86 family and the ARM, this format is definitely out of their league.

Many processors do not have hardware support for floating point. On those processors, all floating point must be accomplished through software. Processors that do support floating point in hardware must have quite sophisticated circuitry to manage the basic operations on data in the IEEE 754 standard formats. Regardless of whether the operations are carried out in software or hardware, the basic arithmetic operations require multiple steps.
The steps required for addition and subtraction of floating point numbers is the same, regardless of the specific format. The steps for adding or subtracting to floating point numbers a and b are as follows:
1. Extract the exponents Ea and Eb.
2. Extract the significands Ma and Mb. and convert them into 2’s complement numbers, using the signs Sa and Sb.
3. Shift the significand with the smaller exponent right by |Ea − Eb|.
4. Perform addition (or subtraction) on the significands to get the significand of the result, Mr. Remember that the result may require one more significant bit to avoid overflow.
5. If Mr is negative, then take the 2’s complement and set Sr to 1. Otherwise set Sr to 0.
6. Shift Mr until the leftmost 1 is in the “hidden” bit position, and add the shift amount to the smaller of the two exponents to form the new exponent Er.
7. Combine the sign Sr, the exponent Er, and significand Mr to form the result.
The complete algorithm must also provide for correct handling of infinity and NaN.
Multiplication and division of floating point numbers also requires several steps. The steps for multiplication and division of two floating point numbers a and b are as follows:
1. Calculate the sign of the result Sr.
2. Extract the exponents Ea and Eb.
3. Extract the significands Ma and Mb.
4. Multiply (or divide) the significands to form Mr.
5. Add (or subtract) the exponents (in excess-N) to get Er.
6. Shift Mr until the leftmost 1 is in the “hidden” bit position, and add the shift amount to Er.
7. Combine the sign S, the exponent Er, and significand Mr to form the result.
The complete algorithm must also provide for correct handling of infinity and NaN.
It has been said, and is commonly accepted, that “you can’t beat the compiler.” The meaning of this statement is that using hand-coded assembly language is futile and/or worthless because the compiler is “smarter” than a human. This statement is a myth, as will now be demonstrated.
There are many mathematical functions that are useful in programming. Two of the most useful functions are
and
. However, these functions are not always implemented in hardware, particularly for fixed-point representations. If these functions are required for fixed-point computation, then they must be written in software. These two functions have some nice properties that can be exploited. In particular:
• If we have the
function, then we can calculate
using the relationship
Therefore, we only need to get the sine function working, and then we can implement cosine with only a little extra effort.
•
is cyclical, so
. This means that we can limit the domain of our function to the range [−π,π].
•
is symmetric, so that
. This means that we can further restrict the domain to [0,π].
• After we restrict the domain to [0,π], we notice another symmetry,
and we can further restrict the domain to
.
• The range of both functions,
and
, is in the range [−1,1].
If we exploit all of these properties, then we can write a single shared function to be used by both sine and cosine. We will name this function sinq, and choose the following fixed-point formats:
• sinq will accept x as an S(1,30), and
• sinq will return an S(1,30)
These formats were chosen because S(1,30) is a good format for storing a signed number between zero and
, and also the optimal format for storing a signed number between one and negative one.
The sine function will map x into the domain accepted by sinq and then call sinq to do the actual work. If the result should be negative, then the sine function will negate it before returning. The cosine function will use the relationship previously mentioned, and call the sine function.
We have now reduced the problem to one of approximating
within the range
. An approximation to the function
can be calculated using the Taylor Series:
The first few terms of the series should be sufficient to achieve a good approximation. The maximum value possible for the seventh term is
, which indicates that our function should be accurate to at least 25 bits using seven terms. If more accuracy is desired, then additional terms can be added.
The numerators in the first nine terms of the Taylor series approximation are: x, x3, x5, x7, x9, x11, x13, x15, and x17. Given an S(1,30) format for x, we can predict the format for the numerator of each successive term in the Taylor series. If we simply perform successive multiplies, then we would get the following formats for the powers of x:
| Term | Format | 32-bit |
| x | S(1,30) | S(1,30) |
| x3 | S(3,90) | S(3,28) |
| x5 | S(5,150) | S(5,26) |
| x7 | S(7,210) | S(7,24) |
| x9 | S(9,270) | S(9,22) |
| x11 | S(11,330) | S(11,20) |
| x13 | S(13,390) | S(13,18) |
The middle column in the table shows that the format for x17 would require 528 bits if all of the fractional bits are retained. Dealing with a number at that level of precision would be slow and impractical. We will, of necessity, need to limit the number of bits used. Since the ARM processor provides a multiply instruction involving two 32-bit numbers, we choose to truncate the numerators to 32 bits. The third column in the table indicates the resulting format for each term if precision is limited to 32 bits.
On further consideration of the Taylor series, we notice that each of the above terms will be divided by a constant. Instead of dividing, we can multiply by the reciprocal of the constant. We will create a similar table holding the formats and constants for the factorial terms. With a bit of luck, the division (implemented as multiplication) in each term will result in a reasonable format for each resulting term.
The first term of the Taylor series is
, so we can simply skip the division. The second term is
and the third term is
We can convert
to binary as follows:
Since the pattern repeats, we can conclude that
. Since we need a negative number, we take the two’s complement, resulting in
. Represented as an S(1,30), this would be

Since the first four bits are one, we can remove three bits and store it as:

In hexadecimal, this is AAAAAAAA16.
Performing the same operations, we find that
can be converted to binary as follows:
Since the fraction in the seventh row is the same as the fraction in the third row, we know that the table will repeat forever. Therefore,
. Since the first six bits to the right of the radix are all zero, we can remove the first five bits. Also adding one to the least significant bit to account for rounding error yields the following S(−6,32):

In hexadecimal, the number to be multiplied is 4444444516. Note that since
is a positive number, the reciprocal was incremented by one to avoid round-off errors. We can apply the same procedure to the remaining terms, resulting in the following table:
We want to keep as much precision as is reasonably possible for our intermediate calculations. Using 64 bits of precision for all intermediate calculations will give a good trade-off between performance and precision. The integer portion should never require more than two bits, so we choose an S(2,61) as our intermediate representation. If we combine the previous two tables, we can determine what the format of each complete term will be. This is shown in Table 8.2.
Table 8.2
Result formats for each term
| Numerator | Reciprocal | Result | ||||
| Term | Value | Format | Value | Format | Hex | Format |
| 1 | x | S(1,30) | Extend to 64 bits and shift right | S(2,61) | ||
| 2 | x3 | S(3,28) | ![]() | S(−2,32) | AAAAAAAA | S(2, 61) |
| 3 | x5 | S(5,26) | ![]() | S(−6,32) | 44444444 | S(0, 63) |
| 4 | x7 | S(7,24) | ![]() | S(−12,32) | 97F97F97 | S(−4, 64) |
| 5 | x9 | S(9,22) | ![]() | S(−18,32) | 5C778E96 | S(−8, 64) |
| 6 | x11 | S(11,20) | ![]() | S(−25,32) | 9466EA60 | S(−13, 64) |
| 7 | x13 | S(13,18) | ![]() | S(−32,32) | 5849184F | S(−18, 64) |

Note that the formats were truncated to fit in a 64-bit result. We can now see that the formats for the first nine terms of the Taylor series are reasonably similar. They all require exactly 64 bits, and the radix points can be shifted so that they are aligned for addition. In order to make the shifting and adding process easier, we will pre-compute the shift amounts and store them in a look-up table.
Table 8.3 shows the shifts that are necessary to convert each term to an S(2,61) so that it can be added to the running total.
Table 8.3
Shifts required for each term
| Term Number | Original Format | Shift Amount | Resulting Format |
| 1 | S(1,30) | 1 | S(2,61) |
| 2 | S(2,61) | 0 | S(2,61) |
| 3 | S(0,63) | 2 | S(2,61) |
| 4 | S(−4,64) | 6 | S(2,61) |
| 5 | S(−8,64) | 10 | S(2,61) |
| 6 | S(−13,64) | 15 | S(2,61) |
| 7 | S(−18,64) | 20 | S(2,61) |

Note that the seventh term contributes very little to the final 32-bit sum which is stored in the upper 32 bits of the running total. We now have all of the information that we need in order to implement the function. Listing 8.7 shows how the sine and cosine function can be implemented in ARM assembly using fixed point computation, and Listing 8.8 shows a main program which prints a table of values and their sine and cosines.






and
using fixed-point calculations.

and
functions can be used to print a table.In some situations it can be very advantageous to use fixed-point math. For example, when using an ARMv6 or older processor, there may not be a hardware floating point unit available. Table 8.4 shows the CPU time required for running a program to compute the sine function on 10,000,000 random values, using various implementations of the sine function. In each case, the program main() function was written in C. The only difference in the six implementations was the data type (which could be fixed-point, IEEE single precision, or IEEE double precision), and the sine function that was used. The times shown in the table include only the amount of CPU time actually used in the sine function, and do not include the time required for program startup, storage allocation, random number generation, printing results, or program exit. The six implementations are as follows:
Table 8.4
Performance of sine function with various implementations
| Optimization | Implementation | CPU seconds |
| None | 32-bit Fixed Point Assembly | 3.85 |
| 32-bit Fixed Point C | 18.99 | |
| Single Precision Software Float C | 56.69 | |
| Double Precision Software Float C | 55.95 | |
| Single Precision VFP C | 11.60 | |
| Double Precision VFP C | 11.48 | |
| Full | 32-bit Fixed Point Assembly | 3.22 |
| 32-bit Fixed Point C | 5.02 | |
| Single Precision Software Float C | 20.53 | |
| Double Precision Software Float C | 54.51 | |
| Single Precision VFP C | 3.70 | |
| Double Precision VFP C | 11.08 |
32-bit Fixed Point Assembly The sine function is computed using the code shown in Listing 8.7.
32-bit Fixed Point C The sine function is computed using exactly the same algorithm as in Listing 8.7, but it is implemented in C rather than Assembly.
Single Precision Software Float C Sine is computed using the floating point sine function which is provided by the GCC C compiler. The code is compiled for an ARMv6 or earlier processor without hardware floating point support. The C code is written to use IEEE single precision floating point numbers.
Double Precision Software Float C Exactly the same as the previous method, but using IEEE double precision instead of single precision.
Single Precision VFP C Sine is computed using the floating point sine function which is provided by the GCC C compiler. The code is compiled for the ARMv6 or later processor using hardware floating point support. The C code is written to use IEEE single precision floating point numbers.
Double Precision VFP C Same as the previous method, but using IEEE double precision instead of single precision.
Each of the six implementations was compiled both with and without compiler optimizations, resulting in a total of 12 test cases. All cases were run on a standard Raspberry Pi model B with the default CPU clock rate.
From Table 8.4, it is clear that the fixed-point implementation written in assembly beats the code generated by the compiler in every case. The closest that the compiler can get is when it can use the VFP hardware floating point unit and the compiler is run with full optimization. Even in that case the fixed-point assembly implementation is almost 15% faster than the single precision floating point implementation, and has 33% more precision (32 bits versus 24 bits). In the worst case, when a VFP hardware unit is not available, the assembly code beats the compiler by a whopping 638% in speed and 33% in precision for single precision floats, and is 1692% faster than double precision floating point at a cost of 41% in precision. Note that even with floating point hardware support, fixed point in assembly is still 3.44 times as fast as the C compiler code.
Similar results could be obtained on any processor architecture, and any reasonably complex mathematical problem. When developing software for small systems, the developer must weigh the costs and benefits of alternative implementations. For battery powered systems, it is important to realize that choices of hardware and software can affect power consumption even more strongly than computing performance. First, the power used by a system which includes a hardware floating point processor will be consistently higher than that of a system without one. Second, the reduction in processing time required for the job is closely related to the reduction in power required. Therefore, for battery operated systems, A fixed-point implementation could greatly extend battery life. The following statements summarize the results from the experiment in this section:
1. A competent assembly programmer can beat the assembler, in some cases by a very large margin.
2. If computational performance is critical, then a well-designed fixed-point implementation will usually outperform even a hardware-accelerated floating point implementation.
3. If there is no hardware support for floating point, then floating point performance is extremely poor, and fixed point will always provide the best performance.
4. If battery life is a consideration, then a fixed-point implementation can have an enormous advantage.
Note also from the table that the assembly language version of the fixed-point sine function beats the identical C version by a wide margin. Section 9.8.2 will demonstrate that a good assembly language programmer who is familiar with the floating point hardware can beat the compiler by an even wider performance margin.
Fixed-point arithmetic is very efficient on modern computers. However it is incumbent upon the programmer to track the radix point at all stages of the computation, and to ensure that a sufficient number of bits are provided on both sides of the radix point. The programmer must ensure that all computations are carried out with the desired level of precision, resolution, accuracy, range, and dynamic range. Failure to do so can have serious consequences.
On February 25, 1991, during the Gulf War, an American Patriot Missile battery in Dharan, Saudi Arabia, failed to intercept an incoming Iraqi SCUD missile. The SCUD struck an American army barracks, killing 28 soldiers and injuring around 98 other people. The cause was an inaccurate calculation of the time elapsed since the system was last booted.
The hardware clock on the system counted the time in tenths of a second since the last reboot. Current time, in seconds, was calculated by multiplying that number by
. For this calculation,
was represented as a U(1,23) fixed-point number. Since
cannot be represented precisely in a fixed number of bits, there was round-off error in the calculations. The small imprecision, when multiplied by a large number, resulted in significant error. The longer the system ran after boot, the larger the error became.
The system determined whether or not it should fire by predicting where the incoming missile would be at a specific time in the future. The time and predicted location were then fed to a second system which was responsible for locking onto the target and firing the Patriot missile. The system would only fire when the missile was at the proper location at the specified time. If the radar did not detect the incoming missile at the correct time and location, then the system would not fire.
At the time of the failure, the Patriot battery had been up for around 100 h. We can estimate the error in the timing calculations by considering how the binary number was stored. The binary representation of
is
. Note that it is a non-terminating, repeating binimal. The 24-bit register in the Patriot could only hold the following set of bits:

This resulted in an error of
. The error can be computed in base 10 as:
To find out how much error was in the total time calculation, we multiply e by the number of tenths of a second in 100 h. This gives 9.5 × 10−8 × 100 × 60 × 60 × 10 = 0.34 s. A SCUD missile travels at about 1,676 m/s. Therefore it travels about 570 m in 0.34 s. Because of this, the targeting and firing system was expecting to find the SCUD at a location that was over half a kilometer from where it really was. This was far enough that the incoming SCUD was outside the “range gate” that the Patriot tracked. It did not detect the SCUD at its predicted location, so it could not lock on and fire the Patriot.
This is an example of how a seemingly insignificant error can lead to a major failure. In this case, it led to loss of life and serious injury. Ironically, one factor that contributed to the problem was that part of the code had been modified to provide more accurate timing calculations, while another part had not. This meant that the inaccuracies did not cancel each other. Had both sections of code been re-written, or neither section changed, then the issue probably would not have surfaced.
The Patriot system was originally designed in 1974 to be mobile and to defend against aircraft that move much more slowly than ballistic missiles. It was expected that the system would be moved often, and therefore the computer would be rebooted frequently. Also, the slow-moving aircraft would be much easier to track, and the error in predicting where it is expected to be would not be significant. The system was modified in 1986 to be capable of shooting down Soviet ballistic missiles. A SCUD missile travels at about twice the speed of the Soviet missiles that the system was re-designed for.
The system was deployed to Iraq in 1990, and successfully shot down a SCUD missile in January of 1991. In mid-February of 1991, Israeli troops discovered that the system became inaccurate if it was allowed to run for long periods of time. They claimed that the system would become unreliable after 20 hours of operation. U.S. military did not think the discovery was significant, but on February 16th, a software update was released. Unfortunately, the update could not immediately reach all units because of wartime difficulties in transportation. The Army released a memo on February 21st, stating that the system was not to be run for “very long times,” but did not specify how long a “very long time” would be. The software update reached Dhahran one day after the Patriot Missile system failed to intercept a SCUD missile, resulting in the death of 28 Americans and many more injuries.
Part of the reason this error was not found sooner was that the program was written in assembly language, and had been patched several times in its 15-year life. The code was difficult to understand and maintain, and did not conform to good programming practices. The people who worked to modify the code to handle the SCUD missiles were not as familiar with the code as they would have been if it were written more recently, and time was a critical factor. Prolonged testing could have caused a disaster by keeping the system out of the hands of soldiers in a time of war. The people at Raytheon Labs had some tough decisions to make. It cannot be said that Raytheon was guilty of negligence or malpractice. The problem with the system was not necessarily the developers, but that the system was modified often and in inconsistent ways, without complete understanding.
Sometimes it is desirable to perform calculations involving non-integral numbers. The two common ways to represent non-integral numbers in a computer are fixed point and floating point. A fixed point representation allows the programmer to perform calculations with non-integral numbers using only integer operations. With fixed point, the programmer must track the radix point throughout the computation. Floating point representations allow the radix point to be tracked automatically, but require much more complex software and/or hardware. Fixed point will usually provide better performance than floating point, but requires more programming skill.
Fractional numbers in radix notation may not terminate in all bases. Numbers which terminate in base two will also terminate in base ten, but the converse is not true. Programmers should avoid counting using fractions which do not terminate in base two, because it leads to the accumulation of round-off errors.
8.1 Perform the following base conversions:
(a) Convert 10110.0012 to base ten.
(b) Convert 11000.01012 to base ten.
(c) Convert 10.12510 to binary.
8.2 Complete the following table (assume all values represent positive fixed-point numbers):
8.3 You are working on a problem involving real numbers between −2 and 2 on a computer that has 16-bit integer registers and no hardware floating point support. You decide to use 16-bit fixed-point arithmetic.
(a) What fixed-point format should you use?
(b) Draw a diagram showing the sign, if any, radix point, integer part, and fractional part.
(c) What is the precision, resolution, accuracy, and range of your format?
8.4 What is the resulting type of each of the following fixed-point operations?
(b) S(3,4)÷U(4,20)
8.5 Convert 26.64062510 to a binary U(18,14) representation. Show the ARM assembly code necessary to load that value into register r4.
8.6 For each of the following fractions, indicate whether or not it will terminate in bases 2, 5, 7, and 10.
(b) 
(c) 
(d) 
(e) 
8.7 What is the exact value of the binary number 0011011100011010 when interpreted as an IEEE half-precision number? Give your answer in base ten.
8.8 The “Software Engineering Code of Ethics And Professional Practice” states that a responsible software engineer should “Approve software only if they have well-founded belief that it is safe, meets specifications, passes appropriate tests…” (sub-principle 1.03) and “Ensure adequate testing, debugging, and review of software…on which they work” (sub-principle 3.10).
The software engineering code of ethics also states that a responsible software engineer should “Treat all forms of software maintenance with the same professionalism as new development.”
(a) Explain how the Software Engineering Code of Ethics And Professional Practice were violated by the Patriot Missile system developers.
(b) How should the engineers and managers at Raytheon have responded when they were asked to modify the Patriot Missile System to work outside of its original design parameters?
(c) What other ethical and non-ethical considerations may have contributed to the disaster?
This chapter begins by giving an overview of the ARM Vector Floating Point (VFP) coprocessor and the ARM VFP register set. Next, it gives an overview of the Floating Point Status and Control Register (FPSCR). It then explains RunFast mode, which gives higher performance but is not fully compliant with the IEEE floating point standards. That is followed by a explanation of vector mode, which can give an additional performance boost in some situations. Then, after a short discussion of the register usage rules, it describes each of the VFP instructions, providing a short description of each one. Next, it presents four implementations of a function to calculate sine using the ARM VFP coprocessor, and shows that they are all significantly faster than the implementation provided by GCC.
Floating point; Vector; IEEE Compliance; Performance
Some ARM processors have dedicated hardware to support floating point operations. For ARMv7 and previous architectures, floating point is provided by an optional Vector Floating Point (VFP) coprocessor. Many newer processors also support the NEON extensions, which are covered in Chapter 10. The remainder of this chapter will explain the VFP coprocessor.
There are four major revisions of the VFP coprocessor:
VFPv2: An optional extension to the ARMv5 and ARMv6 processors. VFPv2 has 16 64-bit FPU registers.
VFPv3: An optional extension to the ARMv7 processors. It is backwards compatible with VFPv2, except that it cannot trap floating-point exceptions. VFPv3-D32 has 32 64-bit FPU registers. Some processors have VFPv3-D16, which supports only 16 64-bit FPU registers. VFPv3 adds several new instructions to the VFP instruction set.
VFPv4: Implemented on some Cortex ARMv7 processors. VFPv4 has 32 64-bit FPU registers. It adds both half-precision extensions and multiply-accumulate instructions to the features of VFPv3. Some processors have VFPv4-D16, which supports only 16 64-bit FPU registers.
Fig. 9.1 shows the 16 ARM integer registers, and the additional registers provided by the VFP coprocessor. Banks four through seven are only present on the VFPv3-D32 and VFPv4-D32 versions of the coprocessor. Note that each register in Banks zero through three can be used to store either one 64-bit number or two 32-bit numbers. For example, double precision register d0 may also be referred to as single precision registers s0 and s1. Each 32-bit VFP register can hold an integer or a single precision floating point number. Registers in Banks four through seven cannot be used as single precision registers.

The VFP adds about 23 new instructions to the ARM instruction set. The exact number of VFP instructions depends on the specific version of the VFP coprocessor. Instructions are provided to:
• transfer floating point values between VFP registers,
• transfer floating-point values between the VFP coprocessor registers and main memory,
• transfer 32-bit values between the VFP coprocessor registers and the ARM integer registers,
• perform addition, subtraction, multiplication, and division, involving two source registers and a destination register,
• compute the square root of a value,
• perform combined multiply-accumulate operations,
• perform conversions between various integer, fixed point, and floating point representations, and
• compare floating-point values.
In addition to performing basic operations involving two source registers and one destination register, VFP instructions can also perform operations involving registers arranged as short vectors (arrays) of up to eight single-precision values or four double-precision values. A single instruction can be used to perform operations on all of the elements of such vectors. This feature can substantially accelerate computation on arrays and matrices of floating point data. This type of data is common in graphics and signal processing applications. Vector mode can reduce code size and increase speed of execution by supporting parallel operations and multiple transfers.
The Floating Point Status and Control Register (FPSCR) is similar to the CPSR register. The FPSCR stores status bits from floating point operations in much the same way as the CPSR stores status bits from integer operations. The programmer can also write to certain bits in the FPSCR to control the behavior of the VFP coprocessor. The layout of the FPSCR is shown in Fig. 9.2. The meaning of each field is as follows:

N The Negative flag is set to one by vcmp if Fd < Fm.
Z The Zero flag is Set to one by vcmp if Fd = Fm.
C The Carry flag is set to one by vcmp if Fd = Fm, or Fd > Fm, or Fd and Fm are unordered.
V The oVerflow flag is set to one by vcmp if Fd and Fm are unordered.
QC NEON only. The saturation cumulative flag is set to one by saturating instructions if saturation has occurred.
0: Disable Default NaN mode. NaN operands propagate through to the output of a floating-point operation.
1: Enable Default NaN mode. Any operation involving one or more NaNs returns the default NaN.
The default single precision NaN is 7FC0000016 and the default double-precision NaN is 7FF800000000000016. Default NaN mode does not comply with IEEE 754 standard, but may increase performance. NEON instructions ignore this bit and always use Default NaN mode.
0: Disable Flush-to-Zero mode.
1: Enable Flush-to-Zero mode.
Flush-to-Zero mode replaces subnormal numbers with 0. This does not comply with IEEE 754 standard, but may increase performance. NEON instructions ignore this bit and always use flush-to-Zero mode.
01 Round towards Plus infinity (RP).
10 Round towards Minus infinity (RM).
11 Round towards Zero (RZ).
NEON instructions ignore these bits and always use Round to Nearest mode.
STRIDE Sets the stride (distance between items) for vector operations:
01 Reserved.
10 Reserved.
11 Stride is 2.
LEN Sets the vector length for vector operations:
000 Vector length is 1 (scalar mode).
001 Vector length is 2.
010 Vector length is 3.
011 Vector length is 4.
100 Vector length is 5.
101 Vector length is 6.
110 Vector length is 7.
111 Vector length is 8.
IDE Input Denormal (subnormal) exception Enable:
1: An exception is generated when one or more operand is subnormal.
1: An exception is generated when the result contains more significand bits than the destination format can contain, and must be rounded.
UFE UnderFlow exception Enable:
1: An exception is generated when the result is closer to zero than can be represented by the destination format.
OFE OverFlow exception Enable:
1: An exception is generated when the result is farther from zero than can be represented by the destination format.
DZE Division by Zero exception Enable:
1: An exception is generated by divide instructions when the divisor is zero or subnormal.
IOE Invalid Operation exception Enable:
1: An exception is generated when the result is not defined, or cannot be represented. For example, adding positive and negative infinity gives an invalid result.
IDC The Input Subnormal Cumulative flag is set to one when an IDE condition has occurred.
IXC The IneXact Cumulative flag is set to one when an IXE condition has occurred.
UFC The UnderFlow Cumulative flag is set to one when a UFE condition has occurred.
OFC The OverFlow Cumulative flag is set to one when an OFE condition has occurred.
DZC The Division by Zero Cumulative flag is set to one when a DZE condition has occurred.
IOC The Invalid Operation Cumulative flag is set to one when an OFE condition has occurred.
The only VFP instruction that can be used to update the status flags in the FPSCR is fcmp, which is similar to the integer cmp instruction. To use the FPSCR flags to control conditional instructions, including conditional VFP instructions, they must first be moved into the CPSR register. Table 9.1 shows the meanings of the FPSCR flags when they are transferred to the CPSR and used for conditional execution on following instructions. The following rules govern how the bits in the FPSCR may be changed by subroutines:
Table 9.1
Condition code meanings for ARM and VFP
| <cond> | ARM Data Processing Instruction | VFP fcmp Instruction |
| AL | Always | Always |
| EQ | Equal | Equal |
| NE | Not Equal | Not equal, or unordered |
| GE | Signed greater than or equal | Greater than or equal |
| LT | Signed less than | Less than, or unordered |
| GT | Signed greater than | Greater than |
| LE | Signed less than or equal | Less than or equal, or unordered |
| HI | Unsigned higher | Greater than, or unordered |
| LS | Unsigned lower or same | Less than or equal |
| HS | Carry set/unsigned higher or same | Greater than or equal, or unordered |
| CS | Same as HS | Same as HS |
| LO | Carry clear/ unsigned lower | less than |
| CC | Same as LO | Same as LO |
| MI | Negative | Less than |
| PL | Positive or zero | Greater than or equal, or unordered |
| VS | Overflow | Unordered (at least one NaN operand) |
| VC | No overflow | Not unordered |
1. Bits 27-31, 0-4, and 7 do not need to be preserved.
2. Subroutines may modify bits 8-12, 15, and 22-25 but the practice is discouraged. These bits should only be changed by specific support subroutines which change the global state of the program. If they are modified within a subroutine, then their original value must be restored before the function returns or calls another function.
3. Bits 16–18 and bits 20–21 may be changed by a subroutine, but must be set to zero before the function returns or calls another function.
4. All other bits are reserved for future use and must not be modified.
Floating point operations are complex, and there are many special cases, such as dealing with NaNs, infinities, and subnormals. These special cases are a normal part of performing floating point math, but they are relatively infrequent. In order to simplify the hardware, many special situations which occur infrequently are handled by software. When one of these exceptional situations occurs, the VFP hardware sets the appropriate flags in the FPSCR and generates an interrupt. The ARM CPU then executes an interrupt handler to deal with the exceptional situation. When the routine finishes, it returns to the point where the exception occurred and execution resumes just as if the situation had been dealt with by the hardware. This approach is taken by many processor architectures to reduce the complexity, cost, and/or power consumption of the floating point hardware, This approach also allows the programmer to make a trade-off between performance and strict IEEE 754 compliance.
The support code for dealing with VFP exceptions is included in most ARM-based operating systems. Even bare-metal embedded systems can include the VFP support service routines. With the support code enabled, the VFP coprocessor is fully compliant with the IEEE 754 standard. However, using the fully compliant mode does increase the average run-time for floating point code, and increases the size of the operating system kernel or embedded system code.
When all of the VFP exceptions are disabled, Default NaN mode is enabled, and Flush-to-Zero is enabled, the VFP is not fully compliant with the IEEE 754 standard. However, floating point code runs significantly faster. For that reason, the state when bits 8–12 and bit 15 are set to zero while bits 24 and 25 are set to one is referred to as RunFast mode. There is some loss of accuracy for very small values, but the hardware no longer has to check for many of the conditions that may stall the floating point pipeline. This results in fewer stalls and much higher throughput in the hardware, as well as eliminating the necessity to handle exceptions in software. Many other floating point architectures have similar modes, so the GCC developers have found it worthwhile to provide programmers with the option of using them. User applications can be compiled to use this mode with GCC by using the - ffast -math and/or -Ofast options during compilation and linking. The startup code in the C standard library will then set the VFP to RunFast mode before calling the main function.
A VFP vector consists of up to eight single-precision registers, or up to four double-precision registers. All of the registers in a vector must be in the same bank. Also, vectors cannot be stored in Bank 0 or Bank 4. For example, registers s8 through s10 could be treated as a vector of three single-precision values. Registers s14 through s17 cannot be treated as a vector because some of those registers are in Bank 1 and others are in Bank 2. Registers d0 through d3 cannot be treated as a vector because they are in Bank 0.
The LEN field in the FPSCR controls the length of vectors that are used for vector operations. In vector operations, the first register in the vector is given as the operand, and the remaining registers are inferred from the settings of LEN and STRIDE. The STRIDE field allows data to be interleaved. For example, if the stride is set to two, and length is set to four, then the vector starting at s8 would consist of registers s8, s10, s12, and s14, while the vector starting at s9 would consist of registers s9, s11, s13, and s15. If a vector runs off the end of a bank, then the address wraps around to the first register in the bank. For example, if length is set to six and stride is set to one, then the vector starting at s13 would consist of s13, s14, s15, s8, s9, and s10, in that order.
The vector-capable data-processing instructions have one of the following two forms:

where Op is the VFP instruction, Fd is the destination register (or the first register in a vector), Fn is an operand register (or the first register in a vector), and Fm is an operand register (or the first register in a vector). Most data-processing instructions can operate in scalar mode, mixed mode, or vector mode. The mode depends on the LEN bits in the FPSCR, as well as on which register banks contain the destination and operand(s).
• The operation is scalar if the LEN field is set to zero (scalar mode) or the destination operand, Fd, is in Bank 0 or Bank 4. The operation acts on Fm (and Fn if the operation uses two operands) and places the result in Fd.
• The operation is mixed if the LEN field is not set to zero and Fm is in Bank 0 or Bank 4 but Fd is not. If the operation has only one operand, then the operation is applied to Fm and copies of the result are stored into each register in the destination vector. If the operation has two operands, then it is applied with the scalar Fm and each element in the vector starting at Fn, and the result is stored in the vector beginning at Fd.
• The operation is vector if the LEN field is not set to zero and neither Fd nor Fm is in Bank 0 or Bank 4. If the operation has only one operand, then the operation is applied to the vector starting at Fm and the results are placed in the vector starting at Fd. If the operation has two operands, then it is applied with corresponding elements from the vectors starting at Fm and Fn, and the result is stored in the vector beginning at Fd.
As with the integer registers, there are rules for using the VFP registers. These rules are a convention, and following the convention ensures interoperability between code written by different programmers and compilers. Registers s16 through s31 are non-volatile. This implies that d8 through d15 are also non-volatile, since they are really the same registers. The contents of these registers must be preserved across subroutine calls. The remaining registers (s0 through s15, also known as d0 through d7) are volatile. They are used for passing arguments, returning results, and for holding local variables. They do not need to be preserved by subroutines. If registers d16 through d31 are present, then they are also considered volatile.
In addition to the FPSCR, all VFP implementations contain at least two additional system registers. The Floating-point System ID register (FPSID) is a read-only register whose value indicates which VFP implementation is being provided. The contents of the FPSID can be transferred to an ARM integer register, then examined to determine which VFP version is available. There is also a Floating-point Exception register (FPEXC). Two bits of the FPEXC register provide system-level status and control. The remaining bits of this register are defined by the sub-architecture. These additional system registers should not be accessed by user applications.
The VFP provides several instructions for moving data between memory and the VFP registers. There are instructions for loading and storing single and double precision registers, and for moving multiple registers to or from memory.. All of the load and store instructions require a memory address to be in one of the ARM integer registers.
The following instructions are used to load or store a single VFP register:
vstr Store VFP Register.
• <op> may be either ld or st.
• Fd may be any single or double precision register.
• Rn may be any ARM integer register.
• <cond> is an optional condition code.
• <prec> may be either f32 or f64.

These instructions load or store multiple floating-point registers:
vldm Load Multiple VFP Registers, and
vstm Store Multiple VFP Registers.
As with the integer ldm and stm instructions, there are multiple versions for use in moving data and accessing stacks.
• <op> may be either ld or st.
ia Increment address after each transfer.
db Decrement address before each transfer.
• Rn may be any ARM integer register.
• <cond> is an optional condition code.
• <prec> may be either f32 or f64.
• <list> may be any set of contiguous single precision registers, or any set of contiguous double precision registers.
• If mode is db then the ! is required.
• vpop <list> is equivalent to vldmia sp!,< list >.
• vpush <list> is equivalent to vstmdb sp!,< list >.
| Name | Effect | Description |
| vldmia |
for i ∈ register_list do
if single then
else
end if end for if ! is present then
end if | Load multiple registers from memory starting at the address in Rd. Increment address after each load. |
| vstmia |
for i ∈ register_list do
if single then
else
end if end for if ! is present then
end if | Store multiple registers in memory starting at the address in Rd. Increment address after each store. |
| vldmdb |
for i ∈ register_list do if single then
else
end if
end for
| Load multiple registers from memory starting at the address in Rd. Decrement address before each load. |
| vstmdb |
for i ∈ register_list do if single then
else
end if
end for
| Store multiple registers in memory starting at the address in Rd. Decrement address before each store. |


These operations are vector-capable. For details on how to use vector mode, refer to Section 9.2.2. Instructions are provided to perform the four basic arithmetic functions, plus absolute value, negation, and square root. There are also special forms of the multiply instructions that perform multiply-accumulate.
The unary operations require on source operand and a destination register. The source and destination can be the same register. There are four unary operations:
vcpy Copy VFP Register (equivalent to move),
vabs Absolute Value,
vneg Negate, and
vsqrt Square Root.
• <op> is one of cpy, abs, neg, or sqrt.
• <cond> is an optional condition code.
• <prec> may be either f32 or f64.

The basic mathematical operations require two source operands and one destination. There are five basic mathematical operations:
vsub Subtract,
vmul Multiply,
vnmul Negate and Multiply, and
vdiv Divide.
• <op> is one of add, sub, mul, nmul, or div.
• <cond> is an optional condition code.
• <prec> may be either f32 or f64.

The compare instruction subtracts the value in Fm from the value in Fd and sets the flags in the FPSCR based on the result. The comparison operation will raise an exception if one of the operations is a signalling NaN. There is also a version of the instruction that will raise an exception if either operand is any type of NaN. The two comparison instructions are:
vcmpe Compare with Exception.
• If e is present, an exception is raised if either operand is any kind of NaN. Otherwise, an exception is raised only if either operand is a signaling NaN.
• <cond> is an optional condition code.
• <prec> may be either f32 or f64.

With the addition of all of the VFP registers, there many more possibilities for how data can be moved. There are many more registers, and VFP registers may be 32 or 64 bit. This results in several possible combinations for moving data among all of the registers. The VFP instruction set includes instructions for moving data between two VFP registers, between VFP and integer registers, and between the various system registers.
The most basic move instruction involving VFP registers simply moves data between two floating point registers. The instruction is:
vmov Move Between VFP Registers.
• Fd and Fm must be the same size.
• <cond> is an optional condition code.
• <prec> is either f32 or f64.

This version of the move instruction allows 32 bits of data to be moved between an ARM integer register and a floating point register. The instruction is:
vmov Move Between VFP and One ARM Integer Register.
• Rd is an ARM integer register.
• Sd is a VFP single precision register.
• <cond> is an optional condition code.

This version of the move instruction is used to transfer 64 bits of data between ARM integer registers and floating point registers:
vmov Move Between VFP and Two ARM Integer Registers.
• Source and destination must be VFP or integer registers. One of them must be a set of ARM integer registers, and the other must be VFP coprocessor registers. The following table shows the possible choices for sources and destinations.
• Sd and Sd’ must be adjacent, and Sd’ must be the higher-numbered register.
• <cond> is an optional condition code.

There are two instructions which allow the programmer to examine and change bits in the VFP system register(s):
vmrs Move From VFP System Register to ARM Register, and
vmsr Move From ARM Register to VFP System Register.
User programs should only access the FPSCR to check the flags and control vector mode.
• VFPsysreg can be any of the VFP system registers.
• Rd can be APSR_nzcv or any ARM integer register.,
• <cond> is an optional condition code.

The ARM VFP provides several instructions for converting between various floating point and integer formats. Some VFP versions also have instructions for converting between fixed point and floating point formats.
These instructions are used to convert integers to single or double precision floating point, or for converting single or double precision to integer:
vcvt Convert Between Floating Point and Integer
vcvtr Convert Floating Point to Integer with Rounding
These instructions always use a single precision register for the integer, but the floating point argument can be single precision or double precision. Some versions of the VFP do not support the double precision versions.
• The optional r makes the operation use the rounding mode specified in the FPSCR. The default is to round toward zero.
• <cond> is an optional condition code.
• The <type> can be either u32 or s32 to specify unsigned or signed integer.
• These instructions can also convert from fixed point to floating point if followed by an appropriate vmul.
| Opcode | Effect | Description |
| vcvt.f64.s32 | ![]() | Convert signed integer to double |
| vcvt.f32,s32 | ![]() | Convert signed integer to single |
| vcvt.f64.u32 | ![]() | Convert unsigned integer to double |
| vcvt.f32.u32 | ![]() | Convert unsigned integer to single |
| vcvt.s32.f32 | ![]() | Convert single to signed integer |
| vcvt.u32.f32 | ![]() | Convert single to unsigned integer |
| vcvt.s32.f64 | ![]() | Convert double to signed integer |
| vcvt.u32.f64 | ![]() | Convert double to unsigned integer |

VFPv3 and higher coprocessors have additional instructions used for converting between fixed point and single precision floating point:
vcvt Convert To or From Fixed Point.
• <cond> is an optional condition code.
• <td> specifies the type and size of the fixed point number, and must be one of the following:
u32 unsigned 32 bit value,
s16 signed 16 bit value, or
u16 unsigned 16 bit value.
• The #fbits operand specifies the number of fraction bits in the fixed point number, and must be less than or equal to the size of the fixed point number indicated by <td>.
| Name | Effect | Description |
| vcvt.s32.f32 | ![]() | Convert single precision to 32-bit signed fixed point. |
| vcvt.u32.f32 | ![]() | Convert single precision to 32-bit unsigned fixed point. |
| vcvt.s16.f32 | ![]() | Convert single precision to 16-bit signed fixed point. |
| vcvt.u16.f32 | ![]() | Convert single precision to 16-bit unsigned fixed point. |
| vcvt.f32.s32 | ![]() | Convert signed 32-bit fixed point to single precision |
| vcvt.f32.u32 | ![]() | Convert unsigned 32-bit fixed point to single precision |

A fixed point implementation of the sine function was discussed in Section 8.7, and shown to be superior to the floating point sine function provided by GCC. Now that we have covered the VFP instructions, we can write an assembly version using floating point which also performs better than the routines provided by GCC.
Listing 9.1 shows a single precision floating point implementation of the sine function, using the ARM VFPv3 instruction set. It works in a similar way to the previous fixed point code. There is a table of constants, each of which is the reciprocal of one of the factorial divisors in the Taylor series for sine. The subroutine calculates the powers of x one-by-one, and multiplies each power by the next constant in the table, summing the results as it goes. Note that the table of constants is shorter than the fixed point version of the code, because there are fewer bits of precision in a single precision floating point number than there are in the fixed point representation that was used previously.

Listing 9.2 shows a double precision floating point implementation of the sine function, using the ARM VFPv3 instruction set. Again, there is a table of constants, each of which is the reciprocal of one of the factorial divisors in the Taylor series for sine. The subroutine calculates the powers of x one-by-one, and multiplies each power by the next constant in the table, summing the results as it goes. Note that the table of constants is longer than the fixed point version of the code, because there are more bits of precision in a double precision floating point number than there are in the fixed point representation that was used previously.

The previous implementations are already faster than the implementations provided by GCC, However, it may be possible to gain a little more performance by using VFP vector mode. In the single precision code, there are five terms to be added. Since single precision vectors can have up to eight elements, the code should not require any loop at all.
Listing 9.3 shows a single precision floating point implementation of the sine function, using the ARM VFPv3 instruction set in vector mode. It performs the same operations as the previous implementation, but instead of using a loop, all of the data is pre-loaded into vector banks and then a vector multiply operation is performed. The processor is then returned to scalar mode, and the summation is performed. This implementation is slightly faster than the previous version.

Listing 9.4 shows a double precision floating point implementation of the sine function, using the ARM VFPv3 instruction set in vector mode. It performs the same operations as the previous implementation, but performs the nine multiplications in three groups of three, using vector operations. Also, computing the powers of x is done within the loop, using a vector multiply. In this case, the vector code is significantly faster than the scalar version.



Table 9.2 shows the performance of various implementations of the sine function, with and without compiler optimization. The Single Precision C and Double Precision C implementations are the standard implementations provided by GCC.
Table 9.2
Performance of sine function with various implementations
| Optimization | Implementation | CPU seconds |
| None | Single Precision Scalar Assembly | 2.96 |
| Single Precision Vector Assembly | 2.63 | |
| Single Precision C | 8.75 | |
| Double Precision Scalar Assembly | 4.59 | |
| Double Precision Vector Assembly | 3.75 | |
| Double Precision C | 9.21 | |
| Full | Single Precision Scalar Assembly | 2.16 |
| Single Precision Vector Assembly | 2.06 | |
| Single Precision C | 2.59 | |
| Double Precision Scalar Assembly | 3.88 | |
| Double Precision Vector Assembly | 3.16 | |
| Double Precision C | 8.49 |
When compiler optimization is not used, the single precision scalar VFP implementation achieves a speedup of about 2.96, and the vector implementation achieves a speedup of about 3.33 compared to the GCC implementation. The double precision scalar VFP implementation achieves a speedup of about 2.01, and the vector implementation achieves a speedup of about 2.46 compared to the GCC implementation.
When the best possible compiler optimization is used (-Ofast), the single precision scalar VFP implementation achieves a speedup of about 1.20, and the vector implementation achieves a speedup of about 1.26 compared to the GCC implementation. The double precision scalar VFP implementation achieves a speedup of about 2.19, and the vector implementation achieves a speedup of about 2.69 compared to the GCC implementation.
In most cases, the assembly versions were significantly faster than the functions provided by GCC. GCC with full optimization using single-precision numbers was competitive, but the assembly language vector implementation still beat it by over 25%. It is clear that writing some functions in assembly can result in large performance gains.
| Name | Page | Operation |
| vabs | 277 | Absolute Value |
| vadd | 278 | Add |
| vcmp | 279 | Compare |
| vcmpe | 279 | Compare with Exception |
| vcpy | 277 | Copy VFP Register |
| vcvt | 283 | Convert Between Floating Point and Integer |
| vcvt | 284 | Convert To or From Fixed Point |
| vcvtr | 283 | Convert Floating Point to Integer with Rounding |
| vdiv | 278 | Divide |
| vldm | 275 | Load Multiple VFP Registers |
| vldr | 274 | Load VFP Register |
| vmov | 280 | Move Between VFP and One ARM Integer Register |
| vmov | 281 | Move Between VFP and Two ARM Integer Registers |
| vmov | 279 | Move Between VFP Registers |
| vmrs | 282 | Move From VFP System Register to ARM Register |
| vmsr | 282 | Move From ARM Register to VFP System Register |
| vmul | 278 | Multiply |
| vneg | 277 | Negate |
| vnmul | 278 | Negate and Multiply |
| vsqrt | 277 | Square Root |
| vstm | 275 | Store Multiple VFP Registers |
| vstr | 274 | Store VFP Register |
| vsub | 278 | Subtract |
The ARM VFP coprocessor adds a great deal of power to the ARM architecture. The register set is expanded to hold up to four times the amount of data that can be held in the ARM integer registers. The additional instructions allow the programmer to deal directly with the most common IEEE 754 formats for floating point numbers. The ability to treat groups of registers as vectors adds a significant performance improvement. Access to the vector features is only possible through assembly language. The GCC compiler is not capable of using these advanced features, which gives the assembly programmer a big advantage when high-performance code is needed.
9.1 How many registers does the VFP coprocessor add to the ARM architecture?
9.2 What is the purpose of the FZ, DN, and IDE, IXE, UFE, OFE, DZE, and IOE bits in the FPSCR? What is it called when FZ and DN are set to one and all of the others are set to zero?
9.3 If a VFP coprocessor is present, how are floating point parameters passed to subroutines? How is a pointer to a floating point value (or array of values) passed to a subroutine?
9.4 Write the following C code in ARM assembly:

9.5 In the previous exercise, the C code contains a subtle bug.
b. Show two ways to fix the code in ARM assembly. Hint: One way is to change the amount of the increment, which will change the number of times that the loop executes.
9.6 The fixed point sine function from the previous chapter was not compared directly to the hand-coded VFP implementation. Based on the information in Tables 9.2 and 8.4, would you expect the fixed point sine function from the previous chapter to beat the hand-coded assembly VFP sine function in this chapter? Why or why not?
9.7 3-D objects are often stored as an array of points, where each point is a vector (array) consisting of four values, x, y, z, and the constant 1.0. Rotation, translation, scaling and other operations are accomplished by multiplying each point by a 4 × 4 transformation matrix. The following C code shows the data types and the transform operation:

Write the equivalent ARM assembly code.
9.8 Optimize the ARM assembly code you wrote in the previous exercise. Use vector mode if possible.
9.9 Since the fourth element of the point is always 1.0, there is no need to actually store it. This will reduce memory requirements by about 25%, and require one fewer multiply. The C code would look something like this:

Write optimal ARM VFP code to implement this function.
9.10 The function in the previous problem would typically be called multiple times to process an array of points, as in the following function:

This could be somewhat inefficient. Re-write this function in assembly so that the transformation of each point is done without resorting to a function call. Make your code as efficient as possible.
This chapter begins with an overview of the NEON extensions and explains the relationship between VFP and NEON. The NEON registers are explained, and the syntax for NEON instructions is explained. Next, each of the NEON instructions are explained, with short examples. In some cases, extended examples and figures are provided to help explain the operation of complex instructions. After all of the instructions are explained, another implementation of sine is presented and compared to previous implementations and with the GCC sine function. It is shown that NEON gives a significant performance advantage over VFP and hand coded assembly is much faster than the sin function provided by the compiler.
Single instruction multiple data (SIMD); Vector; Vector element; Instruction level parallelism; Lane
The ARM VFP coprocessor has been replaced or augmented by the NEON architecture on ARMv7 and higher systems. NEON extends the VFP instruction set with about 125 instructions and pseudo-instructions to support not only floating point, but also integer and fixed point. NEON also supports Single Instruction, Multiple Data (SIMD) operations. All NEON processors have the full set of 32 double precision VFP registers, but NEON adds the ability to view the register set as 16 128-bit (quadruple-word) registers, named q0 through q15.
A single NEON instruction can operate on up to 128 bits, which may represent multiple integer, fixed point, or floating point numbers. For example, if two of the 128-bit registers each contain eight 16-bit integers, then a single NEON instruction can add all eight integers from one register to the corresponding integers in the other register, resulting in eight simultaneous additions. For certain applications, this SIMD architecture can result in extremely fast and efficient implementations. NEON is particularly useful at handling streaming video and audio, but also can give very good performance on floating point intensive tasks. NEON instructions perform parallel operations on vectors. NEON deprecates the use of VFP vector mode covered in Section 9.2.2. On most NEON systems, using the VFP vector mode will result in an exception, which transfers control to the support code which emulates vector mode in software. This causes a severe performance penalty, so VFP vector mode should not be used on NEON systems.
Fig. 10.1 shows the ARM integer, VFP, and NEON register set. NEON views each register as containing a vector of 1, 2, 4, 8, or 16 elements, all of the same size and type. Individual elements of each vector can also be accessed as scalars. A scalar can be 8 bits, 16 bits, 32 bits, or 64 bits. The instruction syntax is extended to refer to scalars using an index, x, in a doubleword register. Dm[x] is element x in register Dm. The size of the elements is given as part of the instruction. Instructions that access scalars can access any element in the register bank.

The GCC compiler gives C (and C++) programs direct access to the NEON instructions through the NEON intrinsics. The intrinsics are a large set of functions that are built into the compiler. Most of the intrinsics functions map to one NEON instruction. There are additional functions provided for typecasting (reinterpreting) NEON vectors, so that the C compiler does not complain about mismatched types. It is usually shorter and more efficient to write the NEON code directly as assembly language functions and link them to the C code. However only those who know assembly language are capable of doing that.
Some instructions require specific register types. Other instructions allow the programmer to choose single word, double word, or quad word registers. If the instruction requires single precision registers, then the registers are specified as Sd for the destination register, Sn for the first operand register, and Sm for the second operand register. If the instruction requires only two registers, then Sn is not used. The lower-case letter is replaced with a valid register number. The register name is not case sensitive, so S10 and s10 are both valid names for single precision register 10.
The syntax of the NEON instructions can be described using a relatively simple notation. The notation consists of the following elements:
{item} Braces around an item indicate that the item is optional. For example, many operations have an optional condition, which is written as {<cond>}.
Ry An ARM integer register. y can be any number in the range 0{15.
Sy A 32-bit or single precision register. y can be any number in the range 0{31.
Dy A 64-bit or double precision register. y can be any number in the range 0{31.
Qy A quad word register. y can be any number in the range 0{15.
Fy A VFP register. F must be either s for a single word register, or d for a double word register. y can be any valid register number.
Ny A NEON or VFP register. N must be either s for a single word register, d for a double word register, or q for a quad word register. y can be any valid register number.
Vy A NEON vector register. V must be replaced with d for a double word register, or q for a quad word register. y can be any valid register number.
Vy[x] A NEON scalar (vector element). The size of the scalar is defined as part of the instruction. V must be replaced with d for a double word register, or q for a quad word register. y can be any valid register number. x specifies which scalar element of Vy is to be used. Valid values for x can be deduced by the size of Vy and the size of the scalars that the instruction uses.
<op> Operation specific part of a general instruction format
<n> An integer usually indicating a specific instruction version
<size> An integer indicating the number of bits used
<cond> ARM condition code from Table 3.2
<type> Many instructions operate on one or more of the following specific data types:
i16 Untyped 16 bits
i32 Untyped 32 bits
i64 Untyped 64 bits
s8 Signed 8-bit integer
s16 Signed 16-bit integer
s32 Signed 32-bit integer
s64 Signed 64-bit integer
u8 Unsigned 8-bit integer
u16 Unsigned 16-bit integer
u32 Unsigned 32-bit integer
u64 Unsigned 64-bit integer
f16 IEEE 754 half precision floating point
f32 IEEE 754 single precision floating point
f64 IEEE 754 double precision floating point
<list> A brace-delimited list of up to four NEON registers, vectors, or scalars. The general form is {Dn,D(n+a),D(n+2a),D(n+3a)} where a is either 1 or 2.
<align> Specifies the memory alignment of structured data for certain load and store operations.
<imm> An immediate value. The required format for immediate values depends on the instruction.
<fbits> Specifies the number of fraction bits in fixed point numbers.
The following function definitions are used in describing the effects of many of the instructions:
The floor function maps a real number, x, to the next smallest integer.
The saturate function limits the value of x to the highest or lowest value that can be stored in the destination register.
The round function maps a real number, x, to the nearest integer.
The narrow function reduces a 2n bit number to an n bit number, by taking the n least significant bits.
The extend function converts an n bit number to a 2n bit number, performing zero extension if the number is unsigned, or sign extension if the number is signed.
These instructions can be used to perform interleaving of data when structured data is loaded or stored. The data should be properly aligned for best performance. These instructions are very useful for common multimedia data types.
For example, image data is typically stored in arrays of pixels, where each pixel is a small data structure such as the pixel struct shown in Listing 5.37. Since each pixel is three bytes, and a d register is 8 bytes, loading a single pixel into one register would be inefficient. It would be much better to load multiple pixels at once, but an even number of pixels will not fit in a register. It will take three doubleword or quadword registers to hold an even number of pixels without wasting space, as shown in Fig. 10.2. This is the way data would be loaded using a VFP vldr or vldm instruction. Many image processing operations work best if each color “channel” is processed separately. The NEON load and store vector instructions can be used to split the image data into color channels, where each channel is stored in a different register, as shown in Fig. 10.3.


Other examples of interleaved data include stereo audio, which is two interleaved channels, and surround sound, which may have up to nine interleaved channels. In all of these cases, most processing operations are simplified when the data is separated into non-interleaved channels.
These instructions are used to load and store structured data across multiple registers:
vld<n> Load Structured Data, and
vst<n> Store Structured Data.
They can be used for interleaving or deinterleaving the data as it is loaded or stored, as shown in Fig. 10.3.
• <op> must be either ld or st.
• <n> must be one of 1, 2, 3, or 4.
• <size> must be one of 8, 16, or 32.
• <list> specifies the list of registers. There are four list formats:
2. {Dd[x], D(d+a)[x]}
3. {Dd[x], D(d+a)[x], D(d+2a)[x]}
4. {Dd[x], D(d+a)[x], D(d+2a)[x], D(d+3a)[x]}
where a can be either 1 or 2. Every register in the list must be in the range d0-d31.
• Rn is the ARM register containing the base address. Rn cannot be pc.
• <align> specifies an optional alignment. If <align> is not specified, then standard alignment rules apply.
• The optional ! indicates that Rn is updated after the data is transferred. This is similar to the ldm and stm instructions.
• Rm is an ARM register containing an offset from the base address. If Rm is present, Rn is updated to Rn + Rm after the address is used to access memory. Rm cannot be sp or pc.
Table 10.1 shows all valid combinations of parameters for these instructions. Note that the same vector element (scalar) x must be used in each register. Up to four registers can be specified. If the structure has more than four fields, then these instructions can be used repeatedly to load or store all of the fields.
Table 10.1
Parameter combinations for loading and storing a single structure
| <n> | <size> | <list> | <align> | Alignment |
| 1 | 8 | Dd[x] | Standard only | |
| 2-5 | 16 | Dd[x] | 16 | 2 byte |
| 2-5 | 32 | Dd[x] | 32 | 4 byte |
| 2 | 8 | Dd[x], D(d+1)[x] | 16 | 2 byte |
| 2-5 | 16 | Dd[x], D(d+1)[x] | 32 | 4 byte |
| Dd[x], D(d+2)[x] | 32 | 4 byte | ||
| 2-5 | 32 | Dd[x], D(d+1)[x] | 64 | 8 byte |
| Dd[x], D(d+2)[x] | 64 | 8 byte | ||
| 3 | 8 | Dd[x], D(d+1)[x], D(d+2)[x] | Standard only | |
| 2-5 | 16 or 32 | Dd[x], D(d+1)[x], D(d+2)[x] | Standard only | |
| Dd[x], D(d+2)[x], D(d+4)[x] | Standard only | |||
| 4 | 8 | Dd[x], D(d+1)[x], D(d+2)[x], D(d+3)[x] | 32 | 4 byte |
| 2-5 | 16 | Dd[x], D(d+1)[x], D(d+2)[x], D(d+3)[x] | 64 | 8 byte |
| Dd[x], D(d+2)[x], D(d+4)[x], D(d+6)[x] | 64 | 8 byte | ||
| 2-5 | 32 | Dd[x], D(d+1)[x], D(d+2)[x], D(d+3)[x] | 64 or 128 | (<align> ÷ 8) bytes |
| Dd[x], D(d+2)[x], D(d+4)[x], D(d+6)[x] | 64 or 128 | (<align> ÷ 8) bytes |

| Name | Effect | Description |
| vld<n> |
for D ∈ regs(<list>) do end for if ! is present then else if Rm is specified then end if end if | Load one or more data items into a single lane of one or more registers |
| vst<n> |
for D ∈ regs(<list>) do end for if ! is present then else if Rm is specified then end if end if | Store one or more data items from a single lane of one or more registers |


This instruction is used to load multiple copies of structured data across multiple registers:
vld<n> Load Copies of Structured Data.
The data is copied to all lanes. This instruction is useful for initializing vectors for use in later instructions.
• <n> must be one of 1, 2, 3, or 4.
• <size> must be one of 8, 16, or 32.
• <list> specifies the list of registers. There are four list formats:
2. {Dd[], D(d+a)[]}
3. {Dd[], D(d+a)[], D(d+2a)[]}
4. {Dd[], D(d+a)[], D(d+2a)[], D(d+3a)[]}
where a can be either 1 or 2. Every register in the list must be in the range d0-d31.
• Rn is the ARM register containing the base address. Rn cannot be pc.
• <align> specifies an optional alignment. If <align> is not specified, then standard alignment rules apply.
• The optional ! indicates that Rn is updated after the data is transferred. This is similar to the ldm and stm instructions.
• Rm is an ARM register containing an offset from the base address. If Rm is present, Rn is updated to Rn + Rm after the address is used to access memory. Rm cannot be sp or pc.
Table 10.2 shows all valid combinations of parameters for this instruction. Note that the vector element number is not specified, but the brackets [] must be present. Up to four registers can be specified. If the structure has more than four fields, then this instruction can be repeated to load or store all of the fields.
Table 10.2
Parameter combinations for loading multiple structures
| <n> | <size> | <list> | <align> | Alignment |
| 1 | 8 | Dd[] | Standard only | |
| Dd[], D(d+1)[] | Standard only | |||
| 2-5 | 16 | Dd[] | 16 | 2 byte |
| Dd[], D(d+1)[] | 16 | 2 byte | ||
| 2-5 | 32 | Dd[] | 32 | 4 byte |
| Dd[], D(d+1)[] | 32 | 4 byte | ||
| 2 | 8 | Dd[], D(d+1)[] | 8 | 1 byte |
| 8 | Dd[], D(d+2)[] | 8 | 1 byte | |
| 2-5 | 16 | Dd[], D(d+1)[] | 16 | 2 byte |
| Dd[], D(d+2)[] | 16 | 2 byte | ||
| 2-5 | 32 | Dd[], D(d+1)[] | 32 | 4 byte |
| Dd[], D(d+2)[] | 32 | 4 byte | ||
| 3 | 8, 16, or 32 | Dd[], D(d+1)[], D(d+2)[] | Standard only | |
| Dd[], D(d+2)[], D(d+4)[] | Standard only | |||
| 4 | 8 | Dd[], D(d+1)[], D(d+2)[], D(d+3)[] | 32 | 4 byte |
| Dd[], D(d+2)[], D(d+4)[], D(d+6)[] | 32 | 4 byte | ||
| 2-5 | 16 | Dd[], D(d+1)[], D(d+2)[], D(d+3)[] | 64 | 8 byte |
| Dd[], D(d+2)[], D(d+4)[], D(d+6)[] | 64 | 8 byte | ||
| 2-5 | 32 | Dd[], D(d+1)[], D(d+2)[], D(d+3)[] | 64 or 128 | (<align> ÷ 8) bytes |
| Dd[], D(d+2)[], D(d+4)[], D(d+6)[] | 64 or 128 | (<align> ÷ 8) bytes |


These instructions are used to load and store multiple data structures across multiple registers with interleaving or deinterleaving:
vld<n> Load Multiple Structured Data, and
vst<n> Store Multiple Structured Data.
• <op> must be either ld or st.
• <n> must be one of 1, 2, 3, or 4.
• <size> must be one of 8, 16, or 32.
• <list> specifies the list of registers. There are four list formats:
2. {Dd, D(d+a)}
3. {Dd, D(d+a), D(d+2a)}
4. {Dd, D(d+a), D(d+2a), D(d+3a)}
where a can be either 1 or 2. Every register in the list must be in the range d0-d31.
• Rn is the ARM register containing the base address. Rn cannot be pc.
• <align> specifies an optional alignment. If <align> is not specified, then standard alignment rules apply.
• The options ! indicates that Rn is updated after the data is transferred, similar to the ldm and stm instructions.
• Rm is an ARM register containing an offset from the base address. If Rm is present, Rn is updated to Rn + Rm after the address is used to access memory. Rm cannot be sp or pc.
Table 10.3 shows all valid combinations of parameters for this instruction. Note that the scalar is not specified and the instructions work on all multiple vector elements. Up to four registers can be specified. If the structure has more than four fields, then this instruction can be repeated to load or store all of the fields.
Table 10.3
Parameter combinations for loading copies of a structure
| <n> | <size> | <list> | <align> | Alignment |
| 1 | 8, 16, 32, or 64 | Dd | 64 | 8 bytes |
| Dd, D(d+1) | 64 or 128 | (<align> ÷ 8) bytes | ||
| Dd, D(d+1), D(d+2) | 64 | 8 bytes | ||
| Dd, D(d+1), D(d+2), D(d+3) | 64, 128, or 256 | (<align> ÷ 8) bytes | ||
| 2 | 8, 16, or 32 | Dd, D(d+1) | 64 or 128 | (<align> ÷ 8) bytes |
| Dd, D(d+2) | 64 or 128 | (<align> ÷ 8) bytes | ||
| Dd, D(d+1), D(d+2), D(d+3) | 64, 128, or 256 | (<align> ÷ 8) bytes | ||
| 3 | 8, 16, or 32 | Dd, D(d+1), D(d+2) | 64 | 8 bytes |
| Dd, D(d+2), D(d+3) | 64 | 8 bytes | ||
| 4 | 8, 16, or 32 | Dd, D(d+1), D(d+2), D(d+3) | 64, 128, or 256 | (<align> ÷ 8) bytes |
| Dd, D(d+2), D(d+4), D(d+6) | 64, 128, or 256 | (<align> ÷ 8) bytes |

| Name | Effect | Description |
| vld<n> |
for 0 ≤ x < nlanes do for D ∈<list> do end for end for if ! is present then else if Rm is specified then end if end if | Load one or more data items into all lanes of one or more registers. |
| vst<n> |
for 0 ≤ x < nlanes do for D ∈<list> do end for end for if ! is present then else if Rm is specified then end if end if | Load one or more data items into all lanes of one or more registers. |


Because they use the same set of registers, VFP and NEON share some instructions for loading, storing, and moving registers. The shared instructions are vldr, vstr, vldm, vstm, vpop, vpush, vmov, vmrs, and vmsr. These were explained in Chapter 9. NEON extends the vmov instructions to allow specification of NEON scalars and quadwords, and adds the ability to perform one’s complement during a move.
This version of the move instruction allows data to be moved between the NEON registers and the ARM integer registers as 8-bit, 16-bit, or 32-bit NEON scalars:
vmov Move Between NEON and ARM.
• <cond> is an optional condition code.
• <size> must be 8, 16, or 32, and specifies the number of bits that are to be moved.
• The <type> must be u8, u16, u32, s8, s16, s32, or f32, and specifies the number of bits that are to be moved and whether or not the result should be sign-extended in the ARM integer destination register.

NEON extends the VFP vmov instruction to include the ability to move an immediate value, or the one’s complement of an immediate value, to every element of a register. The instructions are:
vmvn Move Immediate NOT.
• <op> must be either <mov> or <mvn>.
• <type> must be i8, i16, i32, f32, or i64, and specifies the size of items in the vector.
• V can be s, d, or q.
• <imm> is an immediate value that matches <type>, and is copied to every element in the vector. The following table shows valid formats for imm:
| <type> | vmov | vmvn |
| i8 | 0xXY | 0xXY |
| i16 | 0x00XY | 0xFFXY |
| 0xXY00 | 0xXYFF | |
| i32 | 0x000000XY | 0xFFFFFFXY |
| 0x0000XY00 | 0xFFFFXYFF | |
| 0x00XY0000 | 0xFFXYFFFF | |
| 0xXY000000 | 0xXYFFFFFF | |
| i64 | 0xABCDEFGH | 0xABCDEFGH |
| 2-3 | Each letter represents a byte, and must be either FF or 00 | |
| f32 | Any number that can be written as ± n × (2 − r), where n and r are integers, such that 16 ≤ n ≤ 31 and 0 ≤ r ≤ 7 | |


It is sometimes useful to increase or decrease the number of bits per element in a vector. NEON provides these instructions to convert a doubleword vector with elements of size y to a quadword vector with size 2y, or to perform the inverse operation:
vmovn Move and Narrow,
vqmovn Saturating Move and Narrow, and
vqmovun Saturating Move and Narrow Unsigned.
• The valid choices for <type> are given in the following table:
| Opcode | Valid Types |
| vmovl | s8, s16, s32, u8, u16, or u32 |
| vmovn | i8, i16, or i32 |
| vqmovn | s8, s16, s32, u8, u16, or u32 |
| vqmovun | s8, s16, or s32 |
• q indicates that the results are saturated.
| Name | Effect | Description |
| vmovl |
end for | Sign or zero extends (depending on <type>) each element of a doubleword vector to twice their length |
| v{q}movn |
if q is present then else end if end for | Copy the least significant half of each element of a quadword vector to the corresponding elements of a doubleword vector. If q is present, then the value is saturated |
| vqmovun |
end for | Copy each element of the operand vector to the corresponding element of the destination vector. The destination element is unsigned, and the value is saturated |


The duplicate instruction copies a scalar into every element of the destination vector. The scalar can be in a NEON register or an ARM integer register. The instruction is:
• <size> must be one of 8, 16 or 32.
• V can be d or q.
• Rm cannot be r15.

This instruction extracts 8-bit elements from two vectors and concatenates them. Fig. 10.4 gives an example of what this instruction does. The instruction is:

• <size> must be one of 8, 16, 32, or 64.
• V can be d or q.
• <imm> is the number of elements to extract from the bottom of Vm. The remaining elements required to fill Vd are taken from the top of Vn.

This instruction reverses the order of data in a register:
One use of this instruction is for converting data from big-endian to little-endian order, or from little-endian to big-endian order. It could also be useful for swapping data and transforming matrices. Fig. 10.5 shows three examples.

• <size> is either 8, 16, or 32 and indicates the size of the elements to be reversed. <size> must be less than <n>.
• V can be q or d.

This instruction simply swaps two NEON registers:
• <type> can be any NEON data type. The assembler ignores the type, but it can be useful to the programmer as extra documentation.
• V can be q or d.

This instruction transposes 2 × 2 matrices:
Fig. 10.6 shows two examples of this instruction. Larger matrices can be transposed using a divide-and-conquer approach.

• <size> is either 8, 16, or 32 and indicates the size of the elements in the matrix (or matrices).
• V can be q or d.

Fig. 10.7 shows how the vtrn instruction can be used to transpose a 3 × 3 matrix. Transposing a 4 × 4 matrix requires the transposition of 13 2 × 2 matrices. However, this instruction can operate on multiple 2 × 2 sub-matrices in parallel, and can group elements into different sized sub-matrices. There is also a very useful swap instruction that can exchange the rows of a matrix. Using the swap and transpose instructions, transposing a 4 × 4 matrix of 16-bit elements can be done with only four instructions, as shown in Fig. 10.8.


The table lookup instructions use indices held in one vector to lookup values from a table held in one or more other vectors. The resulting values are stored in the destination vector. The table lookup instructions are:
vtbx Table Lookup with Extend.
• <list> specifies the list of registers. There are five list formats:
2. {Dn, D(n+1)},
3. {Dn, D(n+1), D(n+2)},
4. {Dn, D(n+1), D(n+2), D(n+3)}, or
5. {Qn, Q(n+1)}.
• Dm is the register holding the indices.
• The table can contain up to 32 bytes.
| Name | Effect | Description |
| vtbl |
for 0 ≤ i < 8 do if r > Maxr then else end if end for | Use indices Dm to look up values in a table and store them in Dd. If the index is out of range, zero is stored in the corresponding destination. |
| vtbx |
for 0 ≤ i < 8 do if r ≤ Maxr then end if end for | Use indices Dm to look up values in a table and store them in Dd. If the index is out of range, the corresponding destination is unchanged. |


These instructions are used to interleave or deinterleave the data from two vectors:
vuzp Unzip Vectors.
Fig. 10.9 gives an example of the vzip instruction. The vuzp instruction performs the inverse operation.

• <size> is either 8, 16, or 32 and indicates the size of the elements in the matrix (or matrices).
• V can be q or d.
| Name | Effect | Description |
| vzip |
for 0 ≤ i < (n ÷ 2) by 2 do end for for (n ÷ 2) ≤ i < n by 2 do end for
| Interleave data from two vectors. tmp is a vector of suitable size. |
| vuzp |
for 0 ≤ i < (n ÷ 2) by 2 do end for for (n ÷ 2) ≤ i < n by 2 do end for
| Interleave data from two vectors. tmp is a vector of suitable size. |


When high precision is not required, The IEEE half-precision format can be used to store floating point numbers in memory. This can reduce memory requirements by up to 50%. This can also result in a significant performance improvement, since only half as much data needs to be moved between the CPU and main memory. However, on most processors half-precision data must be converted to single precision before it is used in calculations. NEON provides enhanced versions of the vcvt instruction which support conversion to and from IEEE half precision. There are also versions of vcvt which operate on vectors, and perform integer or fixed-point to floating-point conversions.
This instruction can be used to perform a data conversion between single precision and fixed point on each element in a vector:
The elements in the vector must be a 32-bit single precision floating point or a 32-bit integer. Fixed point (or integer) arithmetic operations are up to twice as fast as floating point operations. In some cases it is much more efficient to make this conversion, perform the calculations, then convert the results back to floating point.
• <cond> is an optional condition code.
• <type> must be either s32 or u32.
• The optional <fbits> operand specifies the number of fraction bits for a fixed point number, and must be between 0 and 32. If it is omitted, then it is assumed to be zero.
| Name | Effect | Description |
| vcvt.s32.f32 | ![]() | Convert single precision to 32-bit signed fixed point or integer. |
| vcvt.u32.f32 | ![]() | Convert single precision to 32-bit unsigned fixed point or integer. |
| vcvt.f32.s32 | ![]() | Convert signed 32-bit fixed point or integer to single precision |
| vcvt.f32.u32 | ![]() | Convert unsigned 32-bit fixed point or integer to single precision |

NEON systems with the half-precision extension provide the following instruction to perform conversion between single precision and half precision floating point formats:
vcvt Convert Between Half and Single.
• The <op> must be either b or t and specifies whether the top or bottom half of the register should be used for the half-precision number.
• <cond> is an optional condition code.
| Name | Effect | Description |
| vcvtb.f16.f32 | ![]() | Convert single precision to half precision and store in bottom half of destination |
| vcvtt.f16.f32 | ![]() | Convert single precision to half precision and store in top half of destination |
| vcvtb.f32.f16 | ![]() | Convert half precision number from bottom half of source to single precision |
| vcvtt.f32.f16 | ![]() | Convert half precision number from top half of source to single precision |

NEON adds the ability to perform integer comparisons between vectors. Since there are multiple pairs of items to be compared, the comparison instructions set one element in a result vector for each pair of items. After the comparison operation, each element of the result vector will have every bit set to zero (for false) or one (for true). Note that if the elements of the result vector are interpreted as signed two’s-complement numbers, then the value 0 represents false and the value − 1 represents true.
The following instructions perform comparisons of all of the corresponding elements of two vectors in parallel:
vcge Compare Greater Than or Equal,
vcgt Compare Greater Than,
vcle Compare Less Than or Equal, and
vclt Compare Less Than.
The vector compare instructions compare each element of a vector with the corresponding element in a second vector, and sets an element in the destination vector for each comparison. If the comparison is true, then all bits in the result element are set to one. Otherwise, all bits in the result element are set to zero. Note that summing the elements of the result vector (as signed integers) will give the two’s complement of the number of comparisons which were true.
Note: vcle and vclt are actually pseudo-instructions. They are equivalent to vcgt and vcge with the operands reversed.
• <op> must be one of eq, ge, gt, le, or lt.
• If <op> is eq, then <type> must be i8, i16, i32, or f32.
• If <op> is not eq and Rop is #0, then < type > must be s8, s16, s32, or f32.
• If <op> is not eq and the third operand is a register, then <type> must be s8, s16, s32, u8, u16, u32, or f32.
• The result data type is determined from the following table:
• If the third operand is #0, then it is taken to be a vector of the correct size in which every element is zero.
• V can be d or q.

The following instructions perform comparisons between the absolute values of all of the corresponding elements of two vectors in parallel:
vacgt Absolute Compare Greater Than, and
vacge Absolute Compare Greater Than or Equal.
The vector absolute compare instruction compares the absolute value of each element of a vector with the absolute value of the corresponding element in a second vector, and sets an element in the destination vector for each comparison. If the comparison is true, then all bits in the result element are set to one. Otherwise, all bits in the result element are set to zero. Note that summing the elements of the result vector (as signed integers) will give the two’s complement of the number of comparisons which were true.
• <op> must be either ge or gt.
• V can be d or q.
• The operand element type must be f32.
• The result element type is i32.

NEON provides the following vector version of the ARM tst instruction:
The vector test bits instruction performs a logical AND operation between each element of a vector and the corresponding element in a second vector. If the result is not zero, then every bit in the corresponding element of the result vector is set to one. Otherwise, every bit in the corresponding element of the result vector is set to zero.
• <size> must be one of 8, 16 or 32
• The result element type is defined by the following table:

NEON adds the ability to perform integer and bitwise logical operations on the VFP register set. Recall that integer operations can also be used on fixed-point data. These operations add a great deal of power to the ARM processor.
NEON includes vector versions of the following five basic logical operations:
veor Bitwise Exclusive-OR,
vorr Bitwise OR,
vorn Bitwise Complement and OR, and
vbic Bit Clear.
All of them involve two source operands and a destination register.
• <op> must be one of and, eor, orr, orn, or bic.
• V must be either q or d.
• type must be i8, i16, i32, or i64. For these bitwise logical operations, type does not matter.

It is often useful to clear and/or set specific bits in a register. The NEON instruction set provides the following vector versions of the logical OR and bit clear instructions:
vorr Bitwise OR Immediate, and
vbic Bit Clear Immediate.
• <op> must be either orr, or bic.
• V must be either q or d to specify whether the operation involves quadwords or doublewords.
• <type> must be i16 or i32.
• <imm> is a 16-bit or 32-bit immediate value, which is interpreted as a pattern for filling the immediate operand. The following table shows acceptable patterns for <imm>, based on what was chosen for <type>:

NEON provides three instructions which can be used to combine the bits in two registers or to extract specific bits from a register, according to a pattern:
vbif Bitwise Insert if False, and
vbsl Bitwise Select.
• <op> can be bif, bit, or bsl.
• V can be d or q.
• The <type> must be i8, i16, i32, or i64, and specifies the size of items in the vectors. Note that for these bitwise logical operations, the type does not matter. so the assembler ignores it. However, it can be useful to the programmer as extra documentation.
| Name | Effect | Description |
| vbit | ![]() | Insert each bit from the first operand into the destination if the corresponding bit of the second operand is 1 |
| vbif | ![]() | Insert each bit from the first operand into the destination if the corresponding bit of the second operand is 0 |
| vbsl | ![]() | Select each bit for the destination from the first operand if the corresponding bit of the destination is 1, or from the second operand if the corresponding bit of the destination is 0 |

The NEON shift instructions operate on vectors. Shifts are often used for multiplication and division by powers of two. The results of a left shift may be larger than the destination register, resulting in overflow. A shift right is equivalent to division. In some cases, it may be useful to round the result of a division, rather than truncating. NEON provides versions of the shift instruction which perform saturation and/or rounding of the result.
These instructions shift each element in a vector left by an immediate value:
vqshl Saturating Shift Left Immediate,
vqshlu Saturating Shift Left Immediate Unsigned, and
vshll Shift Left Immediate Long.
Overflow conditions can be avoided by using the saturating version, or by using the long version, in which case the destination is twice the size of the source.
• If u is present, then the results are unsigned.
• The valid choices for <type> are given in the following table:
| Name | Effect | Description |
| vshl | Each element of Vm is shifted left by the immediate value and stored in the corresponding element of Vd. Bits shifted past the end of an element are lost. | |
| vshll | Each element of Vm is shifted left by the immediate value and stored in the corresponding element of Vd. The values are sign or zero extended, depending on <type> | |
| vqshl{u} | Each element of Vm is shifted left by the immediate value and stored in the corresponding element of Vd. If the result of the shift is outside the range of the destination element, then the value is saturated. If u was specified, then the destination is unsigned. Otherwise, it is signed |


These instructions shift each element in a vector, using the least significant byte of the corresponding element of a second vector as the shift amount:
vshl Shift Left or Right by Variable,
vrshl Shift Left or Right by Variable and Round,
vqshl Saturating Shift Left or Right by Variable, and
vqrshl Saturating Shift Left or Right by Variable and Round.
If the shift value is positive, the operation is a left shift. If the shift value is negative, then it is a right shift. A shift value of zero is equivalent to a move. If the operation is a right shift, and r is specified, then the result is rounded rather than truncated. Results are saturated if q is specified.
• If q is present, then the results are saturated.
• If r is present, then right shifted values are rounded rather than truncated.
• V can be d or q.
• <type> must be one of s8, s16, s32, s64, s8, s16, s32, or s64.

These instructions shift each element in a vector right by an immediate value:
vrshr Shift Right Immediate and Round,
vshrn Shift Right Immediate and Narrow,
vrshrn Shift Right Immediate Round and Narrow,
vsra Shift Right and Accumulate Immediate, and
vrsra Shift Right Round and Accumulate Immediate.
• If r is present, then right shifted values are rounded rather than truncated.
• <cond> is an optional condition code.
• The valid choices for <type> are given in the following table:
| Name | Effect | Description |
| v{r}shr |
else end if | Each element of Vm is shifted right with zero extension by the immediate value and stored in the corresponding element of Vd. Results can be rounded both. |
| v{r}shrn |
else end if | Each element of Vm is shifted right with zero extension by the immediate value, optionally rounded, then narrowed and stored in the corresponding element of Vd. |
| v{r}sra |
else end if | Each element of Vm is shifted right with sign or zero extension by the immediate value and accumulated in the corresponding element of Vd. Results can be rounded. |


These instructions shift each element in a quad word vector right by an immediate value:
vqshrn Saturating Shift Right Immediate,
vqrshrn Saturating Shift Right Immediate Round,
vqshrun Saturating Shift Right Immediate Unsigned, and
vqrshrun Saturating Shift Right Immediate Round Unsigned.
The result is optionally rounded, then saturated, narrowed, and stored in a double word vector.
• If r is present, then right shifted values are rounded rather than truncated.
• If u is present, then the results are unsigned, regardless of the type of elements in Qm.
• The valid choices for <type> are given in the following table:
• <imm> Is the amount that elements are to be shifted, and must be between zero and one less than the number of bits in <type>.
| Name | Effect | Description |
| vq{r}shrn |
else end if | Each element of Vm is shifted right with sign extension by the immediate value, optionally rounded, then saturated and narrowed, and stored in the corresponding element of Vd. |
| vq{r}shrun |
else end if | Each element of Vm is shifted right with zero extension by the immediate value, optionally rounded, then saturated and narrowed, and stored in the corresponding element of Vd. |


These instructions perform bitwise shifting of each element in a vector, then combine the results with the contents of the destination register:
vsri Shift Right and Insert.
Fig. 10.10 provides an example.

• <dir> must be l for a left shift, or r for a right shift.
• <size> must be 8, 16, 32, or 64.
• <imm> is the amount that elements are to be shifted, and must be between zero and <size>− 1 for vsli, or between one and <size> for vsri.

NEON provides several instructions for addition, subtraction, and multiplication, but does not provide a divide instruction. Whenever possible, division should be performed by multiplying the reciprocal. When dividing by constants, the reciprocal can be calculated in advance, as shown in Chapter 8. For dividing by variables, NEON provides instructions for quickly calculating the reciprocals for all elements in a vector. In most cases, this is faster than using a divide instruction. When division is absolutely unavoidable, the VFP divide instructions can be used.
The following eight instructions perform vector addition and subtraction:
vqadd Saturating Add
vaddl Add Long
vaddw Add Wide
vsub Subtract
vqsub Saturating Subtract
vsubl Subtract Long
vsubw Subtract Wide
The Vector Add (vadd) instruction adds corresponding elements in two vectors and stores the results in the corresponding elements of the destination register. The Vector Subtract (vsub) instruction subtracts elements in one vector from corresponding elements in another vector and stores the results in the corresponding elements of the destination register. Other versions allow mismatched operand and destination sizes, and the saturating versions prevent overflow by limiting the range of the results.
• The valid choices for <type> are given in the following table:
| Name | Effect | Description |
| v<op> | The operation is applied to corresponding elements of Vn and Vm. The results are stored in the corresponding elements of Vd. | |
| vq<op> | The operation is applied to corresponding elements of Vn and Vm. The results are saturated then stored in the corresponding elements of Vd. | |
| v<op>l | The operation is applied to corresponding elements of Dn and Dm. The results are zero or sign extended then stored in the corresponding elements of Qd. | |
| v<op>w | The elements of Vm are sign or zero extended, then the operation is applied with corresponding elements of Vn. The results are stored in the corresponding elements of Vd. |


These instructions add or subtract the corresponding elements of two vectors, and narrow by taking the most significant half of the result:
vraddhn Add, Round, and Narrow
vsubhn Subtract and Narrow
vrsubhn Subtract, Round, and Narrow
The results are stored in the corresponding elements of the destination register. Results can be optionally rounded instead of truncated.
• If <r> is specified, then the result is rounded instead of truncated.
• <type> must be either i16, i32, or i64.

These instructions add or subtract corresponding elements from two vectors then shift the result right by one bit:
vrhadd Halving Add and Round
vhsub Halving Subtract
The results are stored in corresponding elements of the destination vector. If the operation is addition, then the results can be optionally rounded.
• If <r> is specified, then the result is rounded instead of truncated.
• <type> must be either s8, s16, s32, u8, u16, ar u32.
| Name | Effect | Description |
| v{r}hadd |
else end if | The corresponding elements of Vn and Vm are added together, optionally rounded, then shifted right one bit. Results are stored in the corresponding elements of Vd. |
| vhsub | The elements of Vn are subtracted from the corresponding elements of Vm. Results are shifted right one bit and stored in the corresponding elements of Vd. |


These instructions add vector elements pairwise:
vpaddl Add Pairwise Long
vpadal Add Pairwises and Accumulate Long
The long versions can be used to prevent overflow.
• <op> must be either add or ada.
• The valid choices for <type> are given in the following table:
| Name | Effect | Description |
| vpadd |
for 0 ≤ i < (n ÷ 2) do end for for (n ÷ 2) ≤ i < n do end for | Add elements of two vectors pairwise and store the results in another vector. |
| vpaddl |
for 0 ≤ i < (n ÷ 2) by 2 do end for | Add elements of a vector pairwise and store the results in another vector. |
| vpadal |
for 0 ≤ i < (n ÷ 2) by 2 do end for | Add elements of a vector pairwise and accumulate the results in another vector. |


These instructions subtract the elements of one vector from another and store or accumulate the absolute value of the results:
vaba Absolute Difference and Accumulate
vabal Absolute Difference and Accumulate Long
vabd Absolute Difference
vabdl Absolute Difference Long
The long versions can be used to prevent overflow.
• The valid choices for <type> are given in the following table:
| Name | Effect | Description |
| vabd | Subtract corresponding elements and take the absolute value | |
| vaba | Subtract corresponding elements and take the absolute value. Accumulate the results | |
| vabdl | Extend and subtract corresponding elements, then take the absolute value | |
| v<op>w | Extend and subtract corresponding elements, then take the absolute value. Accumulate the results |


These operations compute the absolute value or negate each element in a vector:
vneg Negate
vqabs Saturating Absolute Value
vqneg Saturating Negate
The saturating versions can be used to prevent overflow.
• If q is present then results are saturated.
• <op> is either abs or neg.
• The valid choices for <type> are given in the following table:

The following four instructions select the maximum or minimum elements and store the results in the destination vector:
vmin Minimum
vpmax Pairwise Maximum
vpmin Pairwise Minimum
• <type> must be one of s8, s16, s32, u8, u16, u32, or f32.
| Name | Effect | Description |
| vmax |
for 0 ≤ i < n do if V n[i] > V m[i] then else end if end for | Compare corresponding elements and copy the greater of each pair into the corresponding element in the destination vector |
| vpmax |
for 0 ≤ i < (n ÷ 2) do if Dm[i] > Dm[i + 1] then else end if end for for (n ÷ 2) ≤ i < n do if Dn[i] > Dn[i + 1] then else end if end for | Compare elements pairwise and copy the greater of each pair into an element in the destination vector, another vector |
| vmin |
for 0 ≤ i < n do if V n[i] < V m[i] then else end if end for | Compare corresponding elements and copy the lesser of each pair into the corresponding element in the destination vector |
| vpmin |
for 0 ≤ i < (n ÷ 2) do if Dm[i] < Dm[i + 1] then else end if end for for (n ÷ 2) ≤ i < n do if Dn[i] < Dn[i + 1] then else end if end for | Compare elements pairwise and copy the lesser of each pair into an element in the destination vector, another vector |


These instructions can be used to count leading sign bits or zeros, or to count the number of bits that are set for each element in a vector:
vclz Count Leading Zero Bits
vcnt Count Set Bits
• <op> is either cls, clz or cnt.
• The valid choices for <type> are given in the following table:
| Name | Effect | Description |
| vcls |
for 0 ≤ i < n) do end for | Count the number of consecutive bits that are the same as the sign bit for each element in Fm, and store the counts in the corresponding elements of Fd |
| vcls |
for 0 ≤ i < n) do end for | Count the number of leading zero bits for each element in Fm, and store the counts in the corresponding elements of Fd. |
| vcnt |
for 0 ≤ i < n) do end for | Count the number of bits in Fm that are set to one, and store the counts in the corresponding elements of Fd |


There is no vector divide instruction in NEON. Division is accomplished with multiplication by the reciprocals of the divisors. The reciprocals are found by making an initial estimate, then using the Newton-Raphson method to improve the approximation. This can actually be faster than using a hardware divider. NEON supports single precision floating point and unsigned fixed point reciprocal calculation. Fixed point reciprocals provide higher precision. Division using the NEON reciprocal method may not provide the best precision possible. If the best possible precision is required, then the VFP divide instruction should be used.
These instructions are used to multiply the corresponding elements from two vectors:
vmla Multiply Accumulate
vmls Multiply Subtract
vmull Multiply Long
vmlal Multiply Accumulate Long
vmlsl Multiply Subtract Long
The long versions can be used to avoid overflow.
• <op> is either mul, mla. or mls.
• The valid choices for <type> are given in the following table:
| Name | Effect | Description |
| vmul | Multiply corresponding elements from two vectors and store the results in a third vector | |
| vmla | Multiply corresponding elements from two vectors and add the results in a third vector | |
| vmul | Multiply corresponding elements from two vectors and subtract the results from a third vector | |
| vmull | Multiply corresponding elements from two vectors and store the results in a third vector | |
| vmlal | Multiply corresponding elements from two vectors and add the results in a third vector | |
| vmul | Multiply corresponding elements from two vectors and subtract the results from a third vector |


These instructions are used to multiply each element in a vector by a scalar:
vmla Multiply Accumulate by Scalar
vmls Multiply Subtract by Scalar
vmull Multiply Long by Scalar
vmlal Multiply Accumulate Long by Scalar
vmlsl Multiply Subtract Long by Scalar
The long versions can be used to avoid overflow.
• <op> is either mul, mla. or mls.
• The valid choices for <type> are given in the following table:
| Opcode | Valid Types |
| vmul | i16, i32, or f32 |
| vmla | i16, i32, or f32 |
| vmls | i16, i32, or f32 |
| vmull | s16, s32, u16, or u32 |
| vmlal | s16, s32, u16, or u32 |
| vmlsl | s16, s32, u16, or u32 |
• x must be valid for the chosen <type>.
| Name | Effect | Description |
| vmul | Multiply corresponding elements from two vectors and store the results in a third vector | |
| vmla | Multiply corresponding elements from two vectors and add the results in a third vector | |
| vmul | Multiply corresponding elements from two vectors and subtract the results from a third vector | |
| vmull | Multiply corresponding elements from two vectors and store the results in a third vector | |
| vmlal | Multiply corresponding elements from two vectors and add the results in a third vector | |
| vmul | Multiply corresponding elements from two vectors and subtract the results from a third vector |


A fused multiply accumulate operation does not perform rounding between the multiply and add operations. The two operations are fused into one. NEON provides the following fused multiply accumulate instructions:
vfma Fused Multiply Accumulate
vfnma Fused Negate Multiply Accumulate
vfms Fused Multiply Subtract
vfnms Fused Negate Multiply Subtract
Using the fused multiply accumulate can result in improved speed and accuracy for many computations that involve the accumulation of products.
<op> is one of vfma, vfnma, vfms, or vfnms.
<cond> is an optional condition code.
<prec> may be either f32 or f64.

These instructions perform multiplication, double the results, and perform saturation:
vqdmull Saturating Multiply Double (Low)
vqdmlal Saturating Multiply Double Accumulate (Low)
vqdmlsl Saturating Multiply Double Subtract (Low)
• <op> is either mul, mla. or mls.
• <type> must be either s16 or s32.


These instructions perform multiplication, double the results, perform saturation, and store the high half of the results:
vqdmulh Saturating Multiply Double (High)
vqrdmulh Saturating Multiply Double (High) and Round
| Name | Effect | Description |
| vqdmulh |
if second operand is scalar then else end if | Multiply elements, double the results and store the high half in the destination vector with saturation |
| vqrdmulh |
if second operand is scalar then else end if | Multiply elements, double the results, round, and store the high half in the destination vector with saturation |


These instructions perform the initial estimates of the reciprocal values:
vrsqrte Reciprocal Square Root Estimate
These work on floating point and unsigned fixed point vectors. The estimates from this instruction are accurate to within about eight bits. If higher accuracy is desired, then the Newton-Raphson method can be used to improve the initial estimates. For more information, see the Reciprocal Step instruction.
• <op> is either recpe or rsqrte.
• <type> must be either u32, or f32.
• If <type> is u32, then the elements are assumed to be U(1,31) fixed point numbers, and the most significant fraction bit (bit 30) must be 1, and the integer part must be zero. The vclz and shift by variable instructions can be used to put the data in the correct format.
• The result elements are always f32.

These instructions are used to perform one Newton-Raphson step for improving the reciprocal estimates:
vrsqrts Reciprocal Square Root Step
For each element in the vector, the following equation can be used to improve the estimates of the reciprocals:
where xn is the estimated reciprocal from the previous step, and d is the number for which the reciprocal is desired. This equation converges to
if x0 is obtained using vrecpe on d. The vrecps instruction computes
so one additional multiplication is required to complete the update step. The initial estimate x0 must be obtained using the vrecpe instruction.
For each element in the vector, the following equation can be used to improve the estimates of the reciprocals of the square roots:
where xn is the estimated reciprocal from the previous step, and d is the number for which the reciprocal is desired. This equation converges to
if x0 is obtained using vrsqrte on d. The vrsqrts instruction computes
so two additional multiplications are required to complete the update step. The initial estimate x0 must be obtained using the vrsqrte instruction.
• <op> is either recps or rsqrts.
• <type> must be either u32, or f32.

The GNU assembler supports five pseudo-instructions for NEON. Two of them are vcle and vclt, which were covered in Section 10.6.1. The other three are explained in the following sections.
This pseudo-instruction loads a constant value into every element of a NEON vector, or into a VFP single-precision or double-precision register:
This pseudo-instruction will use vmov if possible. Otherwise, it will create an entry in the literal pool and use vldr.
• <cond> is an optional condition code.
• <type> must be one of i8, i16, i32, i64, s8, s16, s32, s64, u8, u16, u32, u64, f32, or f64.
• <imm> is a value appropriate for the specified <type>.

It is often useful to clear and/or set specific bits in a register. The following pseudo-instructions can provide bitwise logical operations:
vorn Bitwise Complement and OR Immediate
• <op> must be either and, or orn.
• V must be either q or d to specify whether the operation involves quadwords or doublewords.
• <type> must be i8, i16, i32, or i64.
• <imm> is a 16-bit or 32-bit immediate value, which is interpreted as a pattern for filling the immediate operand. The following table shows acceptable patterns for <imm>, based on what was chosen for <type>:

The following pseudo-instructions perform comparisons between the absolute values of all of the corresponding elements of two vectors in parallel:
vacle Absolute Compare Less Than or Equal
vaclt Absolute Compare Less Than
The vector absolute compare instruction compares the absolute value of each element of a vector with the absolute value of the corresponding element in a second vector, and sets an element in the destination vector for each comparison. If the comparison is true, then all bits in the result element are set to one. Otherwise, all bits in the result element are set to zero. Note that summing the elements of the result vector (as signed integers) will give the two’s complement of the number of comparisons which were true.
• <op> must be either lt or lt.
• V can be d or q.
• The operand element type must be f32.
• The result element type is i32.

In Chapter 9, four versions of the sine function were given. Those implementations used scalar and VFP vector modes for single-precision and double-precision. Those previous implementations are already faster than the implementations provided by GCC, However, it may be possible to gain a little more performance by taking advantage of the NEON architecture. All versions of NEON are guaranteed to have a very large register set, and that fact can be used to attain better performance.
Listing 10.1 shows a single precision floating point implementation of the sine function, using the ARM NEON instruction set. It performs the same operations as the previous implementations of the sine function, but performs many of the calculations in parallel. This implementation is slightly faster than the previous version.

Listing 10.2 shows a double precision floating point implementation of the sine function. This code is intended to run on ARMv7 and earlier NEON/VFP systems with the full set of 32 double-precision registers. NEON systems prior to ARMv8 do not have NEON SIMD instructions for double precision operations. This implementation is faster than Listing 9.4 because it uses a large number of registers, does not contain a loop, and is written carefully so that multiple instructions can be at different stages in the pipeline at the same time. This technique of gaining performance is known as loop unrolling.



Table 10.4 compares the implementations from Listings 10.1 and 10.2 with the VFP vector implementations from Chapter 9 and the sine function provided by GCC. Notice that in every case, using vector mode VFP instructions is slower than the scalar VFP version. As mentioned previously, vector mode is deprecated on NEON processors. On NEON systems, vector mode is emulated in software. Although vector mode is supported, using it will result in reduced performance, because each vector instruction causes the operating system to take over and substitute a series of scalar floating point operations on-the-fly. A great deal of time was spent by the operating system software in emulating the VFP hardware vector mode.
Table 10.4
Performance of sine function with various implementations
| Optimization | Implementation | CPU seconds |
| None | Single Precision VFP scalar Assembly | 1.74 |
| Single Precision VFP vector Assembly | 27.09 | |
| Single Precision NEON Assembly | 1.32 | |
| Single Precision C | 4.36 | |
| Double Precision VFP scalar Assembly | 2.83 | |
| Double Precision VFP vector Assembly | 106.46 | |
| Double Precision NEON Assembly | 2.24 | |
| Double Precision C | 4.59 | |
| Full | Single Precision VFP scalar Assembly | 1.11 |
| Single Precision VFP vector Assembly | 27.15 | |
| Single Precision NEON Assembly | 0.96 | |
| Single Precision C | 1.69 | |
| Double Precision VFP scalar Assembly | 2.56 | |
| Double Precision VFP vector Assembly | 107.5.53 | |
| Double Precision NEON Assembly | 2.05 | |
| Double Precision C | 4.27 |
When compiler optimization is not used, the single precision scalar VFP implementation achieves a speedup of about 2.51, and the NEON implementation achieves a speedup of about 3.30 compared to the GCC implementation. The double precision scalar VFP implementation achieves a speedup of about 1.62, and the loop-unrolled NEON implementation achieves a speedup of about 2.05 compared to the GCC implementation.
When the best possible compiler optimization is used (-Ofast), the single precision scalar VFP implementation achieves a speedup of about 1.52, and the NEON implementation achieves a speedup of about 1.76 compared to the GCC implementation. The double precision scalar VFP implementation achieves a speedup of about 1.67, and the loop-unrolled NEON implementation achieves a speedup of about 2.08 compared to the GCC implementation. The single precision NEON version was 1.16 times as fast as the VFP scalar version and the double precision NEON implementation was 1.25 times as fast as the VFP scalar implementation.
Although the VFP versions of the sine function ran without modification on the NEON processor, re-writing them for NEON resulted in significant performance improvement. Performance of the vectorized VFP code running on a NEON processor was abysmal. The take-away lesson is that a programmer can improve performance by writing some functions in assembly that are specifically targeted to run on an specific platform. However, assembly code which improves performance on one platform may actually result in very poor performance on a different platform. To achieve optimal or near-optimal performance, it is important for the programmer to be aware of exactly which hardware platform is being used.
| Name | Page | Operation |
| vaba | 339 | Absolute Difference and Accumulate |
| vabal | 339 | Absolute Difference and Accumulate Long |
| vabd | 339 | Absolute Difference |
| vabdl | 339 | Absolute Difference Long |
| vabs | 340 | Absolute Value |
| vacge | 324 | Absolute Compare Greater Than or Equal |
| vacgt | 324 | Absolute Compare Greater Than |
| vacle | 353 | Absolute Compare Less Than or Equal |
| vaclt | 353 | Absolute Compare Less Than |
| vadd | 335 | Add |
| vaddhn | 336 | Add and Narrow |
| vaddl | 335 | Add Long |
| vaddw | 335 | Add Wide |
| vand | 326 | Bitwise AND |
| vand | 352 | Bitwise AND Immediate |
| vbic | 326 | Bit Clear |
| vbic | 327 | Bit Clear Immediate |
| vbif | 328 | Bitwise Insert if False |
| vbit | 328 | Bitwise Insert |
| vbsl | 328 | Bitwise Select |
| vceq | 323 | Compare Equal |
| vcge | 323 | Compare Greater Than or Equal |
| vcgt | 323 | Compare Greater Than |
| vcle | 323 | Compare Less Than or Equal |
| vcls | 342 | Count Leading Sign Bits |
| vclt | 323 | Compare Less Than |
| vclz | 342 | Count Leading Zero Bits |
| vcnt | 342 | Count Set Bits |
| vcvt | 322 | Convert Between Half and Single |
| vcvt | 321 | Convert Data Format |
| vdup | 312 | Duplicate Scalar |
| veor | 326 | Bitwise Exclusive-OR |
| vext | 313 | Extract Elements |
| vfma | 346 | Fused Multiply Accumulate |
| vfms | 346 | Fused Multiply Subtract |
| vfnma | 346 | Fused Negate Multiply Accumulate |
| vfnms | 346 | Fused Negate Multiply Subtract |
| vhadd | 337 | Halving Add |
| vhsub | 337 | Halving Subtract |
| vld¡n¿ | 305 | Load Copies of Structured Data |
| vld¡n¿ | 307 | Load Multiple Structured Data |
| vld¡n¿ | 303 | Load Structured Data |
| vldr | 351 | Load Constant |
| vmax | 341 | Maximum |
| vmin | 341 | Minimum |
| vmla | 343 | Multiply Accumulate |
| vmla | 345 | Multiply Accumulate by Scalar |
| vmlal | 344 | Multiply Accumulate Long |
| vmlal | 345 | Multiply Accumulate Long by Scalar |
| vmls | 343 | Multiply Subtract |
| vmls | 345 | Multiply Subtract by Scalar |
| vmlsl | 344 | Multiply Subtract Long |
| vmlsl | 345 | Multiply Subtract Long by Scalar |
| vmov | 310 | Move Immediate |
| vmov | 309 | Move Between NEON and ARM |
| vmovl | 311 | Move and Lengthen |
| vmovn | 311 | Move and Narrow |
| vmul | 343 | Multiply |
| vmul | 345 | Multiply by Scalar |
| vmull | 343 | Multiply Long |
| vmull | 345 | Multiply Long by Scalar |
| vmvn | 310 | Move Immediate Negative |
| vneg | 340 | Negate |
| vorn | 326 | Bitwise Complement and OR |
| vorn | 352 | Bitwise Complement and OR Immediate |
| vorr | 326 | Bitwise OR |
| vorr | 327 | Bitwise OR Immediate |
| vpadal | 338 | Add Pairwises and Accumulate Long |
| vpadd | 338 | Add Pairwise |
| vpaddl | 338 | Add Pairwise Long |
| vpmax | 341 | Pairwise Maximum |
| vpmin | 341 | Pairwise Minimum |
| vqabs | 340 | Saturating Absolute Value |
| vqadd | 335 | Saturating Add |
| vqdmlal | 347 | Saturating Multiply Double Accumulate (Low) |
| vqdmlsl | 347 | Saturating Multiply Double Subtract (Low) |
| vqdmulh | 348 | Saturating Multiply Double (High) |
| vqdmull | 347 | Saturating Multiply Double (Low) |
| vqmovn | 311 | Saturating Move and Narrow |
| vqmovun | 311 | Saturating Move and Narrow Unsigned |
| vqneg | 340 | Saturating Negate |
| vqrdmulh | 348 | Saturating Multiply Double (High) and Round |
| vqrshl | 330 | Saturating Shift Left or Right by Variable and Round |
| vqrshrn | 332 | Saturating Shift Right Immediate Round |
| vqrshrun | 333 | Saturating Shift Right Immediate Round Unsigned |
| vqshl | 329 | Saturating Shift Left Immediate |
| vqshl | 330 | Saturating Shift Left or Right by Variable |
| vqshlu | 329 | Saturating Shift Left Immediate Unsigned |
| vqshrn | 332 | Saturating Shift Right Immediate |
| vqshrun | 333 | Saturating Shift Right Immediate Unsigned |
| vqsub | 335 | Saturating Subtract |
| vraddhn | 336 | Add, Round, and Narrow |
| vrecpe | 348 | Reciprocal Estimate |
| vrecps | 349 | Reciprocal Step |
| vrev | 314 | Reverse Elements |
| vrhadd | 337 | Halving Add and Round |
| vrshl | 330 | Shift Left or Right by Variable and Round |
| vrshr | 331 | Shift Right Immediate and Round |
| vrshrn | 331 | Shift Right Immediate Round and Narrow |
| vrsqrte | 348 | Reciprocal Square Root Estimate |
| vrsqrts | 349 | Reciprocal Square Root Step |
| vrsra | 331 | Shift Right Round and Accumulate Immediate |
| vrsubhn | 336 | Subtract, Round, and Narrow |
| vshl | 329 | Shift Left Immediate |
| vshl | 330 | Shift Left or Right by Variable |
| vshll | 329 | Shift Left Immediate Long |
| vshr | 331 | Shift Right Immediate |
| vshrn | 331 | Shift Right Immediate and Narrow |
| vsli | 334 | Shift Left and Insert |
| vsra | 331 | Shift Right and Accumulate Immediate |
| vsri | 334 | Shift Right and Insert |
| vst<n> | 307 | Store Multiple Structured Data |
| vst<n> | 303 | Store Structured Data |
| vsub | 335 | Subtract |
| vsubhn | 336 | Subtract and Narrow |
| vsubl | 335 | Subtract Long |
| vsubw | 335 | Subtract Wide |
| vswp | 315 | Swap Vectors |
| vtbl | 318 | Table Lookup |
| vtbx | 318 | Table Lookup with Extend |
| vtrn | 316 | Transpose Matrix |
| vtst | 325 | Test Bits |
| vuzp | 319 | Unzip Vectors |
| vzip | 319 | Zip Vectors |


NEON can dramatically improve performance of algorithms that can take advantage of data parallelism. However, compiler support for automatically vectorizing and using NEON instructions is still immature. NEON intrinsics allow C and C++ programmers to access NEON instructions, by making them look like C functions. It is usually just as easy and more concise to write NEON assembly code as it is to use the intrinsics functions. A careful assembly language programmer can usually beat the compiler, sometimes by a wide margin. The greatest gains usually come from converting an algorithm to avoid floating point, and taking advantage of data parallelism.
10.1 What is the advantage of using IEEE half-precision? What is the disadvantage?
10.2 NEON achieved relatively modest performance gains on the sine function, when compared to VFP.
(b) List some tasks for which NEON could significantly outperform VFP.
10.3 There are some limitations on the size of the structure that can be loaded or stored using the vld<n> and vst<n> instructions. What are the limitations?
10.4 The sine function in Listing 10.2 uses a technique known as “loop unrolling” to achieve higher performance. Name at least three reasons why this code is more efficient than using a loop?
10.5 Reimplement the fixed-point sine function from Listing 8.7 using NEON instructions. Hint: you should not need to use a loop. Compare the performance of your NEON implementation with the performance of the original implementation.
10.6 Reimplement Exercise 9.10 using NEON instructions.
10.7 Fixed point operations may be faster than floating point operations. Modify your code from the previous example so that it uses the following definitions for points and transformation matrices:

Use saturating instructions and/or any other techniques necessary to prevent overflow. Compare the performance of the two implementations.
Accessing Devices
This chapter starts with a high-level explanation of how devices may be accessed in a modern computer system, and then explains that most devices on modern architectures are memory-mapped. Next, it explains how memory mapped devices can be accessed by user processes under Linux, by making use of the mmap system call. Code examples are given, showing how several devices can be mapped into the memory of a user-level program on the Raspberry Pi and pcDuino. Next the General Purpose I/O devices on both systems are explained, providing the reader with the opportunity to do a comparison between two different devices which perform almost precisely the same functions.
Device; Memory map; General purpose I/O (GPIO); I/O Pin; Header; Pull-up and pull-down resistor; LED; Switch
As mentioned in Chapter 1, a computer system consists of three main parts: the CPU, memory, and devices. The typical computing system has many devices of various types for performing specific functions. Some devices, such as data caches, are closely coupled to the CPU, and are typically controlled by executing special CPU instructions that can only be accessed in assembly language. However, most of the devices on a typical system are accessed and controlled through the system data bus. These devices appear to the programmer to be ordinary memory locations. The hardware in the system bus decodes the addresses coming from the CPU, and some addresses correspond to devices rather than memory. Fig. 11.1 shows the memory layout for a typical system. The exact locations of the devices and memory are chosen by the system hardware designers. From the programmer’s standpoint, writing data to certain memory addresses results in the data being transferred to a device rather than stored in memory. The programmer must read documentation on the hardware design to determine exactly where the devices are in memory.

There are devices that allow data to be read or written from external sources, devices that can measure time, devices for moving data from one location in memory to another, devices for modifying the addresses of memory regions, and devices for even more esoteric purposes. Some devices are capable of sending signals to the CPU to indicate that they need attention, while others simply wait for the CPU to check on their status.
A modern computer system, such as the Raspberry Pi, has dozens or even hundreds of devices. Programmers write device driver software for each device. A device driver provides a few standard function calls for each device, so that it can be used easily. The specific set of functions depends on the type of device and the design of the operating system. Operating system designers strive to define a small set of device types, and to define a standard software interface for each type in order to make devices interchangeable.
Devices are typically controlled by writing specific values to the device’s internal device registers. For the ARM processor, access to most device registers is accomplished using the load and store instructions. Each device is assigned a base address in memory. This address corresponds with the first register inside the device. The device may also have other registers that are accessible at some pre-defined offset address from the base address. Some registers are read-only, some are write-only, and some are read-write. To use the device, the programmer must read from, and write appropriate data to, the correct device registers. For every device, there is a programmer’s model and documentation explaining what each register in the device does. Some devices are well designed, easy to use, and well documented. Some devices are not, and the programmer must work harder to write software to use them.
Linux is a powerful, multiuser, multitasking operating system. The Linux kernel manages all of the devices and protects them from direct access by user programs. User programs are intended to access devices by making system calls. The kernel accesses the devices on behalf of the user programs, ensuring that an errant user program cannot misuse the devices and other resources on the system. Attempting to directly access the registers in any device will result in an exception. The kernel will take over and kill the offending process.
However, our programs will need direct access to the device registers. Linux allows user programs to gain direct access through the mmap() system call. Listing 11.1 shows how four devices can be mapped into the memory space of a user program on a Raspberry Pi. In most cases, the user program will need administrator privileges in order to perform the mapping. The operating system does not usually give permission for ordinary users to access devices directly. However Linux does provide the ability to change permissions on /dev/mem, or for user programs to run with elevated privileges.






Listing 11.2 shows how four devices can be mapped into the memory space of a user program on a pcDuino. The devices are equivalent to the devices mapped in Listing 11.1. Some of the devices are described in the following sections of this chapter. The pcDuino devices and Raspberry Pi devices operate differently, but provide similar functionality. Note that most of the code is the same for both listings. The only real differences between Listings 11.1 and 11.2 are the names of the devices and their hardware addresses.





One type of device, commonly found on embedded systems, is the General Purpose I/O (GPIO) device. Although there are many variations on this device provided by different manufacturers, they all provide similar capabilities. The device provides a set of input and/or output bits, which allow signals to be transferred to or from the outside world. Each bit of input or output in a GPIO device is generally referred to as a pin, and a group of pins is referred to as a GPIO port. Ports commonly support 8 bits of input or output, but some devices have 16 or 32 bit ports. Some GPIO devices support multiple ports, and some systems have multiple GPIO devices in them.
A system with a GPIO device usually has some type of connector or wires that allow external inputs or outputs to be connected to the system. For example, the IBM PC has a type of GPIO device that was originally intended for communications with a parallel printer. On that platform, the GPIO device is commonly referred to as the parallel printer port.
Some GPIO devices, such as the one on the IBM PC, are arranged as sets of pins that can be switched as a group to either input or output. In many modern GPIO devices, each pin can be individually configured to accept or source different input and output voltages. On some devices, the amount of drive current available can be configured. Some include the ability to configure built-in pull-up and/or pull-down resistors. On most older GPIO devices, the input and output voltages are typically limited to the supply voltage of the GPIO device, and the device may be damaged by greater voltages. Newer GPIO devices generally can tolerate 5 V on inputs, regardless of the supply voltage of the device.
GPIO devices are very common in systems that are intended to be used for embedded applications. For most GPIO devices:
• individual pins or groups of pins can be configured,
• pins can be configured to be input or output,
• pins can be disabled so that they are neither input nor output,
• input values can be read by the CPU (typically high=1, low=0),
• output values can be read or written by the CPU, and
• input pins can be configured to generate interrupt requests.
Some GPIO devices may also have more advanced features, such as the ability to use Direct Memory Access (DMA) to send data without requiring the CPU to move each byte or word. Fig. 11.2 shows two common ways to use GPIO pins. Fig. 11.2A shows a GPIO pin that has been configured for input, and connected to a push-button switch. When the switch is open, the pull-up resistor pulls the voltage on the pin to a high state. When the switch is closed, the pin is pulled to a low state and some current flows through the pull-up resistor to ground. Typically, the pull-up resistor would be around 10 kΩ. The specific value is not critical, but it must be high enough to limit the current to a small amount when the switch is closed. Fig. 11.2B shows a GPIO pin that is configured as an output and is being used to drive an LED. When a 1 is output on the pin, it is at the same voltage as Vcc (the power supply voltage), and no current flows. The LED is off. When a 0 is output on the pin, current is drawn through the resistor and the LED, and through the pin to ground. This causes the LED to be illuminated. Selection of the resistor is not critical, but it must be small enough to light the LED without allowing enough current to destroy either the LED or the GPIO circuitry. This is typically around 1 kΩ. Note that, in general, GPIO pins can sink more current than they can source, so it is most common to connect LEDs and other devices in the way shown.

The Broadcom BCM2835 system-on-chip contains 54 GPIO pins that are split into two banks. The GPIO pins are named using the following format: GPIOx, where x is a number between 0 and 53. The GPIO pins are highly configurable. Each pin can be used for general purpose I/O, or can be configured to serve up to six pre-defined alternate functions. Configuring a GPIO pin for an alternate function usually allows some other device within the BCM2835 to use the pin. For example, GPIO4 can be used
• to send the signal generated by General Purpose Clock 0 to external devices,
• to send bit one of the Secondary Address Bus to external devices, or
• to receive JTAG data for programming the firmware of the device.
The last eight GPIO pins, GPIO46–GPIO53 have no alternate functions, and are used only for GPIO.
In addition to the alternate function, all GPIO pins can be configured individually as input or output. When configured as input, a pin can also be configured to detect when the signal changes, and to send an interrupt to the ARM CPU. Each input pin also has internal pull-up and pull-down resistors, which can be enabled or disabled by the programmer.
The GPIO pins on the BCM2835 SOC are very flexible and are quite complex, but are well designed and not difficult to program, once the programmer understands how the pins operate and what the various registers do. There are 41 registers that control the GPIO pins. The base address for the GPIO device is 20200000. The 41 registers and their offsets from the base address are shown in Table 11.1.
Table 11.1
Raspberry Pi GPIO register map
| Offset | Name | Description | Size | R/W |
| 0016 | GPFSEL0 | GPIO Function Select 0 | 32 | R/W |
| 0416 | GPFSEL1 | GPIO Function Select 1 | 32 | R/W |
| 0816 | GPFSEL2 | GPIO Function Select 2 | 32 | R/W |
| 0C16 | GPFSEL3 | GPIO Function Select 3 | 32 | R/W |
| 1016 | GPFSEL4 | GPIO Function Select 4 | 32 | R/W |
| 1416 | GPFSEL5 | GPIO Function Select 5 | 32 | R/W |
| 1C16 | GPSET0 | GPIO Pin Output Set 0 | 32 | W |
| 2016 | GPSET1 | GPIO Pin Output Set 1 | 32 | W |
| 2816 | GPCLR0 | GPIO Pin Output Clear 0 | 32 | W |
| 2C16 | GPCLR1 | GPIO Pin Output Clear 1 | 32 | W |
| 3416 | GPLEV0 | GPIO Pin Level 0 | 32 | R |
| 3816 | GPLEV1 | GPIO Pin Level 1 | 32 | R |
| 4016 | GPEDS0 | GPIO Pin Event Detect Status 0 | 32 | R/W |
| 4416 | GPEDS1 | GPIO Pin Event Detect Status 1 | 32 | R/W |
| 4C16 | GPREN0 | GPIO Pin Rising Edge Detect Enable 0 | 32 | R/W |
| 5016 | GPREN1 | GPIO Pin Rising Edge Detect Enable 1 | 32 | R/W |
| 5816 | GPFEN0 | GPIO Pin Falling Edge Detect Enable 0 | 32 | R/W |
| 5C16 | GPFEN1 | GPIO Pin Falling Edge Detect Enable 1 | 32 | R/W |
| 6416 | GPHEN0 | GPIO Pin High Detect Enable 0 | 32 | R/W |
| 6816 | GPHEN1 | GPIO Pin High Detect Enable 1 | 32 | R/W |
| 7016 | GPLEN0 | GPIO Pin Low Detect Enable 0 | 32 | R/W |
| 7416 | GPLEN1 | GPIO Pin Low Detect Enable 1 | 32 | R/W |
| 7C16 | GPAREN0 | GPIO Pin Async. Rising Edge Detect 0 | 32 | R/W |
| 8016 | GPAREN1 | GPIO Pin Async. Rising Edge Detect 1 | 32 | R/W |
| 8816 | GPAFEN0 | GPIO Pin Async. Falling Edge Detect 0 | 32 | R/W |
| 8C16 | GPAFEN1 | GPIO Pin Async. Falling Edge Detect 1 | 32 | R/W |
| 9416 | GPPUD | GPIO Pin Pull-up/down Enable | 32 | R/W |
| 9816 | GPPUDCLK0 | GPIO Pin Pull-up/down Enable Clock 0 | 32 | R/W |
| 9C16 | GPPUDCLK1 | GPIO Pin Pull-up/down Enable Clock 1 | 32 | R/W |

The first six 32-bit registers in the device are used to select the function for each of the 54 GPIO pins. The function of each pin is controlled by a group of three bits in one of these registers. The mapping is very regular. Bits 0–2 of GPIOFSEL0 control the function of GPIO pin 0. Bits 3–5 of GPIOFSEL0 control the function of GPIO pin 1, and so on, up to bits 27–29 of GPIOFSEL0, which control the function of GPIO pin 9. The next pin, pin 10, is controlled by bits 0–2 of GPIOFSEL1. The pins are assigned in sequence through the remaining bits, until bits 27–29, which control GPIO pin 19. The remaining four GPIOFSEL registers control the remaining GPIO pins. Note that bits 30 and 31 of all of the GPIOFSEL registers are not used, and most of the bits in GPIOFSEL5 are not assigned to any pin. The meaning of each combination of the three bits is shown in Table 11.2. Note that the encoding is not as simple as one might expect.
Table 11.2
GPIO pin function select bits
| MSB-LSB | Function |
| 000 | Pin is an input |
| 001 | Pin is an output |
| 100 | Pin performs alternate function 0 |
| 101 | Pin performs alternate function 1 |
| 110 | Pin performs alternate function 2 |
| 111 | Pin performs alternate function 3 |
| 011 | Pin performs alternate function 4 |
| 010 | Pin performs alternate function 5 |
The procedure for setting the function of a GPIO pin is as follows:
• Determine which GPIOFSEL register controls the desired pin.
• Determine which bits of the GPIOFSEL register are used.
• Determine what the bit pattern should be.
• Read the GPIOFSEL register.
• Clear the correct bits using the bic instruction.
• Set them to the correct pattern using the orr instruction.
For example, Listing 11.3 shows the sequence of code which would be used to set GPIO pin 26 to alternate function 1.

To use a GPIO pin for output, the function select bits for that pin must be set to 001. Once that is done, the output can be driven high or low by using the GPSET and GPCLR registers. GPIO pin 0 is set to a high output by writing a 1 to bit 0 of GPSET0, and it is set to low output by writing a 1 to bit 0 of GPCLR0. GPIO pin 1 is similarly controlled by bit 1 in GPSET0 and GPCLR0. Each of the GPIO pins numbered 0 through 31 is assigned one bit in GPSET0 and one bit in GPCLR0. GPIO pin 32 is assigned to bit 0 of GPSET1 and GPCLR1, GPIO pin 33 is assigned to bit 1 of GPSET1 and GPCLR1, and so on. Since there are only 54 GPIO pins, bits 22–31 of GPSET1 and GPCLR1 are not used. The programmer can set or clear several outputs simultaneously by writing the appropriate bits in the GPSET and GPCLR registers.
To use a GPIO pin for input, the function select bits for that pin must be set to 000. Once that is done, the input can be read at any time by reading the appropriate GPLEV register and examining the bit that corresponds with the input pin. GPIO pin 0 is read as bit 0 of GPLEV0, GPIO pin 1 is similarly read as bit 1 of GPLEV1. Each of the GPIO pins numbered 0 through 31 is assigned one bit in GPLEV0. GPIO pin 32 is assigned to bit 0 of GPLEV1, GPIO pin 33 is assigned to bit 1 of GPLEV1, and so on. Since there are only 54 GPIO pins, bits 22–31 of GPLEV1 are not used. The programmer can read the status of several inputs simultaneously by reading one of the GPLEV registers and examining the bits corresponding to the appropriate pins.
Input pins can be configured with internal pull-up or pull-down resistors. This can simplify the design of the system. For instance, Fig. 11.2A, shows a push-button switch connected to an input, with an external pull-up resistor. That resistor is unnecessary if the internal pull-up for that pin is enabled.
Enabling the pull-up or pull-down is a two step process. The first step is to configure the type of change to be made, and the second step is to perform that change on the selected pin(s). The first step is accomplished by writing to the GPPUD register. The valid binary control codes are shown in Table 11.3.
Table 11.3
GPPUD control codes
| Code | Function |
| 00 | Disable pull-up and pull-down |
| 01 | Enable pull-down |
| 10 | Enable pull-up |
Once the GPPUD register is configured, the selected operation can be performed on multiple pins by writing to one or both of the GPPUDCLK registers. GPIO pins are assigned to bits in these two registers in the same way as the pins are assigned in the GPLEV, GPSET, and GPCLR registers. Writing 1 to bit 0 of GPPUDCLK0 will configure the pull-up or pull-down for GPIO pin 0, according to the control code that is currently in the GPPUD register.
The GPEDS registers are used for detecting events that have occurred on the GPIO pins. For instance a pin may have transitioned from low to high, and back to low. If the CPU does not read the GPLEV register often enough, then such an event could be missed. The GPEDS registers can be configured to capture such events so that the CPU can detect that they occurred.
GPIO pins are assigned to bits in these two registers in the same way as the pins are assigned in the GPLEV, GPSET, and GPCLR registers. If bit 1 of GPEDS0 is set, then that indicates that an event has occurred on GPIO pin 0. Writing a 0 to that bit will clear the bit and allow the event detector to detect another event. Each pin can be configured to detect specific types of events by writing to the GPREN, GPHEN, GPLEN, GPAREN, and GPAFEN registers. For more information, refer to the BCM2835 ARM Peripherals manual.
The Raspberry Pi provides access to several of the 54 GPIO pins through the expansion header. The expansion header is a group of physical pins located in the corner of the Raspberry Pi board. Fig. 11.3 shows where the header is located on the Raspberry Pi. Wires can be connected to these pins and then the GPIO device can be programmed to send and/or receive digital information. Fig. 11.4 shows which signals are attached to the various pins. Some of the pins are used to provide power and ground to the external devices.


Table 11.4 shows some useful alternate functions available on each pin of the Raspberry Pi expansion header. Many of the alternate functions available on these pins are not really useful. Those functions have been left out of the table. The most useful alternate functions are probably GPIO 14 and 15, which can be used for serial communication, and GPIO 18, which can be used for pulse width modulation. Pulse width modulation is covered in Section 12.2, and serial communication is covered in Section 13.2. The Serial Peripheral Interface (SPI) functions could also be useful for connecting the Raspberry Pi to other devices which support SPI. Also, the SDA and SCL functions could be used to communicate with I2C devices.
The AllWinner A10/A20 system-on-chip contains 175 GPIO pins, which are arranged in seven ports. Each of the seven ports is identified by a letter between “A” and “I.” The ports are part of the PIO device, which is mapped at address 01C2080016. The GPIO pins are named using the following format: PNx, where N is a letter between “A” and “I” indicating the port, and x is a number indicating a pin on the given port. The assignment of pins to ports is somewhat irregular, as shown in Table 11.5. Some ports have as many as 28 physical pins, while others have as few as six. However, the layout of the registers in the device is very regular. Given any port and pin combination, finding the correct registers and sets of bits within the registers, is very straightforward.
Table 11.5
Number of pins available on each of the AllWinner A10/A20 PIO ports
| Port | Pins |
| A | 18 |
| B | 24 |
| C | 25 |
| D | 28 |
| E | 12 |
| F | 6 |
| G | 12 |
| H | 28 |
| I | 22 |
Each of the 9 ports is controlled by a set of 9 registers, for a total of 81 registers. There are seven additional registers that can be used to configure pins as interrupt sources. Interrupt processing is explained in Section 14.2. All of the port and interrupt registers together make a total of 88 registers for the GPIO device. The complete register map with the offset of each register from the device base address is shown in Table 11.6.
Table 11.6
Registers in the AllWinner GPIO device
| Offset | Name | Description |
| 00016 | PA_CFG0 | Function select for Port A, Pins 0–7 |
| 00416 | PA_CFG1 | Function select for Port A, Pins 8–15 |
| 00816 | PA_CFG2 | Function select for Port A, Pins 16–17 |
| 00C16 | PA_CFG3 | Not used |
| 01016 | PA_DAT | Port A Data Register |
| 01416 | PA_DRV0 | Port A Multi-driving, Pins 0–15 |
| 01816 | PA_DRV1 | Port A Multi-driving, Pins 16–17 |
| 01C16 | PA_PULL0 | Port A Pull-Up/-Down, Pins 0–15 |
| 02016 | PA_PULL1 | Port A Pull-Up/-Down, Pins 16–17 |
| 02416 | PB_CFG0 | Function select for Port B, Pins 0–7 |
| 02816 | PB_CFG1 | Function select for Port B, Pins 8–15 |
| 02C16 | PB_CFG2 | Function select for Port B, Pins 16–23 |
| 03016 | PB_CFG3 | Not used |
| 03416 | PB_DAT | Port B Data Register |
| 03816 | PB_DRV0 | Port B Multi-driving, Pins 0–15 |
| 03C16 | PB_DRV1 | Port B Multi-driving, Pins 16–23 |
| 04016 | PB_PULL0 | Port B Pull-Up/-Down, Pins 0–15 |
| 04416 | PB_PULL1 | Port B Pull-Up/-Down, Pins 16–23 |
| 04816 | PC_CFG0 | Function select for Port C, Pins 0–7 |
| 04C16 | PC_CFG1 | Function select for Port C, Pins 8–15 |
| 05016 | PC_CFG2 | Function select for Port C, Pins 16–23 |
| 05416 | PC_CFG3 | Function select for Port C, Pin 24 |
| 05816 | PC_DAT | Port C Data Register |
| 05C16 | PC_DRV0 | Port C Multi-driving, Pins 0–15 |
| 06016 | PC_DRV1 | Port C Multi-driving, Pins 16–23 |
| 06416 | PC_PULL0 | Port C Pull-Up/-Down, Pins 0–15 |
| 06816 | PC_PULL1 | Port C Pull-Up/-Down, Pins 16–23 |
| 06C16 | PD_CFG0 | Function select for Port D, Pins 0–7 |
| 07016 | PD_CFG1 | Function select for Port D, Pins 8–15 |
| 07416 | PD_CFG2 | Function select for Port D, Pins 16–23 |
| 07816 | PD_CFG3 | Function select for Port D, Pins 24–27 |
| 07C16 | PD_DAT | Port D Data Register |
| 08016 | PD_DRV0 | Port D Multi-driving, Pins 0–15 |
| 08416 | PD_DRV1 | Port D Multi-driving, Pins 16–27 |
| 08816 | PD_PULL0 | Port D Pull-Up/-Down, Pins 0–15 |
| 08C16 | PD_PULL1 | Port D Pull-Up/-Down, Pins 16–27 |
| 09016 | PE_CFG0 | Function select for Port E, Pins 0–7 |
| 09416 | PE_CFG1 | Function select for Port E, Pins 8–11 |
| 09816 | PE_CFG2 | Not used |
| 09C16 | PE_CFG3 | Not used |
| 0A016 | PE_DAT | Port E Data Register |
| 0A416 | PE_DRV0 | Port E Multi-driving, Pins 0–11 |
| 0A816 | PE_DRV1 | Not used |
| 0AC16 | PE_PULL0 | Port E Pull-Up/-Down, Pins 0–11 |
| 0B016 | PE_PULL1 | Not used |
| 0B416 | PF_CFG0 | Function select for Port F, Pins 0–5 |
| 0B816 | PF_CFG1 | Not used |
| 0BC16 | PF_CFG2 | Not used |
| 0C016 | PF_CFG3 | Not used |
| 0C416 | PF_DAT | Port F Data Register |
| 0C816 | PF_DRV0 | Port F Multi-driving, Pins 0–5 |
| 0CC16 | PF_DRV1 | Not used |
| 0D016 | PF_PULL0 | Port F Pull-Up/-Down, Pins 0–5 |
| 0D416 | PF_PULL1 | Not used |
| 0D816 | PG_CFG0 | Function select for Port G, Pins 0–7 |
| 0DC16 | PG_CFG1 | Function select for Port G, Pins 8–11 |
| 0E016 | PG_CFG2 | Not used |
| 0E416 | PG_CFG3 | Not used |
| 0E816 | PG_DAT | Port G Data Register |
| 0EC16 | PG_DRV0 | Port G Multi-driving, Pins 0–11 |
| 0F016 | PG_DRV1 | Not used |
| 0F416 | PG_PULL0 | Port G Pull-Up/-Down, Pins 0–11 |
| 0F816 | PG_PULL1 | Not used |
| 0FC16 | PH_CFG0 | Function select for Port H, Pins 0–7 |
| 10016 | PH_CFG1 | Function select for Port H, Pins 8–15 |
| 10416 | PH_CFG2 | Function select for Port H, Pins 16–23 |
| 10816 | PH_CFG3 | Function select for Port H, Pins 24–27 |
| 10C 16 | PH_DAT | Port H Data Register |
| 11016 | PH_DRV0 | Port H Multi-driving, Pins 0–15 |
| 11416 | PH_DRV1 | Port H Multi-driving, Pins 16–27 |
| 11816 | PH_PULL0 | Port H Pull-Up/-Down, Pins 0–15 |
| 11C16 | PH_PULL1 | Port H Pull-Up/-Down, Pins 16–27 |
| 12016 | PI_CFG0 | Function select for Port I, Pins 0–7 |
| 12416 | PI_CFG1 | Function select for Port I, Pins 8–15 |
| 12816 | PI_CFG2 | Function select for Port I, Pins 16–21 |
| 12C16 | PI_CFG3 | Not used |
| 13016 | PI_DAT | Port I Data Register |
| 13416 | PI_DRV0 | Port I Multi-driving, Pins 0–15 |
| 13816 | PI_DRV1 | Port I Multi-driving, Pins 16–21 |
| 13C16 | PI_PULL0 | Port I Pull-Up/-Down, Pins 0–15 |
| 14016 | PI_PULL1 | Port I Pull-Up/-Down, Pins 16–21 |
| 20016 | PIO_INT_CFG0 | PIO Interrupt Configure Register 0 |
| 20416 | PIO_INT_CFG1 | PIO Interrupt Configure Register 1 |
| 20816 | PIO_INT_CFG2 | PIO Interrupt Configure Register 2 |
| 20C16 | PIO_INT_CFG3 | PIO Interrupt Configure Register 3 |
| 21016 | PIO_INT_CTL | PIO Interrupt Control Register |
| 21416 | PIO_INT_STATUS | PIO Interrupt Status Register |
| 21816 | PIO_INT_DEB | PIO Interrupt Debounce Register |


The GPIO pins are highly configurable. Each pin can be used either for general purpose I/O, or can be configured to serve one of up to six pre-defined alternate functions. Configuring a GPIO pin for an alternate function usually allows some other device within the A10/A20 SOC to use the pin. For example PB2 (pin 2 of port B) can be used for general purpose I/O, or can be used to output the signal from a Pulse Width Modulator (PWM) device (explained in Section 12.2). Each input pin also has internal pull-up and pull-down resistors, which can be enabled or disabled by the programmer.
The first four registers for each port are used to configure the functions for each of the pins. The function of each pin is controlled by three bits in one of the four configuration registers. Pins 0–7 are controlled using configuration register 0. Pins 8–15 are controlled by configuration register 1, and so on. The assignment of pins to control bits is shown in Fig. 11.5. Note that eight pins are controlled by each register, and there is an unused bit between each group of three bits.

Each GPIO pin can be configured by writing a 3-bit code to the appropriate location in the correct port configuration register. The meanings of each possible code is shown in Table 11.7. For example, to configure port A, pin 10 (PA10) for output, the 3-bit code 001 must be written to bits 8–10 the PA_CFG1 register, without changing any other bit in the register. Listing 11.4 shows how this operation can be accomplished.
Table 11.7
Allwinner A10/A20 GPIO pin function select bits
| MSB-LSB | Function |
| 000 | Pin is an input |
| 001 | Pin is an output |
| 010 | Pin performs alternate function 0 |
| 011 | Pin performs alternate function 1 |
| 100 | Pin performs alternate function 2 |
| 101 | Pin performs alternate function 3 |
| 110 | Pin performs alternate function 4 |
| 111 | Pin performs alternate function 5 |

An output pin can be set to a high state by setting the corresponding bit in the correct port data register. Likewise the pin can be set to a low state by clearing its corresponding bit. Care must be taken to avoid changing any other bits in the port data register. Listing 11.5 shows how this operation can be accomplished for setting a port to output a high state. To set the port output to a low state, the orr instruction would be replaced with a bic instruction.

To determine the current state of an output pin or read an input pin, the programmer can read the contents of the correct port data register and use bitwise logical operations to isolate the appropriate bit. For example, to read the state of pin 14 of port I (PI14), the programmer would read the PI_DAT register and mask all bits except bit 14. Listing 11.6 shows how this operation can be accomplished. Another method would be to use the tst instruction, rather than the ands instruction, to set the CPSR flags.

Input pins can be configured with internal pull-up or pull-down resistors. This can simplify the design of the system. For instance, Fig. 11.2a, shows a push-button switch connected to an input with an external pull-up resistor. That resistor is unnecessary if the internal pull-up for that pin is enabled. Each pin is assigned two bits in one of the port pull-up/-down registers. The pull-up and pull-down resistors for pin 0 on port B are controlled using bits 0 and 1 of the PB_PULL0 register. Likewise the pull-up and pull-down resistors for pin 19 of port C are controlled using bits 6 and 7 of the PC_PULL1 register. Table 11.8 shows the bit patterns used to configure the pull-up and pull-down resisters for a pin.
When configured as an input, most of the pins on the pdDuino can be configured to generate an interrupt, which notifies the CPU than an event has occurred. Configuration of interrupts is beyond the scope of this chapter. It is accomplished using the PIO_INT registers.
The pcDuino provides access to several of the 175 GPIO pins through the expansion headers. Fig. 11.6 shows where the headers are located on the pcDuino. Wires can be plugged into the holes in these headers and then the GPIO device can be programmed to send and/or receive digital and/or analog signals. The physical layout of the pcDuino header makes it compatible with a wide range of expansion modules designed for the Arduino family of microcontroller boards.

Some of the header holes can provide power and ground to the external devices. Analog signals can be read into the pcDuino using the ADC header connections. Fig. 11.7 shows the pcDuino names for the signals that are available on the headers. Table 11.9 shows how the pcDuino header signal names are mapped to the actual port pins on the AllWinner A10/A20 chip. It also shows the most useful alternate functions available on each of the pins. Many alternate functions are left out of the table because they are not really useful. Note that the pcDunio and the Raspberry Pi both provide pins to perform PWM, UART communications, and SPI.

Table 11.9
pcDuino GPIO pins and function select code assignments.
| Function Select Code Assignment | ||||||
| pcDuino Pin Name | Port | Pin | 010 | 011 | 100 | 110 |
| UART-Rx(GPIO0) | I | 19 | UART2_RX | EINT31 | ||
| UART-Tx(GPIO1) | I | 18 | UART2_TX | EINT30 | ||
| GPIO3(GPIO2) | H | 7 | UART5_RX | EINT7 | ||
| PWM0(GPIO3) | H | 6 | UART5_TX | EINT6 | ||
| GPIO4 | H | 8 | EINT8 | |||
| PWM1(GPIO5) | B | 2 | PWM0 | |||
| PWM2(GPIO6) | I | 3 | PWM1 | |||
| GPIO7 | H | 9 | EINT9 | |||
| GPIO8 | H | 10 | EINT10 | |||
| PWM3(GPIO9) | H | 5 | EINT5 | |||
| SPI_CS(GPIO10) | I | 10 | SPI0_CS0 | UART5_TX | EINT22 | |
| SPI_MOSI(GPIO11) | I | 12 | SPI0_MOSI | UART6_TX | CLK_OUT_A | EINT24 |
| SPI_MISO(GPIO12) | I | 13 | SPI0_MISO | UART6_RX | CLK_OUT_B | EINT25 |
| SPI_CLK(GPIO13) | I | 11 | SPI0_CLK | UART5_RX | EINT23 | |

All input and output are accomplished by using devices. There are many types of devices, and each device has its own set of registers which are used to control the device. The programmer must understand the operation of the device and the use of each register in order to use the device at a low level. Computer system manufacturers usually can provide documentation providing the necessary information for low-level programming. The quality of the documentation can vary greatly, and a general understanding of various types of devices can help in deciphering poor or incomplete documentation.
There are two major tasks where programming devices at the register level is required: operating system drivers and very small embedded systems. Operating systems provide an abstract view of each device and this allows programmers to use them more easily. However, someone must write that driver, and that person must have intimate knowledge of the device. On very small systems, there may not be a driver available. In that case, the device must be accessed directly. Even when an operating system provides a driver, it is sometimes necessary or desirable for the programmer to access the device directly. For example, some devices may provide modes of operation or capabilities that are not supported by the operating system driver. Linux provides a mechanism which allows the programmer to map a physical device into the program’s memory space, thereby gaining access to the raw device registers.
11.1 Explain the relationships and differences between device registers, memory locations, and CPU registers.
11.2 Why is it necessary to map the device into user program memory before accessing it under Linux? Would this step be necessary under all operating systems or in the case where there is no operating system and our code is running on the “bare metal?”
11.3 What is the purpose of a GPIO device?
11.4 The Raspberry Pi and the PcDuino have very different GPIO devices.
(a) Are they functionally equivalent?
(b) Are they equally programmer-friendly?
(c) If you have answered no to either of the previous questions, then what are the differences?
11.5 Draw a circuit diagram showing how to connect:
(a) a pushbutton switch to GPIO 23 and an LED to GPIO 27 on the Raspberry Pi, and
(b) a pushbutton switch to GPIO12 and an LED to GPIO13 on the PcDuino.
11.6 Assuming the systems are wired according to the previous exercise, write two functions. One function must initialize the GPIO pins, and the other function must read the state of the switch and turn the LED on if the button is pressed, and off if the button is not pressed. Write the two functions for
(b) a PcDuino.
11.7 Write the code necessary to route the output from PWM0 to GPIO 18 on a Raspberry Pi.
11.8 Write the code necessary to route the output from PWM0 to GPIO 5 on a PcDuino.
This chapter begins by explaining pulse density and pulse width modulation in general terms. It then introduces and describes the PWM device on the Raspberry Pi. Following that, it covers the pcDuino PWM device. This gives the reader another opportunity to see two different devices which both perform essentially the same functions.
Pulse width modulation; Pulse density modulation; Digital to analog; Low pass filter
The GPIO device provides a method for sending digital signals to external devices. This can be useful to control devices that have basically two states: on and off. In some situations, it is useful to have the ability to turn a device on at varying levels. For instance, it could be useful to control a motor at any required speed, or control the brightness of a light source. One way that this can be accomplished is through pulse modulation.
The basic idea is that the computer sends a stream of pulses to the device. The device acts as a low-pass filter, which averages the digital pulses into an analog voltage. By varying the percentage of time that the pulses are high, versus low, the computer can control how much average energy is sent to the device. The percentage of time that the pulses are high versus low is known as the duty cycle. Varying the duty cycle is referred to as modulation. There are two major types of pulse modulation: pulse density modulation (PDM) and pulse width modulation (PWM). Most pulse modulation devices are configured in three steps as follows:
1. The base frequency of the clock that drives the PWM device is configured. This step is usually optional.
2. The mode of operation for the pulse modulation device is configured by writing to one or more configuration registers in the pulse modulation device.
3. The cycle time is set by writing a “range” value into a register in the pulse modulation device. This value is usually set as a multiple of the base clock cycle time.
Once the device is configured, the duty cycle can be changed easily by writing to one or more registers in the pulse modulation device.
With PDM, also known as pulse frequency modulation (PFM), the duration of the positive pulses does not change, but the time between them (the pulse density) is modulated. When using PDM devices, the programmer typically sets the device cycle time tc in a register, then uses another register to specify the number of pulses d that are to be sent during a device cycle. The number of pulses is typically referred to as the duty cycle and must be chosen such that 0 ≤ d ≤ tc. For instance, if tc = 1024, then the device cycle time is 1024 times the cycle time of the clock that drives the device. If d = 512, then the device will send 512 pulses, evenly spaced, during the device cycle. Each pulse will have the same duration as the base clock. The device will continue to output this pulse pattern until d is changed.
Fig. 12.1 shows a signal that is being sent using PDM, and the resulting set of pulses. Each pulse transfers a fixed amount of energy to the device. When the pulses arrive at the device, they are effectively filtered using a low pass filter. The resulting received signal is also shown. Notice that the received signal has a delay, or phase shift, caused by the low-pass filtering. This approach is suitable for controlling certain types of devices, such as lights and speakers.

However, when driving such devices directly with the digital pulses, care must be taken that the minimum frequency of pulses remains above the threshold that can be detected by human senses. For instance, when driving a speaker, the minimum pulse frequency must be high enough that the individual pulses cannot be distinguished by the human ear. This minimum frequency is around 40 kHz. Likewise, when driving an LED directly, the minimum frequency must be high enough that the eye cannot detect the individual pulses, because they will be seen as a flickering effect. That minimum frequency is around 70 Hz. To reduce or alleviate this problem, designers may add a low-pass filter between the PWM device and the device that is being driven.
In PWM, the frequency of the pulses remains fixed, but the duration of the positive pulse (the pulse width) is modulated. When using PWM devices, the programmer typically sets the device cycle time tc in a register, then uses another register to specify the number of base clock cycles, d, for which the output should be high. The percentage
is typically referred to as the duty cycle and d must be chosen such that 0 ≤ d ≤ tc. For instance, if tc = 1024, then the device cycle time is 1024 times the cycle time of the clock that drives the device. If d = 512, then the device will output a high signal for 512 clock cycles, then output a low signal for 512 clock cycles. It will continue to repeat this pattern of pulses until d is changed.
Fig. 12.2 shows a signal that is being sent using PWM. The pulses are also shown. Each pulse transfers some energy to the device. The width of each pulse determines how much energy is transferred. When the pulses arrive at the device, they are effectively filtered using a low-pass filter. The resulting received signal is shown by the dashed line. As with PDM, the received signal has a delay, or phase shift, caused by the low-pass filtering.

One advantage of PWM over PDM is that the digital circuit is not as complex. Another advantage of PWM over PDM is that the frequency of the pulses does not vary, so it is easier for the programmer to set the base frequency high enough that the individual pulses cannot be detected by human senses. Also, when driving motors it is usually necessary to match the pulse frequency to the size and type of motor. Mismatching the frequency can cause loss of efficiency as well as overheating of the motor and drive electronics. In severe cases, this can cause premature failure of the motor and/or drive electronics. With PWM, it is easier for the programmer to control the base frequency, and thereby avoid those problems.
The Broadcom BCM2835 system-on-chip includes a device that can create two PWM signals. One of the signals (PWM0) can be routed through GPIO pin 18 (alternate function 5), where it is available on the Raspberry Pi expansion header at pin 12. PWM0 can also be routed through GPIO pin 40. On the Raspberry Pi, pin 40 it is sent through a low-pass filter, and then to the Raspberry Pi audio output port as the right stereo channel. The other signal (PWM1) can be routed through GPIO pin 45. From there, it is sent through a low-pass filter, and then to the Raspberry Pi audio output port as the left stereo channel. So, both PWM channels are accessible, but PWM1 is only accessible through the audio output port after it has been low-pass filtered. The raw PWM0 signal is available through the Raspberry Pi expansion header at pin 12.
There are three modes of operation for the BCM2835 PWM device:
2. PWM mode, and
3. serial transmission mode.
The following paragraphs explain how the device can be used in basic PWM mode, which is the most simple and straightforward mode for this device. Information on how to use the PDM and serial transmission modes, the FIFO, and DMA is available in the BCM2835 ARM Peripherals manual.
The base address of the PWM device is 2020C00016 and it contains eight registers. Table 12.1 shows the offset, name, and a short description for each of the registers. The mode of operation is selected for each channel independently by writing appropriate bits in the PWMCTL register. The base clock frequency is controlled by the clock manager device, which is explained in Section 13.1. By default, the system startup code sets the base clock for the PWM device to 100 MHz.
Table 12.1
Raspberry Pi PWM register map
| Offset | Name | Description | Size | R/W |
| 0016 | PWMCTL | PWM Control | 32 | R/W |
| 0416 | PWMSTA | PWM FIFO Status | 32 | R/W |
| 0816 | PWMDMAC | PWM DMA Configuration | 32 | R/W |
| 1016 | PWMRNG1 | PWM Channel 1 Range | 32 | R/W |
| 1416 | PWMDAT1 | PWM Channel 1 Data | 32 | R/W |
| 1816 | PWMFIF1 | PWM FIFO Input | 32 | R/W |
| 2016 | PWMRNG2 | PWM Channel 2 Range | 32 | R/W |
| 2416 | PWMDAT2 | PWM Channel 2 Data | 32 | R/W |

Table 12.2 shows the names and short descriptions of the bits in the PWMCTL register. There are 8 bits used for controlling channel 1 and 8 bits for controlling channel 2. PWENn is the master enable bit for channel n. Setting that bit to 0 disables the PWM channel, while setting it to 1 enables the channel. MODEn is used to select whether the channel is in serial transmission mode or in the PDM/PWM mode. If MODEn is set to 0, then MSENn is used to choose whether channel n is in PDM mode or PWM mode. If MODEn is set to 1, then RPTLn, SBITn, USEFn, and CLRFn are used to manage the operation of the FIFO for channel n. POLAn is used to enable or disable inversion of the output signal for channel n.
Table 12.2
Raspberry Pi PWM control register bits

The PWMRNGn registers are used to define the base period for the corresponding channel. In PDM mode, evenly distributed pulses are sent within a period of length defined by this register, and the number of pulses sent during the base period is controlled by writing to the corresponding PWMDATn register. In PWM mode, the PWMRNGn register defines the base frequency for the pulses, and the duty cycle is controlled by writing to the corresponding PWMDATn register. Example 12.1 gives an overview of the steps needed to configure PWM0 for use in PWM mode.
The AllWinner A10/A20 SOCs have a hardware PWM device which is capable of generating two PWM signals. The PWM device is driven by the OSC24M signal, which is generated by the Clock Control Unit (CCU) in the AllWinner SOC. This base clock runs at 24 MHz by default, and changing the base frequency could affect many other devices in the system. The base clock can be divided by one of 11 predefined values using a prescaler built into the PWM device. Each of the two channels has its own prescaler. Table 12.3 shows the possible settings for the prescalers.
Table 12.3
Prescaler bits in the pcDuino PWM device
| Value | Effect |
| 0000 | Base clock is divided by 120 |
| 0001 | Base clock is divided by 180 |
| 0010 | Base clock is divided by 240 |
| 0011 | Base clock is divided by 360 |
| 0100 | Base clock is divided by 480 |
| 0101,0110,0111 | Not used |
| 1000 | Base clock is divided by 1200 |
| 1001 | Base clock is divided by 2400 |
| 1010 | Base clock is divided by 3600 |
| 1011 | Base clock is divided by 4800 |
| 1100 | Base clock is divided by 7200 |
| 1101,1110 | Not used |
| 1111 | Base clock is divided by 1 |
There are two modes of operation for the PWM device. In the first mode, the device operates like a standard PWM device as described in Section 12.2. In the second mode, it sends a single pulse and then waits until it is triggered again by the CPU. In this mode, it is a monostable multivibrator, also known as a one-shot multivibrator, or just one-shot. The duration of the pulse is controlled using the pre-scaler and the period register.
The PWM device is mapped at address 01C20C0016. Table 12.4 shows the registers and their offsets from the base address. All of the device configuration is done through a single control register, which can also be read in order to determine the status of the device. The bits in the control register are shown in Table 12.5.
Table 12.4
pcDuino PWM register map
| Offset | Name | Description |
| 20016 | PWMCTL | PWM Control |
| 20416 | PWM_CH0_PERIOD | PWM Channel 0 Period |
| 20816 | PWM_CH1_PERIOD | PWM Channel 1 Period |
Table 12.5
pcDuino PWM control register bits
| Bit | Name | Description | Values |
| 3-0 | CH0_PRESCAL | Channel 0 Prescale | These bits must be set before PWM Channel 0 clock is enabled. See Table 12.3. |
| 4 | CH0_EN | Channel 0 Enable | 0: Channel disabled |
| 1: Channel enabled | |||
| 5 | CH0_ACT_STA | Channel 0 Polarity | 0: Channel is active low |
| 1: Channel is active high | |||
| 6 | SCLK_CH0_GATING | Channel 0 Clock | 0: Clock disabled |
| 1: Clock enabled | |||
| 7 | CH0_PUL_START | Start pulse | If configured for pulse mode, writing a 1 causes the PWM device to emit a single pulse. |
| 8 | PWM0_BYPASS | Bypass PWM | 0: Output PWM device signal |
| 1: Output base clock | |||
| 9 | SCLK_CH0_MODE | Select Mode | 0: PWM mode |
| 1: Pulse mode | |||
| 10-14 | Not Used | ||
| 18-15 | CH1_PRESCAL | Channel 1 Prescale | These bits must be set before PWM Channel 1 clock is enabled. See Table 12.3. |
| 19 | CH1_EN | Channel 1 Enable | 0: Channel disabled |
| 1: Channel enabled | |||
| 20 | CH1_ACT_STA | Channel 1 Polarity | 0: Channel is active low |
| 1: Channel is active high | |||
| 21 | SCLK_CH1_GATING | Channel 1 Clock | 0: Clock disabled |
| 1: Clock enabled | |||
| 22 | CH1_PUL_START | Start pulse | If configured for pulse mode, writing a 1 causes the PWM device to emit a single pulse. |
| 23 | PWM1_BYPASS | Bypass PWM | 0: Output PWM device signal |
| 1: Output base clock | |||
| 24 | SCLK_CH1_MODE | Select Mode | 0: PWM mode |
| 1: Pulse mode | |||
| 27-25 | Not Used | ||
| 28 | PWM0_RDY | CH0 Period Ready | 0: PWM0 Period register is ready |
| 1: PWM0 Period register is busy | |||
| 29 | PWM1_RDY | CH1 Period Ready | 0: PWM1 Period register is ready |
| 1: PWM1 Period register is busy | |||
| 31–30 | Not Used |

Before enabling a PWM channel, the period register for that channel should be initialized. The two period registers are each organized as two 16-bit numbers. The upper 16 bits control the total number of clock cycles in one period. In other words, they control the base frequency of the PWM signal. The PWM frequency is calculated as
where OSC24M is the frequency of the base clock (the default is 24 MHz), PSC is the prescale value set in the channel prescale bits in the PWM control register, and N is the value stored in the upper 16 bits of the channel period register.
The lower 16 bits of the channel period register control the duty cycle. The duty cycle (expressed as % of full on) can be calculated as
where N is the value stored in the upper 16 bits of the channel period register, and D is the value stored in the lower 16 bits of the channel period register. Note that the condition D ≤ N must always remain true. If the programmer allows D to become greater than N, the results are unpredictable.
The procedure for configuring the AllWinner A10/A20 PWM device is as follows:
1. Disable the desired channel:
(a) Read the PWM control register into x.
(b) Clear all of the bits in x for the desired PWM channel.
(c) Write x back to the PWM control register
2. Initialize the period register for the desired channel.
(a) Calculate the desired value for N.
(b) Let D = 0.
(c) Let y = N × 216 + D.
(d) Write y to the desired channel period register.
3. Set the prescaler.
(a) Select the four-bit code for the desired divisor from Table 12.3.
(b) Set the prescaler code bits in x.
(c) Write x back to the PWM control register.
4. Enable the PWM device.
(a) Set the appropriate bits in x to enable the desired channel, select the polarity, and enable the clock.
(b) Write x to the PWM control register.
Once the control register is configured, the duty cycle can be controlled by calculating a new value for D and then writing y = N × 216 + D to the desired channel period register.
Pulse modulation is a group of methods for generating analog signals using digital equipment, and is commonly used in control systems to regulate the power sent to motors and other devices. Pulse modulation techniques can have very low power loss compared to other methods of controlling analog devices, and the circuitry required is relatively simple.
The cycle frequency must be programmed to match the application. Typically, 10 Hz is adequate for controlling an electric heating element, while 120 Hz would be more appropriate for controlling an incandescent light bulb. Large electric motors may be controlled with a cycle frequency as low as 100 Hz, while smaller motors may need frequencies around 10,000 Hz. It can take some experimentation to find the best frequency for any given application.
12.1 Write ARM assembly programs to configure PWM0 and the GPIO device to send a signal out on Raspberry Pi header pin 12 with:
(a) period of 1 ms and duty cycle of 25%, and
(b) frequency of 150 Hz and duty cycle of 63%.
12.2 Write ARM assembly programs to configure PWM0 and the GPIO device to send a signal out on the pcDuino PWM1/GPIO5 pin with:
(a) period of 1 ms and duty cycle of 25%, and
(b) frequency of 150 Hz and duty cycle of 63%.
This chapter briefly describes some of the devices which are present in most modern computer systems. It then describes in detail the clock management devices on the Raspberry Pi and the pcDuino. Next, it gives an explanation of asynchronous serial communications, and explains how there is some tolerance for mismatch between the clock rate of the transmitter and receiver. It then explains the Universal Asynchronous Receiver/Transmitter (UART) device. Next it covers in detail the UART devices present on the Raspberry Pi and the PcDuino. Once again, the reader is given the opportunity to do a comparison between two different devices which perform almost precisely the same functions.
Universal asynchronous receiver/transmitter (UART); Clock manager; Serial communications; RS232
There are some classes of devices that are found in almost every system, including the smallest embedded systems. Such common devices include hardware for managing the clock signals sent to other devices, and serial communications (typically RS232). Most mid-sized or large systems also include devices for managing virtual memory, managing the cache, driving a display, interfacing with keyboard and mouse, accessing disk and other storage devices, and networking. Small embedded systems may have devices for converting analog signals to digital and vice versa, pulse width modulation, and other purposes. Some systems, such as the Raspberry Pi and pcDuino, have all or most of the devices of large systems, as well as most of the devices found on embedded systems. In this chapter, we look at two devices found on almost every system.
Very simple computer systems can be driven by a single clock. Most devices, including the CPU, are designed as state machines. The clock device sends a square-wave signal at a fixed frequency to all devices that need it. The clock signal tells the devices when to transition to the next state. Without the clock signal, none of the devices would do anything.
More complex computers may contain devices which need to run at different rates. This requires the system to have separate clock signals for each device (or group of devices). System designers often solve this problem by adding a clock manager device to the system. This device allows the programmer to configure the clock signals that are sent to the other devices in the system. Fig. 13.1 shows a typical system. The clock manager, just like any other device, is configured by the CPU writing data to its registers using the system bus.

The BCM2835 system-on-chip contains an ARM CPU and several devices. Some of the devices need their own clock to drive their operation at the correct frequency. Some devices, such as serial communications receivers and transmitters, need configurable clocks so that the programmer has control over the speed of the device. To provide this flexibility and allow the programmer to have control over the clocks for each device, the BCM2835 includes a clock manager device, which can be used to configure the clock signals driving the other devices in the system.
The Raspberry Pi has a 19.2 MHz oscillator which can be used as a base frequency for any of the clocks. The BCM2835 also has three phase-locked-loop circuits that boost the oscillator to higher frequencies. Table 13.1 shows the frequencies that are available from various sources. Each device clock can be driven by one of the PLLs, the external 19.2 MHz oscillator, a signal from the HDMI port, or either of two test/debug inputs.
Table 13.1
Clock sources available for the clocks provided by the clock manager
| Number | Name | Frequency | Note |
| 0 | GND | 0 Hz | Clock is stopped |
| 1 | oscillator | 19.2 MHz | |
| 2 | testdebug0 | Unknown | Used for system testing |
| 3 | testdebug1 | Unknown | Used for system testing |
| 4 | PLLA | 650 MHz | May not be available |
| 5 | PLLC | 200 MHz | May not be available |
| 6 | PLLD | 500 MHz | |
| 7 | HDMI auxiliary | Unknown | |
| 8–15 | GND | 0 Hz | Clock is stopped |

Among the clocks controlled by the clock manager device are the core clock (CM_VPU), the system timer clock (PM_TIME) which controls the speed of the system timer, the GPIO clocks which are documented in the Raspberry Pi peripheral documentation, the pulse modulator device clocks, and the serial communications clocks. It is generally not a good idea to modify the settings of any of the clocks without good reason.
The base address of the clock manager device is 2010100016. Some of the clock manager registers are shown in Table 13.2. Each clock is managed by two registers: a control register and a divisor. The control register is used to enable or disable a clock, to select which source oscillator drives the clock, and to select an optional multistage noise shaping (MASH) filter level. MASH filtering is useful for reducing the perceived noise when a clock is being used to generate an audio signal. In most cases, MASH filtering should not be used.
Table 13.2
Some registers in the clock manager device
| Offset | Name | Description |
| 07016 | CM_GP0_CTL | GPIO Clock 0 (GPCLK0) Control |
| 07416 | CM_GP0_DIV | GPIO Clock 0 (GPCLK0) Divisor |
| 07816 | CM_GP1_CTL | GPIO Clock 1 (GPCLK1) Control |
| 07c16 | CM_GP1_DIV | GPIO Clock 1 (GPCLK1) Divisor |
| 08016 | CM_GP2_CTL | GPIO Clock 2 (GPCLK2) Control |
| 08416 | CM_GP2_DIV | GPIO Clock 2 (GPCLK2) Divisor |
| 09816 | CM_PCM_CTL | Pulse Code Modulator Clock (PCM_CLK) Control |
| 09c16 | CM_PCM_DIV | Pulse Code Modulator Clock (PCM_CLK) Divisor |
| 0a016 | CM_PWM_CTL | Pulse Modulator Device Clock (PWM_CLK) Control |
| 0a416 | CM_PWM_DIV | Pulse Modulator Device Clock (PWM_CLK) Divisor |
| 0f016 | CM_UART_CTL | Serial Communications Clock (UART_CLK) Control |
| 0f416 | CM_UART_DIV | Serial Communications Clock (UART_CLK) Divisor |
Table 13.3 shows the meaning of the bits in the control registers for each of the clocks, and Table 13.4 shows the fields in the clock manager divisor registers. The procedure for configuring one of the clocks is:
Table 13.3
Bit fields in the clock manager control registers
| Bit | Name | Description |
| 3–0 | SRC | Clock source chosen from Table 13.1 |
| 4 | ENAB | Writing a 0 causes the clock to shut down. The clock will not stop immediately. The BUSY bit will be 1 while the clock is shutting down. When the BUSY bit becomes 0, the clock has stopped and it is safe to reconfigure it. Writing a 1 to this bit causes the clock to start |
| 5 | KILL | Writing a 1 to this bit will stop and reset the clock. This does not shut down the clock cleanly, and could cause a glitch in the clock output |
| 6 | - | Unused |
| 7 | BUSY | A 1 in this bit indicates that the clock is running |
| 8 | FLIP | Writing a 1 to this bit will invert the clock output. Do not change this bit while the clock is running |
| 10–9 | MASH | Controls how the clock source is divided.
01: 1-stage MASH division 10: 2-stage MASH division 11: 3-stage MASH division Do not change this while the clock is running. |
| 23–11 | – | Unused |
| 31–24 | PASSWD | This field must be set to 5A16 every time the clock control register is written to |

Table 13.4
Bit fields in the clock manager divisor registers
| Bit | Name | Description |
| 11–0 | DIVF | Fractional part of divisor. Do not change this while the clock is running |
| 23–12 | DIVI | Integer part of divisor. Do not change this while the clock is running |
| 31–24 | PASSWD | This field must be set to 5A16 every time the clock divisor register is written to |
1. Read the desired clock control register.
2. Clear bit 4 in the word that was read, then OR it with 5A00000016 and store the result back to the desired clock control register.
3. Repeatedly read the desired clock control register, until bit 7 becomes 0.
4. Calculate the divisor required and store it into the desired clock divisor register.
5. Create a word to configure and start the clock. Begin with 5A00000016, and set bits 3–0 to select the desired clock source. Set bits 10–9 to select the type of division, and set bit 4 to 1 to enable the clock.
6. Store the control word into the desired clock control register.
Selection of the divisor depends on which clock source is used, what type of division is selected, and the desired output of the clock being configured. For example, to set the PWM clock to 100 kHz, the 19.20 MHz clock can be used. Dividing that clock by 192 will provide a 100-KHz clock. To accomplish this, it is necessary to stop the PWM clock as described, store the value 5A0C000016 in the PWM clock divisor register, and then start the clock by writing 5A00001116 into the PWM clock control register.
The AllWinner A10/A20 SOCs have a relatively simple clock manager, which is referred to as the Clock Control Unit. All of the clock signals in the system are driven by two crystal oscillators: the main oscillator runs at 24 MHz, and the real-time-clock oscillator, which runs at 32768 Hz. The real-time-clock oscillator is used only to provide a signal to the real-time-clock device.
The main clock oscillator drives many of the devices in the system, but there are seven phase-locked-loop circuits in the CCU which provide signals for devices which need clocks that are faster or slower than 24 MHz. Table 13.5 shows which devices are driven by the nine clock signals.
Table 13.5
Clock signals in the AllWinner A10/A20 SOC
| Clock Domain | Modules | Frequency | Description |
| OSC24M | Most modules | 24 MHz | Main clock |
| CPU32_clk | CPU | 2 kHz–1.2 GHz | Drives CPU |
| AHB_clk | AHB devices | 8 kHz–276 MHz | Drives some devices |
| APB_clk | Peripheral bus | 500 Hz–138 MHz | Drives some devices |
| SDRAM_clk | SDRAM | 0 Hz–400 MHz | Drives SDRAM memory |
| Usb:clk | USB | 480 MHz | Drives USB devices |

There are basically two methods for transferring data between two digital devices: parallel and serial. Parallel connections use multiple wires to carry several bits at one time, typically including extra wires to carry timing information. Parallel communications are used for transferring large amounts of data over very short distances. However, this approach becomes very expensive when data must be transferred more than a few meters. Serial, on the other hand, uses a single wire to transfer the data bits one at a time. When compared to parallel transfer, the speed of serial transfer typically suffers. However, because it uses significantly fewer wires, the distance may be greatly extended, reliability improved, and cost vastly reduced.
One of the oldest and most common devices for communications between computers and peripheral devices is the Universal Asynchronous Receiver/Transmitter, or UART. The word “universal” indicates that the device is highly configurable and flexible. UARTs allow a receiver and transmitter to communicate without a synchronizing signal.
The logic signal produced by the digital UART typically oscillates between zero volts for a low level and five volts for a high level, and the amount of current that the UART can supply is limited. For transmitting the data over long distances, the signals may go through a level-shifting or amplification stage. The circuit used to accomplish this is typically called a line driver. This circuit boosts the signal provided by the UART and also protects the delicate digital outputs from short circuits and signal spikes. Various standards, such as RS-232, RS-422, and RS-485 define the voltages that the line driver uses. For example, the RS-232 standard specifies that valid signals are in the range of + 3 to + 15 V, or − 3 to − 15 V. The standards also specify the maximum time that is allowable when shifting from a high signal to a low signal and vice versa, the amount of current that the device must be capable of sourcing and sinking, and other relevant design criteria.
The UART transmits data by sending each bit sequentially. The receiving UART re-assembles the bits into the original data. Fig. 13.2 shows how the transmitting UART converts a byte of data into a serial signal, and how the receiving UART samples the signal to recover the original data. Serializing the transmission and reassembly of the data are accomplished using shift registers. The receiver and transmitter each have their own clocks, and are configured so that the clocks run at the same speed (or close to the same speed). In this case, the receiver’s clock is running slightly slower than the transmitter’s clock, but the data are still received correctly.

To transfer a group of bits, called a data frame, the transmitter typically first sends a start bit. Most UARTs can be configured to transfer between four and eight data bits in each group. The transmitting and receiving UARTS must be configured to use the same number of data bits. After each group of data bits, the transmitter will return the signal to the low state and keep it there for some minimum period. This period is usually the time that it would take to send two bits of data, and is referred to as the two stop bits. The stop bits allow the receiver to have some time to process the received byte and prepare for the next start bit. Fig. 13.2A shows what a typical RS-232 signal would look like when transferring the value 5616 (the ASCII “V” character). The UART enters the idle state only if there is not another byte immediately ready to send. If the transmitter has another byte to send, then the start bit can begin at the end of the second stop bit.
Note that it is impossible to ensure that the receiver and transmitter have clocks which are running at exactly the same speed, unless they use the same clock signal. Fig. 13.2B shows how the receiver can reassemble the original data, even with a slightly different clock rate. When the start bit is detected by the receiver, it prepares to receive the data bits, which will be sent by the transmitter at an expected rate (within some tolerance). The receive circuitry of most UARTS is driven by a clock that runs 16 times as fast as the baud rate. The receive circuitry uses its faster clock to latch each bit in the middle of its expected time period. In Fig. 13.2B, the receiver clock is running slower than the transmitter clock. By the end of the data frame, the sample time is very far from the center of the bit, but the correct value is received. If the clocks differed by much more, or if more than eight data bits were sent, then it is very likely that incorrect data would be received. Thus, as long as their clocks are synchronized within some tolerance (which is dependent on the number of data bits and the baud rate), the data will be received correctly.
The RS-232 standard allows point-to-point communication between two devices for limited distances. With the RS-232 standard, simple one-way communications can be accomplished using only two wires: One to carry the serial bits, and another to provide a common ground. For bi-directional communication, three wires are required. In addition, the RS-232 standard specifies optional hand-shaking signals, which the UARTs can use to signal their readiness to transmit or receive data. The RS-422 and RS-485 standards allow multiple devices to be connected using only two wires.
The first UART device to enjoy widespread use was the 8250. The original version had 12 registers for configuration, sending, and receiving data. The most important registers are the ones that allow the programmer to set the transmit and receive bit rates, or baud. One baud is one bit per second. The baud is set by storing a 16 bit divisor in two of the registers in the UART. The chip is driven by an external clock, and the divisor is used to reduce the frequency of the external clock to a frequency that is appropriate for serial communication. For example, if the external clock runs at 1 MHz, and the required baud is 1200, then the divisor must be
. Note that the divisor can only be an integer, so the device cannot achieve exactly 1200 baud. However, as explained previously, the sending and receiving devices do not have to agree precisely on the baud. During the transmission and reception of a byte, 1200.48 baud is close enough that the bits will be received correctly even if the other end is running slightly below 1200 baud. In the 8250, there was only one 8-bit register for sending data and only one 8-bit register for receiving data. The UART could send an interrupt to the CPU after each byte was transmitted or received. When receiving, the CPU had to respond to the interrupt very quickly. If the current byte was not read quickly enough by the CPU, it would be overwritten by the subsequent incoming byte. When transmitting, the CPU needed to respond quickly to interrupts to provide the next byte to be sent, or the transmission rate would suffer.
The next generation of UART device was the 16550A. This device is the model for most UART devices today. It features 16-byte input and output buffers and the ability to trigger interrupts when a buffer is partially full or partially empty. This allows the CPU to move several bytes of data at a time and results in much lower CPU overhead and much higher data transmission and reception rates. The 16550A also supports much higher baud rates than the 8250.
The BCM2835 system-on-chip provides two UART devices: UART0 and UART1. UART 1 is part of the I2C device, and is not recommended for use as a UART. UART0 is a PL011 UART, which is based on the industry standard 16550A UART. The major differences are that the PL011 allows greater flexibility in configuring the interrupt trigger levels, the registers appear in different locations, and the locations of bits in some of the registers is different. So, although it operates very much like a 16550A, things have been moved to different locations. The transmit and receive lines can be routed through GPIO pin 14 and GPIO pin 15, respectively. UART0 has 18 registers, starting at its base address of 2E2010016. Table 13.6 shows the name, location, and a brief description for each of the registers.
Table 13.6
Raspberry Pi UART0 register map
| Offset | Name | Description |
| 0016 | UART_DR | Data Register |
| 0416 | UART_RSRECR | Receive Status Register/Error Clear Register |
| 1816 | UART_ FR | Flag register |
| 2016 | UART_ILPR | not in use |
| 2416 | UART_IBRD | Integer Baud rate divisor |
| 2816 | UART_FBRD | Fractional Baud rate divisor |
| 2c16 | UART_LCRH | Line Control register |
| 3016 | UART_CR | Control register |
| 3416 | UART_IFLS | Interrupt FIFO Level Select Register |
| 3816 | UART_IMSC | Interrupt Mask Set Clear Register |
| 3c16 | UART_RIS | Raw Interrupt Status Register |
| 4016 | UART_MIS | Masked Interrupt Status Register |
| 4416 | UART_ICR | Interrupt Clear Register |
| 4816 | UART_DMACR | DMA Control Register |
| 8016 | UART_ITCR | Test Control register |
| 8416 | UART_ITIP | Integration test input reg |
| 8816 | UART_ITOP | Integration test output reg |
| 8c16 | UART_TDR | Test Data reg |
UART_DR: The UART Data Register is used to send and receive data. Data are sent or received one byte at a time. Writing to this register will add a byte to the transmit FIFO. Although the register is 32 bits, only the 8 least significant bits are used in transmission, and 12 least significant bits are used for reception. If the FIFO is empty, then the UART will begin transmitting the byte immediately. If the FIFO is full, then the last byte in the FIFO will be overwritten with the new byte that is written to the Data Register. When this register is read, it returns the byte at the top of the receive FIFO, along with four additional status bits to indicate if any errors were encountered. Table 13.7 specifies the names and use of the bits in the UART Data Register.
Table 13.7
Raspberry Pi UART data register
| Bit | Name | Description | Values |
| 7–0 | DATA | Data |
Write: Data byte to transmit |
| 8 | FE | Framing error |
1: The received character did not have a valid stop bit |
| 9 | PE | Parity error |
1: The received character did not have the correct parity, as set in theEPS and SPS bits of the Line Control Register (UART_LCRH) |
| 10 | BE | Break error |
1: A break condition was detected. The data input line was held low forlonger than the time it would take to receive a complete byte,including the start and stop bits. |
| 11 | OE | Overrun error |
1: Data was not read quickly enough, and one or more bytes wereoverwritten in the input buffer |
| 31–12 | - | Not used | Write as zero, read as don’t care |

UART_RSRECR: The UART Receive Status Register/Error Clear Register is used to check the status of the byte most recently read from the UART Data Register, and to check for overrun conditions at any time. The status information for overrun is set immediately when an overrun condition occurs. The Receive Status Register/Error Clear Register provides the same four status bits as the Data Register (but in bits 3–0 rather than bits 11–8). The received data character must be read first from the Data Register, before reading the error status associated with that data character from the RSRECR register. Since the Data Register also contains these 4 bits, this register may not be required, depending on how the software is written. Table 13.8 describes the bits in this register.
Table 13.8
Raspberry Pi UART receive status register/error clear register
| Bit | Name | Description | Values |
| 0 | FE | Framing error |
1: The received character did not have a valid stop bit |
| 1 | PE | Parity error |
1: The received character did not have the correct parity, as set in theEPS and SPS bits of the Line Control Register (UART_LCRH) |
| 2 | BE | Break error |
1: A break condition was detected. The data input line was held low for longer than the time it would take to receive a complete byte,including the start and stop bits. |
| 3 | OE | Overrun error |
1: Data was not read quickly enough, and one or more bytes wereoverwritten in the input buffer |
| 31–4 | Not used | Write as zero, read as don’t care |

UART_FR: The UART Flag Register can be read to determine the status of the UART. The bits in this register are used mainly when sending and receiving data using the FIFOs. When several bytes need to be sent, the TXFF flag should be checked to ensure that the transmit FIFO is not full before each byte is written to the data register. When receiving data, the RXFE bit can be used to determine whether or not there is more data to be read from the FIFO. Table 13.9 describes the flags in this register.
Table 13.9
Raspberry Pi UART flags register bits

UART_ILPR: This is the IrDA register, which is supported by some PL011 UARTs. IrDA stands for the Infrared Data Association, which is a group of companies that cooperate to provide specifications for a complete set of protocols for wireless infrared communications. The name “IrDA” also refers to that set of protocols. IrDA is not implemented on the Raspberry Pi UART. Writing to this register has no effect and reading returns 0.
UART_IBRD and UART_FBRD: UART_FBRD is the fractional part of the baud rate divisor value, and UART_IBRD is the integer part. The baud rate divisor is calculated as follows:
where UARTCLK is the frequency of the UART_CLK that is configured in the Clock Manager device. The default value is 3 MHz. BAUDDIV is stored in two registers. UART_IBRD holds the integer part and UART_FBRD holds the fractional part. Thus BAUDDIV should be calculated as a U(16,6) fixed point number. The contents of the UART_IBRD and UART_FBRD registers may be written at any time, but the change will not have any effect until transmission or reception of the current character is complete. Table 13.10 shows the arrangement of the integer baud rate divisor register, and Table 13.11 shows the arrangement of the fractional baud rate divisor register.
Table 13.10
Raspberry Pi UART integer baud rate divisor
| Bit | Name | Description | Values |
| 15–0 | IBRD | Integer Baud Rate Divisor | See Eq. (13.1) |
| 31–16 | Not used | Write as zero, read as don’t care |

Table 13.11
Raspberry Pi UART fractional baud rate divisor
| Bit | Name | Description | Values |
| 5-0 | FBRD | Fractional Baud Rate Divisor | See Eq. (13.1) |
| 31-6 | Not used | Write as zero, read as don’t care |

UART_LCRH: UART_LCRH is the line control register. It is used to configure the communication parameters. This register must not be changed until the UART is disabled by writing zero to bit 0 of UART_CR, and the BUSY flag in UART_FR is clear. Table 13.12 shows the layout of the line control register.
Table 13.12
Raspberry Pi UART line control register bits

UART_CR: The UART Control Register is used for configuring, enabling, and disabling the UART. Table 13.13 shows the layout of the control register. To enable transmission, the TXE bit and UARTEN bit must be set to 1. To enable reception, the RXE bit and UARTEN bit must be set to 1. In general, the following steps should be used to configure or re-configure the UART:
Table 13.13
Raspberry Pi UART control register bits
| Bit | Name | Description | Values |
| 0 | UARTEN | UART Enable |
1: UART enabled. |
| 1 | SIREN | Not used | Write as zero, read as don’t care |
| 2 | SIRLP | Not used | Write as zero, read as don’t care |
| 3–6 | Not used | Write as zero, read as don’t care | |
| 7 | LBE | Loopback Enable |
1: Loopback enabled. Transmitted data is also fed back to the receiver. |
| 8 | TXE | Transmit enable |
1: Transmitter is enabled |
| 9 | RXE | Receive enable |
1: Receiver is enabled |
| 10 | DTR | Not used | Write as zero, read as don’t care |
| 11 | RTS | Complement of nUARTRTS | |
| 12 | OUT1 | Not used | Write as zero, read as don’t care |
| 13 | OUT2 | Not used | Write as zero, read as don’t care |
| 14 | RTSEN | RTS Enable |
1: Hardware RTS Enabled |
| 15 | CTSEN | CTS Enable |
1: Hardware CTS Enabled |
| 16–31 | Not used | Write as zero, read as don’t care |

(b) Wait for the end of transmission or reception of the current character.
(c) Flush the transmit FIFO by setting the FEN bit to 0 in the Line Control Register.
(d) Reprogram the Control Register.
(e) Enable the UART.
Interrupt Control: The UART can signal the CPU by asserting an interrupt when certain conditions occur. This will be covered in more detail in Chapter 14. For now, it is enough to know that there are five additional registers which are used to configure and use the interrupt mechanism.
UART_IFLS defines the FIFO level that triggers the assertion of the interrupt signal. One interrupt is generated when the FIFO reaches the specified level. The CPU must clear the interrupt before another can be generated.
UART_IMSC is the interrupt mask set/clear register. It is used to enable or disable specific interrupts. This register determines which of the possible interrupt conditions are allowed to generate an interrupt to the CPU.
UART_RIS is the raw interrupt status register. It can be read to raw status of interrupts conditions before any masking is performed.
UART_MIS is the masked interrupt status register. It contains the masked status of the interrupts. This is the register that the operating system should use to determine the cause of a UART interrupt.
UART_ICR is the interrupt clear register. writing to it clears the interrupt conditions. The operating system should use this register to clear interrupts before returning from the interrupt service routine.
UART_DMACR: The DMA control register is used to configure the UART to access memory directly, so that the CPU does not have to move each byte of data to or from the UART. DMA will be explained in more detail in Chapter 14.
Additional Registers: The remaining registers, UART_ITCR, UART_ITIP, and UART_ITOP, are either unimplemented or are used for testing the UART. These registers should not be used.
Listing 13.1 shows four basic functions for initializing the UART, changing the baud rate, sending a character, and receiving a character using UART0 on the Raspberry Pi. Note that a large part of the code simply defines the location and offset for all of the registers (and bits) that can be used to control the UART.





The AllWinner A10/A20 SOC includes eight UART devices. They are all fully compatible with the 16550A UART, and also provide some enhancements. All of them provide transmit (TX) and receive (RX) signals. UART0 has the full set of RS232 signals, including RTS, CTS, DTR, DSR, DCD, and RING. UART1 has the RTS and CTS signals. The remaining six UARTs only provide the TX and RX signals. They can all be configured for serial IrDA. Table 13.14 shows the base address for each of the eight UART devices.
Table 13.14
pcDuino UART addresses
| Name | Address |
| UART0 | 0x01C28000 |
| UART1 | 0x01C28400 |
| UART2 | 0x01C28800 |
| UART3 | 0x01C28C00 |
| UART4 | 0x01C29000 |
| UART5 | 0x01C29400 |
| UART6 | 0x01C29800 |
| UART7 | 0x01C29C00 |
When the 16550 UART was designed, 8-bit processors were common, and most of them provided only 16 address bits. Memory was typically limited to 64 kB, and every byte of address space was important. Because of these considerations, the designers of the 16550 decided to limit the number of addresses used to 8, and to only use eight bits of data per address. There are 10 registers in the 16550 UART, but some of them share the same address. For example, there are three registers mapped to an offset address of zero, two registers mapped at offset four, and two registers mapped at offset eight. Bit seven in the Line Control Register is used to determine which of the registers is active for a given address.
Because they are meant to be fully backwards-compatible with the 16550, the AllWinner A10/A20 SOC UART devices also use only 8 bits for each register, and the first 12 registers correspond exactly with the 16550 UART. The only differences are that the pcDuino uses word addresses rather than byte addresses, and they provide four additional registers that are used for IrDA mode. Table 13.15 shows the arrangement of the registers in each of the 8 UARTs on the pcDuino. The following sections will explain the registers.
Table 13.15
pcDuino UART register offsets
| Register Name | Offset | Description |
| UART_RBR | 0x00 | UART Receive Buffer Register |
| UART_THR | 0x00 | UART Transmit Holding Register |
| UART_DLL | 0x00 | UART Divisor Latch Low Register |
| UART_DLH | 0x04 | UART Divisor Latch High Register |
| UART_IER | 0x04 | UART Interrupt Enable Register |
| UART_IIR | 0x08 | UART Interrupt Identity Register |
| UART_FCR | 0x08 | UART FIFO Control Register |
| UART_LCR | 0x0C | UART Line Control Register |
| UART_MCR | 0x10 | UART Modem Control Register |
| UART_LSR | 0x14 | UART Line Status Register |
| UART_MSR | 0x18 | UART Modem Status Register |
| UART_SCH | 0x1C | UART Scratch Register |
| UART_USR | 0x7C | UART Status Register |
| UART_TFL | 0x80 | UART Transmit FIFO Level |
| UART_RFL | 0x84 | UART_RFL |
| UART_HALT | 0xA4 | UART Halt TX Register |
The baud rate is set using a 16-bit Baud Rate Divisor, according to the following equation:
where sclk is the frequency of the UART serial clock, which is configured by the Clock Manager device. The default frequency of the clock is 24 MHz. BAUDDIV is stored in two registers. UART_DLL holds the least significant 8 bits, and UART_DLH holds the most significant 8 bits. Thus BAUDDIV should be calculated as a 16-bit unsigned integer. Note that for high baud rates, it may not be possible to get exactly the rate desired. For example, a baud rate of 115200 would require a divisor of
. Since the baud rate divisor can only be given as an integer, the desired rate must be based on a divisor of 13, so the true baud rate will be
, or about 0.16% faster than desired. Although slightly fast, it is well within the tolerance for RS232 communication.
UART_RBR: The UART Receive Buffer Register is used to receive data, 1 byte at a time. If the receive FIFO is enabled, then as the UART receives data, it places the data into a receive FIFO. Reading from this address removes 1 byte from the receive FIFO. If the FIFO becomes full and another data byte arrives, then the new data are lost and an overrun error occurs. Table 13.16 shows the layout of the receive buffer register.
Table 13.16
pcDuno UART receive buffer register
| Bit | Name | Description | Values |
| 7–0 | RBR | Data | Read only: One byte of received data. Bit 7 of LCR must bezero. |
| 31–8 | Unused |

UART_THR: Writing to the Transmit Holding Register will cause that byte to be transmitted by the UART. If the transmit FIFO is enabled, then the byte will be added to the end of the transmit FIFO. If the FIFO is empty, then the UART will begin transmitting the byte immediately. If the FIFO is full, then the new data byte will be lost. Table 13.17 shows the layout of the transmit holding register.
Table 13.17
pcDuno UART transmit holding register
| Bit | Name | Description | Values |
| 7–0 | THR | Data | Write only: One byte of data to transmit. Bit 7 of LCR must bezero. |
| 31–8 | Unused |

UART_DLL: The UART Divisor Latch Low register is used to set the least significant byte of the baud rate divisor. When bit 7 of the Line Control Register is set to one, writing to this address will access the DLL register. If bit 7 of the Line Control Register is set to zero, then writing to this address will access the transmit holding register. Table 13.18 shows the layout of the UART_DLL register.
Table 13.18
pcDuno UART divisor latch low register
| Bit | Name | Description | Values |
| 7–0 | DLL | Data | Write only: Least significant eight bits of the Baud Rate Divisor. Bit 7 of LCR must be one. |
| 31–8 | Unused |

UART_DLH: The UART Divisor Latch High register is used to set the most significant byte of the baud rate divisor. When bit 7 of the Line Control Register is set to one, writing to this address will access the DLH register. If bit 7 of the Line Control Register is set to zero, then writing to this address will access the Interrupt Enable Register rather than the Divisor Latch High register. Table 13.19 shows the layout of the UART_DLL register.
If the two Divisor Latch Registers (DLL and DLH) are set to zero, the baud clock is disabled and no serial communications occur. DLH should be set before DLL, and at least eight clock cycles of the UART clock should be allowed to pass before data are transmitted or received.
Table 13.19
pcDuno UART divisor latch high register
| Bit | Name | Description | Values |
| 7–0 | DLH | Data | Write only: Most significant eight bits of the Baud Rate Divisor. Bit 7 of LCR must be one. |
| 31–8 | Unused |

UART_FCR: is the UART FIFO control register. It is used to enable or disable the receive and transmit FIFOs (buffers), flush their contents, set the level at which the transmit and receive FIFOs trigger an interrupt, and to control Direct Memory Access (DMA) Table 13.20 shows the layout of the UART_FCR register.
Table 13.20
pcDuno UART FIFO control register

UART_LCR: The Line Control Register is used to control the parity, number of data bits, and number of stop bits for the serial port. Bit 7 also controls which registers are mapped at offsets 0, 4, and 8 from the device base address. Table 13.21 shows the layout of the UART_LCR register.
Table 13.21
pcDuno UART line control register

UART_LSR: The Line Status Register is used to read status information from the UART. Table 13.22 shows the layout of the UART_LSR register.
Table 13.22
pcDuno UART line status register
| Bit | Name | Description |
| 0 | DR | When the Data Ready bit is set to 1, it indicates that at least one byte is ready to be read from the receive FIFO or RBR. |
| 1 | OE | When the Overrun Error bit is set to 1, it indicates that an overrun error occurred for the byte at the top of the receive FIFO. |
| 2 | PE | When the Parity Error bit is set to 1, it indicates that a parity error occurred for the byte at the top of the receive FIFO. |
| 3 | FE | When the Framing Error bit is set to 1, it indicates that a framing error occurred for the byte at the top of the receive FIFO. |
| 4 | BI | When the Break Interrupt bit is set to 1, it indicates thata break has been received. |
| 5 | THRE | When the Transmit Holding Register Empty bit is 1, it indicates that there are there are no bytes waiting to be transmitted, but there may be a byte currently being transmitted. |
| 6 | TEMT | When the Transmitter Empty bit is 1, it indicates that there are no bytes waiting to be transmitted and no byte currently being transmitted. |
| 7 | FIFOERR | When this bit is 1, an error has occurred (PE, BE, or BI) in the receive FIFO. This bit is cleared when the Line Status Register is read. |
| 31–8 | Unused |
UART_USR: The UART Status Register is used to read information about the status of the transmit and receive FIFOs, and the current state of the receiver and transmitter. Table 13.23 shows the layout of the UART_USR register. This register contains essentially the same information as the status register in the Raspberry Pi UART.
Table 13.23
pcDuno UART status register
| Bit | Name | Description |
| 0 | BUSY | When the Busy bit is 1, it indicates that the UART is currently busy. When it is 0, the UART is idle or inactive. |
| 1 | TFNF | When the Transmit FIFO Not Full bit is 1, it indicates that at least one more byte can be safely written to the Transmit FIFO. |
| 2 | TFE | When the Transmit FIFO Empty bit is 1, it indicates that there are no bytes remaining in the transmit FIFO. |
| 3 | RFNE | When the Receive FIFO Not Empty bit is 1, it indicates that at least one more byte is waiting to be read from the receive FIFO. |
| 4 | RFF | When the Receive FIFO Full bit is 1, it indicates that there is no more room in the receive FIFO. If data is not read before the next character is received, an overrun error will occur. |
| 31–5 | Unused |
UART_TFL: The UART Transmit FIFO Level register allows the programmer to determine exactly how many bytes are currently in the transmit FIFO. Table 13.24 shows the layout of the UART_TFL register.
Table 13.24
pcDuno UART transmit FIFO level register
| Bit | Name | Description |
| 6–0 | TFL | The Transmit FIFO level field contains an integer which indicates the number of bytes currently in the transmit FIFO. |
| 31–7 | Unused |
UART_RFL: The UART Receive FIFO Level register allows the programmer to determine exactly how many bytes are currently in the receive FIFO. Table 13.25 shows the layout of the UART_RFL register.
Table 13.25
pcDuno UART receive FIFO level register
| Bit | Name | Description |
| 6–0 | RFL | The Receive FIFO level field contains an integer which indicates the number of bytes currently in the receive FIFO. |
| 31–7 | Unused |
UART_HALT: The UART transmit halt register is used to halt the UART so that it can be reconfigured. After the configuration is performed, it is then used to signal the UART to restart with the new settings. It can also be used to invert the receive and transmit polarity. Table 13.26 shows the layout of the UART_HALT register.
Table 13.26
pcDuno UART transmit halt register
| Bit | Name | Description |
| 0 | Unused | |
| 1 | CHCFG_AT_BUSY | Setting this bit to 1 causes the UART to allow changing the Line Control Register (except the DLAB bit) and allows setting the baud rate even when the UART is busy. When this bit is set to 0, changes can only occur when the BUSY bit in the UART Status Register is 0. |
| 2 | CHANGE_UPDATE | After writing 1 to CHCFG_AT_BUSY and performing the configuration, 1 should be written to this bit to signal that the UART should re-start with the new configuration. This bit will stay at 1 while the new configuration is loaded, and go back to 0 when the re-start is complete. |
| 3 | Unused | |
| 4 | SIR_TX_INVERT | This bit allows the polarity of the transmitter to be inverted.
1: Polarity inverted |
| 5 | SIR_RX_INVERT | This bit allows the polarity of the receiver to be inverted.
1: Polarity inverted |
| 31–5 | Unused |

Interrupt Control: The UART can signal the CPU by asserting an interrupt when certain conditions occur. This will be covered in more detail in Chapter 14. For now, it is enough to know that there are five additional registers which are used to configure and use the interrupt mechanism.
UART_IFLS defines the FIFO level that triggers the assertion of the interrupt signal. One interrupt is generated when the FIFO reaches the specified level. The CPU must clear the interrupt before another can be generated.
UART_IER is the interrupt enable register. It is used to enable or disable the generation of interrupts for specific conditions.
UART_IIR is the Interrupt Identity Register. When an interrupt occurs, the CPU can read this register to determine what caused the interrupt.
Additional Registers There are several additional registers which are not needed for basic use of the UART.
UART_MCR is the Modem Control Register. It is used to configure the port for IrDA mode, enable Automatic Flow Control, and manage the RS-232 RTS and DTR hardware handshaking signals for the ports in which they are implemented. The default configuration disables these extra features.
UART_MSR is the Modem Status Register, which is used to read the state of the RS-232 modem control and status lines on ports that implement them. This register can be ignored unless a telephone modem is being used on the port.
UART_SCH is the Modem Scratch Register. It provides 8 bits of storage for temporary data values. In the days of 8 and 16-bit computers, when the 16550 UART was designed, this extra byte of storage was useful.
Most modern computer systems have some type of Universal Asynchronous Receiver/Transmitter. These are serial communications devices, and are meant to provide communications with other systems using RS-232 (most commonly) or some other standard serial protocol. Modern systems often have a large number of other devices as well. Each device may need its own clock source to drive it at the correct frequency for its operation. The clock sources for all of the devices are often controlled by yet another device: the clock manager.
Although two systems may have different UARTs, these devices perform the same basic functions. The specifics about how they are programmed will vary from one system to another. However, there is always enough similarity between devices of the same class that a programmer who is familiar with one specific device can easily learn to program another similar device. The more experience a programmer has, the less time it takes to learn how to control a new device.
13.1 Write a function for setting the PWM clock on the Raspberry Pi to 2 MHz.
13.2 The UART_GET_BYTE function in Listing 13.1 contains skeleton code for handling errors, but does not actually do anything when errors occur. Describe at least two ways that the errors could be handled.
13.3 Listing 13.1 provides four functions for managing the UART on the Raspberry Pi. Write equivalent functions for the pcDuino UART.
This chapter starts by describing the extra responsibilities that the programmer must assume when writing code to run without an operating system (bare metal). It then explains privileged and user modes and describes all of the privileged modes available on the ARM processor. Next, it gives an overview of exception processing, and provides example code for setting up the vector table stubs for exception handling functions on the ARM processor. Next, it describes the boot processes on the Raspberry Pi and the pcDuino. After that, it shows how to write a basic bare metal program, without any exception processing. The chapter finishes by showing a more efficient version of the bare metal program using an interrupt.
Bare metal; Exception; Vector table; Exception handler; Sleep mode; User mode; Privileged mode; Startup code; Linker script; Boot loader; Interrupt
The previous chapters assumed that the software would be running in user mode under an operating system. Sometimes, it is necessary to write assembly code to run on “bare metal,” which simply means: without an operating system. For example, when we write an operating system kernel, it must run on bare metal and a significant part of the code (especially during the boot process) must be written in assembly language. Coding on bare metal is useful to deeply understand how the hardware works and what happens in the lowest levels of an operating system. There are some significant differences between code that is meant to run under an operating system and code that is meant to run on bare metal.
The operating system takes care of many details for the programmer. For instance, it sets up the stack, text, and data sections, initializes static variables, provides an interface to input and output devices, and gives the programmer an abstracted view of the machine. When accessing data on a disk drive, the programmer uses the file abstraction. The underlying hardware only knows about blocks of data. The operating system provides the data structures and operations which allow the programmer to think of data in terms of files and streams of bytes. A user program may be scattered in physical memory, but the hardware memory management unit, managed by the operating system, allows the programmer to view memory as a simple memory map (such as shown in Fig. 1.7). The programmer uses system calls to access the abstractions provided by the operating system. On bare metal, there are no abstractions, unless the programmer creates them.
However, there are some software packages to help bare-metal programmers. For example, Newlib is a C standard library intended for use in bare-metal programs. Its major features are that:
• it implements the hardware-independent parts of the standard C library,
• for I/O, it relies on only a few low-level functions that must be implemented specifically for the target hardware, and
• many target machines are already supported in the Newlib source code.
To support a new machine, the programmer only has to write a few low-level functions in C and/or Assembly to initialize the system and perform low-level I/O on the target hardware.
Many early computers were not capable of protecting the operating system from user programs. That problem was solved mostly by building CPUs that support multiple “levels of privilege” for running programs. Almost all modern CPUs have the ability to operate in at least two modes:
User mode is the mode that normal user programs use when running under an operating system, and
Privileged mode is reserved for operating system code. There are operations that can be performed in privileged mode which cannot be performed in user mode.
The ARM processor provides six privileged modes and one user mode. Five of the privileged modes have their own stack pointer (r13) and link register (r14). When the processor mode is changed, the corresponding link register and stack pointer become active, “replacing” the user stack pointer and link register.
In any of the six privileged modes, the link registers and stack pointers of the other modes can be accessed. The privileged mode stack pointers and link registers are not accessible from user mode. One of the privileged modes, FIQ, has five additional registers which become active when the processor enters FIQ mode. These registers “replace” registers r8 through r12. Additionally, five of the privileged modes have a Saved Process Status Register (SPSR). When entering those privileged modes, the CPSR is copied into the corresponding SPSR. This allows the CPSR to be restored to its original contents when the privileged code returns to the previously active mode. The full register set for all modes is shown in Table 14.1. Registers r0 through r7 and the program counter are shared by all modes. Some processors have an additional monitor mode, as part of the ARMv6-M and ARMv7-M security extensions.
Table 14.1
The ARM user and system registers
| usr | svc | abt | und | irq | fiq |
| sys | |||||
| r0 | |||||
| r1 | |||||
| r2 | |||||
| r3 | |||||
| r4 | |||||
| r5 | |||||
| r6 | |||||
| r7 | |||||
| r8 | r8_fiq | ||||
| r9 | r9_fiq | ||||
| r10 | r10_fiq | ||||
| r11 (fp) | r11_fiq | ||||
| r12 (ip) | r12_fiq | ||||
| r13 (sp) | r13_svc | r13_abt | r13_und | r13_irq | r13_fiq |
| r14 (lr) | r14_svc | r14_abt | r14_und | r14_irq | r14_fiq |
| r15 (pc) | |||||
| CPSR | CPSR | CPSR | CPSR | CPSR | CPSR |
| SPSR_svc | SPSR_abt | SPSR_und | SPSR_irq | SPSR_fiq |

All of the bits of the Program Status Register (PSR) are shown in Fig. 14.1. The processor mode is selected by writing a bit pattern into the mode bits (M[4:0]) of the PSR. The bit pattern assignment for each processor mode is shown in Table 14.2. Not all combinations of the mode bits define a valid processor mode. An illegal value programmed into M[4:0] causes the processor to enter an unrecoverable state. If this occurs, a hardware reset must be used to re-start the processor. Programs running in user mode cannot modify these bits directly. User programs can only change the processor mode by executing the software interrupt (swi) instruction (also known as the svc instruction), which automatically gives control to privileged code in the operating system. The hardware is carefully designed so that the user program cannot run its own code in privileged mode.

Table 14.2
Mode bits in the PSR
| M[4:0] | Mode | Name | Register Set |
| 10000 | usr | User | R0-R14, CPSR, PC |
| 10001 | fiq | Fast Interrupt | R0-R7, R8_fiq-R14_fiq, CPSR, SPSR_fiq, PC |
| 10010 | irq | Interrupt Request | R0-R12, R13_irq, R14_irq, CPSR, SPSR_irq, PC |
| 10011 | svc | Supervisor | R0-R12, R13_svc R14_svc CPSR, SPSR_irq, PC |
| 10111 | abt | Abort | R0-R12, R13_abt R14_abt CPSR, SPSR_abt PC |
| 11011 | und | Undefined Instruction | R0-R12, R13_und R14_und, CPSR, SPSR_und PC |
| 11111 | sys | System | R0-R14, CPSR, PC |

The swi instruction does not really cause an interrupt, but the hardware and operating system handle it in a very similar way. The software interrupt is used by user programs to request that the operating system perform some task on their behalf. Another general class of interrupt is the “hardware interrupt.” This class of interrupt may occur at any time and is used by hardware devices to signal that they require service. Another type of interrupt may be generated within the CPU when certain conditions arise, such as attempting to execute an unknown instruction. These are generally known as “exceptions” to distinguish them from hardware interrupts. On the ARM processor, there are three bits in the CPSR which affect interrupt processing:
I: when set to one, normal hardware interrupts are disabled,
F: when set to one, fast hardware interrupts are disabled, and
A: (only on ARMv6 and later processors) when set to one, imprecise aborts are disabled (this is an abort on a memory write that has been held in a write buffer in the processor and not written to memory until later, perhaps after another abort).
Programs running in user mode cannot modify these bits. Therefore, the operating system gains control of the CPU whenever an interrupt occurs and the user program cannot disable interrupts and continue to run. Most operating systems use a hardware timer to generate periodic interrupts, thus they are able to regain control of the CPU every few milliseconds.
Most of the privileged modes are entered automatically by the hardware when certain exceptional circumstances occur. For example, when a hardware device needs attention, it can signal the processor by causing an interrupt. When this occurs, the processor immediately enters IRQ mode and begins executing the IRQ exception handler function. Some devices can cause a fast interrupt, which causes the processor to immediately enter FIQ mode and begin executing the FIQ exception handler function. There are six possible exceptions that can occur, each one corresponding to one of the six privileged modes. Each exception must be handled by a dedicated function, with one additional function required to handle CPU reset events. The first instruction of each of these seven exception handlers is stored in a vector table at a known location in memory (usually address 0). When an exception occurs, the CPU automatically loads the appropriate instruction from the vector table and executes it. Table 14.3 shows the address, exception type, and the mode that the processor will be in, for each entry in ARM vector table. The vector table usually contains branch instructions. Each branch instruction will jump to the correct function for handling a specific exception type. Listing 14.1 shows a short section of assembly code which provides definitions for the ARM CPU modes.
Table 14.3
ARM vector table
| Address | Exception | Mode |
| 0x00000000 | Reset | svc |
| 0x00000004 | Undefined Instruction | und |
| 0x00000008 | Software Interrupt | svc |
| 0x0000000C | Prefetch Abort | abt |
| 0x00000010 | Data Abort | abt |
| 0x00000014 | Reserved | |
| 0x00000018 | Interrupt Request | irq |
| 0x0000001C | Fast Interrupt Request | fiq |

Many bare-metal programs consist of a single thread of execution running in user mode to perform some task. This main program is occasionally interrupted by the occurrence of some exception. The exception is processed, and then control returns to the main thread. Fig. 14.2 shows the sequence of events when an exception occurs in such a system. The main program typically would be running with the CPU in user mode. When the exception occurs, the CPU executes the corresponding instruction in the vector table, which branches to the exception handler. The exception handler must save any registers that it is going to use, execute the code required to handle the exception, then restore the registers. When it returns to the user mode process, everything will be as it was before the exception occurred. The user mode program continues executing as if the exception never occurred.

More complex systems may have multiple tasks, threads of execution, or user processes running concurrently. In a single-processor system, only one task, thread, or user process can actually be executing at any given instant, but when an exception occurs, the exception handler may change the currently active task, thread, or user process. This is the basis for all modern multiprocessing systems. Fig. 14.3 shows how an exception may be processed on such a system. It is common on multi-processing systems for a timer device to be used to generate periodic interrupts, which allows the currently active task, thread, or user process to be changed at a fixed frequency.

When any exception occurs, it causes the ARM CPU hardware to perform a very well-defined sequence of actions:
1. The CPSR is copied into the SPSR for the mode corresponding to the type of exception that has occurred.
2. The CPSR mode bits are changed, switching the CPU into the appropriate privileged mode.
3. The banked registers for the new mode become active.
4. The I bit of the CPSR is cleared, which disables interrupts.
5. If the exception was an FIQ, or if a reset has occurred, then the FIQ bit is cleared, disabling fast interrupts.
6. The program counter is copied to the link register for the new mode.
7. The program counter is loaded with the address in the vector table corresponding with the exception that has occurred.
8. The processor then fetches the next instruction using the program counter as usual. However, the program counter has been set so that in loads an instruction from the vector table.
The instruction in the vector table should cause the CPU to branch to a function which handles the exception. At the end of that function, the program counter must be loaded with the address of the instruction where the exception occurred, and the SPSR must be copied back into the CPSR. That will cause the processor to branch back to where it was when the exception occurred, and return to the mode that it was in at that time.
Listing 14.2 shows in detail how the vector table is initialized. The vector table contains eight identical instructions. These instructions load the program counter, which causes a branch. In each case, the program counter is loaded with a value at the memory location that is 32 bytes greater than the corresponding load instruction. An offset of 24 is used because the program counter will have advanced 8 bytes by the time the load instruction is executed. The addresses of the exception handlers have been stored in a second table, that begins at an address 32 bytes after the first load instruction. Thus, each instruction in the vector table loads a unique address into the program counter. Note that one of the slots in the vector table is not used and is reserved by ARM for future use. That slot is treated like all of the others, but it will never be used on any current ARM processor.

Listing 14.3 shows the stub functions for each of the exception handlers.


Note that the return sequence depends on the type of exception. For some exceptions, the return address must be adjusted. This is because the program counter may have been advanced past the instruction where the exception occurred. These stub functions simply return the processor to the mode and location at which the exception occurred. To be useful, they will need to be extended significantly. Note that these functions all return using a data processing instruction with the optional s specified and with the program counter as the destination register. This special form of data processing instruction indicates that the SPSR should be copied into the CPSR at the same time that the program counter is loaded with the return address. Thus, the function returns to the point where the exception occurred, and the processer switches back into the mode that it was in when the exception occurred.
A special form of the ldm instruction can also be used to return from an exception processing function. In order to use that method, the exception handler should start by adjusting the link register (depending on the type of exception) and then pushing it onto the stack. The handler should also push any other registers that it will need to use. At the end of the function, an ldmfd is used to restore the registers, but instead of restoring the link register, it loads the program counter. Also a carat (ˆ) is added to the end of the instruction. Listing 14.4 shows the skeleton for an exception handler function using this method.

In order to create a bare-metal program, we must understand what the processor does when power is first applied or after a reset. The ARM CPU begins to execute code at a predetermined address. Depending on the configuration of the ARM processor, the program counter starts either at address 0 or 0xFFFF0000. In order for the system to work, the startup code must be at the correct address when the system starts up.
On the Raspberry Pi, when power is first applied, the ARM CPU is disabled and the graphics processing unit (GPU) is enabled. The GPU runs a program that is stored in ROM. That program, called the first stage boot loader, reads the second stage boot loader from a file named (bootcode.bin) on the SD card. That program enables the SDRAM, and then loads the third stage bootloader, start.elf. At this point, some basic hardware configuration is performed, and then the kernel is loaded to address 0x8000 from the kernel.img file on the SD card. Once the kernel image file is loaded, a “b #0x8000” instruction is placed at address 0, and the ARM CPU is enabled. The ARM CPU executes the branch instruction at address 0, then immediately jumps to the kernel code at address 0x8000.
To run a bare-metal program on the Raspberry Pi, it is only necessary to build an executable image and store it as kernel.img on the SD card. Then, the boot process will load the bare-metal program instead of the Linux kernel image. Care must be taken to ensure that the linker prepares the program to run at address 0x8000 and places the first executable instruction at the beginning of the image file. It is also important to make a copy of the original kernel image so that it can be restored (using another computer). If the original kernel image is lost, then there will be no way to boot Linux until it is replaced.
The pcDuino uses u-boot, which is a highly configurable open-source boot loader. The boot loader is configured to attempt booting from the SD card. If a bootable SD card is detected, then it is used. Otherwise, the pcDuino boots from its internal NAND flash. In either case, u-boot finds the Linux kernel image file, named uImage, loads it at address 0x40008000, and then jumps to that location. The easiest way to run bare-metal code on the pcDuino is to create a duplicate of the operating system on an SD card, then replace the uImage file with another executable image. Care must be taken to ensure that the linker prepares the program to run at address 0x40008000 and places the first executable instruction at the beginning of the image file. If the SD card is inserted, then the bare-metal code will be loaded. Otherwise, it will boot normally from the NAND flash memory.
A bare-metal program should be divided into several files. Some of the code may be written in assembly, and other parts in C or some other language. The initial startup code, and the entry and exit from exception handlers, must be written in assembly. However, it may be much more productive to write the main program and the remainder of the exception handlers as C functions and have the assembly code call them.
Other than the code being loaded at different addresses, there is very little difference between getting bare-metal code running on the Raspberry Pi and the pcDuino. For either platform, the bare-metal program must include some start-up code. The startup code will:
• initialize the stack pointers for all of the modes,
• set up interrupt and exception handling,
• initialize the .bss section,
• configure the CPU and critical systems (optional),
• set up memory management (optional),
• set up process and/or thread management (optional),
• initialize devices (optional), and call the main function.
The startup code requires some knowledge of the target platform, and must be at least partly written in assembly language. Listing 14.5 shows a function named _start which sets up the stacks, initializes the .bss section, calls a function to set up the vector table, then calls the main function:



The first task for the startup code is to ensure that the stack pointer for each processor mode is initialized. When an exception or interrupt occurs, the processor will automatically change into the appropriate mode and begin executing an exception handler, using the stack pointer for that mode. Hardware interrupts can be disabled, but some exceptions cannot be disabled. In order to guarantee correct operation, a stack must be set up for each processor mode, and an exception handler must be provided. The exception handler does not actually have to do anything.
On the Raspberry Pi, memory is mapped to begin at address 0, and all models have at least 256 MB of memory. Therefore, it is safe to assume that the last valid memory address is 0x0FFFFFFF. If each mode is given 4 kB of stack space, then all of the stacks together will consume 32 kB, and the initial stack addresses can be easily calculated. Since the C compiler uses a full descending stack, the initial stack pointers can be assigned addresses 0x10000000, 0x0FFFF000, 0x0FFFE000, etc.
For the pcDuino, there is a small amount of memory mapped at address 0, but most of the available memory is in the region between 0x40000000 and 0xBFFFFFFF. The pcDuino has at least 1 GB of memory. One possible way to assign the stack locations is: 0x50000000, 0x4FFFF000, 0x4FFFE000, etc. This assignment of addresses will make it easy to write one piece of code to set up the stacks for either the Raspberry Pi or the pcDuino.
After initializing the stacks, the startup code must set all bytes in the .bss section to zero. Recall that the .bss section is used to hold data that is initialized to zero, but the program file does not actually contain all of the zeros. Programs running under an operating system can rely on the C standard library to initialize the .bss section. If it is not linked to a C library, then a bare-metal program must set all of the bytes in the .bss section to zero for itself.
The final part of this bare-metal program is the main function. Listing 14.6 shows a very simple main program which reads from three GPIO pins which have pushbuttons connected to them, and controls three other pins that have LEDs connected to them. When a button is pressed the LED associated with it is illuminated. The only real difference between the pcDuino and Raspberry Pi versions of this program is in the functions which drive the GPIO device. Therefore, those functions have been removed from the main program file. This makes the main program portable; it can run on the pcDuino or the Raspberry Pi. It could also run on any other ARM system, with the addition of another file to implement the mappings and functions for using the GPIO device for that system.

When compiling the program, it is necessary to perform a few extra steps to ensure that the program is ready to be loaded and run by the boot code. The last step in compiling a program is to link all of the object files together, possibly also including some object files from system libraries. A linker script is a file that tells the linker which sections to include in the output file, as well as which order to put them in, what type of file is to be produced, and what is to be the address of the first instruction. The default linker script used by GCC creates an ELF executable file, which includes startup code from the C library and also includes information which tells the loader where the various sections reside in memory. The default linker script creates a file that can be loaded by the operating system kernel, but which cannot be executed on bare metal.
For a bare-metal program, the linker must be configured to link the program so that the first instruction of the startup function is given the correct address in memory. This address depends on how the boot loader will load and execute the program. On the Raspberry Pi this address is 0x8000, and on the pcDuino this address is 0x40008000. The linker will automatically adjust any other addresses as it links the code together. The most efficient way to accomplish this is by providing a custom linker script to be used instead of the default system script. Additionally, either the linker must be instructed to create a flat binary file, rather than an ELF executable file, or a separate program (objcopy) must be used to convert the ELF executable into a flat binary file.
Listing 14.7 is an example of a linker script that can be used to create a bare-metal program. The first line is just a comment. The second line specifies the name of the function where the program begins execution. In this case, it specifies that a function named _start is where the program will begin execution. Next, the file specifies the sections that the output file will contain. For each output section, it lists the input sections that are to be used.

The first output section is the .text section, and it is composed of any sections whose names end in .text.boot followed by any sections whose names end in .text. In Listing 14.5, the _start function was placed in the .text.boot section, and it is the only thing in that section. Therefore the linker will put the _start function at the very beginning of the program. The remaining text sections will be appended, and then the remaining sections, in the order that they appear. After the sections are concatenated together, the linker will make a pass through the resulting file, correcting the addresses of branch and load instructions as necessary so that the program will execute correctly.
Compiling a program that consists of multiple source files, a custom linker script, and special commands to create an executable image can become tedious. The make utility was created specifically to help in this situation. Listing 14.8 shows a make script that can be used to combine all of the elements of the program together and produce a uImage file for the pcDuino and a kernel.img file for the Raspberry Pi. Listing 14.9 shows how the program can be built by typing “make” at the command line.


The main program shown in Listing 14.6 is extremely wasteful because it runs the CPU in a loop, repeatedly checking the status of the GPIO pins. It uses far more CPU time (and electrical power) than is necessary. In reality, the pins are unlikely to change state very often, and it is sufficient to check them a few times per second. It only takes a few nanoseconds to check the input pins and set the output pins so the CPU only needs to be running for a few nanoseconds at a time, a few times per second.
A much more efficient implementation would set up a timer to send interrupts at a fixed frequency. Then the main loop can check the buttons, set the outputs, and put the CPU to sleep. Listing 14.10 shows the main program, modified to put the processor to sleep after each iteration of the main loop. The only difference between this main function and the one in Listing 14.6 is the addition of a wfi instruction at line 43. The new implementation will consume far less electrical power and allow the CPU to run cooler, thereby extending its life. However, some additional work must be performed in order to set up the timer and interrupt system before the main function is called.

Some changes must be made to the startup code in Listing 14.5 so that after setting up the vector table, it calls a function to initialize the interrupt controller then calls another function to set up the timer. Listing 14.5 shows the modified startup function.
Lines 50 through 57 have been added to initialize the interrupt controller, enable the timer, and change the CPU into user mode before calling main. Of course, the hardware timers and interrupt controllers on the pcDuino and Raspberry Pi are very different.
The pcDuino has an ARM Generic Interrupt Controller (GIC-400) device to manage interrupts. The GIC device can handle a large number of interrupts. Each one is a separate input signal to the GIC. The GIC hardware prioritizes each input, and assigns each one a unique integer identifier. When the CPU receives an interrupt, it simply reads the GIC to determine which hardware device signaled the interrupt, calls the function which handles that device, then writes to one of the GIC registers to indicate that the interrupt has been processed. Listing 14.12 provides a few basic functions for managing this device.




The Raspberry Pi has a much simpler interrupt controller. It can enable and disable interrupt sources, and requires that the programmer read up to three registers to determine the source of an interrupt. For our purposes, we only need to manage the ARM timer interrupt. Listing 14.13 provides a few basic functions for using this device to enable the timer interrupt. Extending these functions to provide functionality equal to the GIC would not be very difficult, but would take some time. It would be necessary to set up a mapping from the interrupt bits in the interrupt register controller to integer values, so that each interrupt source has a unique identifier. Then the functions could be written to use those identifiers. The result would be a software implementation to provide capabilities equivalent to the GIC.


Note that although the devices are very different internally, they perform basically the same function. With the addition of a software driver layer, implemented in Listings 14.12 and 14.13 the devices become interchangeable and other parts of the bare-metal program do not have to be changed when porting from one platform to the other.


The pcDuino provides several timers that could be used, Timer0 was chosen arbitrarily. Listing 14.14 provides a few basic functions for managing this Device.


The Raspberry Pi also provides several timers that could be used, but the ARM timer is the easiest to configure. Listing 14.15 provides a few basic functions for managing this device:


The final step in writing the bare-metal code to operate in an interrupt-driven fashion is to modify the IRQ handler from Listing 14.3. Listing 14.16 shows a new version of the IRQ exception handler which checks and clears the timer interrupt, then returns to the location and CPU mode that were current when the interrupt occurred. This code works for both platforms.

Finally, the make file must be modified to include the new source code that was added to the program. Listing 14.17 shows the modified make script. The only change is that two extra object files have been added. when make is run, those files will be compiled and linked with the program. Listing 14.9 shows how the program can be built by typing “make” at the command line.

Since its introduction in 1982 as the flagship processor for Acorn RISC Machine, the ARM processor has gone through many changes. Throughout the years, ARM processors have always maintained a good balance of simplicity, performance, and efficiency. Although originally intended as a desktop processor, the ARM architecture has been more successful than any other architecture for use in embedded applications. That is at least partially because of good choices made by its original designers. The architectural decisions resulted in a processor that provides relatively high computing power with a relatively small number of transistors. This design also results in relatively low power consumption.
Today, there are almost 20 major versions of the ARMv7 architecture, targeted for everything from smart sensors to desktops and servers, and sales of ARM-based processors outnumber all other processor architectures combined. Historically, ARM has given numbers to various versions of the architecture. With the ARMv7, they introduced a simpler scheme to describe different versions of the processor. They divided their processor families into three major profiles:
ARMv7-A: Applications processors are capable of running a full, multiuser, virtual memory, multiprocessing operating system.
ARMv7-R: Real-time processors are for embedded systems that may need powerful processors, cache, and/or large amounts of memory.
ARMv7-M: Microcontroller processors only execute Thumb instructions and are intended for use in very small cost-sensitive embedded systems. They provide low cost, low power, and small size, and may not have hardware floating point or other high-performance features.
In 2014, ARM introduced the ARMv8 architecture. This is the first radical change in the ARM architecture in over 30 years. The new architecture extends the register set to thirty 64-bit general purpose registers, and has a completely new instruction set. Compatibility with ARMv7 and earlier code is supported by switching the processor into 32-bit mode, so that it

executes the 32-bit ARM instruction set. This is somewhat similar to the way that the Thumb instructions are supported on 32-bit ARM cores, but the change to 32-bit code can only be made when the processor is in privileged mode, and drops back to unprivileged mode.
Writing bare-metal programs can be a daunting task. However, that task can be made easier by writing and testing code under an operating system before attempting to run it bare metal. There are some functions which cannot be tested in this way. In those cases, it is best to keep those functions as simple as possible. Once the program works on bare metal, extra capabilities can be added.
Interrupt-driven processing is the basis for all modern operating systems. The system timer allows the O/S to take control periodically and select a different process to run on the CPU. Interrupts allow hardware devices to do their jobs independently and signal the CPU when they need service. The ability to restrict user access to devices and certain processor features provides the basis for a secure and robust system.
14.1 What are the advantages of a CPU which supports user mode and privileged mode over a CPU which does not?
14.2 What are the six privileged modes supported by the ARM architecture?
14.3 The interrupt handling mechanism is somewhat complex and requires significant programming effort to use. Why is it preferred over simply having the processor poll I/O devices?
14.4 Where does program control transfer to when a hardware interrupt occurs?
14.5 What is the purpose of the Undefined Instruction exception? How can it be used to allow an older processor to run programs that have new instructions? What other uses does it have?
14.6 What is an swi instruction? What is its use in operating systems? What is the key difference between an swi instruction and an interrupt?
14.7 Which of the following operations should be allowed only in privileged mode? Briefly explain your decision for each one.
(a) Execute an swi instruction.
(b) Disable all interrupts.
(c) Read the time-of-day clock.
(d) Receive a packet of data from the network.
(e) Shutdown the computer.
14.8 The main program in Listing 14.10 has two different methods to put the processor to sleep waiting for an interrupt. One method is for the Raspberry Pi, while the other is for the pcDuino. In order to compile the code, the correct lines must be uncommented and the unneeded lines must be commented out or removed. Explain two ways to change the code so that exactly the same main program can be used on both systems.
14.9 The programs in this chapter assumed the existence of libraries of functions for controlling the GPIO pins on the Raspberry Pi and the pcDuino. Both libraries provide the same high-level functions, but one operates on the Raspberry Pi GPIO device and the other operates on the pcDuino GPIO device. The C prototypes for the functions are: int GPIO_get_pin(int pin), void GPIO_set_pin(int pin,int state), GPIO_dir_input (int pin), and GPIO_dir_output (int pin). Write these libraries in ARM assembly language for both platforms.
14.10 Write an interrupt-driven program to read characters from the serial port on either the Raspberry Pi or the pcDuino. The UART on either system can be configured to send an interrupt when a character is received.
When a character is received through the UART and an interrupt occurs, the character should be echoed by transmitting it back to the sender. The character should also be stored in a buffer. If the character received is newline (“n), or if the buffer becomes full, then the contents of the buffer should be transmitted through the UART. Then, the buffer cleared and prepared to receive more characters.
Note: Page numbers followed by b indicate boxes, f indicate figures and t indicate tables.
A
B
C
D
E
F
G
H
I
L
M
N
O
P
Q
R
S
T
U
V
W
Z