Chapter 10

The ARM NEON Extensions

Abstract

This chapter begins with an overview of the NEON extensions and explains the relationship between VFP and NEON. The NEON registers are explained, and the syntax for NEON instructions is explained. Next, each of the NEON instructions are explained, with short examples. In some cases, extended examples and figures are provided to help explain the operation of complex instructions. After all of the instructions are explained, another implementation of sine is presented and compared to previous implementations and with the GCC sine function. It is shown that NEON gives a significant performance advantage over VFP and hand coded assembly is much faster than the sin function provided by the compiler.

Keywords

Single instruction multiple data (SIMD); Vector; Vector element; Instruction level parallelism; Lane

The ARM VFP coprocessor has been replaced or augmented by the NEON architecture on ARMv7 and higher systems. NEON extends the VFP instruction set with about 125 instructions and pseudo-instructions to support not only floating point, but also integer and fixed point. NEON also supports Single Instruction, Multiple Data (SIMD) operations. All NEON processors have the full set of 32 double precision VFP registers, but NEON adds the ability to view the register set as 16 128-bit (quadruple-word) registers, named q0 through q15.

A single NEON instruction can operate on up to 128 bits, which may represent multiple integer, fixed point, or floating point numbers. For example, if two of the 128-bit registers each contain eight 16-bit integers, then a single NEON instruction can add all eight integers from one register to the corresponding integers in the other register, resulting in eight simultaneous additions. For certain applications, this SIMD architecture can result in extremely fast and efficient implementations. NEON is particularly useful at handling streaming video and audio, but also can give very good performance on floating point intensive tasks. NEON instructions perform parallel operations on vectors. NEON deprecates the use of VFP vector mode covered in Section 9.2.2. On most NEON systems, using the VFP vector mode will result in an exception, which transfers control to the support code which emulates vector mode in software. This causes a severe performance penalty, so VFP vector mode should not be used on NEON systems.

Fig. 10.1 shows the ARM integer, VFP, and NEON register set. NEON views each register as containing a vector of 1, 2, 4, 8, or 16 elements, all of the same size and type. Individual elements of each vector can also be accessed as scalars. A scalar can be 8 bits, 16 bits, 32 bits, or 64 bits. The instruction syntax is extended to refer to scalars using an index, x, in a doubleword register. Dm[x] is element x in register Dm. The size of the elements is given as part of the instruction. Instructions that access scalars can access any element in the register bank.

f10-01-9780128036983 — Figure 10.1 ARM integer and NEON user program registers.

10.1 NEON Intrinsics

The GCC compiler gives C (and C++) programs direct access to the NEON instructions through the NEON intrinsics. The intrinsics are a large set of functions that are built into the compiler. Most of the intrinsics functions map to one NEON instruction. There are additional functions provided for typecasting (reinterpreting) NEON vectors, so that the C compiler does not complain about mismatched types. It is usually shorter and more efficient to write the NEON code directly as assembly language functions and link them to the C code. However only those who know assembly language are capable of doing that.

10.2 Instruction Syntax

Some instructions require specific register types. Other instructions allow the programmer to choose single word, double word, or quad word registers. If the instruction requires single precision registers, then the registers are specified as Sd for the destination register, Sn for the first operand register, and Sm for the second operand register. If the instruction requires only two registers, then Sn is not used. The lower-case letter is replaced with a valid register number. The register name is not case sensitive, so S10 and s10 are both valid names for single precision register 10.

The syntax of the NEON instructions can be described using a relatively simple notation. The notation consists of the following elements:

{item} Braces around an item indicate that the item is optional. For example, many operations have an optional condition, which is written as {<cond>}.

Ry An ARM integer register. y can be any number in the range 0{15.

Sy A 32-bit or single precision register. y can be any number in the range 0{31.

Dy A 64-bit or double precision register. y can be any number in the range 0{31.

Qy A quad word register. y can be any number in the range 0{15.

Fy A VFP register. F must be either s for a single word register, or d for a double word register. y can be any valid register number.

Ny A NEON or VFP register. N must be either s for a single word register, d for a double word register, or q for a quad word register. y can be any valid register number.

Vy A NEON vector register. V must be replaced with d for a double word register, or q for a quad word register. y can be any valid register number.

Vy[x] A NEON scalar (vector element). The size of the scalar is defined as part of the instruction. V must be replaced with d for a double word register, or q for a quad word register. y can be any valid register number. x specifies which scalar element of Vy is to be used. Valid values for x can be deduced by the size of Vy and the size of the scalars that the instruction uses.

<op> Operation specific part of a general instruction format

<n> An integer usually indicating a specific instruction version

<size> An integer indicating the number of bits used

<cond> ARM condition code from Table 3.2

<type> Many instructions operate on one or more of the following specific data types:

i8 Untyped 8 bits

i16 Untyped 16 bits

i32 Untyped 32 bits

i64 Untyped 64 bits

s8 Signed 8-bit integer

s16 Signed 16-bit integer

s32 Signed 32-bit integer

s64 Signed 64-bit integer

u8 Unsigned 8-bit integer

u16 Unsigned 16-bit integer

u32 Unsigned 32-bit integer

u64 Unsigned 64-bit integer

f16 IEEE 754 half precision floating point

f32 IEEE 754 single precision floating point

f64 IEEE 754 double precision floating point

<list> A brace-delimited list of up to four NEON registers, vectors, or scalars. The general form is {Dn,D(n+a),D(n+2a),D(n+3a)} where a is either 1 or 2.

<align> Specifies the memory alignment of structured data for certain load and store operations.

<imm> An immediate value. The required format for immediate values depends on the instruction.

<fbits> Specifies the number of fraction bits in fixed point numbers.

The following function definitions are used in describing the effects of many of the instructions:

$⌊x⌋$ si1_e The floor function maps a real number, x, to the next smallest integer.

u10-01-9780128036983 The saturate function limits the value of x to the highest or lowest value that can be stored in the destination register.

$‖x‖$ si2_e The round function maps a real number, x, to the nearest integer.

$≻ x ≺$ si3_e The narrow function reduces a 2n bit number to an n bit number, by taking the n least significant bits.

$≺ x ≻$ si4_e The extend function converts an n bit number to a 2n bit number, performing zero extension if the number is unsigned, or sign extension if the number is signed.

10.3 Load and Store Instructions

These instructions can be used to perform interleaving of data when structured data is loaded or stored. The data should be properly aligned for best performance. These instructions are very useful for common multimedia data types.

For example, image data is typically stored in arrays of pixels, where each pixel is a small data structure such as the pixel struct shown in Listing 5.37. Since each pixel is three bytes, and a d register is 8 bytes, loading a single pixel into one register would be inefficient. It would be much better to load multiple pixels at once, but an even number of pixels will not fit in a register. It will take three doubleword or quadword registers to hold an even number of pixels without wasting space, as shown in Fig. 10.2. This is the way data would be loaded using a VFP vldr or vldm instruction. Many image processing operations work best if each color “channel” is processed separately. The NEON load and store vector instructions can be used to split the image data into color channels, where each channel is stored in a different register, as shown in Fig. 10.3.

f10-02-9780128036983 — Figure 10.2 Pixel data interleaved in three doubleword registers.

f10-03-9780128036983 — Figure 10.3 Pixel data de-interleaved in three doubleword registers.

Other examples of interleaved data include stereo audio, which is two interleaved channels, and surround sound, which may have up to nine interleaved channels. In all of these cases, most processing operations are simplified when the data is separated into non-interleaved channels.

10.3.1 Load or Store Single Structure Using One Lane

These instructions are used to load and store structured data across multiple registers:

vld<n> Load Structured Data, and

vst<n> Store Structured Data.

They can be used for interleaving or deinterleaving the data as it is loaded or stored, as shown in Fig. 10.3.

Syntax

v<op><n>.<size> <list>,[Rn{:<align>}]{!}

v<op><n>.<size> <list>,[Rn{:<align>}],Rm

• <op> must be either ld or st.

• <n> must be one of 1, 2, 3, or 4.

• <size> must be one of 8, 16, or 32.

• <list> specifies the list of registers. There are four list formats:

1. {Dd[x]}

2. {Dd[x], D(d+a)[x]}

3. {Dd[x], D(d+a)[x], D(d+2a)[x]}

4. {Dd[x], D(d+a)[x], D(d+2a)[x], D(d+3a)[x]}

where a can be either 1 or 2. Every register in the list must be in the range d0-d31.

• Rn is the ARM register containing the base address. Rn cannot be pc.

• <align> specifies an optional alignment. If <align> is not specified, then standard alignment rules apply.

• The optional ! indicates that Rn is updated after the data is transferred. This is similar to the ldm and stm instructions.

• Rm is an ARM register containing an offset from the base address. If Rm is present, Rn is updated to Rn + Rm after the address is used to access memory. Rm cannot be sp or pc.

Table 10.1 shows all valid combinations of parameters for these instructions. Note that the same vector element (scalar) x must be used in each register. Up to four registers can be specified. If the structure has more than four fields, then these instructions can be used repeatedly to load or store all of the fields.

Table 10.1

Parameter combinations for loading and storing a single structure

<n>	<size>	<list>	<align>	Alignment
1	8	Dd[x]		Standard only
2-5	16	Dd[x]	16	2 byte
2-5	32	Dd[x]	32	4 byte
2	8	Dd[x], D(d+1)[x]	16	2 byte
2-5	16	Dd[x], D(d+1)[x]	32	4 byte
		Dd[x], D(d+2)[x]	32	4 byte
2-5	32	Dd[x], D(d+1)[x]	64	8 byte
		Dd[x], D(d+2)[x]	64	8 byte
3	8	Dd[x], D(d+1)[x], D(d+2)[x]		Standard only
2-5	16 or 32	Dd[x], D(d+1)[x], D(d+2)[x]		Standard only
		Dd[x], D(d+2)[x], D(d+4)[x]		Standard only
4	8	Dd[x], D(d+1)[x], D(d+2)[x], D(d+3)[x]	32	4 byte
2-5	16	Dd[x], D(d+1)[x], D(d+2)[x], D(d+3)[x]	64	8 byte
		Dd[x], D(d+2)[x], D(d+4)[x], D(d+6)[x]	64	8 byte
2-5	32	Dd[x], D(d+1)[x], D(d+2)[x], D(d+3)[x]	64 or 128	(<align> ÷ 8) bytes
		Dd[x], D(d+2)[x], D(d+4)[x], D(d+6)[x]	64 or 128	(<align> ÷ 8) bytes

t0010

Operations

Name	Effect	Description
vld<n>	$t m p \leftarrow R n$ $i n c r \leftarrow ($ <size> ÷ 8) for D ∈ regs(<list>) do $D [x] \leftarrow M e m [t m p]$ $t m p \leftarrow t m p + i n c r$ end for if ! is present then $R n \leftarrow t m p$ else if Rm is specified then $R n \leftarrow R m$ end if end if	Load one or more data items into a single lane of one or more registers
vst<n>	$t m p \leftarrow R n$ $i n c r \leftarrow ($ <size> ÷ 8) for D ∈ regs(<list>) do $M e m [t m p] \leftarrow D [x]$ $t m p \leftarrow t m p + i n c r$ end for if ! is present then $R n \leftarrow t m p$ else if Rm is specified then $R n \leftarrow R m$ end if end if	Store one or more data items from a single lane of one or more registers

Name

Effect

Description

vld<n>

$t m p \leftarrow R n$ si5_e

$i n c r \leftarrow ($ si6_e <size> ÷ 8)

for D ∈ regs(<list>) do

$D [x] \leftarrow M e m [t m p]$ si7_e

$t m p \leftarrow t m p + i n c r$ si8_e

end for

if ! is present then

$R n \leftarrow t m p$ si9_e

else

if Rm is specified then

$R n \leftarrow R m$ si10_e

end if

Load one or more data items into a single lane of one or more registers

vst<n>