Chapter 10

The ARM NEON Extensions

Abstract

This chapter begins with an overview of the NEON extensions and explains the relationship between VFP and NEON. The NEON registers are explained, and the syntax for NEON instructions is explained. Next, each of the NEON instructions are explained, with short examples. In some cases, extended examples and figures are provided to help explain the operation of complex instructions. After all of the instructions are explained, another implementation of sine is presented and compared to previous implementations and with the GCC sine function. It is shown that NEON gives a significant performance advantage over VFP and hand coded assembly is much faster than the sin function provided by the compiler.

Keywords

Single instruction multiple data (SIMD); Vector; Vector element; Instruction level parallelism; Lane

The ARM VFP coprocessor has been replaced or augmented by the NEON architecture on ARMv7 and higher systems. NEON extends the VFP instruction set with about 125 instructions and pseudo-instructions to support not only floating point, but also integer and fixed point. NEON also supports Single Instruction, Multiple Data (SIMD) operations. All NEON processors have the full set of 32 double precision VFP registers, but NEON adds the ability to view the register set as 16 128-bit (quadruple-word) registers, named q0 through q15.

A single NEON instruction can operate on up to 128 bits, which may represent multiple integer, fixed point, or floating point numbers. For example, if two of the 128-bit registers each contain eight 16-bit integers, then a single NEON instruction can add all eight integers from one register to the corresponding integers in the other register, resulting in eight simultaneous additions. For certain applications, this SIMD architecture can result in extremely fast and efficient implementations. NEON is particularly useful at handling streaming video and audio, but also can give very good performance on floating point intensive tasks. NEON instructions perform parallel operations on vectors. NEON deprecates the use of VFP vector mode covered in Section 9.2.2. On most NEON systems, using the VFP vector mode will result in an exception, which transfers control to the support code which emulates vector mode in software. This causes a severe performance penalty, so VFP vector mode should not be used on NEON systems.

Fig. 10.1 shows the ARM integer, VFP, and NEON register set. NEON views each register as containing a vector of 1, 2, 4, 8, or 16 elements, all of the same size and type. Individual elements of each vector can also be accessed as scalars. A scalar can be 8 bits, 16 bits, 32 bits, or 64 bits. The instruction syntax is extended to refer to scalars using an index, x, in a doubleword register. Dm[x] is element x in register Dm. The size of the elements is given as part of the instruction. Instructions that access scalars can access any element in the register bank.

f10-01-9780128036983
Figure 10.1 ARM integer and NEON user program registers.

10.1 NEON Intrinsics

The GCC compiler gives C (and C++) programs direct access to the NEON instructions through the NEON intrinsics. The intrinsics are a large set of functions that are built into the compiler. Most of the intrinsics functions map to one NEON instruction. There are additional functions provided for typecasting (reinterpreting) NEON vectors, so that the C compiler does not complain about mismatched types. It is usually shorter and more efficient to write the NEON code directly as assembly language functions and link them to the C code. However only those who know assembly language are capable of doing that.

10.2 Instruction Syntax

Some instructions require specific register types. Other instructions allow the programmer to choose single word, double word, or quad word registers. If the instruction requires single precision registers, then the registers are specified as Sd for the destination register, Sn for the first operand register, and Sm for the second operand register. If the instruction requires only two registers, then Sn is not used. The lower-case letter is replaced with a valid register number. The register name is not case sensitive, so S10 and s10 are both valid names for single precision register 10.

The syntax of the NEON instructions can be described using a relatively simple notation. The notation consists of the following elements:

{item} Braces around an item indicate that the item is optional. For example, many operations have an optional condition, which is written as {<cond>}.

Ry An ARM integer register. y can be any number in the range 0{15.

Sy A 32-bit or single precision register. y can be any number in the range 0{31.

Dy A 64-bit or double precision register. y can be any number in the range 0{31.

Qy A quad word register. y can be any number in the range 0{15.

Fy A VFP register. F must be either s for a single word register, or d for a double word register. y can be any valid register number.

Ny A NEON or VFP register. N must be either s for a single word register, d for a double word register, or q for a quad word register. y can be any valid register number.

Vy A NEON vector register. V must be replaced with d for a double word register, or q for a quad word register. y can be any valid register number.

Vy[x] A NEON scalar (vector element). The size of the scalar is defined as part of the instruction. V must be replaced with d for a double word register, or q for a quad word register. y can be any valid register number. x specifies which scalar element of Vy is to be used. Valid values for x can be deduced by the size of Vy and the size of the scalars that the instruction uses.

<op> Operation specific part of a general instruction format

<n> An integer usually indicating a specific instruction version

<size> An integer indicating the number of bits used

<cond> ARM condition code from Table 3.2

<type> Many instructions operate on one or more of the following specific data types:

i8 Untyped 8 bits

i16 Untyped 16 bits

i32 Untyped 32 bits

i64 Untyped 64 bits

s8 Signed 8-bit integer

s16 Signed 16-bit integer

s32 Signed 32-bit integer

s64 Signed 64-bit integer

u8 Unsigned 8-bit integer

u16 Unsigned 16-bit integer

u32 Unsigned 32-bit integer

u64 Unsigned 64-bit integer

f16 IEEE 754 half precision floating point

f32 IEEE 754 single precision floating point

f64 IEEE 754 double precision floating point

<list> A brace-delimited list of up to four NEON registers, vectors, or scalars. The general form is {Dn,D(n+a),D(n+2a),D(n+3a)} where a is either 1 or 2.

<align> Specifies the memory alignment of structured data for certain load and store operations.

<imm> An immediate value. The required format for immediate values depends on the instruction.

<fbits> Specifies the number of fraction bits in fixed point numbers.

The following function definitions are used in describing the effects of many of the instructions:

xsi1_e The floor function maps a real number, x, to the next smallest integer.

u10-01-9780128036983 The saturate function limits the value of x to the highest or lowest value that can be stored in the destination register.

xsi2_e The round function maps a real number, x, to the nearest integer.

xsi3_e The narrow function reduces a 2n bit number to an n bit number, by taking the n least significant bits.

xsi4_e The extend function converts an n bit number to a 2n bit number, performing zero extension if the number is unsigned, or sign extension if the number is signed.

10.3 Load and Store Instructions

These instructions can be used to perform interleaving of data when structured data is loaded or stored. The data should be properly aligned for best performance. These instructions are very useful for common multimedia data types.

For example, image data is typically stored in arrays of pixels, where each pixel is a small data structure such as the pixel struct shown in Listing 5.37. Since each pixel is three bytes, and a d register is 8 bytes, loading a single pixel into one register would be inefficient. It would be much better to load multiple pixels at once, but an even number of pixels will not fit in a register. It will take three doubleword or quadword registers to hold an even number of pixels without wasting space, as shown in Fig. 10.2. This is the way data would be loaded using a VFP vldr or vldm instruction. Many image processing operations work best if each color “channel” is processed separately. The NEON load and store vector instructions can be used to split the image data into color channels, where each channel is stored in a different register, as shown in Fig. 10.3.

f10-02-9780128036983
Figure 10.2 Pixel data interleaved in three doubleword registers.
f10-03-9780128036983
Figure 10.3 Pixel data de-interleaved in three doubleword registers.

Other examples of interleaved data include stereo audio, which is two interleaved channels, and surround sound, which may have up to nine interleaved channels. In all of these cases, most processing operations are simplified when the data is separated into non-interleaved channels.

10.3.1 Load or Store Single Structure Using One Lane

These instructions are used to load and store structured data across multiple registers:

vld<n> Load Structured Data, and

vst<n> Store Structured Data.

They can be used for interleaving or deinterleaving the data as it is loaded or stored, as shown in Fig. 10.3.

Syntax

 v<op><n>.<size> <list>,[Rn{:<align>}]{!}

 v<op><n>.<size> <list>,[Rn{:<align>}],Rm

 <op> must be either ld or st.

 <n> must be one of 1, 2, 3, or 4.

 <size> must be one of 8, 16, or 32.

 <list> specifies the list of registers. There are four list formats:

1. {Dd[x]}

2. {Dd[x], D(d+a)[x]}

3. {Dd[x], D(d+a)[x], D(d+2a)[x]}

4. {Dd[x], D(d+a)[x], D(d+2a)[x], D(d+3a)[x]}

where a can be either 1 or 2. Every register in the list must be in the range d0-d31.

 Rn is the ARM register containing the base address. Rn cannot be pc.

 <align> specifies an optional alignment. If <align> is not specified, then standard alignment rules apply.

 The optional ! indicates that Rn is updated after the data is transferred. This is similar to the ldm and stm instructions.

 Rm is an ARM register containing an offset from the base address. If Rm is present, Rn is updated to Rn + Rm after the address is used to access memory. Rm cannot be sp or pc.

Table 10.1 shows all valid combinations of parameters for these instructions. Note that the same vector element (scalar) x must be used in each register. Up to four registers can be specified. If the structure has more than four fields, then these instructions can be used repeatedly to load or store all of the fields.

Table 10.1

Parameter combinations for loading and storing a single structure

<n><size><list><align>Alignment
18Dd[x]Standard only
2-516Dd[x]162 byte
2-532Dd[x]324 byte
28Dd[x], D(d+1)[x]162 byte
2-516Dd[x], D(d+1)[x]324 byte
Dd[x], D(d+2)[x]324 byte
2-532Dd[x], D(d+1)[x]648 byte
Dd[x], D(d+2)[x]648 byte
38Dd[x], D(d+1)[x], D(d+2)[x]Standard only
2-516 or 32Dd[x], D(d+1)[x], D(d+2)[x]Standard only
Dd[x], D(d+2)[x], D(d+4)[x]Standard only
48Dd[x], D(d+1)[x], D(d+2)[x], D(d+3)[x]324 byte
2-516Dd[x], D(d+1)[x], D(d+2)[x], D(d+3)[x]648 byte
Dd[x], D(d+2)[x], D(d+4)[x], D(d+6)[x]648 byte
2-532Dd[x], D(d+1)[x], D(d+2)[x], D(d+3)[x]64 or 128(<align> ÷ 8) bytes
Dd[x], D(d+2)[x], D(d+4)[x], D(d+6)[x]64 or 128(<align> ÷ 8) bytes

t0010

Operations

NameEffectDescription
vld<n>

tmpRnsi5_e

incr(si6_e<size> ÷ 8)

for Dregs(<list>) do

 D[x]Mem[tmp]si7_e

 tmptmp+incrsi8_e

end for

if ! is present then

 Rntmpsi9_e

else

 if Rm is specified then

 RnRmsi10_e

end if

end if

Load one or more data items into a single lane of one or more registers
vst<n>

tmpRnsi5_e

incr(si6_e<size> ÷ 8)

for Dregs(<list>) do

 Mem[tmp]D[x]si13_e

 tmptmp+incrsi8_e

end for

if ! is present then

 Rntmpsi9_e

else

 if Rm is specified then

 RnRmsi10_e

 end if

end if

Store one or more data items from a single lane of one or more registers

t0030

Examples

f10-11-9780128036983

10.3.2 Load Copies of a Structure to All Lanes

This instruction is used to load multiple copies of structured data across multiple registers:

vld<n> Load Copies of Structured Data.

The data is copied to all lanes. This instruction is useful for initializing vectors for use in later instructions.

Syntax

 vld<n>.<size> <list>,[Rn{:<align>}]{!}

 vld<n>.<size> <list>,[Rn{:<align>}],Rm

 <n> must be one of 1, 2, 3, or 4.

 <size> must be one of 8, 16, or 32.

 <list> specifies the list of registers. There are four list formats:

1. {Dd[]}

2. {Dd[], D(d+a)[]}

3. {Dd[], D(d+a)[], D(d+2a)[]}

4. {Dd[], D(d+a)[], D(d+2a)[], D(d+3a)[]}

where a can be either 1 or 2. Every register in the list must be in the range d0-d31.

 Rn is the ARM register containing the base address. Rn cannot be pc.

 <align> specifies an optional alignment. If <align> is not specified, then standard alignment rules apply.

 The optional ! indicates that Rn is updated after the data is transferred. This is similar to the ldm and stm instructions.

 Rm is an ARM register containing an offset from the base address. If Rm is present, Rn is updated to Rn + Rm after the address is used to access memory. Rm cannot be sp or pc.

Table 10.2 shows all valid combinations of parameters for this instruction. Note that the vector element number is not specified, but the brackets [] must be present. Up to four registers can be specified. If the structure has more than four fields, then this instruction can be repeated to load or store all of the fields.

Table 10.2

Parameter combinations for loading multiple structures

<n><size><list><align>Alignment
18Dd[]Standard only
Dd[], D(d+1)[]Standard only
2-516Dd[]162 byte
Dd[], D(d+1)[]162 byte
2-532Dd[]324 byte
Dd[], D(d+1)[]324 byte
28Dd[], D(d+1)[]81 byte
8Dd[], D(d+2)[]81 byte
2-516Dd[], D(d+1)[]162 byte
Dd[], D(d+2)[]162 byte
2-532Dd[], D(d+1)[]324 byte
Dd[], D(d+2)[]324 byte
38, 16, or 32Dd[], D(d+1)[], D(d+2)[]Standard only
Dd[], D(d+2)[], D(d+4)[]Standard only
48Dd[], D(d+1)[], D(d+2)[], D(d+3)[]324 byte
Dd[], D(d+2)[], D(d+4)[], D(d+6)[]324 byte
2-516Dd[], D(d+1)[], D(d+2)[], D(d+3)[]648 byte
Dd[], D(d+2)[], D(d+4)[], D(d+6)[]648 byte
2-532Dd[], D(d+1)[], D(d+2)[], D(d+3)[]64 or 128(<align> ÷ 8) bytes
Dd[], D(d+2)[], D(d+4)[], D(d+6)[]64 or 128(<align> ÷ 8) bytes

t0015

Operations

NameEffectDescription
vld<n>

tmpRnsi5_e

incr(si6_e<size> ÷ 8)

nlanes(64÷si19_e<size>)

for Dregs(<list>) do

 for 0 ≤ x < nlanes do

 D[x]Mem[tmp]si7_e

 tmptmp+incrsi8_e

 end for

end for

if ! is present then

 Rntmpsi9_e

else

 if Rm is specified then

 RnRmsi10_e

 end if

end if

Load one or more data items into all lanes of one or more registers.

t0035

Examples

f10-12-9780128036983

10.3.3 Load or Store Multiple Structures

These instructions are used to load and store multiple data structures across multiple registers with interleaving or deinterleaving:

vld<n> Load Multiple Structured Data, and

vst<n> Store Multiple Structured Data.

Syntax

 v<op><n>.<size> <list>,[Rn{:<align>}]{!}

 v<op><n>.<size> <list>,[Rn{:<align>}],Rm

 <op> must be either ld or st.

 <n> must be one of 1, 2, 3, or 4.

 <size> must be one of 8, 16, or 32.

 <list> specifies the list of registers. There are four list formats:

1. {Dd}

2. {Dd, D(d+a)}

3. {Dd, D(d+a), D(d+2a)}

4. {Dd, D(d+a), D(d+2a), D(d+3a)}

where a can be either 1 or 2. Every register in the list must be in the range d0-d31.

 Rn is the ARM register containing the base address. Rn cannot be pc.

 <align> specifies an optional alignment. If <align> is not specified, then standard alignment rules apply.

 The options ! indicates that Rn is updated after the data is transferred, similar to the ldm and stm instructions.

 Rm is an ARM register containing an offset from the base address. If Rm is present, Rn is updated to Rn + Rm after the address is used to access memory. Rm cannot be sp or pc.

Table 10.3 shows all valid combinations of parameters for this instruction. Note that the scalar is not specified and the instructions work on all multiple vector elements. Up to four registers can be specified. If the structure has more than four fields, then this instruction can be repeated to load or store all of the fields.

Table 10.3

Parameter combinations for loading copies of a structure

<n><size><list><align>Alignment
18, 16, 32, or 64Dd648 bytes
Dd, D(d+1)64 or 128(<align> ÷ 8) bytes
Dd, D(d+1), D(d+2)648 bytes
Dd, D(d+1), D(d+2), D(d+3)64, 128, or 256(<align> ÷ 8) bytes
28, 16, or 32Dd, D(d+1)64 or 128(<align> ÷ 8) bytes
Dd, D(d+2)64 or 128(<align> ÷ 8) bytes
Dd, D(d+1), D(d+2), D(d+3)64, 128, or 256(<align> ÷ 8) bytes
38, 16, or 32Dd, D(d+1), D(d+2)648 bytes
Dd, D(d+2), D(d+3)648 bytes
48, 16, or 32Dd, D(d+1), D(d+2), D(d+3)64, 128, or 256(<align> ÷ 8) bytes
Dd, D(d+2), D(d+4), D(d+6)64, 128, or 256(<align> ÷ 8) bytes

t0020

Operations

NameEffectDescription
vld<n>

tmpRnsi5_e

incr(si6_e<size> ÷ 8)

nlanes(64÷si19_e<size>)

for 0 ≤ x < nlanes do

 for D<list> do

 D[x]Mem[tmp]si7_e

 tmptmp+incrsi8_e

 end for

end for

if ! is present then

 Rntmpsi9_e

else

 if Rm is specified then

 RnRmsi10_e

 end if

end if

Load one or more data items into all lanes of one or more registers.
vst<n>

tmpRnsi5_e

incr(si6_e<size> ÷ 8)

nlanes(64÷si19_e<size>)

for 0 ≤ x < nlanes do

 for D<list> do

 Mem[tmp]D[x]si13_e

 tmptmp+incrsi8_e

 end for

end for

if ! is present then

 Rntmpsi9_e

else

 if Rm is specified then

 RnRmsi10_e

 end if

end if

Load one or more data items into all lanes of one or more registers.

t0040

Examples

f10-13-9780128036983

10.4 Data Movement Instructions

Because they use the same set of registers, VFP and NEON share some instructions for loading, storing, and moving registers. The shared instructions are vldr, vstr, vldm, vstm, vpop, vpush, vmov, vmrs, and vmsr. These were explained in Chapter 9. NEON extends the vmov instructions to allow specification of NEON scalars and quadwords, and adds the ability to perform one’s complement during a move.

10.4.1 Moving Between NEON Scalar and Integer Register

This version of the move instruction allows data to be moved between the NEON registers and the ARM integer registers as 8-bit, 16-bit, or 32-bit NEON scalars:

vmov Move Between NEON and ARM.

Syntax

 vmov{<cond>}.<size> Dn[x],Rd

 vmov{<cond>}.<type> Rd,Dn[x]

 <cond> is an optional condition code.

 <size> must be 8, 16, or 32, and specifies the number of bits that are to be moved.

 The <type> must be u8, u16, u32, s8, s16, s32, or f32, and specifies the number of bits that are to be moved and whether or not the result should be sign-extended in the ARM integer destination register.

Operations

NameEffectDescription
vmov Dd[x],RmDn[x]Rdsi38_eMove least significant size bits of Rd to NEON scalar Dn[x].
vmov Rd,Dn[x]RdDn[x]si39_eMove NEON scalar Dn[x] to Rd, storing as specified type

Examples

f10-14-9780128036983

10.4.2 Move Immediate Data

NEON extends the VFP vmov instruction to include the ability to move an immediate value, or the one’s complement of an immediate value, to every element of a register. The instructions are:

vmov Move Immediate, and

vmvn Move Immediate NOT.

Syntax

 v<op>.<type> Vd, #<imm>

 <op> must be either <mov> or <mvn>.

 <type> must be i8, i16, i32, f32, or i64, and specifies the size of items in the vector.

 V can be s, d, or q.

 <imm> is an immediate value that matches <type>, and is copied to every element in the vector. The following table shows valid formats for imm:

<type>vmovvmvn
i80xXY0xXY
i160x00XY0xFFXY
0xXY000xXYFF
i320x000000XY0xFFFFFFXY
0x0000XY000xFFFFXYFF
0x00XY00000xFFXYFFFF
0xXY0000000xXYFFFFFF
i640xABCDEFGH0xABCDEFGH
2-3Each letter represents a byte, and must be either FF or 00
f32Any number that can be written as ± n × (2 − r), where n and r are integers, such that 16 ≤ n ≤ 31 and 0 ≤ r ≤ 7

t0050

Operations

NameEffectDescription
vmovVd[]immedsi40_eCopy immediate value to all elements of Vd.
vmvnVd[]¬immedsi41_eCopy one’s complement of immediate value to all elements of Vd.

Examples

f10-15-9780128036983

10.4.3 Change Size of Elements in a Vector

It is sometimes useful to increase or decrease the number of bits per element in a vector. NEON provides these instructions to convert a doubleword vector with elements of size y to a quadword vector with size 2y, or to perform the inverse operation:

vmovl Move and Lengthen,

vmovn Move and Narrow,

vqmovn Saturating Move and Narrow, and

vqmovun Saturating Move and Narrow Unsigned.

Syntax

 vmovl.<type> Qd, Dm

 v{q}movn.<type> Dd, Qm

 vqmovun.<type> Dd, Qm

 The valid choices for <type> are given in the following table:

OpcodeValid Types
vmovls8, s16, s32, u8, u16, or u32
vmovni8, i16, or i32
vqmovns8, s16, s32, u8, u16, or u32
vqmovuns8, s16, or s32

 q indicates that the results are saturated.

Operations

NameEffectDescription
vmovl

for 0 ≤ i < (64 ÷ size) do

 Qd[i]Dm[i]si42_e

end for

Sign or zero extends (depending on <type>) each element of a doubleword vector to twice their length
v{q}movn

for 0 ≤ i < (64 ÷ size) do

 if q is present then

 eq10-02-9780128036983

 else

 Dd[i]Qm[i])si43_e

 end if

end for

Copy the least significant half of each element of a quadword vector to the corresponding elements of a doubleword vector. If q is present, then the value is saturated
vqmovun

for 0 ≤ i < (64 ÷ size) do

eq10-03-9780128036983

end for

Copy each element of the operand vector to the corresponding element of the destination vector. The destination element is unsigned, and the value is saturated

t0065

Examples

f10-16-9780128036983

10.4.4 Duplicate Scalar

The duplicate instruction copies a scalar into every element of the destination vector. The scalar can be in a NEON register or an ARM integer register. The instruction is:

vdup Duplicate Scalar.

Syntax

 vdup.<size> Vd, Rm

 vdup.<size> Vd, Dm[x]

 <size> must be one of 8, 16 or 32.

 V can be d or q.

 Rm cannot be r15.

Operations

NameEffectDescription
vdup.<size>V d[] < −RmCopy <size> least significant bits of Rm to all elements of Vd
vdup.<size>V d[] < −Dm[x]Copy element x of Dm to all elements of Vd

Examples

f10-17-9780128036983

10.4.5 Extract Elements

This instruction extracts 8-bit elements from two vectors and concatenates them. Fig. 10.4 gives an example of what this instruction does. The instruction is:

f10-04-9780128036983
Figure 10.4 Example of vext.8 d12,d4,d9,#5.

vext Extract Elements.

Syntax

 vext.<size> Vd, Vn, Vm, #<imm>

 <size> must be one of 8, 16, 32, or 64.

 V can be d or q.

 <imm> is the number of elements to extract from the bottom of Vm. The remaining elements required to fill Vd are taken from the top of Vn.

Operation

NameEffectDescription
vext

if V is double then

 size8si44_e

else

 size16si45_e

end if

for imm > i ≥ 0 do

 Vd[i+sizeimm]Vm[i]si46_e

end for

for size > iimm do

 Vd[iimm]Vm[i]si47_e

end for

Concatenate the top of first operand to the bottom of the second operand.

t0075

Examples

f10-18-9780128036983

10.4.6 Reverse Elements

This instruction reverses the order of data in a register:

vrev Reverse Elements.

One use of this instruction is for converting data from big-endian to little-endian order, or from little-endian to big-endian order. It could also be useful for swapping data and transforming matrices. Fig. 10.5 shows three examples.

f10-05-9780128036983
Figure 10.5 Examples of the vrev instruction. (A) vrev16.8 d3,d4; (B) vrev32.16 d8,d9; (C) vrev32.8 d5,d7.

Syntax

 vrev<n>.<size> Vd, Vm

 <n> can be 16, 32, or 64.

 <size> is either 8, 16, or 32 and indicates the size of the elements to be reversed. <size> must be less than <n>.

 V can be q or d.

Operation

NameEffectDescription
vrev

n# of groupssi48_e

gsize of groupsi49_e

for 0 ≤ i < n do

 for 0 ≤ j < g do

 Vd[i×g+j]Vm[i×g+(gj1)]si50_e

 end for

end for

Reverse the order of elements of <size> bits within every element of <n> bits.

t0080

Examples

f10-19-9780128036983

10.4.7 Swap Vectors

This instruction simply swaps two NEON registers:

vswp Swap Vectors.

Syntax

 vswp{.<type>} Vd, Vm

 <type> can be any NEON data type. The assembler ignores the type, but it can be useful to the programmer as extra documentation.

 V can be q or d.

Operation

NameEffectDescription
vswpVdVm;VmVdsi51_eSwap registers

Examples

f10-20-9780128036983

10.4.8 Transpose Matrix

This instruction transposes 2 × 2 matrices:

vtrn Transpose Matrix.

Fig. 10.6 shows two examples of this instruction. Larger matrices can be transposed using a divide-and-conquer approach.

f10-06-9780128036983
Figure 10.6 Examples of the vtrn instruction. (A) vtrn.8 d14,d15; (B) vtrn.32 d31,d15.

Syntax

 vtrn.<size> Vd, Vm

 <size> is either 8, 16, or 32 and indicates the size of the elements in the matrix (or matrices).

 V can be q or d.

Operation

NameEffectDescription
vtrn

n# of elementssi52_e

for 0 ≤ i < n by 2 do

 tmpVm[i]si53_e

 Vm[i]Vd[i+1]si54_e

 Vd[i+1]tmpsi55_e

end for

Treat two vectors as an array of 2 × 2 matrices and transpose them.

t0090

Examples

f10-21-9780128036983

Fig. 10.7 shows how the vtrn instruction can be used to transpose a 3 × 3 matrix. Transposing a 4 × 4 matrix requires the transposition of 13 2 × 2 matrices. However, this instruction can operate on multiple 2 × 2 sub-matrices in parallel, and can group elements into different sized sub-matrices. There is also a very useful swap instruction that can exchange the rows of a matrix. Using the swap and transpose instructions, transposing a 4 × 4 matrix of 16-bit elements can be done with only four instructions, as shown in Fig. 10.8.

f10-07-9780128036983
Figure 10.7 Transpose of a 3 × 3 matrix.
f10-08-9780128036983
Figure 10.8 Transpose of a 4 × 4 matrix of 32-bit numbers.

10.4.9 Table Lookup

The table lookup instructions use indices held in one vector to lookup values from a table held in one or more other vectors. The resulting values are stored in the destination vector. The table lookup instructions are:

vtbl Table Lookup, and

vtbx Table Lookup with Extend.

Syntax

 v<op>.8 Dd, <list>, Dm

 <op> is one of tbl or tbx

 <list> specifies the list of registers. There are five list formats:

1. {Dn},

2. {Dn, D(n+1)},

3. {Dn, D(n+1), D(n+2)},

4. {Dn, D(n+1), D(n+2), D(n+3)}, or

5. {Qn, Q(n+1)}.

 Dm is the register holding the indices.

 The table can contain up to 32 bytes.

Operations

NameEffectDescription
vtbl

Minrsi56_e first register

Maxrsi57_e last register

for 0 ≤ i < 8 do

 rMinr+(Dm[i]÷8)si58_e

 if r > Maxr then

 Dd[i]0si59_e

 else

 eDm[i]mod8si60_e

 Dd[i]Dr[e]si61_e

 end if

end for

Use indices Dm to look up values in a table and store them in Dd. If the index is out of range, zero is stored in the corresponding destination.
vtbx

Minrsi56_e first register

Maxrsi57_e last register

for 0 ≤ i < 8 do

 rMinr+(Dm[i]÷8)si58_e

 if rMaxr then

 eDm[i]mod8si60_e

 Dd[i]Dr[e]si61_e

 end if

end for

Use indices Dm to look up values in a table and store them in Dd. If the index is out of range, the corresponding destination is unchanged.

t0095

Examples

f10-22-9780128036983

10.4.10 Zip or Unzip Vectors

These instructions are used to interleave or deinterleave the data from two vectors:

vzip Zip Vectors, and

vuzp Unzip Vectors.

Fig. 10.9 gives an example of the vzip instruction. The vuzp instruction performs the inverse operation.

f10-09-9780128036983
Figure 10.9 Example of vzip.8 d9,d4.

Syntax

 v<op>.<size> Vd, Vm

 <op> is either zip or uzp.

 <size> is either 8, 16, or 32 and indicates the size of the elements in the matrix (or matrices).

 V can be q or d.

Operations

NameEffectDescription
vzip

n# of elementssi52_e

for 0 ≤ i < (n ÷ 2) by 2 do

 tmp1[2×i]Vm[i]si68_e

 tmp1[2×i+1]Vd[i]si69_e

end for

for (n ÷ 2) ≤ i < n by 2 do

 tmp2[2×i]Vm[i]si70_e

 tmp2[2×i+1]Vd[i]si71_e

end for

Vmtmp1si72_e

Vdtmp2si73_e

Interleave data from two vectors. tmp is a vector of suitable size.
vuzp

n# of elementssi52_e

for 0 ≤ i < (n ÷ 2) by 2 do

 tmp1[i]Vm[2×i]si75_e

 tmp2[i]Vm[2×i+1]si76_e

end for

for (n ÷ 2) ≤ i < n by 2 do

 tmp1[i]Vd[2×i]si77_e

 tmp2[i]Vd[2×i+1]si78_e

end for

Vmtmp1si72_e

Vdtmp2si73_e

Interleave data from two vectors. tmp is a vector of suitable size.

t0100

Examples

f10-23-9780128036983

10.5 Data Conversion

When high precision is not required, The IEEE half-precision format can be used to store floating point numbers in memory. This can reduce memory requirements by up to 50%. This can also result in a significant performance improvement, since only half as much data needs to be moved between the CPU and main memory. However, on most processors half-precision data must be converted to single precision before it is used in calculations. NEON provides enhanced versions of the vcvt instruction which support conversion to and from IEEE half precision. There are also versions of vcvt which operate on vectors, and perform integer or fixed-point to floating-point conversions.

10.5.1 Convert Between Fixed Point and Single-Precision

This instruction can be used to perform a data conversion between single precision and fixed point on each element in a vector:

vcvt Convert Data Format.

The elements in the vector must be a 32-bit single precision floating point or a 32-bit integer. Fixed point (or integer) arithmetic operations are up to twice as fast as floating point operations. In some cases it is much more efficient to make this conversion, perform the calculations, then convert the results back to floating point.

Syntax

 vcvt{<cond>}.<type>.f32 Sd, Sm{, #<fbits>}

 vcvt{<cond>}.f32.<type> Sd, Sm{, #<fbits>}

 <cond> is an optional condition code.

 <type> must be either s32 or u32.

 The optional <fbits> operand specifies the number of fraction bits for a fixed point number, and must be between 0 and 32. If it is omitted, then it is assumed to be zero.

Operations

NameEffectDescription
vcvt.s32.f32Fd[]fixed(Fm[])si81_eConvert single precision to 32-bit signed fixed point or integer.
vcvt.u32.f32Fd[]ufixed(Fm[])si82_eConvert single precision to 32-bit unsigned fixed point or integer.
vcvt.f32.s32Fd[]single(Fm[])si83_eConvert signed 32-bit fixed point or integer to single precision
vcvt.f32.u32Fd[]single(Fm[])si83_eConvert unsigned 32-bit fixed point or integer to single precision

Examples

f10-24-9780128036983

10.5.2 Convert Between Half-Precision and Single-Precision

NEON systems with the half-precision extension provide the following instruction to perform conversion between single precision and half precision floating point formats:

vcvt Convert Between Half and Single.

Syntax

 vcvt<op>{<cond>}.f16.f32 Sd, Sm

 vcvt<op>{<cond>}.f32.f16 Sd, Sm

 The <op> must be either b or t and specifies whether the top or bottom half of the register should be used for the half-precision number.

 <cond> is an optional condition code.

Operations

NameEffectDescription
vcvtb.f16.f32Sdhalf(Sm)si85_eConvert single precision to half precision and store in bottom half of destination
vcvtt.f16.f32Sdhalf(Sm)si85_eConvert single precision to half precision and store in top half of destination
vcvtb.f32.f16Sdsingle(Sm)si87_eConvert half precision number from bottom half of source to single precision
vcvtt.f32.f16Sdsingle(Sm)si87_eConvert half precision number from top half of source to single precision

Examples

f10-25-9780128036983

10.6 Comparison Operations

NEON adds the ability to perform integer comparisons between vectors. Since there are multiple pairs of items to be compared, the comparison instructions set one element in a result vector for each pair of items. After the comparison operation, each element of the result vector will have every bit set to zero (for false) or one (for true). Note that if the elements of the result vector are interpreted as signed two’s-complement numbers, then the value 0 represents false and the value − 1 represents true.

10.6.1 Vector Compare

The following instructions perform comparisons of all of the corresponding elements of two vectors in parallel:

vceq Compare Equal,

vcge Compare Greater Than or Equal,

vcgt Compare Greater Than,

vcle Compare Less Than or Equal, and

vclt Compare Less Than.

The vector compare instructions compare each element of a vector with the corresponding element in a second vector, and sets an element in the destination vector for each comparison. If the comparison is true, then all bits in the result element are set to one. Otherwise, all bits in the result element are set to zero. Note that summing the elements of the result vector (as signed integers) will give the two’s complement of the number of comparisons which were true.

Note: vcle and vclt are actually pseudo-instructions. They are equivalent to vcgt and vcge with the operands reversed.

Syntax

 vc<op>.<type> Vd, Vn, Vm

 vc<op>.<type> Vd, Vn, #0

 <op> must be one of eq, ge, gt, le, or lt.

 If <op> is eq, then <type> must be i8, i16, i32, or f32.

 If <op> is not eq and Rop is #0, then < type > must be s8, s16, s32, or f32.

 If <op> is not eq and the third operand is a register, then <type> must be s8, s16, s32, u8, u16, u32, or f32.

 The result data type is determined from the following table:

Operand TypeResult Type
i32, s32, u32, or f32i32
i16, s16, or u16i16
i8, s8, or u8i8

 If the third operand is #0, then it is taken to be a vector of the correct size in which every element is zero.

 V can be d or q.

Operations

NameEffectDescription
vc<op>

for ivector_length do

 if Fm[i]<op> Rop[i]

 then

 Fd[i]111si89_e

 else

 Fd[i]000si90_e

 end if

end for

Compare each scalar in Fn to the corresponding scalar in Fm. Set the corresponding scalar in Fd to all ones if <op> is true, and all zeros if <op> is not true.

t0120

Examples

f10-26-9780128036983

10.6.2 Vector Absolute Compare

The following instructions perform comparisons between the absolute values of all of the corresponding elements of two vectors in parallel:

vacgt Absolute Compare Greater Than, and

vacge Absolute Compare Greater Than or Equal.

The vector absolute compare instruction compares the absolute value of each element of a vector with the absolute value of the corresponding element in a second vector, and sets an element in the destination vector for each comparison. If the comparison is true, then all bits in the result element are set to one. Otherwise, all bits in the result element are set to zero. Note that summing the elements of the result vector (as signed integers) will give the two’s complement of the number of comparisons which were true.

Syntax

 vac<op>.f32 Vd, Vn, Vm

 <op> must be either ge or gt.

 V can be d or q.

 The operand element type must be f32.

 The result element type is i32.

Operations

NameEffectDescription
vac<op>

for ivector_length do

 if |Fm[i]|<op> |Fn[i]|

then

 Fd[i]111si89_e

 else

 Fd[i]000si90_e

 end if

end for

Compare each scalar in Fn to the corresponding scalar in Fm. If the comparison is true, then set all bits in the corresponding scalar in Fd to one. Otherwise set all bits in the corresponding scalar in Fd to zero.

t0125

Examples

f10-27-9780128036983

10.6.3 Vector Test Bits

NEON provides the following vector version of the ARM tst instruction:

vtst Test Bits.

The vector test bits instruction performs a logical AND operation between each element of a vector and the corresponding element in a second vector. If the result is not zero, then every bit in the corresponding element of the result vector is set to one. Otherwise, every bit in the corresponding element of the result vector is set to zero.

Syntax

 vtst.<size> Vd, Vn, Vm

 V can be d or q.

 <size> must be one of 8, 16 or 32

 The result element type is defined by the following table:

<size>Result Type
32i32
16i16
8i8

Operations

NameEffectDescription
vtst

for ivector_length do

 if (Fm[i] ∧ Fn[i])≠0 then

 Fd[i]111si89_e

 else

 Fd[i]000si90_e

 end if

end for

Perform logical AND between each scalar in Fn and the corresponding scalar in Fm. Set the corresponding scalar in Fd to all ones if the result is not zero, and all zeros otherwise

t0135

Examples

f10-28-9780128036983

10.7 Bitwise Logical Operations

NEON adds the ability to perform integer and bitwise logical operations on the VFP register set. Recall that integer operations can also be used on fixed-point data. These operations add a great deal of power to the ARM processor.

10.7.1 Bitwise Logical Operations

NEON includes vector versions of the following five basic logical operations:

vand Bitwise AND,

veor Bitwise Exclusive-OR,

vorr Bitwise OR,

vorn Bitwise Complement and OR, and

vbic Bit Clear.

All of them involve two source operands and a destination register.

Syntax

 v<op>{.<type>} Vd, Vn, Vm

 <op> must be one of and, eor, orr, orn, or bic.

 V must be either q or d.

 type must be i8, i16, i32, or i64. For these bitwise logical operations, type does not matter.

Operations

NameEffectDescription
vandVdVnVmsi95_eLogical AND
veorVdVnVmsi96_eExclusive OR
vorrVdVnVmsi97_eLogical OR
vornVd¬(VnVm)si98_eComplement of Logical OR
vbicVdVn¬Vmsi99_eBit Clear

Examples

f10-29-9780128036983

10.7.2 Bitwise Logical Operations with Immediate Data

It is often useful to clear and/or set specific bits in a register. The NEON instruction set provides the following vector versions of the logical OR and bit clear instructions:

vorr Bitwise OR Immediate, and

vbic Bit Clear Immediate.

Syntax

 v<op>.<type> Vd, #<imm>

 <op> must be either orr, or bic.

 V must be either q or d to specify whether the operation involves quadwords or doublewords.

 <type> must be i16 or i32.

 <imm> is a 16-bit or 32-bit immediate value, which is interpreted as a pattern for filling the immediate operand. The following table shows acceptable patterns for <imm>, based on what was chosen for <type>:

<type>
i16i32
0x00XY0x000000XY
0xXY000x0000XY00
0x00XY0000
0xXY000000

t0145

Operations

NameEffectDescription
vorrVdVdimm:immsi100_eLogical OR
vbicVdVdimm:immsi101_eBit Clear

Examples

f10-30-9780128036983

10.7.3 Bitwise Insertion and Selection

NEON provides three instructions which can be used to combine the bits in two registers or to extract specific bits from a register, according to a pattern:

vbit Bitwise Insert,

vbif Bitwise Insert if False, and

vbsl Bitwise Select.

Syntax

[frame=single]

 v<op>{.<type>} Vd, Vn, Vm

 <op> can be bif, bit, or bsl.

 V can be d or q.

 The <type> must be i8, i16, i32, or i64, and specifies the size of items in the vectors. Note that for these bitwise logical operations, the type does not matter. so the assembler ignores it. However, it can be useful to the programmer as extra documentation.

Operations

NameEffectDescription
vbitFd(Fd¬Fm)(FnFm)si102_eInsert each bit from the first operand into the destination if the corresponding bit of the second operand is 1
vbifFd(FdFm)(Fn¬Fm)si103_eInsert each bit from the first operand into the destination if the corresponding bit of the second operand is 0
vbslFd(FdFn)(¬FdFm)si104_eSelect each bit for the destination from the first operand if the corresponding bit of the destination is 1, or from the second operand if the corresponding bit of the destination is 0

Examples

f10-31-9780128036983

10.8 Shift Instructions

The NEON shift instructions operate on vectors. Shifts are often used for multiplication and division by powers of two. The results of a left shift may be larger than the destination register, resulting in overflow. A shift right is equivalent to division. In some cases, it may be useful to round the result of a division, rather than truncating. NEON provides versions of the shift instruction which perform saturation and/or rounding of the result.

10.8.1 Shift Left by Immediate

These instructions shift each element in a vector left by an immediate value:

vshl Shift Left Immediate,

vqshl Saturating Shift Left Immediate,

vqshlu Saturating Shift Left Immediate Unsigned, and

vshll Shift Left Immediate Long.

Overflow conditions can be avoided by using the saturating version, or by using the long version, in which case the destination is twice the size of the source.

Syntax

 vshl.<type> Vd, Vm, #<imm>

 vqshl{u}.<type> Vd, Vm, #<imm>

 vshll.<type> Qd, Dm, #<imm>

 If u is present, then the results are unsigned.

 The valid choices for <type> are given in the following table:

OpcodeValid Types
vshli8, i16, i32, i64, s8, s16, or s32
vqshls8, s16, s32, s64, u8, u16, u32, or u64
vqshlus8, s16, s32, or s64
vshllu8, u16, u32, u64, s8, s16, or s32

Operations

NameEffectDescription
vshl

Vd[]Vm[]immsi105_e

Each element of Vm is shifted left by the immediate value and stored in the corresponding element of Vd. Bits shifted past the end of an element are lost.
vshll

Qd[]Dm[]immsi106_e

Each element of Vm is shifted left by the immediate value and stored in the corresponding element of Vd. The values are sign or zero extended, depending on <type>
vqshl{u}

eq10-04-9780128036983

Each element of Vm is shifted left by the immediate value and stored in the corresponding element of Vd. If the result of the shift is outside the range of the destination element, then the value is saturated. If u was specified, then the destination is unsigned. Otherwise, it is signed

t0160

Examples

f10-34-9780128036983

10.8.2 Shift Left or Right by Variable

These instructions shift each element in a vector, using the least significant byte of the corresponding element of a second vector as the shift amount:

vshl Shift Left or Right by Variable,

vrshl Shift Left or Right by Variable and Round,

vqshl Saturating Shift Left or Right by Variable, and

vqrshl Saturating Shift Left or Right by Variable and Round.

If the shift value is positive, the operation is a left shift. If the shift value is negative, then it is a right shift. A shift value of zero is equivalent to a move. If the operation is a right shift, and r is specified, then the result is rounded rather than truncated. Results are saturated if q is specified.

Syntax

 v{q}{r}shl.<type> Vd, Vn, Vm

 If q is present, then the results are saturated.

 If r is present, then right shifted values are rounded rather than truncated.

 V can be d or q.

 <type> must be one of s8, s16, s32, s64, s8, s16, s32, or s64.

Operations

NameEffectDescription
vshl

if q is present then

 if r is present then

 eq10-05-9780128036983

 else

 eq10-06-9780128036983

 end if

else

 if r is present then

 Vd[]Vn[]Vm[]si107_e

 else

 Vd[]Vn[]Vm[]si108_e

 end if

end if

Each element of Vm is shifted left by the immediate value and stored in the corresponding element of Vd. Bits shifted past the end of an element are lost.

t0165

Examples

f10-35-9780128036983

10.8.3 Shift Right by Immediate

These instructions shift each element in a vector right by an immediate value:

vshr Shift Right Immediate,

vrshr Shift Right Immediate and Round,

vshrn Shift Right Immediate and Narrow,

vrshrn Shift Right Immediate Round and Narrow,

vsra Shift Right and Accumulate Immediate, and

vrsra Shift Right Round and Accumulate Immediate.

Syntax

 v{r}shr{<cond>}.<type> Vd, Vm, #<imm>

 v{r}shrn{<cond>}.<type> Vd, Vm, #<imm>

 v{r}sra{<cond>}.<type> Vd, Vm, #<imm>

 V can be d or q.

 If r is present, then right shifted values are rounded rather than truncated.

 <cond> is an optional condition code.

 The valid choices for <type> are given in the following table:

OpcodeValid Types
v{r}shru8, u16, u32, u64, s8, s16, s32, or s64,
v{r}shrni16, i32, or i64
v{r}srau8, u16, u32, u64, s8, s16, s32, or s64,

Operations

NameEffectDescription
v{r}shr

if r is present then

 Vd[]Vm[]immsi109_e

else

 Vd[]Vm[]immsi110_e

end if

Each element of Vm is shifted right with zero extension by the immediate value and stored in the corresponding element of Vd. Results can be rounded both.
v{r}shrn

if r is present then

 Vd[]Vm[]immsi111_e

else

 Vd[]Vm[]immsi112_e

end if

Each element of Vm is shifted right with zero extension by the immediate value, optionally rounded, then narrowed and stored in the corresponding element of Vd.
v{r}sra

if r is present then

 Vd[]Vd[]+Vm[]immsi113_e

else

 Vd[]Vd[]+Vm[]immsi114_e

end if

Each element of Vm is shifted right with sign or zero extension by the immediate value and accumulated in the corresponding element of Vd. Results can be rounded.

t0175

Examples

f10-36-9780128036983

10.8.4 Saturating Shift Right by Immediate

These instructions shift each element in a quad word vector right by an immediate value:

vqshrn Saturating Shift Right Immediate,

vqrshrn Saturating Shift Right Immediate Round,

vqshrun Saturating Shift Right Immediate Unsigned, and

vqrshrun Saturating Shift Right Immediate Round Unsigned.

The result is optionally rounded, then saturated, narrowed, and stored in a double word vector.

Syntax

 vq{r)shr{u}n.<type> Dd, Qm, #<imm>

 If r is present, then right shifted values are rounded rather than truncated.

 If u is present, then the results are unsigned, regardless of the type of elements in Qm.

 The valid choices for <type> are given in the following table:

OpcodeValid Types
vq{r}shrnu16, u32, u64, s16, s32, or s64,
vq{r}shruns16, s32, or s64,

 <imm> Is the amount that elements are to be shifted, and must be between zero and one less than the number of bits in <type>.

Operations

NameEffectDescription
vq{r}shrn

if r is present then

 eq10-07-9780128036983

else

 eq10-08-9780128036983

end if

Each element of Vm is shifted right with sign extension by the immediate value, optionally rounded, then saturated and narrowed, and stored in the corresponding element of Vd.
vq{r}shrun

if r is present then

 eq10-09-9780128036983

else

 eq10-10-9780128036983

end if

Each element of Vm is shifted right with zero extension by the immediate value, optionally rounded, then saturated and narrowed, and stored in the corresponding element of Vd.

t0185

Examples

f10-37-9780128036983

10.8.5 Shift and Insert

These instructions perform bitwise shifting of each element in a vector, then combine the results with the contents of the destination register:

vsli Shift Left and Insert,

vsri Shift Right and Insert.

Fig. 10.10 provides an example.

f10-10-9780128036983
Figure 10.10 Effects of vsli.32 d4,d9,#6.

Syntax

 vs<dir>i.<size> Vd, Vm, #<imm>

 <dir> must be l for a left shift, or r for a right shift.

 <size> must be 8, 16, 32, or 64.

 <imm> is the amount that elements are to be shifted, and must be between zero and <size>− 1 for vsli, or between one and <size> for vsri.

Operations

NameEffectDescription
vsli

mask(1imm+1)1si115_e

Vd[](maskVd[])(Vm[]imm)si116_e

Each element of Vm is shifted left and combined with lower <imm> bits of the corresponding element of Vd.
vsri

mask¬(1sizeimm+1)1si117_e

Vd[](maskVd[])(Vm[]imm)si118_e

Each element of Vm is shifted right and combined with upper <imm> bits of the corresponding element of Vd.

t0190

Examples

f10-38-9780128036983

10.9 Arithmetic Instructions

NEON provides several instructions for addition, subtraction, and multiplication, but does not provide a divide instruction. Whenever possible, division should be performed by multiplying the reciprocal. When dividing by constants, the reciprocal can be calculated in advance, as shown in Chapter 8. For dividing by variables, NEON provides instructions for quickly calculating the reciprocals for all elements in a vector. In most cases, this is faster than using a divide instruction. When division is absolutely unavoidable, the VFP divide instructions can be used.

10.9.1 Vector Add and Subtract

The following eight instructions perform vector addition and subtraction:

vadd Add

vqadd Saturating Add

vaddl Add Long

vaddw Add Wide

vsub Subtract

vqsub Saturating Subtract

vsubl Subtract Long

vsubw Subtract Wide

The Vector Add (vadd) instruction adds corresponding elements in two vectors and stores the results in the corresponding elements of the destination register. The Vector Subtract (vsub) instruction subtracts elements in one vector from corresponding elements in another vector and stores the results in the corresponding elements of the destination register. Other versions allow mismatched operand and destination sizes, and the saturating versions prevent overflow by limiting the range of the results.

Syntax

 v{q}<op>.<type> Vd, Vn, Vm

 v<op>l.<type> Qd, Dn, Dm

 v<op>w.<type> Qd, Qn, Dm

 <op> is either add or sub.

 The valid choices for <type> are given in the following table:

OpcodeValid Types
v<op>i8, i16, i32, i64, or f32
vq<op>s8, s16, s32, s64, u8, u16, u32, or u64
v<op>ls8, s16, s32, u8, u16, or u32
v<op>ws8, s16, s32, u8, u16, or u32

Operations

NameEffectDescription
v<op>

Vd[]Vn[]<op>Vm[]si119_e

The operation is applied to corresponding elements of Vn and Vm. The results are stored in the corresponding elements of Vd.
vq<op>

eq10-11-9780128036983

The operation is applied to corresponding elements of Vn and Vm. The results are saturated then stored in the corresponding elements of Vd.
v<op>l

Qd[]Dn[]<op>Dm[]si120_e

The operation is applied to corresponding elements of Dn and Dm. The results are zero or sign extended then stored in the corresponding elements of Qd.
v<op>w

Qd[]Qn[]<op>Dm[]si121_e

The elements of Vm are sign or zero extended, then the operation is applied with corresponding elements of Vn. The results are stored in the corresponding elements of Vd.

t0200

Examples

f10-39-9780128036983

10.9.2 Vector Add and Subtract with Narrowing

These instructions add or subtract the corresponding elements of two vectors, and narrow by taking the most significant half of the result:

vaddhn Add and Narrow

vraddhn Add, Round, and Narrow

vsubhn Subtract and Narrow

vrsubhn Subtract, Round, and Narrow

The results are stored in the corresponding elements of the destination register. Results can be optionally rounded instead of truncated.

Syntax

 v{r}<op>hn.<type> Dd, Qn, Qm

 <op> is either add or sub.

 If <r> is specified, then the result is rounded instead of truncated.

 <type> must be either i16, i32, or i64.

Operations

NameEffectDescription
v<op>hn

shiftsize÷2si122_e

if r is present then

 xVn[]<op>Vm[]si123_e

 Vd[]xshiftsi124_e

else

 xVn[]<op>Vm[]si125_e

 Vd[]xshiftsi124_e

end if

The operation is applied to corresponding elements of Vn and Vm. The results are optionally rounded, then narrowed by taking the most significant half, and stored in the corresponding elements of Vd.

t0205

Examples

f10-40-9780128036983

10.9.3 Add or Subtract and Divide by Two

These instructions add or subtract corresponding elements from two vectors then shift the result right by one bit:

vhadd Halving Add

vrhadd Halving Add and Round

vhsub Halving Subtract

The results are stored in corresponding elements of the destination vector. If the operation is addition, then the results can be optionally rounded.

Syntax

 v{r}hadd.<type> Vd, Vn, Vm

 vhsub.<type> Vd, Vn, Vm

 If <r> is specified, then the result is rounded instead of truncated.

 <type> must be either s8, s16, s32, u8, u16, ar u32.

Operations

NameEffectDescription
v{r}hadd

if r is present then

 Vd[]Vn[]+Vm[]1si127_e

else

 Vd[]Vn[]+Vm[]1si128_e

end if

The corresponding elements of Vn and Vm are added together, optionally rounded, then shifted right one bit. Results are stored in the corresponding elements of Vd.
vhsub

Vd[]Vn[]Vm[]1si129_e

The elements of Vn are subtracted from the corresponding elements of Vm. Results are shifted right one bit and stored in the corresponding elements of Vd.

t0210

Examples

f10-41-9780128036983

10.9.4 Add Elements Pairwise

These instructions add vector elements pairwise:

vpadd Add Pairwise

vpaddl Add Pairwise Long

vpadal Add Pairwises and Accumulate Long

The long versions can be used to prevent overflow.

Syntax

 vpadd.<type> Dd, Dn, Dm

 vp<op>l.<type> Vd, Vm

 <op> must be either add or ada.

 The valid choices for <type> are given in the following table:

OpcodeValid Types
vpaddi8, i16, i32, or f32
vp<op>ls8, s16, s32, u8, u16, or u32

Operations

NameEffectDescription
vpadd

n# of elementssi52_e

for 0 ≤ i < (n ÷ 2) do

 Dd[i]Dm[i]+Dm[i+1]si131_e

end for

for (n ÷ 2) ≤ i < n do

 ji(n÷2)si132_e

 Dd[i]Dn[j]+Dn[j+1]si133_e

end for

Add elements of two vectors pairwise and store the results in another vector.
vpaddl

n# of elementssi52_e

for 0 ≤ i < (n ÷ 2) by 2 do

 Vd[i]Vm[i]+Vm[i+1]si135_e

end for

Add elements of a vector pairwise and store the results in another vector.
vpadal

n# of elementssi52_e

for 0 ≤ i < (n ÷ 2) by 2 do

 Vd[i]Vd[i]+Vm[i]+Vm[i+1]si137_e

end for

Add elements of a vector pairwise and accumulate the results in another vector.

t0220

Examples

f10-43-9780128036983

10.9.5 Absolute Difference

These instructions subtract the elements of one vector from another and store or accumulate the absolute value of the results:

vaba Absolute Difference and Accumulate

vabal Absolute Difference and Accumulate Long

vabd Absolute Difference

vabdl Absolute Difference Long

The long versions can be used to prevent overflow.

Syntax

v<op>.<type> Vd, Vn, Vm

v<op>l.<type> Qd, Dn, Dm

 <op> is either aba or abd.

 The valid choices for <type> are given in the following table:

OpcodeValid Types
vabds8, s16, s32, u8, u16, u32, or f32
vabas8, s16, s32, u8, u16, or u32
vabdls8, s16, s32, u8, u16, or u32
vabals8, s16, s32, u8, u16, or u32

Operations

NameEffectDescription
vabd

Vd[]Vn[]Vm[]si138_e

Subtract corresponding elements and take the absolute value
vaba

Vd[]Vd[]+Vn[]Vm[]si139_e

Subtract corresponding elements and take the absolute value. Accumulate the results
vabdl

Qd[]Dn[]Dm[]si140_e

Extend and subtract corresponding elements, then take the absolute value
v<op>w

Qd[]Qd[]+Dn[]Dm[]si141_e

Extend and subtract corresponding elements, then take the absolute value. Accumulate the results

t0230

Examples

f10-45-9780128036983

10.9.6 Absolute Value and Negate

These operations compute the absolute value or negate each element in a vector:

vabs Absolute Value

vneg Negate

vqabs Saturating Absolute Value

vqneg Saturating Negate

The saturating versions can be used to prevent overflow.

Syntax

 v{q}<op>.<type> Vd, Vm

 If q is present then results are saturated.

 <op> is either abs or neg.

 The valid choices for <type> are given in the following table:

OpcodeValid Types
vabss8, s16, s32, or f32
vnegs8, s16, s32, or f32
vqabss8, s16, or s32
vqnegs8, s16, or s32

Operations

NameEffectDescription
v{q}abs

if q is present then

 eq10-12-9780128036983

else

 Vd[]Vm[]si142_e

end if

Copy absolute value of each element of Vm to the corresponding element of Vd, optionally saturating the result
v{q}neg

if q is present then

 eq10-13-9780128036983

else

 Vd[]Vm[]si143_e

end if

Copy absolute value of each element of Vm to the corresponding element of Vd, optionally saturating the result

t0240

Examples

f10-46-9780128036983

10.9.7 Get Maximum or Minimum Elements

The following four instructions select the maximum or minimum elements and store the results in the destination vector:

vmax Maximum

vmin Minimum

vpmax Pairwise Maximum

vpmin Pairwise Minimum

Syntax

 v<op>.<type> Vd, Vn, Vm

 vp<op>.<type> Dd, Dn, Dm

 <op> is either max or min.

 <type> must be one of s8, s16, s32, u8, u16, u32, or f32.

Operations

NameEffectDescription
vmax

n# of elementssi52_e

for 0 ≤ i < n do

 if V n[i] > V m[i] then

 Vd[i]Vn[i]si145_e

 else

 Vd[i]Vm[i]si146_e

 end if

end for

Compare corresponding elements and copy the greater of each pair into the corresponding element in the destination vector
vpmax

n# of elementssi52_e

for 0 ≤ i < (n ÷ 2) do

 if Dm[i] > Dm[i + 1] then

 Dd[i]Dm[i]si148_e

 else

 Dd[i]Dm[i+1]si149_e

 end if

end for

for (n ÷ 2) ≤ i < n do

 if Dn[i] > Dn[i + 1] then

 Dd[i+(n÷2)]Dn[i]si150_e

 else

 Dd[i+(n÷2)]Dn[i+1]si151_e

 end if

end for

Compare elements pairwise and copy the greater of each pair into an element in the destination vector, another vector
vmin

n# of elementssi52_e

for 0 ≤ i < n do

 if V n[i] < V m[i] then

 Vd[i]Vn[i]si145_e

 else

 Vd[i]Vm[i]si146_e

 end if

end for

Compare corresponding elements and copy the lesser of each pair into the corresponding element in the destination vector
vpmin

n# of elementssi52_e

for 0 ≤ i < (n ÷ 2) do

 if Dm[i] < Dm[i + 1] then

 Dd[i]Dm[i]si148_e

 else

 Dd[i]Dm[i+1]si149_e

 end if

end for

for (n ÷ 2) ≤ i < n do

 if Dn[i] < Dn[i + 1] then

 Dd[i+(n÷2)]Dn[i]si150_e

 else

 Dd[i+(n÷2)]Dn[i+1]si151_e

 end if

end for

Compare elements pairwise and copy the lesser of each pair into an element in the destination vector, another vector

t0245

Examples

f10-47-9780128036983