Daniel KusswurmModern X86 Assembly Language Programminghttps://doi.org/10.1007/978-1-4842-4063-2_12

12. Advanced Vector Extensions 512

Daniel Kusswurm¹

(1)

Geneva, IL, USA

In the previous eight chapters, you learned about the scalar floating-point, packed floating-point, and packed integer capabilities of AVX and AVX2. In this chapter, you’ll learn about Advance Vector Extensions 512 (AVX-512). AVX-512 is undoubtedly the largest and perhaps the most consequential extension of the x86 platform to date. It doubles the number of available SIMD registers and broadens the width of each register from 256 to 512 bits. AVX-512 also extends the instruction syntax of AVX and AVX2 to support additional capabilities not available in the earlier extensions, including conditional execution and merging, embedded broadcasts, and instruction-level rounding control for floating-point operations.

The content of this chapter is organized as follows. The first section presents a brief overview of AVX-512, which includes information about AVX-512’s various instruction set extensions. This is followed by an examination of the AVX-512 execution environment, including its register sets, data types, instruction syntaxes, and enhanced computational features. The chapter concludes with a synopsis of the AVX-512 instruction set extensions that are included in recently marketed processors for server and workstation platforms.

AVX-512 Overview

Unlike AVX and AVX2, AVX-512 is not a distinct instruction set extension. Rather, it’s a congruous collection of interrelated instruction set extensions. An x86 processor is AVX-512 conforming if it supports the AVX512F (or foundation) instruction set extension. An AVX-512 conforming processor may optionally support additional AVX-512 instruction set extensions and these vary according to the processor’s target market segment (e.g., high-performance computing, server, desktop, mobile, etc.). Table 12-1 lists the AVX-512 instruction set extensions that are currently available in some Intel processors. This table also includes the AVX-512 instruction set extensions that Intel has announced for inclusion in future processors. As of the writing of this text, AMD does not market any processors that support AVX-512.

Table 12-1.

Overview of AVX-512 Instruction Set Extensions

CPUID Flag	Description
AVX512F	Foundation instructions
AVX512ER	Exponential and reciprocal instructions
AVX512PF	Prefetch instructions
AVX512CD	Conflict detect instructions
AVX512DQ	Doubleword and quadword instructions
AVX512BW	Byte and word instructions
AVX512VL	128-bit and 256-bit vector instructions
AVX512_IFMA	Integer fused-multiply-add
AVX512_VBMI	Additional vector byte instructions
AVX512_4FMAPS	Packed single-precision FMA (4 iterations)
AVX512_4VNNI	Vector neural network instructions (4 iterations)
AVX512_VPOPCNTDQ	vpopcnt[d\|q] instructions
AVX512_VNNI	Vector neural net instructions
AVX512_VBMI2	New vector byte, word, doubleword, and quadword instructions
AVX512_BITALG	vpopcnt[b\|w] and vpshufbitqmb instructions

The discussions in this chapter and the source code examples of Chapters 13 and 14 primarily focus on the AVX-512 instruction set extensions that are incorporated in Intel’s Skylake Server microarchitecture, which was launched during 2017. This microarchitecture is used in Intel’s Xeon Scalable (servers), Xeon W (workstations), and Core i7-7800X and i9-7900X series (high-end desktop) CPUs. Processors based on the Skylake Server microarchitecture contain the following AVX-512 instruction set extensions: AVX512F, AVX512CD, AVX512BW, AVX512DQ, and AVX512VL. Future mainstream processors from both AMD and Intel are expected to include these same AVX-512 extensions. Chapter 16 explains how to use the cupid instruction to detect the AVX-512 instructions set extensions that are shown in Table 12-1.

AVX-512 Execution Environment

AVX-512 augments the execution environment of the x86 platform with the addition of new registers and data types. It also extends the assembly language instruction syntax of AVX and AVX2 to support enhanced operations such as conditional executions and merging, embedded broadcasts, and instruction level rounding control. This section discusses these enhancements in greater detail.

Register Sets

Figure 12-1 illustrates the AVX-512 register sets. AVX-512 extends the width of each AVX SIMD register from 256 bits to 512 bits. The 512-bit wide registers are known as the ZMM register set. AVX-512 conforming processors include 32 ZMM registers named ZMM0–ZMM31. The YMM and XMM register sets are aliased to the low-order 256 bits and 128 bits of each ZMM register, respectively. AVX-512 processors also include eight new opmask registers named K0–K7. These registers are primarily used as predicate masks to perform conditional executions and merging operations. They can also be employed as destination operands for instructions that generate vector mask results. You’ll learn more about these registers later in this chapter.

../images/326959_2_En_12_Chapter/326959_2_En_12_Fig1_HTML.jpg — Figure 12-1.
AVX-512 register sets

Data Types

Similar to the YMM and XMM registers, software functions can use the ZMM registers to carry out SIMD operations using packed integer or packed floating-point operands. Table 12-2 shows the maximum number of elements that a ZMM register can hold for each supported data type. This table also shows the maximum number of elements that a YMM and XMM register can hold for comparison purposes.

Table 12-2.

Maximum Number of Elements for AVX-512 Register Operands

Data Type	ZMM	YMM	XMM
Integer byte	64	32	16
Integer word	32	16	8
Integer doubleword	16	8	4
Integer quadword	8	4	2
Single-precision floating-point	16	8	4
Double-precision floating-point	8	4	2

The alignment requirements for 512-bit wide operands in memory are similar to other x86 SIMD operands. Except for instructions that explicitly specify an aligned operand (e.g., vmovdqa[32|64], vmovap[d|s], etc.), proper alignment of a 512-bit wide operand in memory is not mandatory. However, 512-bit wide operands should always be aligned on a 64-byte boundary whenever possible to avoid processing delays that can occur if the processor is forced to access an unaligned operand in memory. AVX-512 instructions that access 256-bit or 128-bit wide operands in memory should also ensure that these types of operands are properly aligned on their respective natural boundaries.

Instruction Syntax

AVX-512 extends the instruction syntax of AVX and AVX2. Most AVX-512 instructions can use the same three-operand instruction syntax as AVX and AVX2 instructions, which consists of two non-destructive source operands and one destination operand. AVX-512 instructions can also exploit several new optional operands. These operands facilitate conditional executions and merging, embedded broadcast operations, and floating-point rounding control. The next few sections discuss AVX-512’s optional instruction operands in greater detail.

Conditional Execution and Merging

Most AVX-512 instructions support conditional execution and merging. A conditional execution and merge operation uses the bits of an opmask register as a predicate mask to control instruction execution and destination operand updates on a per-element basis. Figure 12-2 illustrates this concept in greater detail. In this figure, registers ZMM0, ZMM1, and ZMM2 each contain 16 single-precision floating-point values. The 16 low-order bits of opmask register K1 constitute the predicate mask. When an opmask register is used in this manner, each bit controls how the result of corresponding element position in the destination operand is calculated and updated.

Figure 12-2 also shows the outcome of three distinct executions of the vaddps instruction using the same initial values. The first example instruction, vaddps zmm2,zmm0,zmm1, performs a packed single-precision floating-point add of the elements in ZMM0 and ZMM1 and saves the resultant sums in register ZMM2. Execution of this instruction is no different than an AVX vaddps instruction that uses XMM or YMM register operands. The next example instruction, vaddps zmm2{k1},zmm0,zmm1, illustrates how the bits of opmask register K1 are used to conditionally add and update the destination operand on a per-element basis. More specifically, an element sum is calculated and saved in the destination operand only if the corresponding bit position of the opmask register is set to one; otherwise, the destination operand element position remains unchanged. This is called merge masking. The final example instruction in Figure 12-2, vaddps zmm2{k1}{z},zmm0,zmm1, is similar to the previous instruction. The extra {z} operand instructs the processor to perform zero masking instead of merge masking. Zero masking sets a destination operand element to zero if its corresponding bit position in the opmask register is set to zero; otherwise, the sum is calculated and saved.

../images/326959_2_En_12_Chapter/326959_2_En_12_Fig2_HTML.jpg — Figure 12-2.
Execution examples of the *vaddps* instruction using no masking, merge masking, and zero masking

At this point a few words about the opmask registers are warranted. The eight opmask registers are somewhat like the general-purpose registers. On processors that support AVX-512, each opmask register is 64-bits wide. However, when employed as a predicate mask, only the low-order bits are used during instruction execution. The exact number of used low-order bits varies depending on the number of vector elements. In Figure 12-2, bits 0–15 of opmask register K1 form the predicate mask since the vaddps instruction employs ZMM register operands that contain 16 single-precision floating-point values.

AVX-512 includes several new instructions that can be used to read values from and write values to an opmask register and perform Boolean operations. You’ll learn about these instructions later in this chapter. An opmask register can also be used as destination operand with instructions that generate a vector mask result such as vcmpp[d|s] and vpcmp[b|w|d|q]. The source code examples in Chapters 13 and 14 illustrate how to use these instructions with an opmask register. AVX-512 instructions can use opmask registers K1–K7 as a predicate mask. Opmask register K0 cannot be employed as a predicate mask operand but it can be used in any instruction that requires a source or destination operand opmask register. If an AVX-512 instruction attempts to use K0 as a predicate mask, the processor substitutes an implicit operand of all 1s, which disables all conditional execution and masking operations.

Embedded Broadcast

Many AVX-512 instructions can carry out a SIMD computation using an embedded broadcast operand. An embedded broadcast operand is a memory-based scalar value that is replicated N times into a temporary packed value, where N represents the number of vector elements referenced by the instruction. This temporary packed value is then used as an operand in a SIMD calculation.

Figure 12-3 contains two example instruction sequences that illustrate broadcast operations. The first example uses the vbroadcastss instruction to load the single-precision floating-point constant 2.0 into each element position of ZMM1. The ensuing vmulps zmm2,zmm0,zmm1 instruction multiplies each value in ZMM0 by 2.0 and saves the results to ZMM2. The second example instruction in Figure 12-3, vmulps zmm2,zmm0,real4 bcst [rax], carries out this same operation using an embedded broadcast operand. The text real4 bcst is a MASM directive that instructs the assembler to treat the memory location pointed to by register RAX as an embedded broadcast operand.

../images/326959_2_En_12_Chapter/326959_2_En_12_Fig3_HTML.jpg — Figure 12-3.
Packed single-precision floating-point multiplication using the *vbroadcastss* and *vmulps* instructions versus a *vmulps* instruction with an embedded broadcast operand

AVX-512 supports embedded broadcast operations using 32-bit and 64-bit wide elements. Embedded broadcasts cannot be performed using 8-bit and 16-bit wide elements.

Instruction Level Rounding

The final AVX-512 instruction syntax enhancement involves instruction-level rounding control for floating-point operations. In Chapter 5, you learned how to use the vldmxcsr and vstmxcsr instructions to change the processor’s global rounding mode for floating-point operations (see example Ch05_06). AVX-512 allows some instructions to specify a floating-point rounding mode operand that overrides the current rounding mode in MXCSR.RC. Table 12-3 shows the supported rounding mode operands, which are also called static rounding modes. The -sae suffix that’s appended to each static rounding mode operand string is an acronym for suppress all exceptions. This suffix serves as a reminder that floating-point exceptions are always masked whenever a static rounding mode operand is specified; MXCSR flag updates are also disabled.

Table 12-3.

AVX-512 Instruction-Level Static Rounding Mode Operands

Rounding Mode Operand	Description
{rn-sae}	Round to nearest
{rd-sae}	Round down (toward −∞)
{ru-sae}	Round up (toward +∞)
{rz-sae}	Round toward zero (truncate)

Static rounding mode operands can be used with many (but not all) AVX-512 instructions that perform floating-point operations using 512-bit wide packed operands; 256-bit and 128-bit wide packed operands are not supported. Static rounding mode operands can also be used with instructions that perform scalar floating-point operations. In both use cases, all instruction operands must be registers. For example, the instructions vmulps zmm2,zmm0,zmm1 {rz-sae} and vmulss xmm2,xmm0,xmm1 {rz-sae} are valid, whereas vmulps zmm2,zmm0,zmmword ptr [rax] {rz-sae} and vmulss xmm2,xmm0,real4 ptr [rax] {rz-sae} are invalid. Some AVX-512 floating-point instructions do not support the specification of a static rounding mode operand, but these instructions still can use the operand {sae} to suppress all exceptions.

Instruction Set Overview

This section presents an overview of the following AVX-512 instruction set extensions: AVX512F, AVX512CD, AVX512BW, and AVX512DQ. It also includes a summary of the opmask register instructions. The tables in this section only include instructions that are new to AVX-512. They do not include instructions that are a simple promotion of an existing AVX or AVX2 instruction. Most of the instructions in these tables can be used with 512-bit wide operands; 256-bit and 128-bit wide operands can be used on processors that support AVX512VL.

AVX512F

Table 12-4 lists the AVX512F instructions. As mentioned in the overview section of this chapter, all AVX-512 conforming processors must minimally support the instructions that are included in this table.

Table 12-4.

AVX512F Instruction Set Overview

Mnemonic	Description
valign[d\|q]	Align doubleword \| quadword vectors
vblendmp[d\|s]	Blend floating-point vectors using opmask control
vbroadcastf[32x4\|64x4]	Broadcast floating-point tuples
vbroadcasti[32x4\|64x4]	Broadcast integer tuples
vcompressp[d\|s]	Store sparse packed floating-point values
vcvtp[d\|s]2udq	Convert packed floating-point to packed unsigned doubleword integers
vcvts[d\|s]2usi	Convert scalar floating-point to unsigned doubleword integer
vcvttp[d\|s]2udq	Convert packed floating-point to packed unsigned doubleword integers with truncation
vcvtts[d\|s]2usi	Convert scalar floating-point to unsigned doubleword integer with truncation
vcvtudq2p[d\|s]	Convert packed unsigned doubleword integers to packed floating-point
vcvtusi2s[d\|s]	Convert unsigned doubleword integer to floating-point
vexpandp[d\|s]	Load sparse packed floating-point values
vextractf[32x4\|64x4]	Extract packed floating-point values
vextracti[32x4\|64x4]	Extract packed integer values
vfixupimmp[d\|s]	Fix up special packed floating-point values
vfixupimms[d\|s]	Fix up special scalar floating-point values
vgetexpp[d\|s]	Convert exponents of packed floating-point values
vgetexps[d\|s]	Convert exponents of scalar floating-point values
vgetmantp[d\|s]	Get normalized mantissas from packed floating-point values
vgetmants[d\|s]	Get normalized mantissas from scalar floating-point value
vinsertf[32x4\|64x4]	Insert packed floating-point values
vinserti[32x4\|64x4]	Insert packed integer values
vmovdqa[32\|64]	Move aligned packed integers
vmovdqu[32\|64]	Move unaligned packed integers
vpblendm[d\|q]	Blend packed integers using opmask control
vpbroadcast[d\|q]	Broadcast integer from general-purpose register
vpcmp[d\|q]	Compare packed signed integers
vpcmpu[d\|q]	Compare packed unsigned integers
vpcompress[d\|q]	Store sparse packed integers
vpermi2[d\|q\|ps\|pd]	Permute from two tables overwriting the index
vpermt2[d\|q\|ps\|pd]	Permute from two tables overwriting one table
vpmov[db\|sdb\|usdb]	Down convert packed doublewords to packed bytes
vpexpand[d\|q]	Load sparse packed integers
vpmax[s\|u]q	Calculated packed quadword maximums
vpmin[s\|u]q	Calculate packed quadword minimums
vpmov[db\|sdb\|usdb]	Down convert packed doublewords to packed bytes
vpmov[dw\|sdw\|usdw]	Down convert packed doublewords to packed words
vpmov[qb\|sqb\|usqb]	Down convert packed quadwords to packed bytes
vpmov[qd\|sqd\|usqd]	Down convert packed quadwords to packed doublewords
vpmov[qw\|sqw\|usqw]	Down convert packed quadwords to packed words
vprol[d\|q]	Rotate left packed integers using constant count
vprolv[d\|q]	Rotate left pack integers using variable counts
vpror[d\|q]	Rotate right packed integers using constant count
vprorv[d\|q]	Rotate right packed integers using variable counts
vpscatterd[d\|q]	Scatter packed integers using doubleword indices
vpscatterq[d\|q]	Scatter packed integers using quadword indices
vpsraq	Shift right arithmetic packed quadword integers using constant count
vpsravq	Shift right arithmetic packed quadword integers using variable counts
vpternlog[d\|q]	Bitwise ternary logic
vptestm[d\|q]	Packed integer bitwise AND and set mask
vptestnm[d\|q]	Packed integer bitwise NAND and set mask
vrcp14p[d\|s]	Compute approximate reciprocals of packed floating-point values
vrcp14s[d\|s]	Compute approximate reciprocals of scalar floating-point value
vreducep[d\|s]	Perform reduction transformation on packed floating-point values
vreduces[d\|s]	Perform reduction transformation on scalar floating-point value
vrndscalep[d\|s]	Round packed floating-point values to number of fractional bits
vrndscales[d\|s]	Round floating-point value to number of fractional bits
vrsqrt14p[d\|s]	Compute approximate reciprocals of packed floating-point square roots
vrsqrt14s[d\|s]	Compute approximate reciprocals of scalar floating-point square root
vscalefp[d\|s]	Scale packed floating-point values
vscalefs[d\|s]	Scale scalar floating-point value
vscatterdp[d\|s]	Scatter packed floating-point values using doubleword indices
vscatterqp[d\|s]	Scatter packed floating-point values using quadword indices
vshuff[32x4\|64x2]	Shuffle packed floating-point values
vshufi[32x4\|64x2]	Shuffle packed integer values

AVX512CD

Table 12-5 lists the AVX512CD instructions. These instructions are frequently used to detect and mitigate data dependencies that can occur when performing sparse array calculations or scatter operations. They can also be used with other AVX-512 instructions to perform ordinary computations.

Table 12-5.

AVX512CD Instruction Set Overview

Mnemonic	Description
vpbroadcastm[b2q\|w2d]	Broadcast mask to vector register
vpconflict[d\|q]	Detect conflicts within packed integers
vplzcnt[d\|q]	Count number of leading zeros in packed integers

AVX512BW

Table 12-6 lists the AVX512BW instructions. These instructions carry out their operations using packed byte and word operands.

Table 12-6.

AVX512BW Instruction Set Overview

Mnemonic	Description
vdbpsadbw	Double block packed sum-absolute-differences using unsigned bytes
vmovdq[u8\|u16]	Move unaligned packed integers
vpblendm[b\|w]	Blend packed integers using opmask control
vpbroadcast[b\|w]	Broadcast integer from general-purpose register
vpcmp[b\|w]	Compare packed signed integers
vpcmpu[b\|w]	Compare packed unsigned integers
vpermw	Permute packed words
vpermi2w	Permute word integers from two tables overwriting the index
vpermt2w	Permute word integers from two tables overwriting one table
vpmov[b\|w]2m	Convert vector register to mask register
vpmovm2[b\|w]	Convert mask register to vector register
vpmovw[b\|sb\|usb]	Down convert packed words to packed bytes
vpsllvw	Packed word shift left logical using variable bit counts
vpsravw	Packed word shift right arithmetic using variable bit counts
vpsrlvw	Packed word shift right logical using variable bit counts
vptestm[b\|w]	Packed integer bitwise AND and set mask
vptestnm[b\|w]	Packed integer bitwise NAND and set mask

AVX512DQ

Table 12-7 lists the AVX512DQ instructions. These instructions carry out their operations using packed doubleword and quadword operands. AVX512DQ also includes instructions that perform conversions between packed floating-point and integer quadwords.

Table 12-7.

AVX512DQ Instruction Set Overview

Mnemonic	Description
vcvtp[d\|s]2qq	Convert packed floating-point to signed quadword integers
vcvtp[d\|s]2uqq	Convert packed floating-point to unsigned quadword integers
vcvttp[d\|s]2qq	Convert packed floating-point to signed quadword integers with truncation
vcvttp[d\|s]2uqq	Convert packed floating-point to unsigned quadword integers with truncation
vcvtuqq2p[d\|s]	Convert packed unsigned quadword integers to floating-point
vextractf64x2	Extract packed double-precision floating-point values
vextracti64x2	Extract packed quadword values
vfpclass[pd\|ps]	Test packed floating-point class
vfpclass[sd\|ss]	Test scalar floating-point class
vinsertf64x2	Insert packed double-precision floating-point values
vinserti64x2	Insert packed quadword values
vpmov[d\|q]2m	Convert vector register to mask register
vpmovm2[d\|q]	Convert mask register to vector register
vpmullq	Multiply packed quadword integers and store low result
vrangep[d\|s]	Range restriction calculation for packed floating-point
vranges[d\|s]	Range restriction calculation for scalar floating-point
vreducep[d\|s]	Perform reduction on packed floating-point values
vreduces[d\|s]	Perform reduction on scalar floating-point values

Opmask Registers

Table 12-8 lists the opmask register instructions. The word versions of these instructions require AVX512F except for kaddw and ktestw, which require AVX512DQ. The doubleword and quadword versions of the opmask register instructions require AVX512BW; the byte versions require AVX512DQ.

Table 12-8.

Opmask Register Instruction Set Overview

Mnemonic	Description
kadd[b\|w\|d\|q]	Add mask values
kand[b\|w\|d\|q]	Bitwise AND
kandn[b\|w\|d\|q]	Bitwise AND NOT
kmov[b\|w\|d\|q]	Move value to/from opmask register
knot[b\|w\|d\|q]	Bitwise NOT
kor[b\|w\|d\|q]	Bitwise inclusive OR
kortest[b\|w\|d\|q]	Bitwise inclusive OR; update RFLAGS.ZF and RFLAGS.CF
kshiftl[b\|w\|d\|q]	Shift left
kshiftr[b\|w\|d\|q]	Shift right
ktest[b\|w\|d\|q]	Bitwise AND and ANDN; update RFLAGS.ZF and RFLAGS.CF
kunpck[bw\|wd\|dq]	Unpack
kxnor[b\|w\|d\|q]	Bitwise exclusive NOR
kxor[b\|w\|d\|q]	Bitwise exclusive OR

Summary

Here are the key learning points for Chapter 12:

All AVX-512 conforming processors support the AVX512F instruction set extension. Inclusion of additional AVX-512 instruction set extensions varies depending on the processor’s target market.
The AVX-512 register set includes 32 512-bit wide registers named ZMM0–ZMM31. The low-order 256 and 128 bits are aliased to registers YMM0–YMM31 and XMM0–XMM31, respectively.
The AVX-512 register set also includes eight opmask registers named K0–K7. Opmask registers K1–K7 can be used to perform instruction-level conditional executions with merge masking or zero masking.
Many AVX-512 instructions that require a packed operand of constant values can use an embedded broadcast operand instead of a separate broadcast instruction.
A static rounding mode operand can be specified with many AVX-512 instructions that perform floating-point operations using 512-bit wide packed or scalar floating-point register operands.

Previous Chapter

11. AVX2 Programming – Extended Instructions

Next Chapter

13. AVX-512 Programming – Floating-Point