© Daniel Kusswurm 2018
Daniel KusswurmModern X86 Assembly Language Programminghttps://doi.org/10.1007/978-1-4842-4063-2_12

12. Advanced Vector Extensions 512

Daniel Kusswurm1 
(1)
Geneva, IL, USA
 

In the previous eight chapters, you learned about the scalar floating-point, packed floating-point, and packed integer capabilities of AVX and AVX2. In this chapter, you’ll learn about Advance Vector Extensions 512 (AVX-512). AVX-512 is undoubtedly the largest and perhaps the most consequential extension of the x86 platform to date. It doubles the number of available SIMD registers and broadens the width of each register from 256 to 512 bits. AVX-512 also extends the instruction syntax of AVX and AVX2 to support additional capabilities not available in the earlier extensions, including conditional execution and merging, embedded broadcasts, and instruction-level rounding control for floating-point operations.

The content of this chapter is organized as follows. The first section presents a brief overview of AVX-512, which includes information about AVX-512’s various instruction set extensions. This is followed by an examination of the AVX-512 execution environment, including its register sets, data types, instruction syntaxes, and enhanced computational features. The chapter concludes with a synopsis of the AVX-512 instruction set extensions that are included in recently marketed processors for server and workstation platforms.

AVX-512 Overview

Unlike AVX and AVX2, AVX-512 is not a distinct instruction set extension. Rather, it’s a congruous collection of interrelated instruction set extensions. An x86 processor is AVX-512 conforming if it supports the AVX512F (or foundation) instruction set extension. An AVX-512 conforming processor may optionally support additional AVX-512 instruction set extensions and these vary according to the processor’s target market segment (e.g., high-performance computing, server, desktop, mobile, etc.). Table 12-1 lists the AVX-512 instruction set extensions that are currently available in some Intel processors. This table also includes the AVX-512 instruction set extensions that Intel has announced for inclusion in future processors. As of the writing of this text, AMD does not market any processors that support AVX-512.
Table 12-1.

Overview of AVX-512 Instruction Set Extensions

CPUID Flag

Description

AVX512F

Foundation instructions

AVX512ER

Exponential and reciprocal instructions

AVX512PF

Prefetch instructions

AVX512CD

Conflict detect instructions

AVX512DQ

Doubleword and quadword instructions

AVX512BW

Byte and word instructions

AVX512VL

128-bit and 256-bit vector instructions

AVX512_IFMA

Integer fused-multiply-add

AVX512_VBMI

Additional vector byte instructions

AVX512_4FMAPS

Packed single-precision FMA (4 iterations)

AVX512_4VNNI

Vector neural network instructions (4 iterations)

AVX512_VPOPCNTDQ

vpopcnt[d|q] instructions

AVX512_VNNI

Vector neural net instructions

AVX512_VBMI2

New vector byte, word, doubleword, and quadword instructions

AVX512_BITALG

vpopcnt[b|w] and vpshufbitqmb instructions

The discussions in this chapter and the source code examples of Chapters 13 and 14 primarily focus on the AVX-512 instruction set extensions that are incorporated in Intel’s Skylake Server microarchitecture, which was launched during 2017. This microarchitecture is used in Intel’s Xeon Scalable (servers), Xeon W (workstations), and Core i7-7800X and i9-7900X series (high-end desktop) CPUs. Processors based on the Skylake Server microarchitecture contain the following AVX-512 instruction set extensions: AVX512F, AVX512CD, AVX512BW, AVX512DQ, and AVX512VL. Future mainstream processors from both AMD and Intel are expected to include these same AVX-512 extensions. Chapter 16 explains how to use the cupid instruction to detect the AVX-512 instructions set extensions that are shown in Table 12-1.

AVX-512 Execution Environment

AVX-512 augments the execution environment of the x86 platform with the addition of new registers and data types. It also extends the assembly language instruction syntax of AVX and AVX2 to support enhanced operations such as conditional executions and merging, embedded broadcasts, and instruction level rounding control. This section discusses these enhancements in greater detail.

Register Sets

Figure 12-1 illustrates the AVX-512 register sets. AVX-512 extends the width of each AVX SIMD register from 256 bits to 512 bits. The 512-bit wide registers are known as the ZMM register set. AVX-512 conforming processors include 32 ZMM registers named ZMM0–ZMM31. The YMM and XMM register sets are aliased to the low-order 256 bits and 128 bits of each ZMM register, respectively. AVX-512 processors also include eight new opmask registers named K0–K7. These registers are primarily used as predicate masks to perform conditional executions and merging operations. They can also be employed as destination operands for instructions that generate vector mask results. You’ll learn more about these registers later in this chapter.
../images/326959_2_En_12_Chapter/326959_2_En_12_Fig1_HTML.jpg
Figure 12-1.

AVX-512 register sets

Data Types

Similar to the YMM and XMM registers, software functions can use the ZMM registers to carry out SIMD operations using packed integer or packed floating-point operands. Table 12-2 shows the maximum number of elements that a ZMM register can hold for each supported data type. This table also shows the maximum number of elements that a YMM and XMM register can hold for comparison purposes.
Table 12-2.

Maximum Number of Elements for AVX-512 Register Operands

Data Type

ZMM

YMM

XMM

Integer byte

64

32

16

Integer word

32

16

8

Integer doubleword

16

8

4

Integer quadword

8

4

2

Single-precision floating-point

16

8

4

Double-precision floating-point

8

4

2

The alignment requirements for 512-bit wide operands in memory are similar to other x86 SIMD operands. Except for instructions that explicitly specify an aligned operand (e.g., vmovdqa[32|64], vmovap[d|s], etc.), proper alignment of a 512-bit wide operand in memory is not mandatory. However, 512-bit wide operands should always be aligned on a 64-byte boundary whenever possible to avoid processing delays that can occur if the processor is forced to access an unaligned operand in memory. AVX-512 instructions that access 256-bit or 128-bit wide operands in memory should also ensure that these types of operands are properly aligned on their respective natural boundaries.

Instruction Syntax

AVX-512 extends the instruction syntax of AVX and AVX2. Most AVX-512 instructions can use the same three-operand instruction syntax as AVX and AVX2 instructions, which consists of two non-destructive source operands and one destination operand. AVX-512 instructions can also exploit several new optional operands. These operands facilitate conditional executions and merging, embedded broadcast operations, and floating-point rounding control. The next few sections discuss AVX-512’s optional instruction operands in greater detail.

Conditional Execution and Merging

Most AVX-512 instructions support conditional execution and merging. A conditional execution and merge operation uses the bits of an opmask register as a predicate mask to control instruction execution and destination operand updates on a per-element basis. Figure 12-2 illustrates this concept in greater detail. In this figure, registers ZMM0, ZMM1, and ZMM2 each contain 16 single-precision floating-point values. The 16 low-order bits of opmask register K1 constitute the predicate mask. When an opmask register is used in this manner, each bit controls how the result of corresponding element position in the destination operand is calculated and updated.

Figure 12-2 also shows the outcome of three distinct executions of the vaddps instruction using the same initial values. The first example instruction, vaddps zmm2,zmm0,zmm1, performs a packed single-precision floating-point add of the elements in ZMM0 and ZMM1 and saves the resultant sums in register ZMM2. Execution of this instruction is no different than an AVX vaddps instruction that uses XMM or YMM register operands. The next example instruction, vaddps zmm2{k1},zmm0,zmm1, illustrates how the bits of opmask register K1 are used to conditionally add and update the destination operand on a per-element basis. More specifically, an element sum is calculated and saved in the destination operand only if the corresponding bit position of the opmask register is set to one; otherwise, the destination operand element position remains unchanged. This is called merge masking. The final example instruction in Figure 12-2, vaddps zmm2{k1}{z},zmm0,zmm1, is similar to the previous instruction. The extra {z} operand instructs the processor to perform zero masking instead of merge masking. Zero masking sets a destination operand element to zero if its corresponding bit position in the opmask register is set to zero; otherwise, the sum is calculated and saved.
../images/326959_2_En_12_Chapter/326959_2_En_12_Fig2_HTML.jpg
Figure 12-2.

Execution examples of the vaddps instruction using no masking, merge masking, and zero masking

At this point a few words about the opmask registers are warranted. The eight opmask registers are somewhat like the general-purpose registers. On processors that support AVX-512, each opmask register is 64-bits wide. However, when employed as a predicate mask, only the low-order bits are used during instruction execution. The exact number of used low-order bits varies depending on the number of vector elements. In Figure 12-2, bits 0–15 of opmask register K1 form the predicate mask since the vaddps instruction employs ZMM register operands that contain 16 single-precision floating-point values.

AVX-512 includes several new instructions that can be used to read values from and write values to an opmask register and perform Boolean operations. You’ll learn about these instructions later in this chapter. An opmask register can also be used as destination operand with instructions that generate a vector mask result such as vcmpp[d|s] and vpcmp[b|w|d|q]. The source code examples in Chapters 13 and 14 illustrate how to use these instructions with an opmask register. AVX-512 instructions can use opmask registers K1–K7 as a predicate mask. Opmask register K0 cannot be employed as a predicate mask operand but it can be used in any instruction that requires a source or destination operand opmask register. If an AVX-512 instruction attempts to use K0 as a predicate mask, the processor substitutes an implicit operand of all 1s, which disables all conditional execution and masking operations.

Embedded Broadcast

Many AVX-512 instructions can carry out a SIMD computation using an embedded broadcast operand. An embedded broadcast operand is a memory-based scalar value that is replicated N times into a temporary packed value, where N represents the number of vector elements referenced by the instruction. This temporary packed value is then used as an operand in a SIMD calculation.

Figure 12-3 contains two example instruction sequences that illustrate broadcast operations. The first example uses the vbroadcastss instruction to load the single-precision floating-point constant 2.0 into each element position of ZMM1. The ensuing vmulps zmm2,zmm0,zmm1 instruction multiplies each value in ZMM0 by 2.0 and saves the results to ZMM2. The second example instruction in Figure 12-3, vmulps zmm2,zmm0,real4 bcst [rax], carries out this same operation using an embedded broadcast operand. The text real4 bcst is a MASM directive that instructs the assembler to treat the memory location pointed to by register RAX as an embedded broadcast operand.
../images/326959_2_En_12_Chapter/326959_2_En_12_Fig3_HTML.jpg
Figure 12-3.

Packed single-precision floating-point multiplication using the vbroadcastss and vmulps instructions versus a vmulps instruction with an embedded broadcast operand

AVX-512 supports embedded broadcast operations using 32-bit and 64-bit wide elements. Embedded broadcasts cannot be performed using 8-bit and 16-bit wide elements.

Instruction Level Rounding

The final AVX-512 instruction syntax enhancement involves instruction-level rounding control for floating-point operations. In Chapter 5, you learned how to use the vldmxcsr and vstmxcsr instructions to change the processor’s global rounding mode for floating-point operations (see example Ch05_06). AVX-512 allows some instructions to specify a floating-point rounding mode operand that overrides the current rounding mode in MXCSR.RC. Table 12-3 shows the supported rounding mode operands, which are also called static rounding modes. The -sae suffix that’s appended to each static rounding mode operand string is an acronym for suppress all exceptions. This suffix serves as a reminder that floating-point exceptions are always masked whenever a static rounding mode operand is specified; MXCSR flag updates are also disabled.
Table 12-3.

AVX-512 Instruction-Level Static Rounding Mode Operands

Rounding Mode Operand

Description

{rn-sae}

Round to nearest

{rd-sae}

Round down (toward −∞)

{ru-sae}

Round up (toward +∞)

{rz-sae}

Round toward zero (truncate)

Static rounding mode operands can be used with many (but not all) AVX-512 instructions that perform floating-point operations using 512-bit wide packed operands; 256-bit and 128-bit wide packed operands are not supported. Static rounding mode operands can also be used with instructions that perform scalar floating-point operations. In both use cases, all instruction operands must be registers. For example, the instructions vmulps zmm2,zmm0,zmm1 {rz-sae} and vmulss xmm2,xmm0,xmm1 {rz-sae} are valid, whereas vmulps zmm2,zmm0,zmmword ptr [rax] {rz-sae} and vmulss xmm2,xmm0,real4 ptr [rax] {rz-sae} are invalid. Some AVX-512 floating-point instructions do not support the specification of a static rounding mode operand, but these instructions still can use the operand {sae} to suppress all exceptions.

Instruction Set Overview

This section presents an overview of the following AVX-512 instruction set extensions: AVX512F, AVX512CD, AVX512BW, and AVX512DQ. It also includes a summary of the opmask register instructions. The tables in this section only include instructions that are new to AVX-512. They do not include instructions that are a simple promotion of an existing AVX or AVX2 instruction. Most of the instructions in these tables can be used with 512-bit wide operands; 256-bit and 128-bit wide operands can be used on processors that support AVX512VL.

AVX512F

Table 12-4 lists the AVX512F instructions. As mentioned in the overview section of this chapter, all AVX-512 conforming processors must minimally support the instructions that are included in this table.
Table 12-4.

AVX512F Instruction Set Overview

Mnemonic

Description

valign[d|q]

Align doubleword | quadword vectors

vblendmp[d|s]

Blend floating-point vectors using opmask control

vbroadcastf[32x4|64x4]

Broadcast floating-point tuples

vbroadcasti[32x4|64x4]

Broadcast integer tuples

vcompressp[d|s]

Store sparse packed floating-point values

vcvtp[d|s]2udq

Convert packed floating-point to packed unsigned doubleword integers

vcvts[d|s]2usi

Convert scalar floating-point to unsigned doubleword integer

vcvttp[d|s]2udq

Convert packed floating-point to packed unsigned doubleword integers with truncation

vcvtts[d|s]2usi

Convert scalar floating-point to unsigned doubleword integer with truncation

vcvtudq2p[d|s]

Convert packed unsigned doubleword integers to packed floating-point

vcvtusi2s[d|s]

Convert unsigned doubleword integer to floating-point

vexpandp[d|s]

Load sparse packed floating-point values

vextractf[32x4|64x4]

Extract packed floating-point values

vextracti[32x4|64x4]

Extract packed integer values

vfixupimmp[d|s]

Fix up special packed floating-point values

vfixupimms[d|s]

Fix up special scalar floating-point values

vgetexpp[d|s]

Convert exponents of packed floating-point values

vgetexps[d|s]

Convert exponents of scalar floating-point values

vgetmantp[d|s]

Get normalized mantissas from packed floating-point values

vgetmants[d|s]

Get normalized mantissas from scalar floating-point value

vinsertf[32x4|64x4]

Insert packed floating-point values

vinserti[32x4|64x4]

Insert packed integer values

vmovdqa[32|64]

Move aligned packed integers

vmovdqu[32|64]

Move unaligned packed integers

vpblendm[d|q]

Blend packed integers using opmask control

vpbroadcast[d|q]

Broadcast integer from general-purpose register

vpcmp[d|q]

Compare packed signed integers

vpcmpu[d|q]

Compare packed unsigned integers

vpcompress[d|q]

Store sparse packed integers

vpermi2[d|q|ps|pd]

Permute from two tables overwriting the index

vpermt2[d|q|ps|pd]

Permute from two tables overwriting one table

vpmov[db|sdb|usdb]

Down convert packed doublewords to packed bytes

vpexpand[d|q]

Load sparse packed integers

vpmax[s|u]q

Calculated packed quadword maximums

vpmin[s|u]q

Calculate packed quadword minimums

vpmov[db|sdb|usdb]

Down convert packed doublewords to packed bytes

vpmov[dw|sdw|usdw]

Down convert packed doublewords to packed words

vpmov[qb|sqb|usqb]

Down convert packed quadwords to packed bytes

vpmov[qd|sqd|usqd]

Down convert packed quadwords to packed doublewords

vpmov[qw|sqw|usqw]

Down convert packed quadwords to packed words

vprol[d|q]

Rotate left packed integers using constant count

vprolv[d|q]

Rotate left pack integers using variable counts

vpror[d|q]

Rotate right packed integers using constant count

vprorv[d|q]

Rotate right packed integers using variable counts

vpscatterd[d|q]

Scatter packed integers using doubleword indices

vpscatterq[d|q]

Scatter packed integers using quadword indices

vpsraq

Shift right arithmetic packed quadword integers using constant count

vpsravq

Shift right arithmetic packed quadword integers using variable counts

vpternlog[d|q]

Bitwise ternary logic

vptestm[d|q]

Packed integer bitwise AND and set mask

vptestnm[d|q]

Packed integer bitwise NAND and set mask

vrcp14p[d|s]

Compute approximate reciprocals of packed floating-point values

vrcp14s[d|s]

Compute approximate reciprocals of scalar floating-point value

vreducep[d|s]

Perform reduction transformation on packed floating-point values

vreduces[d|s]

Perform reduction transformation on scalar floating-point value

vrndscalep[d|s]

Round packed floating-point values to number of fractional bits

vrndscales[d|s]

Round floating-point value to number of fractional bits

vrsqrt14p[d|s]

Compute approximate reciprocals of packed floating-point square roots

vrsqrt14s[d|s]

Compute approximate reciprocals of scalar floating-point square root

vscalefp[d|s]

Scale packed floating-point values

vscalefs[d|s]

Scale scalar floating-point value

vscatterdp[d|s]

Scatter packed floating-point values using doubleword indices

vscatterqp[d|s]

Scatter packed floating-point values using quadword indices

vshuff[32x4|64x2]

Shuffle packed floating-point values

vshufi[32x4|64x2]

Shuffle packed integer values

AVX512CD

Table 12-5 lists the AVX512CD instructions. These instructions are frequently used to detect and mitigate data dependencies that can occur when performing sparse array calculations or scatter operations. They can also be used with other AVX-512 instructions to perform ordinary computations.
Table 12-5.

AVX512CD Instruction Set Overview

Mnemonic

Description

vpbroadcastm[b2q|w2d]

Broadcast mask to vector register

vpconflict[d|q]

Detect conflicts within packed integers

vplzcnt[d|q]

Count number of leading zeros in packed integers

AVX512BW

Table 12-6 lists the AVX512BW instructions. These instructions carry out their operations using packed byte and word operands.
Table 12-6.

AVX512BW Instruction Set Overview

Mnemonic

Description

vdbpsadbw

Double block packed sum-absolute-differences using unsigned bytes

vmovdq[u8|u16]

Move unaligned packed integers

vpblendm[b|w]

Blend packed integers using opmask control

vpbroadcast[b|w]

Broadcast integer from general-purpose register

vpcmp[b|w]

Compare packed signed integers

vpcmpu[b|w]

Compare packed unsigned integers

vpermw

Permute packed words

vpermi2w

Permute word integers from two tables overwriting the index

vpermt2w

Permute word integers from two tables overwriting one table

vpmov[b|w]2m

Convert vector register to mask register

vpmovm2[b|w]

Convert mask register to vector register

vpmovw[b|sb|usb]

Down convert packed words to packed bytes

vpsllvw

Packed word shift left logical using variable bit counts

vpsravw

Packed word shift right arithmetic using variable bit counts

vpsrlvw

Packed word shift right logical using variable bit counts

vptestm[b|w]

Packed integer bitwise AND and set mask

vptestnm[b|w]

Packed integer bitwise NAND and set mask

AVX512DQ

Table 12-7 lists the AVX512DQ instructions. These instructions carry out their operations using packed doubleword and quadword operands. AVX512DQ also includes instructions that perform conversions between packed floating-point and integer quadwords.
Table 12-7.

AVX512DQ Instruction Set Overview

Mnemonic

Description

vcvtp[d|s]2qq

Convert packed floating-point to signed quadword integers

vcvtp[d|s]2uqq

Convert packed floating-point to unsigned quadword integers

vcvttp[d|s]2qq

Convert packed floating-point to signed quadword integers with truncation

vcvttp[d|s]2uqq

Convert packed floating-point to unsigned quadword integers with truncation

vcvtuqq2p[d|s]

Convert packed unsigned quadword integers to floating-point

vextractf64x2

Extract packed double-precision floating-point values

vextracti64x2

Extract packed quadword values

vfpclass[pd|ps]

Test packed floating-point class

vfpclass[sd|ss]

Test scalar floating-point class

vinsertf64x2

Insert packed double-precision floating-point values

vinserti64x2

Insert packed quadword values

vpmov[d|q]2m

Convert vector register to mask register

vpmovm2[d|q]

Convert mask register to vector register

vpmullq

Multiply packed quadword integers and store low result

vrangep[d|s]

Range restriction calculation for packed floating-point

vranges[d|s]

Range restriction calculation for scalar floating-point

vreducep[d|s]

Perform reduction on packed floating-point values

vreduces[d|s]

Perform reduction on scalar floating-point values

Opmask Registers

Table 12-8 lists the opmask register instructions. The word versions of these instructions require AVX512F except for kaddw and ktestw, which require AVX512DQ. The doubleword and quadword versions of the opmask register instructions require AVX512BW; the byte versions require AVX512DQ.
Table 12-8.

Opmask Register Instruction Set Overview

Mnemonic

Description

kadd[b|w|d|q]

Add mask values

kand[b|w|d|q]

Bitwise AND

kandn[b|w|d|q]

Bitwise AND NOT

kmov[b|w|d|q]

Move value to/from opmask register

knot[b|w|d|q]

Bitwise NOT

kor[b|w|d|q]

Bitwise inclusive OR

kortest[b|w|d|q]

Bitwise inclusive OR; update RFLAGS.ZF and RFLAGS.CF

kshiftl[b|w|d|q]

Shift left

kshiftr[b|w|d|q]

Shift right

ktest[b|w|d|q]

Bitwise AND and ANDN; update RFLAGS.ZF and RFLAGS.CF

kunpck[bw|wd|dq]

Unpack

kxnor[b|w|d|q]

Bitwise exclusive NOR

kxor[b|w|d|q]

Bitwise exclusive OR

Summary

Here are the key learning points for Chapter 12:
  • All AVX-512 conforming processors support the AVX512F instruction set extension. Inclusion of additional AVX-512 instruction set extensions varies depending on the processor’s target market.

  • The AVX-512 register set includes 32 512-bit wide registers named ZMM0–ZMM31. The low-order 256 and 128 bits are aliased to registers YMM0–YMM31 and XMM0–XMM31, respectively.

  • The AVX-512 register set also includes eight opmask registers named K0–K7. Opmask registers K1–K7 can be used to perform instruction-level conditional executions with merge masking or zero masking.

  • Many AVX-512 instructions that require a packed operand of constant values can use an embedded broadcast operand instead of a separate broadcast instruction.

  • A static rounding mode operand can be specified with many AVX-512 instructions that perform floating-point operations using 512-bit wide packed or scalar floating-point register operands.