© Daniel Kusswurm 2018
Daniel KusswurmModern X86 Assembly Language Programminghttps://doi.org/10.1007/978-1-4842-4063-2_11

11. AVX2 Programming – Extended Instructions

Daniel Kusswurm1 
(1)
Geneva, IL, USA
 

In this chapter, you learn how to use some of the instruction set extensions that were introduced in Chapter 8. The first section contains a couple of source code examples that exemplify use of the scalar and packed fused-multiply-add (FMA) instructions. The second section covers instructions that involve the general-purpose registers. This section includes source code examples that explain flagless multiplication and bit shifting. It also surveys some of the enhanced bit-manipulation instructions. The final section discusses the instructions that perform half-precision floating-point conversions.

The source code examples in the first two sections of this chapter execute correctly on most processors from AMD and Intel that that support AVX2. The half-precision floating-point source code example works on AMD and Intel processors that support AVX and the F16C instruction set extension. As a reminder, you should never assume that a specific instruction set extension is available based on whether the processor supports AVX or AVX2. Production code should always test for a specific instruction set extension using the cupid instruction. You learn how to do this in Chapter 16.

FMA Programming

A FMA calculation performs a floating-point multiplication followed by a floating-point addition using a single rounding operation. Chapter 8 introduced FMA operations and discusses the particulars in greater detail. In this section, you learn how to use FMA instructions to implement discrete convolution functions. The section begins with a brief overview of convolution mathematics. The purpose of this overview is to explain just enough theory to understand the source code examples. This is followed by a source code example that implements a practical discrete convolution function using scalar FMA instructions. The section concludes with a source code example that exploits packed FMA instructions to accelerate the performance of a function that performs discrete convolutions.

Convolutions

Convolution is a mathematical operation that blends an input signal with a response signal to produce an output signal . Formally, the convolution of input signal f and response signal g is defined as follows:
$$ h(t)=\underset{-\infty }{\overset{\infty }{\int }}f\left(t-\tau \right)\;g\left(\tau \right)\; d\tau $$

where h represents the output signal . The notation fg is commonly used to denote the convolution of signals (or functions) f and g.

Convolutions are used extensively in a wide variety of scientific and engineering applications. Many signal processing and image processing techniques are based on convolution theory . In these domains, discrete arrays of sampled data points represent the input, response, and output signals. A discrete convolution can be calculated using the following equation :
$$ h\;\left[i\right]=\sum \limits_{k=-M}^Mf\;\left[i-k\right]\;g\;\left[k\right] $$

where i = 0,1,…,N − 1 and M = ⌊Ng/2⌋. In the preceding equations, N denotes the number of elements in both the input and output signal arrays and Ng symbolizes the size of the response signal array. All of the explanations and source code examples in this section assume that N g is an odd integer greater than or equal to three. If you examine the discrete convolution equation carefully, you will notice that each element in output signal array h is computed using a relatively uncomplicated sum-of-products calculation that encompasses elements of input signal array f and response signal array g. These types of calculations are easy to implement using FMA instructions.

In digital signal processing, many applications use smoothing operators to reduce the amount of noise that’s present in a raw signal. For example, the top plot in Figure 11-1 shows a raw data signal that contains a fair amount of noise. The bottom plot in Figure 11-1 shows the same signal following the application of a smoothing operator. In this instance, the smoothing operator convolved the original raw signal with a set of discrete coefficients that approximate a Gaussian (or low-pass) filter. These coefficients correspond to the response signal array g that’s incorporated in the discrete convolution equation . The response signal array is often called a convolution kernel or convolution mask.
../images/326959_2_En_11_Chapter/326959_2_En_11_Fig1_HTML.jpg
Figure 11-1.

Raw data signal (top plot) and its smoothed counterpart (bottom plot)

The discrete convolution equation can be implemented in source code using a couple of nested for loops. During each outer loop iteration, the convolution kernel center point g[0] is superimposed over the current input signal array element f[i]. The inner loop calculates the intermediate products, as shown in Figure 11-2. These intermediate products are then summed and saved to output signal array element h[i], which is also shown in Figure 11-2. The FMA source code examples that are presented in this section implement convolution functions using the techniques described in this paragraph.
../images/326959_2_En_11_Chapter/326959_2_En_11_Fig2_HTML.jpg
Figure 11-2.

Application of a smoothing operator with an input signal element

The purpose of the preceding overview was to provide just enough background math to understand the source code examples. Numerous tomes have been published that explain convolution and signal processing theory in significantly greater detail. Appendix A contains a list of introductory references that you can consult for additional information about convolution and signal processing theory.

Scalar FMA

Source code example Ch11_01 explains how to implement a one-dimensional discrete convolution function using scalar FMA instructions. It also elucidates the performance benefits of convolution functions that use fixed-size versus variable-sized convolution kernels. Listing 11-1 shows the source code for example Ch11_01.
//------------------------------------------------
//        Ch11_01.h
//------------------------------------------------
#pragma once
// Ch11_01_Misc.cpp
extern void CreateSignal(float* x, int n, int kernel_size, unsigned int seed);
extern void PadSignal(float* x2, int n2, const float* x1, int n1, int ks2);
extern unsigned int g_RngSeedVal;
// Ch11_01.cpp
extern bool Convolve1Cpp(float* y, const float* x, int num_pts, const float* kernel, int kernel_size);
extern bool Convolve1Ks5Cpp(float* y, const float* x, int num_pts, const float* kernel, int kernel_size);
// Ch11_01_.asm
extern "C" bool Convolve1_(float* y, const float* x, int num_pts, const float* kernel, int kernel_size);
extern "C" bool Convolve1Ks5_(float* y, const float* x, int num_pts, const float* kernel, int kernel_size);
// Ch11_01_BM.cpp
extern void Convolve1_BM(void);
//------------------------------------------------
//        Ch11_01_Misc.cpp
//------------------------------------------------
#include "stdafx.h"
#include <iostream>
#include <random>
#define _USE_MATH_DEFINES
#include <math.h>
#include "Ch11_01.h"
using namespace std;
void CreateSignal(float* x, int n, int kernel_size, unsigned int seed)
{
  const float degtorad = (float)(M_PI / 180.0);
  const float t_start = 0;
  const float t_step = 0.002f;
  const int m = 3;
  const float amp[m] {1.0f, 0.80f, 1.20f};
  const float freq[m] {5.0f, 10.0f, 15.0f};
  const float phase[m] {0.0f, 45.0f, 90.0f};
  const int ks2 = kernel_size / 2;
  uniform_int_distribution<> ui_dist {0, 500};
  default_random_engine rng {seed};
  float t = t_start;
  for (int i = 0; i < n; i++, t += t_step)
  {
    float x_val = 0;
    for (int j = 0; j < m; j++)
    {
      float omega = 2.0f * (float)M_PI * freq[j];
      float x_temp1 = amp[j] * sin(omega * t + phase[j] * degtorad);
      int rand_val = ui_dist(rng);
      float noise = (float)((rand_val) - 250) / 10.0f;
      float x_temp2 = x_temp1 + x_temp1 * noise / 100.0f;
      x_val += x_temp2;
    }
    x[i] = x_val;
  }
}
extern void PadSignal(float* x2, int n2, const float* x1, int n1, int ks2)
{
  if (n2 != n1 + ks2 * 2)
    throw runtime_error("InitPad - invalid size argument");
  for (int i = 0; i < n1; i++)
    x2[i + ks2] = x1[i];
  for (int i = 0; i < ks2; i++)
  {
    x2[i] = x1[ks2 - i - 1];
    x2[n1 + ks2 + i] = x1[n1 - i - 1];
  }
}
//------------------------------------------------
//        Ch11_01.cpp
//------------------------------------------------
#include "stdafx.h"
#include <iostream>
#include <iomanip>
#include <memory>
#include <fstream>
#include <stdexcept>
#include "Ch11_01.h"
using namespace std;
extern "C" const int c_NumPtsMin = 32;
extern "C" const int c_NumPtsMax = 16 * 1024 * 1024;
extern "C" const int c_KernelSizeMin = 3;
extern "C" const int c_KernelSizeMax = 15;
unsigned int g_RngSeedVal = 97;
void Convolve1(void)
{
  const int n1 = 512;
  const float kernel[] { 0.0625f, 0.25f, 0.375f, 0.25f, 0.0625f };
  const int ks = sizeof(kernel) / sizeof(float);
  const int ks2 = ks / 2;
  const int n2 = n1 + ks2 * 2;
  // Create signal array
  unique_ptr<float[]> x1_up {new float[n1]};
  unique_ptr<float[]> x2_up {new float[n2]};
  float* x1 = x1_up.get();
  float* x2 = x2_up.get();
  CreateSignal(x1, n1, ks, g_RngSeedVal);
  PadSignal(x2, n2, x1, n1, ks2);
  // Perform convolutions
  const int num_pts = n1;
  unique_ptr<float[]> y1_up {new float[num_pts]};
  unique_ptr<float[]> y2_up {new float[num_pts]};
  unique_ptr<float[]> y3_up {new float[num_pts]};
  unique_ptr<float[]> y4_up {new float[num_pts]};
  float* y1 = y1_up.get();
  float* y2 = y2_up.get();
  float* y3 = y3_up.get();
  float* y4 = y4_up.get();
  bool rc1 = Convolve1Cpp(y1, x2, num_pts, kernel, ks);
  bool rc2 = Convolve1_(y2, x2, num_pts, kernel, ks);
  bool rc3 = Convolve1Ks5Cpp(y3, x2, num_pts, kernel, ks);
  bool rc4 = Convolve1Ks5_(y4, x2, num_pts, kernel, ks);
  cout << "Results for Convolve1\n";
  cout << " rc1 = " << boolalpha << rc1 << '\n';
  cout << " rc2 = " << boolalpha << rc2 << '\n';
  cout << " rc3 = " << boolalpha << rc3 << '\n';
  cout << " rc4 = " << boolalpha << rc4 << '\n';
  if (!rc1 || !rc2 || !rc3 || !rc4)
    return;
  // Save data
  const char* fn = "Ch11_01_Convolve1Results.csv";
  ofstream ofs(fn);
  if (ofs.bad())
    cout << "File create error - " << fn << '\n';
  else
  {
    const char* delim = ", ";
    ofs << fixed << setprecision(7);
    ofs << "i, x1, y1, y2, y3, y4\n";
    for (int i = 0; i < num_pts; i++)
    {
      ofs << setw(5) << i << delim;
      ofs << setw(10) << x1[i] << delim;
      ofs << setw(10) << y1[i] << delim;
      ofs << setw(10) << y2[i] << delim;
      ofs << setw(10) << y3[i] << delim;
      ofs << setw(10) << y4[i] << '\n';
    }
    ofs.close();
    cout << "\nConvolution results saved to file " << fn << '\n';
  }
}
bool Convolve1Cpp(float* y, const float* x, int num_pts, const float* kernel, int kernel_size)
{
  int ks2 = kernel_size / 2;
  if ((kernel_size & 1) == 0)
    return false;
  if (kernel_size < c_KernelSizeMin || kernel_size > c_KernelSizeMax)
    return false;
  if (num_pts < c_NumPtsMin || num_pts > c_NumPtsMax)
    return false;
  x += ks2;  // x points to first signal point
  for (int i = 0; i < num_pts; i++)
  {
    float sum = 0;
    for (int k = -ks2; k <= ks2; k++)
    {
      float x_val = x[i - k];
      float kernel_val = kernel[k + ks2];
      sum += kernel_val * x_val;
    }
    y[i] = sum;
  }
  return true;
}
bool Convolve1Ks5Cpp(float* y, const float* x, int num_pts, const float* kernel, int kernel_size)
{
  int ks2 = kernel_size / 2;
  if (kernel_size != 5)
    return false;
  if (num_pts < c_NumPtsMin || num_pts > c_NumPtsMax)
    return false;
  x += ks2;  // x points to first signal point
  for (int i = 0; i < num_pts; i++)
  {
    float sum = 0;
    int j = i + ks2;
    sum += x[j] * kernel[0];
    sum += x[j - 1] * kernel[1];
    sum += x[j - 2] * kernel[2];
    sum += x[j - 3] * kernel[3];
    sum += x[j - 4] * kernel[4];
    y[i] = sum;
  }
  return true;
}
int main()
{
  int ret_val = 1;
  try
  {
    Convolve1();
    Convolve1_BM();
    ret_val = 0;
  }
  catch (runtime_error& rte)
  {
    cout << "run_time exception has occurred\n";
    cout << rte.what() << '\n';
  }
  catch (...)
  {
    cout << "Unexpected exception has occurred\n";
  }
  return ret_val;
}
;-------------------------------------------------
;        Ch11_01_.asm
;-------------------------------------------------
    include <MacrosX86-64-AVX.asmh>
    extern c_NumPtsMin:dword
    extern c_NumPtsMax:dword
    extern c_KernelSizeMin:dword
    extern c_KernelSizeMax:dword
; extern "C" bool Convolve1_(float* y, const float* x, int num_pts, const float* kernel, int kernel_size)
    .code
Convolve1_ proc frame
    _CreateFrame CV_,0,0,rbx,rsi
    _EndProlog
; Verify argument values
    xor eax,eax             ;set error code (rax is also loop index var)
    mov r10d,dword ptr [rbp+CV_OffsetStackArgs]
    test r10d,1
    jz Done               ;jump if kernel_size is even
    cmp r10d,[c_KernelSizeMin]
    jl Done               ;jump if kernel_size too small
    cmp r10d,[c_KernelSizeMax]
    jg Done               ;jump if kernel_size too big
    cmp r8d,[c_NumPtsMin]
    jl Done               ;jump if num_pts too small
    cmp r8d,[c_NumPtsMax]
    jg Done               ;jump if num_pts too big
; Perform required initializations
    mov r8d,r8d             ;r8 = num_pts
    shr r10d,1             ;ks2 = ks / 2
    lea rdx,[rdx+r10*4]         ;rdx = x + ks2 (first data point)
; Perform convolution
LP1:  vxorps xmm5,xmm5,xmm5        ;sum = 0.0;
    mov r11,r10
    neg r11               ;k = -ks2
LP2:  mov rbx,rax
    sub rbx,r11             ;rbx = i - k
    vmovss xmm0,real4 ptr [rdx+rbx*4]  ;xmm0 = x[i - k]
    mov rsi,r11
    add rsi,r10             ;rsi = k + ks2
    vfmadd231ss xmm5,xmm0,[r9+rsi*4]  ;sum += x[i - k] * kernel[k + ks2]
    add r11,1              ;k++
    cmp r11,r10
    jle LP2               ;jump if k <= ks2
    vmovss real4 ptr [rcx+rax*4],xmm5  ;y[i] = sum
    add rax,1              ;i += 1
    cmp rax,r8
    jl LP1               ;jump if i < num_pts
    mov eax,1              ;set success return code
Done:  vzeroupper
    _DeleteFrame rbx,rsi
    ret
Convolve1_ endp
; extern "C" bool Convolve1Ks5_(float* y, const float* x, int num_pts, const float* kernel, int kernel_size)
Convolve1Ks5_ proc
; Verify argument values
    xor eax,eax             ;set error code (rax is also loop index var)
    cmp dword ptr [rsp+40],5
    jne Done              ;jump if kernel_size is not 5
    cmp r8d,[c_NumPtsMin]
    jl Done               ;jump if num_pts too small
    cmp r8d,[c_NumPtsMax]
    jg Done               ;jump if num_pts too big
; Perform required initializations
    mov r8d,r8d             ;r8 = num_pts
    add rdx,8              ;x += 2
; Perform convolution
@@:   vxorps xmm4,xmm4,xmm4          ;initialize sum vars
    vxorps xmm5,xmm5,xmm5
    mov r11,rax
    add r11,2                ;j = i + ks2
    vmovss xmm0,real4 ptr [rdx+r11*4]    ;xmm0 = x[j]
    vfmadd231ss xmm4,xmm0,[r9]       ;xmm4 += x[j] * kernel[0]
    vmovss xmm1,real4 ptr [rdx+r11*4-4]   ;xmm1 = x[j - 1]
    vfmadd231ss xmm5,xmm1,[r9+4]      ;xmm5 += x[j - 1] * kernel[1]
    vmovss xmm0,real4 ptr [rdx+r11*4-8]   ;xmm0 = x[j - 2]
    vfmadd231ss xmm4,xmm0,[r9+8]      ;xmm4 += x[j - 2] * kernel[2]
    vmovss xmm1,real4 ptr [rdx+r11*4-12]  ;xmm1 = x[j - 3]
    vfmadd231ss xmm5,xmm1,[r9+12]      ;xmm5 += x[j - 3] * kernel[3]
    vmovss xmm0,real4 ptr [rdx+r11*4-16]  ;xmm0 = x[j - 4]
    vfmadd231ss xmm4,xmm0,[r9+16]      ;xmm4 += x[j - 4] * kernel[4]
    vaddps xmm4,xmm4,xmm5
    vmovss real4 ptr [rcx+rax*4],xmm4    ;save y[i]
    inc rax                 ;i += 1
    cmp rax,r8
    jl @B                  ;jump if i < num_pts
    mov eax,1                ;set success return code
Done:  vzeroupper
    ret
Convolve1Ks5_ endp
    end
Listing 11-1.

Example Ch11_01

The C++ code in Listing 11-1 begins with the header file Ch11_01.h, which contains the requisite function declarations for this example. The source code for function CreateSignal is next. This function constructs a synthetic input signal for test purposes. The synthetic input signal consists of three separate sinusoidal waveforms that are summed. Each waveform includes a small amount of random noise. The input signal generated by CreateSignal is the same signal that’s shown in the top plot of Figure 11-1.

When performing convolutions, it is often necessary to pad the input signal array with extra elements to avoid invalid memory accesses when the center point of the convolution kernel is superimposed over input signal array elements located near the beginning and end of the array. The function PadSignal creates a padded copy of input signal array x1 by reflecting the edge elements of x1 and saving these elements along with the original input signal array elements in x2. Figure 11-3 shows an example of a padded input signal array that’s compatible with a five-element convolution kernel. Note that n2, the size of the padded buffer, must equal n1 + ks2 * 2, where n1 represents the number of input signal array elements in x1 and ks2 corresponds to floor(kernel_size / 2).
../images/326959_2_En_11_Chapter/326959_2_En_11_Fig3_HTML.jpg
Figure 11-3.

Padded input signal array following execution of PadSignal using a five-element convolution kernel

The C++ function Convolve1 contains code that exercises several different implementations of the discrete convolution algorithm. Near the top of this function is a single-precision floating-point array named kernel, which contains the convolution kernel coefficients. The coefficients in kernel represent a discrete approximation of a Gaussian (or low pass) filter. When convolved with an input signal array, these coefficients reduce the amount of noise that’s present in a signal, as shown in the bottom plot of Figure 11-1. The padded input signal array x2 is created next using the previously described functions CreateSignal and PadSignal. Note that the C++ template class unique_ptr<> is used to manage storage space for both x2 and the unpadded array x1.

Following the generation of the input signal array x2, Convolve1 allocates storage space for four output signal arrays. The functions that implement different variations of the convolution algorithms are then called. The first two functions, Convolve1Cpp and Convolve1_, contain C++ and assembly language code that carry out their convolutions using the nested for loop technique described earlier in this section. The functions Convolve1Ks5Cpp and Convolve1Ks5_ are optimized for convolution kernels containing five elements. Real-world signal processing software regularly employs convolution functions that are optimized for specific kernel sizes since they’re often significantly faster, as you will soon see.

The function Convolve1Cpp begins its execution by validating argument values kernel_size and num_pts. The next statement, x += ks2, adjusts the input signal array pointer so that it points to the first true input signal array element. Recall that the input signal array x is padded with extra values to ensure correct processing when the convolution kernel is superimposed over the first two and last two input signal elements. Following the pointer x adjustment is the actual code that performs the convolution. The nested for loops implement the discrete convolution equation that was described earlier in this section. Note that the index value used for kernel is offset by ks2 to account for the negative indices of the inner loop. Following function Convolve1Cpp is the function Convolve1Ks5Cpp, which uses explicit C++ statements to calculate convolution sum-of-products instead of a for loop.

The functions Convolve1_ and Convolve1Ks5_ are the assembly language counterparts of Convolve1Cpp and Convolve1Ks5Cpp, respectively. After its prolog, Convolve1_ validates argument values kernel_size and num_pts. This is followed by an initialization code block that begins by loading num_pts into R8. The ensuing shr r10d,1 instruction loads ks2 into register R10D. The final initialization code block instruction, lea rdx,[rdx+r10*4], loads register RDX with the address of the first input signal array element in x.

Similar to the C++ code, Convole1_ uses two nested for loops to perform the convolution. The outer loop, which is labeled LP1, starts with a vxorps xmm5,xmm5,xmm5 instruction that sets sum equal to 0.0. The ensuing mov r11,r10 and neg r11 instructions set inner loop index counter k (R11) to -ks2. The label LP2 marks the start of the inner loop. The mov rbx,rax and sub rbx,r11 instructions calculate the index, or i – k, of the next element in x. This is followed by a vmovss xmm0,real4 ptr [rdx+rbx*4] instruction that loads x[i - k] into XMM0. Next, the mov rsi,r11 and add rsi,r10 instructions calculate k + ks2. The subsequent vfmadd231ss xmm5,xmm0,[r9+rsi*4] instruction calculates sum += x[i - k] * kernel[k + ks2]. As discussed in Chapter 8, the FMA instruction vfmadd231ss carries out its operations using a single rounding operation. Depending on the algorithm, the use of a vfmadd231ss instruction instead of an equivalent sequence of vmulss and vaddss instructions may result in slightly faster execution times.

Following the execution of the vfmadd231ss instruction, the add r11,1 instruction computes k++ and the inner loop repeats until k > ks2 is true. Subsequent to the completion of the inner loop, the vmovss real4 ptr [rcx+rax*4],xmm5 instruction saves the current sum-of-products result to y[i]. The add rax,1 instruction updates index counter i, and the outer loop LP1 repeats until all of the input signal data points have been processed.

The assembly language function Convolve1Ks5_ is size-optimized for convolution kernels that contain five elements. This function replaces the inner loop that was used in Convolve1_ with five explicit vfmadd231ss instructions. Note that this FMA instruction sequence employs two separate registers, XMM4 and XMM5, for the intermediate sums. Most Intel processors that support AVX2 and FMA can execute two scalar FMA instructions simultaneously, which accelerates the algorithm’s overall performance. Using a single sum register here would create a performance-degrading data dependency since each vfmadd231ss instruction would need to finish its operation before the next one could begin. You’ll learn more about instruction-level data dependencies and FMA execution units in Chapter 15. Here is the output for source code example Ch11-01:
Results for Convolve1
 rc1 = true
 rc2 = true
 rc3 = true
 rc4 = true
Convolution results saved to file Ch11_01_Convolve1Results.csv
Running benchmark function Convolve1_BM - please wait
Benchmark times save to file Ch11_01_Convolve1_BM_CHROMIUM.csv
Table 11-1 contains mean execution times for the convolution functions presented in this section. As alluded to earlier, the size-optimized convolution functions Convolve1Ks5Cpp and Convolve1Ks5_ are considerably faster than their size-independent counterparts. Note that the performance of the C++ function Convolve1Cpp is somewhat better than its assembly language equivalent Convolve1_. The reason for this is that the Visual C++ complier generated code that partially unrolled the inner loop and replaced it with a series of sequential scalar single-precision floating-point multiply and add instructions. I could have easily implemented the same optimization technique in the function Convolve1_, but this improves performance only by a few percentage points. In order to achieve maximum FMA performance, an assembly language convolution function must use packed FMA instructions instead of scalar ones. You’ll see an example of this in the next section.
Table 11-1.

Mean Execution Times (Microseconds) for Convolution Functions Using Five-Element Convolution Kernel (2,000,000 Signal Points)

CPU

Convolve1Cpp

Convolve1_

Convolve1Ks5Cpp

Convolve1Ks5_

i7-4790S

6148

6844

2926

2841

i9-7900X

5607

6072

2808

2587

i7-8700K

5149

5576

2539

2394

Packed FMA

It’s okay to use the convolution functions discussed in the previous section with small signal arrays. In many real-world applications, however, convolutions are often performed using signal arrays that contain thousands or millions of data points. For large signal arrays, the basic convolution algorithm can be adapted to use packed FMA instead of scalar FMA instructions to carry out the required calculations. Listing 11-2 shows the source code for example Ch11_02, which illustrates how to implement the discrete convolution equation using packed FMA instructions.
//------------------------------------------------
//        Ch11_02.cpp
//------------------------------------------------
#include "stdafx.h"
#include <iostream>
#include <iomanip>
#include <fstream>
#include "Ch11_02.h"
#include "AlignedMem.h"
using namespace std;
extern "C" const int c_NumPtsMin = 32;
extern "C" const int c_NumPtsMax = 16 * 1024 * 1024;
extern "C" const int c_KernelSizeMin = 3;
extern "C" const int c_KernelSizeMax = 15;
unsigned int g_RngSeedVal = 97;
void Convolve2(void)
{
  const int n1 = 512;
  const float kernel[] { 0.0625f, 0.25f, 0.375f, 0.25f, 0.0625f };
  const int ks = sizeof(kernel) / sizeof(float);
  const int ks2 = ks / 2;
  const int n2 = n1 + ks2 * 2;
  const unsigned int alignment = 32;
  // Create signal array
  AlignedArray<float> x1_aa(n1, alignment);
  AlignedArray<float> x2_aa(n2, alignment);
  float* x1 = x1_aa.Data();
  float* x2 = x2_aa.Data();
  CreateSignal(x1, n1, ks, g_RngSeedVal);
  PadSignal(x2, n2, x1, n1, ks2);
  // Perform convolutions
  AlignedArray<float> y5_aa(n1, alignment);
  AlignedArray<float> y6_aa(n1, alignment);
  AlignedArray<float> y7_aa(n1, alignment);
  float* y5 = y5_aa.Data();
  float* y6 = y6_aa.Data();
  float* y7 = y7_aa.Data();
  bool rc5 = Convolve2_(y5, x2, n1, kernel, ks);
  bool rc6 = Convolve2Ks5_(y6, x2, n1, kernel, ks);
  bool rc7 = Convolve2Ks5Test_(y7, x2, n1, kernel, ks);
  cout << "Results for Convolve2\n";
  cout << " rc5 = " << boolalpha << rc5 << '\n';
  cout << " rc6 = " << boolalpha << rc6 << '\n';
  cout << " rc7 = " << boolalpha << rc7 << '\n';
  if (!rc5 || !rc6 || !rc7)
    return;
  // Save data
  const char* fn = "Ch11_02_Convolve2Results.csv";
  ofstream ofs(fn);
  if (ofs.bad())
    cout << "File create error - " << fn << '\n';
  else
  {
    const char* delim = ", ";
    ofs << fixed << setprecision(7);
    ofs << "i, x1, y5, y6, y7\n";
    for (int i = 0; i < n1; i++)
    {
      ofs << setw(5) << i << delim;
      ofs << setw(10) << x1[i] << delim;
      ofs << setw(10) << y5[i] << delim;
      ofs << setw(10) << y6[i] << delim;
      ofs << setw(10) << y7[i];
      if (y6[i] != y7[i])
        ofs << delim << '*';
      ofs << '\n';
    }
    ofs.close();
    cout << "\nResults data saved to file " << fn << '\n';
  }
}
int main()
{
  int ret_val = 1;
  try
  {
    Convolve2();
    Convolve2_BM();
    ret_val = 0;
  }
  catch (runtime_error& rte)
  {
    cout << "run_time exception has occurred\n";
    cout << rte.what() << '\n';
  }
  catch (...)
  {
    cout << "Unexpected exception has occurred\n";
  }
  return ret_val;
}
;-------------------------------------------------
;        Ch11_02_.asm
;-------------------------------------------------
    include <MacrosX86-64-AVX.asmh>
    extern c_NumPtsMin:dword
    extern c_NumPtsMax:dword
    extern c_KernelSizeMin:dword
    extern c_KernelSizeMax:dword
; extern bool Convolve2_(float* y, const float* x, int num_pts, const float* kernel, int kernel_size)
    .code
Convolve2_ proc frame
    _CreateFrame CV2_,0,0,rbx
    _EndProlog
; Validate argument values
    xor eax,eax             ;set error code
    mov r10d,dword ptr [rbp+CV2_OffsetStackArgs]
    test r10d,1
    jz Done               ;kernel_size is even
    cmp r10d,[c_KernelSizeMin]
    jl Done               ;kernel_size too small
    cmp r10d,[c_KernelSizeMax]
    jg Done               ;kernel_size too big
    cmp r8d,[c_NumPtsMin]
    jl Done               ;num_pts too small
    cmp r8d,[c_NumPtsMax]
    jg Done               ;num_pts too big
    test r8d,7
    jnz Done              ;num_pts not even multiple of 8
    test rcx,1fh
    jnz Done              ;y is not properly aligned
; Initialize convolution loop variables
    shr r10d,1             ;r10 = kernel_size / 2 (ks2)
    lea rdx,[rdx+r10*4]         ;rdx = x + ks2 (first data point)
    xor ebx,ebx             ;i = 0
; Perform convolution
LP1:  vxorps ymm0,ymm0,ymm0        ;packed sum = 0.0;
    mov r11,r10             ;r11 = ks2
    neg r11               ;k = -ks2
LP2:  mov rax,rbx               ;rax = i
    sub rax,r11               ;rax = i - k
    vmovups ymm1,ymmword ptr [rdx+rax*4]  ;load x[i - k]:x[i - k + 7]
    mov rax,r11
    add rax,r10               ;rax = k + ks2
    vbroadcastss ymm2,real4 ptr [r9+rax*4] ;ymm2 = kernel[k + ks2]
    vfmadd231ps ymm0,ymm1,ymm2       ;ymm0 += x[i-k]:x[i-k+7] * kernel[k+ks2]
    add r11,1                ;k += 1
    cmp r11,r10
    jle LP2                 ;repeat until k > ks2
    vmovaps ymmword ptr [rcx+rbx*4],ymm0  ;save y[i]:y[i + 7]
    add rbx,8              ;i += 8
    cmp rbx,r8
    jl LP1               ;repeat until done
    mov eax,1              ;set success return code
Done:  vzeroupper
    _DeleteFrame rbx
    ret
Convolve2_ endp
; extern bool Convolve2Ks5_(float* y, const float* x, int num_pts, const float* kernel, int kernel_size)
Convolve2Ks5_ proc frame
    _CreateFrame CKS5_,0,48
    _SaveXmmRegs xmm6,xmm7,xmm8
    _EndProlog
; Validate argument values
    xor eax,eax             ;set error code (rax is also loop index var)
    cmp dword ptr [rbp+CKS5_OffsetStackArgs],5
    jne Done              ;jump if kernel_size is not 5
    cmp r8d,[c_NumPtsMin]
    jl Done               ;jump if num_pts too small
    cmp r8d,[c_NumPtsMax]
    jg Done               ;jump if num_pts too big
    test r8d,7
    jnz Done              ;num_pts not even multiple of 8
    test rcx,1fh
    jnz Done              ;y is not properly aligned
; Perform required initializations
    vbroadcastss ymm4,real4 ptr [r9]    ;kernel[0]
    vbroadcastss ymm5,real4 ptr [r9+4]   ;kernel[1]
    vbroadcastss ymm6,real4 ptr [r9+8]   ;kernel[2]
    vbroadcastss ymm7,real4 ptr [r9+12]   ;kernel[3]
    vbroadcastss ymm8,real4 ptr [r9+16]   ;kernel[4]
    mov r8d,r8d               ;r8 = num_pts
    add rdx,8                ;x += 2
; Perform convolution
@@:   vxorps ymm2,ymm2,ymm2          ;initialize sum vars
    vxorps ymm3,ymm3,ymm3
    mov r11,rax
    add r11,2                ;j = i + ks2
    vmovups ymm0,ymmword ptr [rdx+r11*4]  ;ymm0 = x[j]:x[j + 7]
    vfmadd231ps ymm2,ymm0,ymm4       ;ymm2 += x[j]:x[j + 7] * kernel[0]
    vmovups ymm1,ymmword ptr [rdx+r11*4-4] ;ymm1 = x[j - 1]:x[j + 6]
    vfmadd231ps ymm3,ymm1,ymm5       ;ymm3 += x[j - 1]:x[j + 6] * kernel[1]
    vmovups ymm0,ymmword ptr [rdx+r11*4-8] ;ymm0 = x[j - 2]:x[j + 5]
    vfmadd231ps ymm2,ymm0,ymm6       ;ymm2 += x[j - 2]:x[j + 5] * kernel[2]
    vmovups ymm1,ymmword ptr [rdx+r11*4-12] ;ymm1 = x[j - 3]:x[j + 4]
    vfmadd231ps ymm3,ymm1,ymm7       ;ymm3 += x[j - 3]:x[j + 4] * kernel[3]
    vmovups ymm0,ymmword ptr [rdx+r11*4-16] ;ymm0 = x[j - 4]:x[j + 3]
    vfmadd231ps ymm2,ymm0,ymm8       ;ymm2 += x[j - 4]:x[j + 3] * kernel[4]
    vaddps ymm0,ymm2,ymm3          ;final values
    vmovaps ymmword ptr [rcx+rax*4],ymm0  ;save y[i]:y[i + 7]
    add rax,8                ;i += 8
    cmp rax,r8
    jl @B                  ;jump if i < num_pts
    mov eax,1                ;set success return code
Done:  vzeroupper
    _RestoreXmmRegs xmm6,xmm7,xmm8
    _DeleteFrame
    ret
Convolve2Ks5_ endp
    end
Listing 11-2.

Example Ch11_02

The convolution functions in source code example Ch11_01 used single-precision floating-point signal arrays and convolution kernels. Recall that a 256-bit wide YMM register can hold eight single-precision floating-point values, which means that a SIMD implementation of the convolution algorithm can carry out eight FMA calculations simultaneously. Figure 11-4 contains two graphics that illustrate a five-element convolution kernel along with an arbitrary segment of an input signal array. Below the graphics are the equations that can be used to convolve the eight input signal points f[i]:f[i+7] with a five-element convolution kernel. These equations are a simple expansion of the discrete convolution equation that was discussed earlier in this section. Note that each column of the SIMD convolution equation set includes a single kernel value and eight consecutive elements from the input signal array. This means that a SIMD convolution function can be easily implemented using data broadcast, packed move, and packed FMA instructions, as you’ll soon see.
../images/326959_2_En_11_Chapter/326959_2_En_11_Fig4_HTML.jpg
Figure 11-4.

SIMD convolution equations for five-element convolution kernel

The C++ function Convolve2 is located near the top of Listing 11-2. This function creates and initializes the padded input signal array x2 using the same CreateSignal and PadSignal functions that were employed in example Ch11_01. It also uses the C++ template class AlignedArray to allocate storage for the output signal arrays y5 and y6. In this example, the output signal arrays must be properly aligned since the assembly language functions use the vmovaps instruction to save calculated results. Proper alignment of the input signal arrays x1 and x2 is optional and used here for consistency. Following the allocation of the signal arrays, Convolve2 invokes the assembly language functions Convolve2_ and ConvolveKs5_; it then saves the output signal array data to a CSV file.

The function Convolve2_ begins its execution by validating argument values kernel_size and num_pts. It also verifies that output signal array y is properly aligned. The convolution code block of Convolve2_ employs the same nested for loop construct that was used in example Ch11_01. Each outer loop LP1 iteration begins with a vxorps ymm0,ymm0,ymm0 instruction that initializes eight single-precision floating-point sum values to 0.0. The ensuing mov r11,r10 and neg r11 instructions initialize k with the value -ks2. The inner loop starts by calculating and saving i - k in register RAX. The subsequent vmovups ymm1,ymmword ptr [rdx+rax*4] instruction loads input signal array elements x[i - k]:x[i - k + 7] into register YMM1. This is followed by a vbroadcastss ymm2,real4 ptr [r9+rax*4] instruction that broadcasts kernel[k + ks2] to each single-precision floating-point element position in YMM2. The vfmadd231ps ymm0,ymm1,ymm2 instruction multiplies each input signal array element in YMM1 by kernel[k + ks2] and adds this intermediate result to the packed sums in register YMM0. The inner loop repeats until k > ks2 is true. Following the completion of the inner loop, the vmovaps ymmword ptr [rcx+rbx*4],ymm0 instruction saves the eight convolution results to output signal array elements y[i]:y[i + 7]. Outer loop index counter i is then updated and the loop repeats until all input signal elements have been processed.

The assembly language function Convolve2Ks5_ is optimized for five-element convolution kernels. Following the requisite argument validations, a series of vbroadcastss instructions load kernel coefficients kernel[0]kernel[4] into registers YMM4–YMM8, respectively. The two vxorps instructions located at the top of the processing loop initialize the intermediate packed sums to 0.0. The array index j = i + ks2 is then calculated and saved in register R11. The ensuing vmovups ymm0,ymmword ptr [rdx+r11*4] instruction loads input signal array elements x[j]:x[j + 7] into register YMM0. This is followed by a vfmadd231ps ymm2,ymm0,ymm4 instruction that multiplies each input signal array element in YMM0 by kernel[0]; it then adds these values to the intermediate packed sums in YMM2. Convolve2Ks5_ then uses four additional sets of vmovups and vfmadd231ps instructions to compute results using coefficients kernel[1]kernel[4]. Similar to the function Convolve1Ks5_ in source code example Ch11_01, Convolve2Ks5_ also uses two YMM registers for its intermediate FMA sums, which facilitates simultaneous execution of two vfmadd231ps instructions on processors with dual 256-bit wide FMA execution units. Following the FMA operations, a vmovaps ymmword ptr [rcx+rax*4],ymm0 instruction saves the eight output signal array elements to y[i]:y[i + 7].

The assembly language code for example Ch11_02 also includes a function named Convolve2Ks5Test_. This function replaces all occurrences of the instruction vfmadd231ps with an equivalent sequence of vmulps and vaddps instructions for benchmarking and value comparison purposes, which are discussed shortly. The source code for Convolve2Ks5Test_ is not shown in Listing 11-2 but is included with the chapter download package. Here are the results for source code example Ch11_02.
Results for Convolve2
 rc5 = true
 rc6 = true
 rc7 = true
Results data saved to file Ch11_02_Convolve2Results.csv
Running benchmark function Convolve2_BM - please wait
Benchmark times save to file Ch11_02_Convolve2_BM_CHROMIUM.csv
Table 11-2 shows the benchmark timing measurements for the SIMD implementations of the convolution functions. As expected, the SIMD versions are significantly faster than their non-SIMD counterparts (see Table 11-1). The mean execution times for the five-element convolution kernel functions Convolve2Ks5_ and Convolve2Ks5Test_ are essentially the same.
Table 11-2.

Mean Execution Times (Microseconds) for SIMD Convolution Functions Using Five-Element Convolution Kernel (2,000,000 Signal Points)

CPU

Convolve2_

Convolve2Ks5_

Convolve2Ks5Test_

i7-4790S

1244

1067

1074

i9-7900X

956

719

709

i7-8700K

859

595

597

It is not uncommon for small value discrepancies to occur between a function that uses FMA instructions and an equivalent function that uses distinct multiply and add instructions. This is confirmed by the comparing the output results of functions Convolve2Ks5_ and Convolve2Ks5Test_. Table 11-3 shows a few examples of value discrepancies from the output file Ch11_02_Convolve2Results.csv. In a real-world application, the magnitudes of these value discrepancies would most likely be inconsequential. However, the potential for value discrepancies is something that you should always keep in mind if you’re developing production code that includes both FMA and non-FMA versions of the same function, especially for functions that perform numerous FMA operations.
Table 11-3.

Examples of Value Discrepancies Using FMA and Non-FMA Instruction Sequences

Index

x[]

Convolve2Ks5_

Convolve2Ks5Test_

33

1.3856432

1.1940877

1.1940879

108

1.3655651

1.4466031

1.4466029

180

-2.8778596

-2.7348523

-2.7348526

277

-1.7654022

-2.0587211

-2.0587208

403

2.0683382

2.0299273

2.0299270

General-Purpose Register Instructions

As mentioned and discussed in Chapter 8, several general-purpose register instructions set extensions have been added to the x86 platform in recent years (see Tables 8-2 and 8-5). In this section, you learn how to use some of these instructions. The first source code example illustrates how to use the flagless multiplication and shift instructions. A flagless instruction executes its operation without modifying any of the status flags in RFLAGS, which can be faster than an equivalent flag-based instruction depending on the specific use case. The second source code example demonstrates several advanced bit-manipulation instructions. The source code examples of this section require a processor that supports the BMI1 , BMI2 , and LZCNT instruction set extensions.

Flagless Multiplication and Shifts

Listing 11-3 shows the source code for example Ch11_03. This example demonstrates how to use the flagless unsigned integer multiplication instruction mulx. It also explains how to use the flagless shift instructions sarx, shlx, and shrx.
//------------------------------------------------
//        Ch11_03.cpp
//------------------------------------------------
#include "stdafx.h"
#include <cstdint>
#include <iostream>
#include <iomanip>
#include <sstream>
using namespace std;
#include "stdafx.h"
extern "C" uint64_t GprMulx_(uint32_t a, uint32_t b, uint64_t flags[2]);
extern "C" void GprShiftx_(uint32_t x, uint32_t count, uint32_t results[3], uint64_t flags[4]);
string ToString(uint64_t flags)
{
  ostringstream oss;
  oss << "OF=" << ((flags & (1ULL << 11)) ? '1' : '0') << ' ';
  oss << "SF=" << ((flags & (1ULL << 7)) ? '1' : '0') << ' ';
  oss << "ZF=" << ((flags & (1ULL << 6)) ? '1' : '0') << ' ';
  oss << "PF=" << ((flags & (1ULL << 2)) ? '1' : '0') << ' ';
  oss << "CF=" << ((flags & (1ULL << 0)) ? '1' : '0') << ' ';
  return oss.str();
}
void GprMulx(void)
{
  const int n = 3;
  uint32_t a[n] = {64, 3200, 100000000};
  uint32_t b[n] = {1001, 12, 250000000};
  cout << "\nResults for AvxGprMulx\n";
  for (int i = 0; i < n; i++)
  {
    uint64_t flags[2];
    uint64_t c = GprMulx_(a[i], b[i], flags);
    cout << "\nTest case " << i << '\n';
    cout << " a: " << a[i] << " b: " << b[i] << " c: " << c << '\n';
    cout << setfill ('0') << hex;
    cout << " status flags before mulx: " << ToString(flags[0]) << '\n';
    cout << " status flags after mulx: " << ToString(flags[1]) << '\n';
    cout << setfill (' ') << dec;
  }
}
void GprShiftx(void)
{
  const int n = 4;
  uint32_t x[n] = { 0x00000008, 0x80000080, 0x00000040, 0xfffffc10 };
  uint32_t count[n] = { 2, 5, 3, 4 };
  cout << "\nResults for AvxGprShiftx\n";
  for (int i = 0; i < n; i++)
  {
    uint32_t results[3];
    uint64_t flags[4];
    GprShiftx_(x[i], count[i], results, flags);
    cout << setfill(' ') << dec;
    cout << "\nTest case " << i << '\n';
    cout << setfill('0') << hex << " x:  0x" << setw(8) << x[i] << " (";
    cout << setfill(' ') << dec << x[i] << ") count: " << count[i] << '\n';
    cout << setfill('0') << hex << " sarx: 0x" << setw(8) << results[0] << " (";
    cout << setfill(' ') << dec << results[0] << ")\n";
    cout << setfill('0') << hex << " shlx: 0x" << setw(8) << results[1] << " (";
    cout << setfill(' ') << dec << results[1] << ")\n";
    cout << setfill('0') << hex << " shrx: 0x" << setw(8) << results[2] << " (";
    cout << setfill(' ') << dec << results[2] << ")\n";
    cout << " status flags before shifts: " << ToString(flags[0]) << '\n';
    cout << " status flags after sarx:  " << ToString(flags[1]) << '\n';
    cout << " status flags after shlx:  " << ToString(flags[2]) << '\n';
    cout << " status flags after shrx:  " << ToString(flags[3]) << '\n';
  }
}
int main()
{
  GprMulx();
  GprShiftx();
  return 0;
}
;-------------------------------------------------
;        Ch11_03_.asm
;-------------------------------------------------
; extern "C" uint64_t GprMulx_(uint32_t a, uint32_t b, uint64_t flags[2]);
;
; Requires   BMI2
    .code
GprMulx_ proc
; Save copy of status flags before mulx
    pushfq
    pop rax
    mov qword ptr [r8],rax       ;save original status flags
; Perform flagless multiplication. The mulx instruction below computes
; the product of explicit source operand ecx (a) and implicit source
; operand edx (b). The 64-bit result is saved to the register pair r11d:r10d.
    mulx r11d,r10d,ecx         ;r11d:r10d = a * b
; Save copy of status flags after mulx
    pushfq
    pop rax
    mov qword ptr [r8+8],rax      ;save post mulx status flags
; Move 64-bit result to rax
    mov eax,r10d
    shl r11,32
    or rax,r11
    ret
GprMulx_ endp
; extern "C" void GprShiftx_(uint32_t x, uint32_t count, uint32_t results[3], uint64_t flags[4])
;
; Requires   BMI2
GprShiftx_ proc
; Save copy of status flags before shifts
    pushfq
    pop rax
    mov qword ptr [r9],rax       ;save original status flags
; Load argument values and perform shifts. Note that each shift
; instruction requires three operands: DesOp, SrcOp, and CountOp.
    sarx eax,ecx,edx          ;shift arithmetic right
    mov dword ptr [r8],eax
    pushfq
    pop rax
    mov qword ptr [r9+8],rax
    shlx eax,ecx,edx          ;shift logical left
    mov dword ptr [r8+4],eax
    pushfq
    pop rax
    mov qword ptr [r9+16],rax
    shrx eax,ecx,edx          ;shift logical right
    mov dword ptr [r8+8],eax
    pushfq
    pop rax
    mov qword ptr [r9+24],rax
    ret
GprShiftx_ endp
    end
Listing 11-3.

Example Ch11_03

The C++ code for example Ch11_03 contains two functions named GprMulx and GprShiftx. These functions initialize test cases that demonstrate flagless multiplication and shift operations, respectively. Note that both GprMulx and GprShiftx define an array of type uint64_t named flags. This array is used to show the contents of the status flags in RFLAGS before and after the execution of each flagless instruction. The remaining code in both GprMulx and GprShiftx format and stream results to cout.

The assembly language function GprMulx begins its execution by saving a copy of RFLAGS. The ensuing mulx r11d,r10d,ecx instruction performs a 32-bit unsigned integer multiplication using implicit source operand EDX (argument value b) with explicit source operand ECX (argument value a). The 64-bit product is then saved in register pair R11D:R10D. Following the execution of the mulx instruction, the contents of RFLAGS are saved again for comparison purposes. The mulx instruction also supports flagless multiplications using 64-bit operands. When used with 64-bit operands, register RDX is employed as the implicit operand and two 64-bit general-purpose registers must be used for the destination operands.

The function GprShiftx_ includes examples of the sarx, shlx, and shrx instructions using 32-bit wide operands. These instructions use a three-operand syntax similar to AVX instructions. The first source operand is shifted by the count value that’s specified in the second source operand. The result is then saved to the destination operand. The flagless shift instructions can also be used with 64-bit wide operands; 8- and 16-bit wide operands are not supported. Here is the output for source code example Ch11_03:
Results for AvxGprMulx
Test case 0
 a: 64 b: 1001 c: 64064
 status flags before mulx: OF=0 SF=0 ZF=1 PF=1 CF=0
 status flags after mulx: OF=0 SF=0 ZF=1 PF=1 CF=0
Test case 1
 a: 3200 b: 12 c: 38400
 status flags before mulx: OF=0 SF=1 ZF=0 PF=0 CF=1
 status flags after mulx: OF=0 SF=1 ZF=0 PF=0 CF=1
Test case 2
 a: 100000000 b: 250000000 c: 25000000000000000
 status flags before mulx: OF=0 SF=1 ZF=0 PF=1 CF=1
 status flags after mulx: OF=0 SF=1 ZF=0 PF=1 CF=1
Results for AvxGprShiftx
Test case 0
 x:  0x00000008 (8) count: 2
 sarx: 0x00000002 (2)
 shlx: 0x00000020 (32)
 shrx: 0x00000002 (2)
 status flags before shifts: OF=0 SF=0 ZF=1 PF=1 CF=0
 status flags after sarx:  OF=0 SF=0 ZF=1 PF=1 CF=0
 status flags after shlx:  OF=0 SF=0 ZF=1 PF=1 CF=0
 status flags after shrx:  OF=0 SF=0 ZF=1 PF=1 CF=0
Test case 1
 x:  0x80000080 (2147483776) count: 5
 sarx: 0xfc000004 (4227858436)
 shlx: 0x00001000 (4096)
 shrx: 0x04000004 (67108868)
 status flags before shifts: OF=0 SF=1 ZF=0 PF=0 CF=1
 status flags after sarx:  OF=0 SF=1 ZF=0 PF=0 CF=1
 status flags after shlx:  OF=0 SF=1 ZF=0 PF=0 CF=1
 status flags after shrx:  OF=0 SF=1 ZF=0 PF=0 CF=1
Test case 2
 x:  0x00000040 (64) count: 3
 sarx: 0x00000008 (8)
 shlx: 0x00000200 (512)
 shrx: 0x00000008 (8)
 status flags before shifts: OF=0 SF=1 ZF=0 PF=0 CF=1
 status flags after sarx:  OF=0 SF=1 ZF=0 PF=0 CF=1
 status flags after shlx:  OF=0 SF=1 ZF=0 PF=0 CF=1
 status flags after shrx:  OF=0 SF=1 ZF=0 PF=0 CF=1
Test case 3
 x:  0xfffffc10 (4294966288) count: 4
 sarx: 0xffffffc1 (4294967233)
 shlx: 0xffffc100 (4294951168)
 shrx: 0x0fffffc1 (268435393)
 status flags before shifts: OF=0 SF=1 ZF=0 PF=1 CF=1
 status flags after sarx:  OF=0 SF=1 ZF=0 PF=1 CF=1
 status flags after shlx:  OF=0 SF=1 ZF=0 PF=1 CF=1
 status flags after shrx:  OF=0 SF=1 ZF=0 PF=1 CF=1

Enhanced Bit Manipulation

Most of the instructions included in the BMI1 and BMI2 instruction set extensions are geared toward improving the performance of specific algorithms, such as data encryption and decryption. They also can be employed to simplify bit-manipulation operations in more mundane algorithms. Source code example Ch11_04 includes three simple assembly language functions that demonstrate use of the enhanced bit-manipulation instructions lzcnt, tzcnt, bextr, and andn. Listing 11-4 shows the C++ and assembly language source code for this example.
//------------------------------------------------
//        Ch11_04.cpp
//------------------------------------------------
#include "stdafx.h"
#include <cstdint>
#include <iostream>
#include <iomanip>
using namespace std;
extern "C" void GprCountZeroBits_(uint32_t x, uint32_t* lzcnt, uint32_t* tzcnt);
extern "C" uint32_t GprBextr_(uint32_t x, uint8_t start, uint8_t length);
extern "C" uint32_t GprAndNot_(uint32_t x, uint32_t y);
void GprCountZeroBits(void)
{
  const int n = 5;
  uint32_t x[n] = { 0x001000008, 0x00008000, 0x8000000, 0x00000001, 0 };
  cout << "\nResults for AvxGprCountZeroBits\n";
  for (int i = 0; i < n; i++)
  {
    uint32_t lzcnt, tzcnt;
    GprCountZeroBits_(x[i], &lzcnt, &tzcnt);
    cout << setfill('0') << hex;
    cout << "x: 0x" << setw(8) << x[i] << " ";
    cout << setfill(' ') << dec;
    cout << "lzcnt: " << setw(3) << lzcnt << " ";
    cout << "tzcnt: " << setw(3) << tzcnt << '\n';
  }
}
void GprExtractBitField(void)
{
  const int n = 3;
  uint32_t x[n] = { 0x12345678, 0x80808080, 0xfedcba98 };
  uint8_t start[n] = { 4, 7, 24 };
  uint8_t len[n] = { 16, 9, 8 };
  cout << "\nResults for GprExtractBitField\n";
  for (int i = 0; i < n; i++)
  {
    uint32_t bextr = GprBextr_(x[i], start[i], len[i]);
    cout << setfill('0') << hex;
    cout << "x: 0x" << setw(8) << x[i] << " ";
    cout << setfill(' ') << dec;
    cout << "start: " << setw(3) << (uint32_t)start[i] << " ";
    cout << "len:  " << setw(3) << (uint32_t)len[i] << " ";
    cout << setfill('0') << hex;
    cout << "bextr: 0x" << setw(8) << bextr << '\n';
  }
}
void GprAndNot(void)
{
  const int n = 3;
  uint32_t x[n] = { 0xf000000f, 0xff00ff00, 0xaaaaaaaa };
  uint32_t y[n] = { 0x12345678, 0x12345678, 0xffaa5500 };
  cout << "\nResults for GprAndNot\n";
  for (int i = 0; i < n; i++)
  {
    uint32_t andn = GprAndNot_(x[i], y[i]);
    cout << setfill('0') << hex;
    cout << "x: 0x" << setw(8) << x[i] << " ";
    cout << "y: 0x" << setw(8) << y[i] << " ";
    cout << "andn: 0x" << setw(8) << andn << '\n';
  }
}
int main()
{
  GprCountZeroBits();
  GprExtractBitField();
  GprAndNot();
  return 0;
}
;-------------------------------------------------
;        Ch11_04_.asm
;-------------------------------------------------
; extern "C" void GprCountZeroBits_(uint32_t x, uint32_t* lzcnt, uint32_t* tzcnt);
;
; Requires:   BMI1, LZCNT
    .code
GprCountZeroBits_ proc
    lzcnt eax,ecx            ;count leading zeros
    mov dword ptr [rdx],eax       ;save result
    tzcnt eax,ecx            ;count trailing zeros
    mov dword ptr [r8],eax       ;save result
    ret
GprCountZeroBits_ endp
; extern "C" uint32_t GprBextr_(uint32_t x, uint8_t start, uint8_t length);
;
; Requires:   BMI1
GprBextr_ proc
    mov al,r8b
    mov ah,al              ;ah = length
    mov al,dl              ;al = start
    bextr eax,ecx,eax          ;eax = extracted bit field (from x)
    ret
GprBextr_ endp
; extern "C" uint32_t GprAndNot_(uint32_t x, uint32_t y);
;
; Requires:   BMI1
GprAndNot_ proc
    andn eax,ecx,edx          ;eax = ∼x & y
    ret
GprAndNot_ endp
    end
Listing 11-4.

Example Ch11_04

The C++ code in Listing 11-4 contains three short functions that set up test cases for the assembly language functions. The first function, GprCountZeroBits, initializes a test array that’s used to demonstrate the lzcnt (Count the Number of Leading Zero Bits ) and tzcnt (Count the Number of Trailing Zero Bits ) instructions. The second function, GprExtractBitField, prepares test data for the bextr (Bit Field Extract) instruction. The final C++ function in Listing 11-4 is named GprAndNot. This function loads test arrays with data that’s used to illuminate execution of the andn (Bitwise AND NOT) instruction.

The first assembly language function is named GprCountZeroBits_. This function uses the lzcnt and tzcnt instructions to count the number of leading and trailing zero bits in their respective 32-bit wide source operands. The calculated bit counts are then saved in the specified destination operand. The next function, GprBextr_, exercises the bextr instruction. This instruction’s first source operand contains the data from which the bit field will be extracted. Bits 7:0 and 15:8 of the second source operand specify the field’s starting bit position and length, respectively. Finally, the function GprAndNot_ shows how to use the andn instruction. This instruction computes DesOp = ∼SrcOp1 & SrcOp2 and is often employed to simplify Boolean masking operations. Here are the results for source code example Ch11_04.
Results for AvxGprCountZeroBits
x: 0x01000008 lzcnt:  7 tzcnt:  3
x: 0x00008000 lzcnt: 16 tzcnt: 15
x: 0x08000000 lzcnt:  4 tzcnt: 27
x: 0x00000001 lzcnt: 31 tzcnt:  0
x: 0x00000000 lzcnt: 32 tzcnt: 32
Results for GprExtractBitField
x: 0x12345678 start:  4 len:  16 bextr: 0x00004567
x: 0x80808080 start:  7 len:   9 bextr: 0x00000101
x: 0xfedcba98 start: 24 len:   8 bextr: 0x000000fe
Results for GprAndNot
x: 0xf000000f y: 0x12345678 andn: 0x02345670
x: 0xff00ff00 y: 0x12345678 andn: 0x00340078
x: 0xaaaaaaaa y: 0xffaa5500 andn: 0x55005500

The BMI1 and BMI2 instruction set extensions also include other enhanced bit-manipulation instructions that can be used to implement specific algorithms or carry out specialized operations. These instructions are listed in Table 8-5.

Half-Precision Floating-Point Conversions

The final source code example of this chapter, Ch11_05, exemplifies use of the half-precision conversion instructions vcvtps2ph and vcvtph2ps. Listing 11-5 shows the source code for this example. If your understanding of the half-precision floating-point data type is incomplete, you may want to review the content in Chapter 8 before perusing this section’s source code and elucidations.
//------------------------------------------------
//        Ch11_05.cpp
//------------------------------------------------
#include "stdafx.h"
#include <cstdint>
#include <string>
#include <iostream>
#include <iomanip>
using namespace std;
extern "C" void SingleToHalfPrecision_(uint16_t x_hp[8], float x_sp[8], int rc);
extern "C" void HalfToSinglePrecision_(float x_sp[8], uint16_t x_hp[8]);
int main()
{
  float x[8];
  x[0] = 4.125f;
  x[1] = 32.9f;
  x[2] = 56.3333f;
  x[3] = -68.6667f;
  x[4] = 42000.5f;
  x[5] = 75600.0f;
  x[6] = -6002.125f;
  x[7] = 170.0625f;
  uint16_t x_hp[8];
  float rn[8], rd[8], ru[8], rz[8];
  SingleToHalfPrecision_(x_hp, x, 0);
  HalfToSinglePrecision_(rn, x_hp);
  SingleToHalfPrecision_(x_hp, x, 1);
  HalfToSinglePrecision_(rd, x_hp);
  SingleToHalfPrecision_(x_hp, x, 2);
  HalfToSinglePrecision_(ru, x_hp);
  SingleToHalfPrecision_(x_hp, x, 3);
  HalfToSinglePrecision_(rz, x_hp);
  unsigned int w = 15;
  string line(76, '-');
  cout << fixed << setprecision(4);
  cout << setw(w) << "x";
  cout << setw(w) << "RoundNearest";
  cout << setw(w) << "RoundDown";
  cout << setw(w) << "RoundUp";
  cout << setw(w) << "RoundZero";
  cout << '\n' << line << '\n';
  for (int i = 0; i < 8; i++)
  {
    cout << setw(w) << x[i];
    cout << setw(w) << rn[i];
    cout << setw(w) << rd[i];
    cout << setw(w) << ru[i];
    cout << setw(w) << rz[i];
    cout << '\n';
  }
  return 0;
}
;-------------------------------------------------
;        Ch11_05_.asm
;-------------------------------------------------
; extern "C" void SingleToHalfPrecision_(uint16_t x_hp[8], float x_sp[8], int rc);
    .code
SingleToHalfPrecision_ proc
; Convert packed single-precision to packed half-precision
    vmovups ymm0,ymmword ptr [rdx]     ;ymm0 = 8 SPFP values
    cmp r8d,0
    jne @F
    vcvtps2ph xmm1,ymm0,0          ;round to nearest
    jmp SaveResult
@@:   cmp r8d,1
    jne @F
    vcvtps2ph xmm1,ymm0,1          ;round down
    jmp SaveResult
@@:   cmp r8d,2
    jne @F
    vcvtps2ph xmm1,ymm0,2          ;round up
    jmp SaveResult
@@:   cmp r8d,3
    jne @F
    vcvtps2ph xmm1,ymm0,3          ;truncate
    jmp SaveResult
@@:   vcvtps2ph xmm1,ymm0,4          ;use MXCSR.RC
SaveResult:
    vmovdqu xmmword ptr [rcx],xmm1     ;save 8 HPFP values
    vzeroupper
    ret
SingleToHalfPrecision_ endp
; extern "C" void HalfToSinglePrecision_(float x_sp[8], uint16_t x_hp[8]);
HalfToSinglePrecision_ proc
; Convert packed half-precision to packed single-precision
    vcvtph2ps ymm0,xmmword ptr [rdx]
    vmovups ymmword ptr [rcx],ymm0     ;save 8 SPFP values
    vzeroupper
    ret
HalfToSinglePrecision_ endp
    end
Listing 11-5.

Example Ch11_05

The C++ function main starts its execution by loading single-precision floating-point test values into array x. It then exercises the half-precision floating-point conversion functions SingleToHalfPrecision_ and HalfToSinglePrecision_. Note that the function SingleToHalfPrecision_ requires a third argument, which specifies the rounding mode to use when converting floating-point values from single-precision to half-precision. Also note that an array type of uint16_t is used to store the half-precision floating-point results since C++ does not natively support a half-precision floating-point data type.

The assembly language function SingleToHalfPrecision_ uses the vcvtps2ph instruction to convert the eight single-precision floating-point values in array x_sp to half-precision floating-point. This instruction requires an immediate operand that specifies the rounding mode to use during the type conversion. Table 11-4 shows the rounding mode options for the vcvtps2ph instruction.
Table 11-4.

Rounding Mode Options for the vcvtps2ph Instruction

Immediate Operand Bits

Value

Description

1:0

00b

Round to nearest

 

01b

Round down (toward -∞)

 

10b

Round up (toward +∞)

 

11b

Round toward zero (truncate)

2

0

Use rounding mode specified in immediate operand bits 1:0

 

1

Use rounding mode specified in MXCSR.RC

7:3

Ignored

Not used

The function HalfToSinglePrecision_ uses the vcvtph2ps instruction to convert eight half-precision floating-point values to single-precision floating-point. The output for source code example Ch11_05 follows this paragraph. Note the value differences between the various rounding modes . Also note that when the rounding mode RoundNearest or RoundUp is used, the value 76000.0f is converted to inf (or infinity) since this quantity exceeds the largest possible value for a half-precision floating-point value.
       x  RoundNearest   RoundDown    RoundUp   RoundZero
----------------------------------------------------------------------------
     4.1250     4.1250     4.1250     4.1250     4.1250
    32.9000    32.9063    32.8750    32.9063    32.8750
    56.3333    56.3438    56.3125    56.3438    56.3125
    -68.6667    -68.6875    -68.6875    -68.6250    -68.6250
   42000.5000   42016.0000   41984.0000   42016.0000   41984.0000
   75600.0000      inf   65504.0000      inf   65504.0000
   -6002.1250   -6004.0000   -6004.0000   -6000.0000   -6000.0000
    170.0625    170.0000    170.0000    170.1250    170.0000

Summary

Here are the key learning points from Chapter 11:
  • FMA instructions are often employed to implement numerically-oriented algorithms such as discrete convolutions, which are used extensively in a wide variety of problem domains, including signal processing and image processing.

  • Assembly language functions that employ sequences of consecutive FMA instructions should use multiple XMM or YMM registers to store intermediate sum products. Using multiple registers helps avoid data dependencies that can preclude the processor from executing multiple FMA instructions simultaneously.

  • Value discrepancies normally occur between functions that implement the exact same algorithm or operation using FMA and non-FMA instruction sequences. The significance of these discrepancies depends on the application.

  • Assembly language functions can use the mulx, sarx, shlx, and shrx instructions to carry out flagless unsigned integer multiplication and shifts. These instructions may yield slightly faster performance than their flag-based counterparts in algorithms that perform consecutive multiplication and shift operations.

  • Assembly language functions can use the lzcnt, tzcnt, bextr, and andn instructions to perform advanced bit-manipulation operations.

  • Assembly language functions can use the vcvtps2ph and vcvtph2ps instructions to perform conversions between single-precision and half-precision floating-point values.