10.9.8 Count Bits

These instructions can be used to count leading sign bits or zeros, or to count the number of bits that are set for each element in a vector:

vcls Count Leading Sign Bits

vclz Count Leading Zero Bits

vcnt Count Set Bits

Syntax

 v<op>.<type> Vd, Vm

 <op> is either cls, clz or cnt.

 The valid choices for <type> are given in the following table:

OpcodeValid Types
vclss8, s16, or s32
vclzu8, u16, or u32
vcnti8

Operations

NameEffectDescription
vcls

n# of elementssi52_e

for 0 ≤ i < n) do

 Vd[i]leading_sign_bits(Vm[i])si161_e

end for

Count the number of consecutive bits that are the same as the sign bit for each element in Fm, and store the counts in the corresponding elements of Fd
vcls

n# of elementssi52_e

for 0 ≤ i < n) do

 Vd[i]leading_zero_bits(Vm[i])si163_e

end for

Count the number of leading zero bits for each element in Fm, and store the counts in the corresponding elements of Fd.
vcnt

n# of elementssi52_e

for 0 ≤ i < n) do

 Vd[i]count_one_bits(Vm[i])si165_e

end for

Count the number of bits in Fm that are set to one, and store the counts in the corresponding elements of Fd

t0260

Examples

f10-48-9780128036983

10.10 Multiplication and Division

There is no vector divide instruction in NEON. Division is accomplished with multiplication by the reciprocals of the divisors. The reciprocals are found by making an initial estimate, then using the Newton-Raphson method to improve the approximation. This can actually be faster than using a hardware divider. NEON supports single precision floating point and unsigned fixed point reciprocal calculation. Fixed point reciprocals provide higher precision. Division using the NEON reciprocal method may not provide the best precision possible. If the best possible precision is required, then the VFP divide instruction should be used.

10.10.1 Multiply

These instructions are used to multiply the corresponding elements from two vectors:

vmul Multiply

vmla Multiply Accumulate

vmls Multiply Subtract

vmull Multiply Long

vmlal Multiply Accumulate Long

vmlsl Multiply Subtract Long

The long versions can be used to avoid overflow.

Syntax

 v<op>.<type> Vd, Vn, Vm

 v<op>l.<type> Qd, Dn, Dm

 <op> is either mul, mla. or mls.

 The valid choices for <type> are given in the following table:

OpcodeValid Types
vmulp8, i8, i16, or i32
vmlai8, i16, or i32
vmlsi8, i16, or i32
vmullp8, s8, s16, s32, u8, u16, or u32
vmlals8, s16, s32, u8, u16, or u32
vmlsls8, s16, s32, u8, u16, or u32

Operations

NameEffectDescription
vmul

Vd[]Vn[]×Vm[]si166_e

Multiply corresponding elements from two vectors and store the results in a third vector
vmla

Vd[]Vd[]+(Vn[]×Vm[])si167_e

Multiply corresponding elements from two vectors and add the results in a third vector
vmul

Vd[]Vd[](Vn[]×Vm[])si168_e

Multiply corresponding elements from two vectors and subtract the results from a third vector
vmull

Qd[]Dn[]×Dm[]si169_e

Multiply corresponding elements from two vectors and store the results in a third vector
vmlal

Qd[]Qd[]+(Dn[]×Dm[])si170_e

Multiply corresponding elements from two vectors and add the results in a third vector
vmul

Qd[]Qd[](Dn[]×Dm[])si171_e

Multiply corresponding elements from two vectors and subtract the results from a third vector

t0270

Examples

f10-49-9780128036983

10.10.2 Multiply by Scalar

These instructions are used to multiply each element in a vector by a scalar:

vmul Multiply by Scalar

vmla Multiply Accumulate by Scalar

vmls Multiply Subtract by Scalar

vmull Multiply Long by Scalar

vmlal Multiply Accumulate Long by Scalar

vmlsl Multiply Subtract Long by Scalar

The long versions can be used to avoid overflow.

Syntax

v<op>.<type> Vd, Vn, Dm[x]

v<op>l.<type> Qd, Dn, Dm[x]

 <op> is either mul, mla. or mls.

 The valid choices for <type> are given in the following table:

OpcodeValid Types
vmuli16, i32, or f32
vmlai16, i32, or f32
vmlsi16, i32, or f32
vmulls16, s32, u16, or u32
vmlals16, s32, u16, or u32
vmlsls16, s32, u16, or u32

 x must be valid for the chosen <type>.

Operations

NameEffectDescription
vmul

Vd[]Vn[]×Dm[x]si172_e

Multiply corresponding elements from two vectors and store the results in a third vector
vmla

Vd[]Vd[]+(Vn[]×Dm[x])si173_e

Multiply corresponding elements from two vectors and add the results in a third vector
vmul

Vd[]Vd[](Vn[]×Dm[x])si174_e

Multiply corresponding elements from two vectors and subtract the results from a third vector
vmull

Qd[]Dn[]×Dm[x]si175_e

Multiply corresponding elements from two vectors and store the results in a third vector
vmlal

Qd[]Qd[]+(Dn[]×Dm[x])si176_e

Multiply corresponding elements from two vectors and add the results in a third vector
vmul

Qd[]Qd[](Dn[]×Dm[x])si177_e

Multiply corresponding elements from two vectors and subtract the results from a third vector

t0280

Examples

f10-50-9780128036983

10.10.3 Fused Multiply Accumulate

A fused multiply accumulate operation does not perform rounding between the multiply and add operations. The two operations are fused into one. NEON provides the following fused multiply accumulate instructions:

vfma Fused Multiply Accumulate

vfnma Fused Negate Multiply Accumulate

vfms Fused Multiply Subtract

vfnms Fused Negate Multiply Subtract

Using the fused multiply accumulate can result in improved speed and accuracy for many computations that involve the accumulation of products.

Syntax

 <op>{<cond>}.<prec> Fd, Fn, Fm

<op> is one of vfma, vfnma, vfms, or vfnms.

<cond> is an optional condition code.

<prec> may be either f32 or f64.

Operations

NameEffectDescription
vfmaFdFd+Fn×Fmsi178_eMultiply and accumulate
vnfnmaFdFd+Fn×Fmsi179_eNegate, multiply, and accumulate
vfmsFdFdFn×Fmsi180_eMultiply and subtract
vnfmsFdFdFn×Fmsi181_eNegate multiply, and subtract

Examples

f10-51-9780128036983

10.10.4 Saturating Multiply and Double (Low)

These instructions perform multiplication, double the results, and perform saturation:

vqdmull Saturating Multiply Double (Low)

vqdmlal Saturating Multiply Double Accumulate (Low)

vqdmlsl Saturating Multiply Double Subtract (Low)

Syntax

 vqd<op>l.<type> Qd, Dn, Dm

 vqd<op>l.<type> Qd, Dn, Dm[x]

 <op> is either mul, mla. or mls.

 <type> must be either s16 or s32.

Operations

NameEffectDescription
vqdmull

if second operand is scalar then

 eq10-14-9780128036983

else

 eq10-15-9780128036983

end if

Multiply elements, double the results, and store in the destination vector with saturation
vqdmull

if second operand is scalar then

 eq10-16-9780128036983

else

 eq10-17-9780128036983

end if

Multiply elements, double the results, and add to the destination vector with saturation
vqdmull

if second operand is scalar then

 eq10-18-9780128036983

else

 eq10-19-9780128036983

end if

Multiply elements, double the results, and subtract from the destination vector with saturation

t0290

Examples

f10-52-9780128036983

10.10.5 Saturating Multiply and Double (High)

These instructions perform multiplication, double the results, perform saturation, and store the high half of the results:

vqdmulh Saturating Multiply Double (High)

vqrdmulh Saturating Multiply Double (High) and Round

Syntax

 vq{r}dmulh.<type> Vd, Vn, Vm

 vq{r}dmulh.<type> Vd, Vn, Dm[x]

 <type> must be either s16 or s32.

Operations

NameEffectDescription
vqdmulh

nsize of<type>si182_e

if second operand is scalar then

 Vd[]Vn[]×Dm[x]×2nsi183_e

else

 Vd[]Vn[]×Vm[]×2nsi184_e

end if

Multiply elements, double the results and store the high half in the destination vector with saturation
vqrdmulh

nsize of<type>si182_e

if second operand is scalar then

 Vd[]Vn[]×Dm[x]×2nsi186_e

else

 Vd[]Vn[]×Vm[]×2nsi187_e

end if

Multiply elements, double the results, round, and store the high half in the destination vector with saturation

t0295

Examples

f10-53-9780128036983

10.10.6 Estimate Reciprocals

These instructions perform the initial estimates of the reciprocal values:

vrecpe Reciprocal Estimate

vrsqrte Reciprocal Square Root Estimate

These work on floating point and unsigned fixed point vectors. The estimates from this instruction are accurate to within about eight bits. If higher accuracy is desired, then the Newton-Raphson method can be used to improve the initial estimates. For more information, see the Reciprocal Step instruction.

Syntax

 v<op>.<type> Vd, Vm

 <op> is either recpe or rsqrte.

 <type> must be either u32, or f32.

 If <type> is u32, then the elements are assumed to be U(1,31) fixed point numbers, and the most significant fraction bit (bit 30) must be 1, and the integer part must be zero. The vclz and shift by variable instructions can be used to put the data in the correct format.

 The result elements are always f32.

Operations

NameEffectDescription
vrecpe

n# of elementssi52_e

for 0 ≤ i < n) do

 Vd[i](1÷Vm[i])si189_e

end for

Find an approximate reciprocal of each element in a vector
vrsqrte

n# of elementssi52_e

for 0 ≤ i < n) do

 Vd[i](1÷Vm[i])si191_e

end for

Find an approximate reciprocal square root of each element in a vector

t0300

Examples

f10-54-9780128036983

10.10.7 Reciprocal Step

These instructions are used to perform one Newton-Raphson step for improving the reciprocal estimates:

vrecps Reciprocal Step

vrsqrts Reciprocal Square Root Step

For each element in the vector, the following equation can be used to improve the estimates of the reciprocals:

xn+1=xn(2dxn),

si192_e

where xn is the estimated reciprocal from the previous step, and d is the number for which the reciprocal is desired. This equation converges to 1dsi193_e if x0 is obtained using vrecpe on d. The vrecps instruction computes

xn+1=2dxn,

si194_e

so one additional multiplication is required to complete the update step. The initial estimate x0 must be obtained using the vrecpe instruction.

For each element in the vector, the following equation can be used to improve the estimates of the reciprocals of the square roots:

xn+1=xn3dxn22,

si195_e

where xn is the estimated reciprocal from the previous step, and d is the number for which the reciprocal is desired. This equation converges to 1dsi196_e if x0 is obtained using vrsqrte on d. The vrsqrts instruction computes

xn+1=3dxn2,

si197_e

so two additional multiplications are required to complete the update step. The initial estimate x0 must be obtained using the vrsqrte instruction.

Syntax

 v<op>.<type> Vd, Vn, Vm

 <op> is either recps or rsqrts.

 <type> must be either u32, or f32.

Operations

NameEffectDescription
vrecpe

n# of elementssi52_e

for 0 ≤ i < n) do

 Vd[i]2Vn[i]×Vm[i]si199_e

end for

Perform most of the Newton-Raphson reciprocal improvement step.
vrsqrte

n# of elementssi52_e

for 0 ≤ i < n) do

 Vd[i](3Vn[i]×Vm[i])÷2si201_e

end for

Perform most of the Newton-Raphson reciprocal square root improvement step

t0305

Examples

f10-55-9780128036983

10.11 Pseudo-Instructions

The GNU assembler supports five pseudo-instructions for NEON. Two of them are vcle and vclt, which were covered in Section 10.6.1. The other three are explained in the following sections.

10.11.1 Load Constant

This pseudo-instruction loads a constant value into every element of a NEON vector, or into a VFP single-precision or double-precision register:

vldr Load Constant.

This pseudo-instruction will use vmov if possible. Otherwise, it will create an entry in the literal pool and use vldr.

Syntax

 vldr{<cond>}.<type> Vd, =<imm>

 <cond> is an optional condition code.

 <type> must be one of i8, i16, i32, i64, s8, s16, s32, s64, u8, u16, u32, u64, f32, or f64.

 <imm> is a value appropriate for the specified <type>.

Operations

NameEffectDescription
vldr

Vd<imm>si202_e

Load a constant

t0310

Examples

f10-56-9780128036983

10.11.2 Bitwise Logical Operations with Immediate Data

It is often useful to clear and/or set specific bits in a register. The following pseudo-instructions can provide bitwise logical operations:

vand Bitwise AND Immediate

vorn Bitwise Complement and OR Immediate

Syntax

 v<op>.<type> Vd, #<imm>

 <op> must be either and, or orn.

 V must be either q or d to specify whether the operation involves quadwords or doublewords.

 <type> must be i8, i16, i32, or i64.

 <imm> is a 16-bit or 32-bit immediate value, which is interpreted as a pattern for filling the immediate operand. The following table shows acceptable patterns for <imm>, based on what was chosen for <type>:

<type>
i8,i16i32,i64
0xFFXY0xFFFFFFXY
0xXYFF0xFFFFXYFF
0xFFXYFFFF
0xXYFFFFFF

t0315

Operations

NameEffectDescription
vandVdVdimm:immsi101_eLogical OR
vornVd¬(Vdimm:imm)si204_eBit Clear

Examples

f10-57-9780128036983

10.11.3 Vector Absolute Compare

The following pseudo-instructions perform comparisons between the absolute values of all of the corresponding elements of two vectors in parallel:

vacle Absolute Compare Less Than or Equal

vaclt Absolute Compare Less Than

The vector absolute compare instruction compares the absolute value of each element of a vector with the absolute value of the corresponding element in a second vector, and sets an element in the destination vector for each comparison. If the comparison is true, then all bits in the result element are set to one. Otherwise, all bits in the result element are set to zero. Note that summing the elements of the result vector (as signed integers) will give the two’s complement of the number of comparisons which were true.

Syntax

 vac<op>.f32 Vd, Vn, Vm

 <op> must be either lt or lt.

 V can be d or q.

 The operand element type must be f32.

 The result element type is i32.

Operations

NameEffectDescription
vac<op>

for ivector_length do

 if |Fm[i]|<op> |Fn[i]|

then

 Fd[i]111si89_e

else

 Fd[i]000si90_e

 end if

end for

Compare each scalar in Fn to the corresponding scalar in Fm. If the comparison is true, then set all bits in the corresponding scalar in Fd to one. Otherwise set all bits in the corresponding scalar in Fd to zero.

t0325

Examples

f10-58-9780128036983

10.12 Performance Mathematics: A Final Look at Sine

In Chapter 9, four versions of the sine function were given. Those implementations used scalar and VFP vector modes for single-precision and double-precision. Those previous implementations are already faster than the implementations provided by GCC, However, it may be possible to gain a little more performance by taking advantage of the NEON architecture. All versions of NEON are guaranteed to have a very large register set, and that fact can be used to attain better performance.

10.12.1 Single Precision

Listing 10.1 shows a single precision floating point implementation of the sine function, using the ARM NEON instruction set. It performs the same operations as the previous implementations of the sine function, but performs many of the calculations in parallel. This implementation is slightly faster than the previous version.

f10-59-9780128036983
Listing 10.1 NEON implementation of the sin x function using single precision.

10.12.2 Double Precision

Listing 10.2 shows a double precision floating point implementation of the sine function. This code is intended to run on ARMv7 and earlier NEON/VFP systems with the full set of 32 double-precision registers. NEON systems prior to ARMv8 do not have NEON SIMD instructions for double precision operations. This implementation is faster than Listing 9.4 because it uses a large number of registers, does not contain a loop, and is written carefully so that multiple instructions can be at different stages in the pipeline at the same time. This technique of gaining performance is known as loop unrolling.

f10-60a-9780128036983f10-60b-9780128036983f10-60c-9780128036983
Listing 10.2 NEON implementation of the sin x function using double precision.

10.12.3 Performance Comparison

Table 10.4 compares the implementations from Listings 10.1 and 10.2 with the VFP vector implementations from Chapter 9 and the sine function provided by GCC. Notice that in every case, using vector mode VFP instructions is slower than the scalar VFP version. As mentioned previously, vector mode is deprecated on NEON processors. On NEON systems, vector mode is emulated in software. Although vector mode is supported, using it will result in reduced performance, because each vector instruction causes the operating system to take over and substitute a series of scalar floating point operations on-the-fly. A great deal of time was spent by the operating system software in emulating the VFP hardware vector mode.

Table 10.4

Performance of sine function with various implementations

OptimizationImplementationCPU seconds
NoneSingle Precision VFP scalar Assembly1.74
Single Precision VFP vector Assembly27.09
Single Precision NEON Assembly1.32
Single Precision C4.36
Double Precision VFP scalar Assembly2.83
Double Precision VFP vector Assembly106.46
Double Precision NEON Assembly2.24
Double Precision C4.59
FullSingle Precision VFP scalar Assembly1.11
Single Precision VFP vector Assembly27.15
Single Precision NEON Assembly0.96
Single Precision C1.69
Double Precision VFP scalar Assembly2.56
Double Precision VFP vector Assembly107.5.53
Double Precision NEON Assembly2.05
Double Precision C4.27

When compiler optimization is not used, the single precision scalar VFP implementation achieves a speedup of about 2.51, and the NEON implementation achieves a speedup of about 3.30 compared to the GCC implementation. The double precision scalar VFP implementation achieves a speedup of about 1.62, and the loop-unrolled NEON implementation achieves a speedup of about 2.05 compared to the GCC implementation.

When the best possible compiler optimization is used (-Ofast), the single precision scalar VFP implementation achieves a speedup of about 1.52, and the NEON implementation achieves a speedup of about 1.76 compared to the GCC implementation. The double precision scalar VFP implementation achieves a speedup of about 1.67, and the loop-unrolled NEON implementation achieves a speedup of about 2.08 compared to the GCC implementation. The single precision NEON version was 1.16 times as fast as the VFP scalar version and the double precision NEON implementation was 1.25 times as fast as the VFP scalar implementation.

Although the VFP versions of the sine function ran without modification on the NEON processor, re-writing them for NEON resulted in significant performance improvement. Performance of the vectorized VFP code running on a NEON processor was abysmal. The take-away lesson is that a programmer can improve performance by writing some functions in assembly that are specifically targeted to run on an specific platform. However, assembly code which improves performance on one platform may actually result in very poor performance on a different platform. To achieve optimal or near-optimal performance, it is important for the programmer to be aware of exactly which hardware platform is being used.

10.13 Alphabetized List of NEON Instructions

NamePageOperation
vaba339Absolute Difference and Accumulate
vabal339Absolute Difference and Accumulate Long
vabd339Absolute Difference
vabdl339Absolute Difference Long
vabs340Absolute Value
vacge324Absolute Compare Greater Than or Equal
vacgt324Absolute Compare Greater Than
vacle353Absolute Compare Less Than or Equal
vaclt353Absolute Compare Less Than
vadd335Add
vaddhn336Add and Narrow
vaddl335Add Long
vaddw335Add Wide
vand326Bitwise AND
vand352Bitwise AND Immediate
vbic326Bit Clear
vbic327Bit Clear Immediate
vbif328Bitwise Insert if False
vbit328Bitwise Insert
vbsl328Bitwise Select
vceq323Compare Equal
vcge323Compare Greater Than or Equal
vcgt323Compare Greater Than
vcle323Compare Less Than or Equal
vcls342Count Leading Sign Bits
vclt323Compare Less Than
vclz342Count Leading Zero Bits
vcnt342Count Set Bits
vcvt322Convert Between Half and Single
vcvt321Convert Data Format
vdup312Duplicate Scalar
veor326Bitwise Exclusive-OR
vext313Extract Elements
vfma346Fused Multiply Accumulate
vfms346Fused Multiply Subtract
vfnma346Fused Negate Multiply Accumulate
vfnms346Fused Negate Multiply Subtract
vhadd337Halving Add
vhsub337Halving Subtract
vld¡n¿305Load Copies of Structured Data
vld¡n¿307Load Multiple Structured Data
vld¡n¿303Load Structured Data
vldr351Load Constant
vmax341Maximum
vmin341Minimum
vmla343Multiply Accumulate
vmla345Multiply Accumulate by Scalar
vmlal344Multiply Accumulate Long
vmlal345Multiply Accumulate Long by Scalar
vmls343Multiply Subtract
vmls345Multiply Subtract by Scalar
vmlsl344Multiply Subtract Long
vmlsl345Multiply Subtract Long by Scalar
vmov310Move Immediate
vmov309Move Between NEON and ARM
vmovl311Move and Lengthen
vmovn311Move and Narrow
vmul343Multiply
vmul345Multiply by Scalar
vmull343Multiply Long
vmull345Multiply Long by Scalar
vmvn310Move Immediate Negative
vneg340Negate
vorn326Bitwise Complement and OR
vorn352Bitwise Complement and OR Immediate
vorr326Bitwise OR
vorr327Bitwise OR Immediate
vpadal338Add Pairwises and Accumulate Long
vpadd338Add Pairwise
vpaddl338Add Pairwise Long
vpmax341Pairwise Maximum
vpmin341Pairwise Minimum
vqabs340Saturating Absolute Value
vqadd335Saturating Add
vqdmlal347Saturating Multiply Double Accumulate (Low)
vqdmlsl347Saturating Multiply Double Subtract (Low)
vqdmulh348Saturating Multiply Double (High)
vqdmull347Saturating Multiply Double (Low)
vqmovn311Saturating Move and Narrow
vqmovun311Saturating Move and Narrow Unsigned
vqneg340Saturating Negate
vqrdmulh348Saturating Multiply Double (High) and Round
vqrshl330Saturating Shift Left or Right by Variable and Round
vqrshrn332Saturating Shift Right Immediate Round
vqrshrun333Saturating Shift Right Immediate Round Unsigned
vqshl329Saturating Shift Left Immediate
vqshl330Saturating Shift Left or Right by Variable
vqshlu329Saturating Shift Left Immediate Unsigned
vqshrn332Saturating Shift Right Immediate
vqshrun333Saturating Shift Right Immediate Unsigned
vqsub335Saturating Subtract
vraddhn336Add, Round, and Narrow
vrecpe348Reciprocal Estimate
vrecps349Reciprocal Step
vrev314Reverse Elements
vrhadd337Halving Add and Round
vrshl330Shift Left or Right by Variable and Round
vrshr331Shift Right Immediate and Round
vrshrn331Shift Right Immediate Round and Narrow
vrsqrte348Reciprocal Square Root Estimate
vrsqrts349Reciprocal Square Root Step
vrsra331Shift Right Round and Accumulate Immediate
vrsubhn336Subtract, Round, and Narrow
vshl329Shift Left Immediate
vshl330Shift Left or Right by Variable
vshll329Shift Left Immediate Long
vshr331Shift Right Immediate
vshrn331Shift Right Immediate and Narrow
vsli334Shift Left and Insert
vsra331Shift Right and Accumulate Immediate
vsri334Shift Right and Insert
vst<n>307Store Multiple Structured Data
vst<n>303Store Structured Data
vsub335Subtract
vsubhn336Subtract and Narrow
vsubl335Subtract Long
vsubw335Subtract Wide
vswp315Swap Vectors
vtbl318Table Lookup
vtbx318Table Lookup with Extend
vtrn316Transpose Matrix
vtst325Test Bits
vuzp319Unzip Vectors
vzip319Zip Vectors

t0330_at0330_b

10.14 Chapter Summary

NEON can dramatically improve performance of algorithms that can take advantage of data parallelism. However, compiler support for automatically vectorizing and using NEON instructions is still immature. NEON intrinsics allow C and C++ programmers to access NEON instructions, by making them look like C functions. It is usually just as easy and more concise to write NEON assembly code as it is to use the intrinsics functions. A careful assembly language programmer can usually beat the compiler, sometimes by a wide margin. The greatest gains usually come from converting an algorithm to avoid floating point, and taking advantage of data parallelism.

Exercises

10.1 What is the advantage of using IEEE half-precision? What is the disadvantage?

10.2 NEON achieved relatively modest performance gains on the sine function, when compared to VFP.

(a) Why?

(b) List some tasks for which NEON could significantly outperform VFP.

10.3 There are some limitations on the size of the structure that can be loaded or stored using the vld<n> and vst<n> instructions. What are the limitations?

10.4 The sine function in Listing 10.2 uses a technique known as “loop unrolling” to achieve higher performance. Name at least three reasons why this code is more efficient than using a loop?

10.5 Reimplement the fixed-point sine function from Listing 8.7 using NEON instructions. Hint: you should not need to use a loop. Compare the performance of your NEON implementation with the performance of the original implementation.

10.6 Reimplement Exercise 9.10 using NEON instructions.

10.7 Fixed point operations may be faster than floating point operations. Modify your code from the previous example so that it uses the following definitions for points and transformation matrices:

f10-61-9780128036983

Use saturating instructions and/or any other techniques necessary to prevent overflow. Compare the performance of the two implementations.