Mastering Assembly Programming

First of all, we need to make some tiny corrections to the dates (months are specified by their number). We are interested in the number of days since January 1 until the first day of a month. The easiest and fastest way to perform such correction would be using a small table with 12 entries, containing the number of days between January 1 and the first day of a month. The table is called monthtab and is located in the data section.

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;
; Entry point
;
;-----------------------------------------------------
_start:
   mov ecx, 20                     ; Length of biorhythm data to    
                                   ; produce

   mov eax, [bmonth]               ; Load birth month
   dec eax                         ; Decrement it in order to address   
                                   ; 0-based array

   mov eax, [monthtab + eax * 4]   ; Replace month with number of days
                                   ; since New Year
   mov [bmonth], eax               ; Store it back

   mov eax, [cmonth]               ; Do the same for current month
   dec eax
   mov eax, [monthtab + eax * 4]
   mov [cmonth], eax

   xor eax, eax                ; Reset EAX as we will use it as counter

The preceding code illustrates this very fix being applied:

We read the month number from the birth date
Decrement it as the table we are using is in fact a 0-based array of values
Replace the original month number with the value read from the table

By the way, the addressing mode used when reading a value from the table is a variation of the scale/index/base/displacement. As we may see, monthtab is the displacement, eax register holds the index and 4 is the scale.

The day/month/year of the two dates are specifically pre-arranged to fit properly in the XMM registers and to ease calculations. It may seem that the first line of the following code loads the value of cday into XMM0, but, in fact, the instruction being used loads xmmword (128-bit data type) starting from the address of cday, meaning that it loads four values into XMM0:

bits 96 - 127	bits 64 - 95	bits 32 - 63	bits 0 - 31
`byear`	`bday`	`cyear`	`cday`
1979	16	2017	9

Data representation in the XMM0 register

Similarly, the second movaps loads XMM1 register with four double words starting at address of cmonth:

bits 96 - 127	bits 64 - 95	bits 32 - 63	bits 0 - 31
0	`bmonth`	0	`cmonth`
0	0	0	120

Data representation in the XMM1 register

As we can see, placing the two tables directly one above the other and thinking of them as XMM registers 0 and 1, we have cmonth/cday and bmonth/bday loaded to the same double words in both XMM0 and XMM1. We will see why such an arrangement of the data was so important in a few moments.

The movaps instruction is only able to move data between two XMM registers or an XMM register and a 16 bytes aligned memory location. You should use movups for accessing unaligned memory locations.

In the last two lines of the following code fragment, we convert the values we have just loaded from double words to single precision float numbers:

   movaps xmm0, xword[cday]    ; Load the day/year parts of both dates
   movapd xmm1, xword[cmonth]  ; Load number of days since Jan 1st for both dates
   cvtdq2ps xmm0, xmm0         ; Convert loaded values to single precision floats
   cvtdq2ps xmm1, xmm1

We still have not finished conversion of dates into the amount of days, as years are still, well, years, and the number in days of a month and the number of days since January 1st for both dates are still stored separately. All we have to do before summation of the days for each date is multiply each year by 365.25 (where 0.25 is a compensation for leap years). However, parts of the XMM registers cannot be accessed separately, as with parts of general purpose registers (for example, there is no analog to AX, AH, AL in EAX). We can, however, manipulate parts of XMM registers by using special instructions. In the first line of the following code fragment we load the lower 64-bit part of XMM2 register with two float values stored at dpy (days per year). The aforementioned values are 1.0 and 365.25. What does 1.0 have to do with it, you may ask, and the answer is shown in the following table:

bits 96 - 127	bits 64 - 95	bits 32 - 63	bits 0 - 31	register name
1979.0	16.0	2017.0	9.0	XMM0
0.0	0.0	0.0	120.0	XMM1
0.0	0.0	365.25	1.0	XMM2

Content of XMM0 - XMM2 registers

Packed operations on XMM registers (packed means operations on more than one value) are, most of the time, performed in columns. Thus, in order to multiply 2017.0 by 365.25, we need to multiply XMM2 by XMM0. However, we must not forget about 1979.0 either, and the easiest way to multiply both 2017.0 and 1979.0 by 365.25 with a single instruction is to copy the content of the lower part of XMM2 register to its upper part with the movlhps instruction.

   movq xmm2, qword[dpy]  ; Load days per year into lower half of XMM2
   movlhps xmm2, xmm2     ; Duplicate it to the upper half

After these instructions the content of the XMM0 - XMM2 registers should look like this:

bits 96 - 127	bits 64 - 95	bits 32 - 63	bits 0 - 31	register name
1979.0	16.0	2017.0	9.0	XMM0
0.0	0.0	0.0	120.0	XMM1
365.25	1.0	365.25	1.0	XMM2

Content of XMM0 - XMM2 registers after movlhps

Use pinsrb/pinsrd/pinsrq instructions for insertions of individual bytes/double words/quad words into an XMM register when needed. They are not used in our code for the purpose of demonstration of horizontal operations.

Now we are safe to proceed with multiplication and summation:

addps xmm1, xmm0      ; Summation of day of the month with days since January 1st
mulps xmm2, xmm1      ; Multiplication of years by days per year
haddps xmm2, xmm2     ; Final summation of days for both dates
hsubps xmm2, xmm2     ; Subtraction of birth date from current date

The preceding code first calculates the total number of days since January 1 up to the day of the month for both dates on the first line. On the second line, at last, it multiplies the years of both dates by the number of days per year. This line also explains why the days per year value was accompanied by 1.0--as we are multiplying XMM1 by XMM2 and we do not want to lose the previously calculated number of days, we simply multiply the number of days since January 1st by 1.0.

At this moment the content of the three XMM registers should be like this:

bits 96 - 127	bits 64 - 95	bits 32 - 63	bits 0 - 31	register name
1979.0	16.0	2017.0	9.0	XMM0
1979.0	16.0	2017.0	129.0	XMM1
722829.75	16.0	736709.25	129.0	XMM2

Content of XMM0 - XMM2 registers after addition of days and multiplication by days per year of relative parts of XMM2 and XMM1 registers

There are two remaining operations to perform:

Finalize calculation of the total number of days for each date
Subtract the earlier date from the later one

By this time, all of the values that we need to use in our calculations are stored in a single register, XMM2. Luckily, SSE3 introduced two important instructions:

haddps: Horizontal addition of single-precision values
Adds the single-precision floating-point values in the first and second and in third and fourth dwords of the destination operand, and stores the results in the first and second dwords of the destination operand respectively. The third and fourth dwords are overwritten with the results too, where the third dword contains the same value as the first dword and the fourth dword the same value as the second dword.
hsubps: Horizontal subtraction of single-precision values
Subtracts the single-precision floating-point value in the second dword of the destination operand from the first dword of the destination operand and the value of the fourth dword of the destination operand from the third dword, and stores the results into the first and second dwords and third and fourth dwords of the destination operand respectively.

Upon completion of hsubps instruction, the content of the registers should be:

bits 96 - 127	bits 64 - 95	bits 32 - 63	bits 0 - 31	register name
1979.0	16.0	2017.0	9.0	XMM0
1979.0	16.0	2017.0	129.0	XMM1
13992.5	13992.5	13992.5	13992.5	XMM2

Content of XMM0 - XMM2 registers after addition and later subtraction of values

As we see, the XMM2 register contains the number of days between the two dates (the date of birth and the current date) minus 1, as the day of birth itself is not included (this problem will be solved in the calculation loop);

movd xmm3, [dpy]       ; Load 1.0 into the lower double word of XMM3
movlhps xmm3, xmm3     ; Duplicate it to the third double word of XMM3
movsldup xmm3, xmm3    ; Duplicate it to the second and fourth double words of XMM3

The preceding three lines set up the step value for our forecast by loading the double word stored at dpy, which is 1.0, and propagate this value throughout the XMM3 register. We will be adding XMM3 to XMM2 for each new day of the forecast.

The following three lines are logically similar to the previous three; they set all four single precision floats of the XMM4 register to 2*PI:

movd xmm4, [pi_2]
movlhps xmm4, xmm4
movsldup xmm4, xmm4

And the last step before entering the calculation loop: we load XMM1 with the lengths of the biorhythmic cycles and set the eax register to point to the location in the memory where we are going to store our output data (the forecast). Given the arrangement of data in the data section, the fourth single of the XMM1 register will be loaded with 2*PI, but, since the fourth single is not going to be used in our calculations, we simply leave it as is. We could, however, zero it out with the value of eax by using the pinsrd xmm1, eax, 3 instruction:

movaps xmm1, xword[T]
lea eax, [output]

At last we have all the data set up and ready for actual calculation of biorhythmic values for a given range of dates. The registers XMM0 to XMM4 should now have the following values:

bits 96 - 127	bits 64 - 95	bits 32 - 63	bits 0 - 31	register name
1979.0	16.0	2017.0	9.0	XMM0
6.2831802	33.0	28.0	23.0	XMM1
13992.5	13992.5	13992.5	13992.5	XMM2
1.0	1.0	1.0	1.0	XMM3
6.2831802	6.2831802	6.2831802	6.2831802	XMM4

Table of Contents for
Mastering Assembly Programming

Data preparation steps

Table of Contents for Mastering Assembly Programming

Table of Contents for
Mastering Assembly Programming