When we use gcc to compile combine3 with command line option 02, we get code with substantially better CPE performance than with 01 We achieve performance comparable to that for combine4, except for the case of integer sum, but even it improves significantly On examining the assembly code generated...

This assembly code demonstrates a clever optimization opportunity detected by gcc It is worth studyi ...

When we use gcc to compile combine3 with command-line option -02, we get code with substantially better

Question:

When we use gcc to compile combine3 with command-line option -02, we get code with substantially better CPE performance than with -01:

Function Page Method combine3 549 combine3. 549 combine4 551 Compiled -01 Compiled -02 Accumulate in

We achieve performance comparable to that for combine4, except for the case of integer sum, but even it improves significantly. On examining the assembly code generated by the compiler, we find an interesting variant for the inner loop:

1 2 3 4 5 6 1 vmulsd (%rdx), %xmm0, %xmm0 addq $8, %rdx cmpq %rax, %rdx Compare to data+length Store product

We see that, besides some reordering of instructions, the only difference is that the more optimized version does not contain the vmovsd implementing the read from the location designated by dest (line 2).

A. How does the role of register %xmm0 differ in these two loops?

B. Will the more optimized version faithfully implement the C code of combine3, including when there is memory aliasing between dest and the vector data?

C. Either explain why this optimization preserves the desired behavior, or give an example where it would produce different results than the less optimized code.

Fantastic news! We've Found the answer you've been seeking!