Question: When we use gcc to compile combine3 with command-line option -02, we get code with substantially better CPE performance than with -01: We achieve performance

When we use gcc to compile combine3 with command-line option -02, we get code with substantially better CPE performance than with -01:

Function Page Method combine3 549 combine3. 549 combine4 551 Compiled -01 Compiled -02 Accumulate in

We achieve performance comparable to that for combine4, except for the case of integer sum, but even it improves significantly. On examining the assembly code generated by the compiler, we find an interesting variant for the inner loop:

1 2 3 4 5 6 1 vmulsd (%rdx), %xmm0, %xmm0 addq $8, %rdx cmpq %rax, %rdx Compare to data+length Store product

We see that, besides some reordering of instructions, the only difference is that the more optimized version does not contain the vmovsd implementing the read from the location designated by dest (line 2).

A. How does the role of register %xmm0 differ in these two loops?

B. Will the more optimized version faithfully implement the C code of combine3, including when there is memory aliasing between dest and the vector data?

C. Either explain why this optimization preserves the desired behavior, or give an example where it would produce different results than the less optimized code.

Function Page Method combine3 549 combine3. 549 combine4 551 Compiled -01 Compiled -02 Accumulate in temporary Integer + 7.17 1.60 1.27 9.02 3.01 3.01 Floating point 9.02 3.01 3.01 11.03 5.01 5.01

Step by Step Solution

3.34 Rating (160 Votes )

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock

This assembly code demonstrates a clever optimization opportunity detected by gcc It is worth studyi... View full answer

blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Computer Systems A Programmers Perspective Questions!