We saw that our measurements of the prefix-sum function psum1 (Figure 5.1) yield a CPE of 9.00

Question:

We saw that our measurements of the prefix-sum function psum1 (Figure 5.1) yield a CPE of 9.00 on a machine where the basic operation to be performed, floating point addition, has a latency of just 3 clock cycles. Let us try to understand why our function performs so poorly.

The following is the assembly code for the inner loop of the function:

1 2 3 4 5 6 7 Inner loop of psumi a in %rdi, i in %rax, cnt in %rdx .L5: vmovss -4(%rsi,%rax, 4), %xmmo

Perform an analysis similar to those shown for combine3 (Figure 5.14) and for write_read (Figure 5.36) to diagram the data dependencies created by this loop, and hence the critical path that forms as the computation proceeds. Explain why the CPE is so high.

Figure 5.1