Question: HW 4 - 1 ( 4 3 points ) Suppose we wish to write a procedure that computes the inner product of two vectors

HW4-1(43 points) Suppose we wish to write a procedure that computes the inner product of two vectors \( u \) and \( v \). An abstract version of the function has a CPE of 14-18 with x86-64 for different types of integer and floating-point data. Doing the same sort of transformations as in the text to get from the program combine1 to the more efficient combine4, we get the following code:
```
typedef float data_t;
#include "vec.h"
long i;
long length = vec_length(u);
data_t *udata = get_vec_start(u);
data_t *vdata = get_vec_start(v);
data_t sum =(data_t)0;
for (i =0; i length; i++){
sum = sum + udata[i]* vdata[i];
}
*dest = sum;
}
```
void inner4(vec_ptr u, vec_ptr v, data_t *dest){
Our measurements show that this function has a CPE of 1.50 for integer data and 3.00 for floatingpoint data. For data type double, the x86-64 assembly code for the inner loop (produced on our virtual machine with flags -02,-mavx2, and -S is as follows:
```
# Inner loop of inner4. data_t = double. OP =*.
# udata in %rbp, vdata %rax, sum in %xmm0, i in rcx, limit in rbx
.L15: # loop:
vmovsd O(%rbp,%rcx,8),%xmm1 # Get udata[i]
vmulsd (%rax,%rcx,8),%xmm1,%xmm1 # Multiply by vdata[i]
vaddsd %xmm1,%xmm0,%xmm0 # Add to sum
addq $1,%rcx # Increment i
cmpq %rbx,%rcx # Compare i:limit
jl .L15 # If , goto loop
```
The new details of floating-point assembly code are pretty fully captured by just looking at Figures 3.45,3.46, and 3.49 with their captions.
Assume that the functional units have the latencies and issue times given in Figure 5.12(and in the course notes).
A. Diagram how this instruction sequence would be decoded into operations, and show how the data dependencies between them would create a critical path of operations. This process of diagramming is illustrated in Figures 5.13(dpb-sequential.pptx (live.com)),5.14(Figure: dpb-flow.pptx (live.com) and Figure: dpb-flow-abstract.pptx (live.com)), and 5.15(Figure: dpb-flow-multiple.pptx (live.com)); you can draw just a diagram in the style of 5.14(a), but do add identification of where the critical path is.(25 points.)
B. For data type double, what lower bound on the CPE is determined by the critical path? Give a numerical value and an explanation. (6 points.)
C. Assuming similar instruction sequences for the integer code as well, what lower bound on the CPE is determined by the critical path for integer data? Give a numerical value and an explanation. (6 points.)
D. Explain how the floating-point version can have a CPE of 3.00 even though the multiplication operation requires 5 cycles. (6 points.)
HW4-2(27 points)
A. Write a version of the inner product procedure described in the previous problem that uses five-way loop unrolling (\(5\times 1\); no parallelism).(15 points.)
For x86-64, our measurements of the unrolled version give a CPE of 1.07 for integer data but still 3.01 for floating-point data.
B. Explain why any version of any inner product procedure (even with parallelism) cannot achieve a CPE less than 1.00.(6 points.)
C. Explain why the performance for floating-point data did not improve with loop unrolling. (6 points.)
HW 4 - 1 ( 4 3 points ) Suppose we wish to write

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!