Question: HW 4 - 1 ( 4 3 points ) Suppose we wish to write a procedure that computes the inner product of two vectors
HW points Suppose we wish to write a procedure that computes the inner product of two vectors u and v An abstract version of the function has a CPE of with x for different types of integer and floatingpoint data. Doing the same sort of transformations as in the text to get from the program combine to the more efficient combine we get the following code:
typedef float datat;
#include "vec.h
long i;
long length veclengthu;
datat udata getvecstartu;
datat vdata getvecstartv;
datat sum datat;
for i ; i length; i
sum sum udatai vdatai;
dest sum;
void innervecptr u vecptr v datat dest
Our measurements show that this function has a CPE of for integer data and for floatingpoint data. For data type double, the x assembly code for the inner loop produced on our virtual machine with flags mavx and S is as follows:
# Inner loop of inner datat double. OP
# udata in rbp vdata rax, sum in xmm i in rcx limit in rbx
L: # loop:
vmovsd Orbprcxxmm # Get udatai
vmulsd rax,rcxxmmxmm # Multiply by vdatai
vaddsd xmmxmmxmm # Add to sum
addq $rcx # Increment i
cmpq rbxrcx # Compare i:limit
jl L # If goto loop
The new details of floatingpoint assembly code are pretty fully captured by just looking at Figures and with their captions.
Assume that the functional units have the latencies and issue times given in Figure and in the course notes
A Diagram how this instruction sequence would be decoded into operations, and show how the data dependencies between them would create a critical path of operations. This process of diagramming is illustrated in Figures dpbsequential.pptx livecomFigure: dpbflow.pptx livecom and Figure: dpbflowabstractpptx livecom and Figure: dpbflowmultiple.pptx livecom; you can draw just a diagram in the style of a but do add identification of where the critical path is points.
B For data type double, what lower bound on the CPE is determined by the critical path? Give a numerical value and an explanation. points.
C Assuming similar instruction sequences for the integer code as well, what lower bound on the CPE is determined by the critical path for integer data? Give a numerical value and an explanation. points.
D Explain how the floatingpoint version can have a CPE of even though the multiplication operation requires cycles. points.
HW points
A Write a version of the inner product procedure described in the previous problem that uses fiveway loop unrolling times ; no parallelism points.
For x our measurements of the unrolled version give a CPE of for integer data but still for floatingpoint data.
B Explain why any version of any inner product procedure even with parallelism cannot achieve a CPE less than points.
C Explain why the performance for floatingpoint data did not improve with loop unrolling. points.
Step by Step Solution
There are 3 Steps involved in it
1 Expert Approved Answer
Step: 1 Unlock
Question Has Been Solved by an Expert!
Get step-by-step solutions from verified subject matter experts
Step: 2 Unlock
Step: 3 Unlock
