Question: I Will upvote if solved completely 3.14 [25/25/25] In this exercise, we look at how software techniques can extract instruction-level parallelism (ILP) in a common
I Will upvote if solved completely
![I Will upvote if solved completely 3.14 [25/25/25] In this exercise, we](https://dsd5zvtm8ll6.cloudfront.net/si.experts.images/questions/2024/09/66f3d5849afdd_94066f3d584226af.jpg)
3.14 [25/25/25] In this exercise, we look at how software techniques can extract instruction-level parallelism (ILP) in a common vector loop. The following loop is the so-called DAXPY loop (double-precision aX plus Y) and is the central operation in Gaussian elimination. The following code implements the DAXPY operation, Y=aX + Y, for a vector length 100. Initially, R1 is set to the base address of array X and R2 is set to the base address of Y: addi x4,x1,#800 ; x1 = upper bound for X Case Studies and Exercises by Jason D. Bakos and Robert P. Colwell 275 foo: fld F2.0 (x1) ; (F2) = X(i) fmul.d F4,F2, FO ; (F4) = a*X(i) fld F6,0(x2) ; (F6 ) = Y(i) fadd.d F6, F4, F6 ; (46) = a*X(i) + y(i) fsd F6, 0(x2) ; Y(i) = a*X(i) + Y(i) addi x1,x1,48 ; increment X index addi x2,x2,8 ; increment Y index sltu x3, x1,x4 ; test: continue loop? bnez X3, foo ; loop if needed Assume the functional unit latencies as shown in the following table. Assume a one-cycle delayed branch that resolves in the ID stage. Assume that results are fully bypassed. Instruction using result FP ALU op FP ALU op Latency in clock cycles 6 4 Instruction producing result FP multiply FP add FP multiply FP add Integer operations and all loads FP store 5 FP store 4 Any 2
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
