Question: (1) Assume the outcome of branch instruction is correctly predicted. (2) Assume there is an integer ALU for address calculation; and another integer ALU
(1) Assume the outcome of branch instruction is correctly predicted. (2) Assume there is an integer ALU for address calculation; and another integer ALU for branch and all other integer operations. (3) If the first instruction in an issue packet is a branch instruction, only this branch instruction can be issued in this cycle. (4) Up to two instructions can be committed per cycle. (5) There are two CDBS. (6) For load/store, EX is for address calculation. (7) Only show the first two iterations and ignore the addi instruction before the loop. (8) The functional units (FUs) are pipelined and with latency described in the table below. FU Type Cycles in EX Number of FUs Number of reservation stations Integer 1 2 FP adder 12 1 3 FP multiplier 18 1 Q1-a) With speculation; there are twelve Reorder Buffer (ROB) entries. Q1-b) Without speculation; [25/25/25] In this exercise, we look at how software techniques can extract instruction-level parallelism (ILP) in a common vector loop. The following loop is the so-called DAXPY loop (double-precision aX plus Y) and is the central operation in Gaussian elimination. The following code implements the DAXPY operation, Y=aX + Y, for a vector length 100. Initially, R1 is set to the base address of array X and R2 is set to the base address of Y: 3.14 addi x4,x1,#800 ; x1 = upper bound for X Case Studies and Exercises by Jason D. Bakos and Robert P. Colwell 275 ; (F2) = X(i) ; (F4) = a*X(i) ; (F6) = Y (i) ; (F6) = a*X(i) + Y(i) ; Y(i) = a*X(i) +Y(i) ; incrementX index ; increment Y index ; test: continue loop? ; loop if needed foo: fld F2,0(x1) F4, F2, FO F6,0(x2) F6, F4, F6 F6,0(x2) x1,x1,#8 x2, x2,#8 x3, x1,x4 x3, foo fmul.d fld fadd.d fsd addi addi sltu bnez Assume the functional unit latencies as shown in the following table. Assume a one-cycle delayed branch that resolves in the ID stage. Assume that results are fully bypassed. Instruction producing result FP multiply Instruction using result FP ALU op Latency in clock cycles 6. FP add FP ALU op 4 FP multiply FP store FP add FP store 4 Integer operations and all loads Any
Step by Step Solution
3.49 Rating (169 Votes )
There are 3 Steps involved in it
Answer ANSWER Q1a Correct Speculation void stallresult is computed earlyperfprmance Speculation accu... View full answer
Get step-by-step solutions from verified subject matter experts
