(1) Assume the outcome of branch instruction is correctly predicted. (2) Assume there is an integer...

Fantastic news! We've Found the answer you've been seeking!

Question:

Transcribed Image Text:

(1) Assume the outcome of branch instruction is correctly predicted. (2) Assume there is an integer ALU for address calculation; and another integer ALU for branch and all other integer operations. (3) If the first instruction in an issue packet is a branch instruction, only this branch instruction can be issued in this cycle. (4) Up to two instructions can be committed per cycle. (5) There are two CDBS. (6) For load/store, EX is for address calculation. (7) Only show the first two iterations and ignore the addi instruction before the loop. (8) The functional units (FUs) are pipelined and with latency described in the table below. FU Type Cycles in EX Number of FUs Number of reservation stations Integer 1 2 FP adder 12 1 3 FP multiplier 18 1 Q1-a) With speculation; there are twelve Reorder Buffer (ROB) entries. Q1-b) Without speculation; [25/25/25] <3.2, 3.7> In this exercise, we look at how software techniques can extract instruction-level parallelism (ILP) in a common vector loop. The following loop is the so-called DAXPY loop (double-precision aX plus Y) and is the central operation in Gaussian elimination. The following code implements the DAXPY operation, Y=aX + Y, for a vector length 100. Initially, R1 is set to the base address of array X and R2 is set to the base address of Y: 3.14 addi x4,x1,#800 ; x1 = upper bound for X Case Studies and Exercises by Jason D. Bakos and Robert P. Colwell 275 ; (F2) = X(i) ; (F4) = aX(i) ; (F6) = Y (i) ; (F6) = aX(i) + Y(i) ; Y(i) = aX(i) +Y(i) ; incrementX index ; increment Y index ; test: continue loop? ; loop if needed foo: fld F2,0(x1) F4, F2, FO F6,0(x2) F6, F4, F6 F6,0(x2) x1,x1,#8 x2, x2,#8 x3, x1,x4 x3, foo fmul.d fld fadd.d fsd addi addi sltu bnez Assume the functional unit latencies as shown in the following table. Assume a one-cycle delayed branch that resolves in the ID stage. Assume that results are fully bypassed. Instruction producing result FP multiply Instruction using result FP ALU op Latency in clock cycles 6. FP add FP ALU op 4 FP multiply FP store FP add FP store 4 Integer operations and all loads Any (1) Assume the outcome of branch instruction is correctly predicted. (2) Assume there is an integer ALU for address calculation; and another integer ALU for branch and all other integer operations. (3) If the first instruction in an issue packet is a branch instruction, only this branch instruction can be issued in this cycle. (4) Up to two instructions can be committed per cycle. (5) There are two CDBS. (6) For load/store, EX is for address calculation. (7) Only show the first two iterations and ignore the addi instruction before the loop. (8) The functional units (FUs) are pipelined and with latency described in the table below. FU Type Cycles in EX Number of FUs Number of reservation stations Integer 1 2 FP adder 12 1 3 FP multiplier 18 1 Q1-a) With speculation; there are twelve Reorder Buffer (ROB) entries. Q1-b) Without speculation; [25/25/25] <3.2, 3.7> In this exercise, we look at how software techniques can extract instruction-level parallelism (ILP) in a common vector loop. The following loop is the so-called DAXPY loop (double-precision aX plus Y) and is the central operation in Gaussian elimination. The following code implements the DAXPY operation, Y=aX + Y, for a vector length 100. Initially, R1 is set to the base address of array X and R2 is set to the base address of Y: 3.14 addi x4,x1,#800 ; x1 = upper bound for X Case Studies and Exercises by Jason D. Bakos and Robert P. Colwell 275 ; (F2) = X(i) ; (F4) = aX(i) ; (F6) = Y (i) ; (F6) = aX(i) + Y(i) ; Y(i) = aX(i) +Y(i) ; incrementX index ; increment Y index ; test: continue loop? ; loop if needed foo: fld F2,0(x1) F4, F2, FO F6,0(x2) F6, F4, F6 F6,0(x2) x1,x1,#8 x2, x2,#8 x3, x1,x4 x3, foo fmul.d fld fadd.d fsd addi addi sltu bnez Assume the functional unit latencies as shown in the following table. Assume a one-cycle delayed branch that resolves in the ID stage. Assume that results are fully bypassed. Instruction producing result FP multiply Instruction using result FP ALU op Latency in clock cycles 6. FP add FP ALU op 4 FP multiply FP store FP add FP store 4 Integer operations and all loads Any

Related Book For answer-question

answer-question

Cornerstones of Financial and Managerial Accounting

Cornerstones of Financial and Managerial Accounting

ISBN: 978-1111879044

2nd edition

Authors: Rich, Jeff Jones, Dan Heitger, Maryanne Mowen, Don Hansen

See More Books

Posted Date: Nov 22, 2020 01:23 AM

See More Questions