Question: With a superpipelined CPU design shown below, where instruction fetch takes two cycles (IF1 and IF2), data Load and Stores take two cycles (ME1 and
With a superpipelined CPU design shown below, where instruction fetch takes two cycles (IF1 and IF2), data Load and Stores take two cycles (ME1 and ME2) and the execution takes two cycles (EX1 and EX2). Branches are handled in EX2 and always predicted untaken by hardware:
consider the following program, which searches an area of memory and counts the number of times a memory word is equal to a key word:
SEARCH: LW R5, 0(R3)
SUB R6, R5, R2
BNEZ R6, NOMATCH
ADDI R1, R1, #1
NOMATCH: ADDI R3, R3, #4
BNE R4, R3, SEARCH
Branches are predicted untaken always and are taken in EX if needed. Hardware support for branches is included in all cases. Consider several possible pipeline interlock designs for data hazards and answer the following questions for each loop iteration, except for the last iteration.
a) Assume first that the pipeline has no forwarding unit and no hazard detection unit. Values are not even forwarded inside the register file. Re-write the code by inserting NOOPs wherever needed so that the code will execute correctly.
b) Assume no forwarding at all, but a hazard detection unit that stalls instructions in ID to avoid hazards. How many clocks does it take to execute one iteration of the loop (1) on a match and (2) on no match?
c) Assume full forwarding and a hazard detection unit that stalls instructions in ID to avoid hazards. How many clocks does it take to execute one iteration of the loop (1) on a match and (2) on no match?
d) Identify basic blocks (using instruction numbers). Is it possible to save cycles by local optimizations? Why?
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
