Question: Use the following code fragment:In this exercise, we look at how software techniques can extract instruction - level parallelism ( ILP ) in a common
Use the following code fragment:In this exercise, we look at how software techniques can extract instructionlevel parallelism
ILP in a common vector loop. The following loop is the socalled DAXPY loop double
precision a plus and is the central operation in Gaussian elimination. The following code
implements the DAXPY operation, and are arrays with elements
Initially, is set to the base address of array and is set to the base address of
contains to represent bytes of arrays and
Assume the functional unit latencies as shown in the following table. Assume a onecycle
delayed branch that resolves in the ID stage. Assume that results are fully bypassed data
forwarding
a Assume a singleissue pipeline. Reorder code as necessary to minimize stalls. Remember to
use the latencies given in the table above. How many cycles are needed to complete one iteration?
b Unroll the loop as many times as necessary to schedule it without any stalls, collapsing the loop
overhead instructions. How many times must the loop be unrolled? Show the instruction schedule.
What is the execution time per element of the result or per iteration time
Assume that the initial value of is loop is repeated times
a Show the timing of this instruction sequence for the stage RISC pipeline without any
forwarding or bypassing hardware but assuming that a register read and a write in the same clock
cycle for example, when an instruction writes back result to a register in cycle another
instruction read the register in the same cycle n Assume that if branch instruction causes stalls
if the branch is taken and zero cycle if not taken. Show the flow for one iteration and compute the
number of cycles needed to complete one iteration, then compute total number of cycles needed
to complete all iterations.
b Show the timing of this instruction sequence for the stage RISC pipeline with full forwarding
and bypassing hardware. Remember that you need a stall after load if the next instruction needs
the value read from memory. Assume that if branch instruction causes stalls if the branch is taken
and zero cycle if not taken. Show the flow for one iteration and compute the number of cycles
needed to complete one iteration, then compute total number of cycles needed to complete all
iterations.
c Highperformance processors have very deep pipelinesmore than stages. For this problem,
imagine that you have a stage pipeline in which every stage of the stage pipeline has been
split in two that is we have two Instruction Fetch stages, say IF IF two decode, D D etc
The only catch is that, for data forwarding, data can be forwarded from the end of the second
execute or second memory stage. a pair of stages to the beginning of the two stages where they are
needed. So data are forwarded from the output of the second execute stage to theinput of the first
execute stage, still causing a Icycle delay. Show the timing of this instruction sequence for the
stage RISC pipeline with full forwarding and bypassing hardware. Assume branch causes
stalls if the branch is taken and zero if the branch is not taken. How many cycles does this loop
take to complete one iteration, and how many cycles to complete all iterations?
Step by Step Solution
There are 3 Steps involved in it
1 Expert Approved Answer
Step: 1 Unlock
Question Has Been Solved by an Expert!
Get step-by-step solutions from verified subject matter experts
Step: 2 Unlock
Step: 3 Unlock
