Question: Use the following code fragment:In this exercise, we look at how software techniques can extract instruction - level parallelism ( ILP ) in a common

Use the following code fragment:In this exercise, we look at how software techniques can extract instruction-level parallelism
(ILP) in a common vector loop. The following loop is the so-called DAXPY loop (double-
precision a*x plus Y) and is the central operation in Gaussian elimination. The following code
implements the DAXPY operation, and Y are arrays with 100 elements).
Initially, x1 is set to the base address of array x and x2 is set to the base address of Y,x4
contains 800 to represent 100**8 bytes of arrays x and Y.
Assume the functional unit latencies as shown in the following table. Assume a one-cycle
delayed branch that resolves in the ID stage. Assume that results are fully bypassed (data
forwarding).
a. Assume a single-issue pipeline. Reorder code as necessary to minimize stalls. Remember to
use the latencies given in the table above. How many cycles are needed to complete one iteration?
b. Unroll the loop as many times as necessary to schedule it without any stalls, collapsing the loop
overhead instructions. How many times must the loop be unrolled? Show the instruction schedule.
What is the execution time per element of the result (or per iteration time)?
Assume that the initial value of 5 is 2+396(loop is repeated 99 times)
a. Show the timing of this instruction sequence for the 5-stage RISC pipeline without any
forwarding or bypassing hardware but assuming that a register read and a write in the same clock
cycle (for example, when an instruction writes back result to a register in cycle n, another
instruction read the register in the same cycle n). Assume that if branch instruction causes 2 stalls
if the branch is taken and zero cycle if not taken. Show the flow for one iteration and compute the
number of cycles needed to complete one iteration, then compute total number of cycles needed
to complete all 99 iterations.
b. Show the timing of this instruction sequence for the 5-stage RISC pipeline with full forwarding
and bypassing hardware. Remember that you need a stall after load if the next instruction needs
the value read from memory. Assume that if branch instruction causes 2 stalls if the branch is taken
and zero cycle if not taken. Show the flow for one iteration and compute the number of cycles
needed to complete one iteration, then compute total number of cycles needed to complete all 99
iterations.
c. High-performance processors have very deep pipelines-more than 15 stages. For this problem,
imagine that you have a 10-stage pipeline in which every stage of the 5-stage pipeline has been
split in two (that is we have two Instruction Fetch stages, say IF1, IF2, two decode, D1, D2, etc).
The only catch is that, for data forwarding, data can be forwarded from the end of the second
execute or second memory stage. a pair of stages to the beginning of the two stages where they are
needed. So, data are forwarded from the output of the second execute stage to the/input of the first
execute stage, still causing a I-cycle delay. Show the timing of this instruction sequence for the
10-stage RISC pipeline with full forwarding and bypassing hardware. Assume branch causes 4
stalls if the branch is taken and zero if the branch is not taken. How many cycles does this loop
take to complete one iteration, and how many cycles to complete all 99 iterations?
Use the following code fragment:In this exercise,

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!