Question: Q1: In this exercise, we look at how software techniques can extract instruction-level parallelism (ILP) in a common vector loop. The following loop is the

 Q1: In this exercise, we look at how software techniques can

extract instruction-level parallelism (ILP) in a common vector loop. The following loop

is the so-called DAXPY loop (double- precision aX plus Y) and is

Q1: In this exercise, we look at how software techniques can extract instruction-level parallelism (ILP) in a common vector loop. The following loop is the so-called DAXPY loop (double- precision aX plus Y) and is the central operation in Gaussian elimination. The following code implements the DAXPY operation, Y= aX+ Y. Initially, F4 holds constant a, RI is set to the base address of array X, and R2 is set to the base address of array Y: foo: L.D MUL.D L.D ADD.D S. D DADDIU DADDIU DSLTU BNEZ F6, O(R1) F2, F6, F4 F8, O(R2) F8, F2, F8 F8, O(R2) RI, RI, #8 R2, R2, #8 R5, R1, R3 R5, foo load X(i) to Reg(F6) Reg(F2) = a*X(i) Reg(FS)-Y(i) Reg( F8) = a*X(i)+Y(i) store Reg(F8) to Y(i) increase X index increase Y index test: continue loop? loop if needed The table below shows the number of intervening clock cycles needed to avoid a stall. Assume that results are fully bypassed Instruction producing result FP multiply FP ALU o FP multiply FP ALU o Load Load Integer ALU op Integer ALU op Instruction using result FP Store FP Store FP ALU o FP ALU o Store Other than store Branch Integer ALU o Latency in clock cvcles 4 0 0 Q1: In this exercise, we look at how software techniques can extract instruction-level parallelism (ILP) in a common vector loop. The following loop is the so-called DAXPY loop (double- precision aX plus Y) and is the central operation in Gaussian elimination. The following code implements the DAXPY operation, Y= aX+ Y. Initially, F4 holds constant a, RI is set to the base address of array X, and R2 is set to the base address of array Y: foo: L.D MUL.D L.D ADD.D S. D DADDIU DADDIU DSLTU BNEZ F6, O(R1) F2, F6, F4 F8, O(R2) F8, F2, F8 F8, O(R2) RI, RI, #8 R2, R2, #8 R5, R1, R3 R5, foo load X(i) to Reg(F6) Reg(F2) = a*X(i) Reg(FS)-Y(i) Reg( F8) = a*X(i)+Y(i) store Reg(F8) to Y(i) increase X index increase Y index test: continue loop? loop if needed The table below shows the number of intervening clock cycles needed to avoid a stall. Assume that results are fully bypassed Instruction producing result FP multiply FP ALU o FP multiply FP ALU o Load Load Integer ALU op Integer ALU op Instruction using result FP Store FP Store FP ALU o FP ALU o Store Other than store Branch Integer ALU o Latency in clock cvcles 4 0 0

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!