Question: [10 Marks) To increase clock frequency, we divide the data path into the five standard pipeline stages. As in Ben's design, we still use a

 [10 Marks) To increase clock frequency, we divide the data path
into the five standard pipeline stages. As in Ben's design, we still

[10 Marks) To increase clock frequency, we divide the data path into the five standard pipeline stages. As in Ben's design, we still use a Princeton architecture design with a single memory. The pipeline is shown below in a stylized fashion, with the same memory drawn once in the I-Fetch stage and once in the Memory stage. Both stages use the same two-bank interleaved memory described in the handout, but each stage is connected to a different port of that memory. Assume the pipeline is fully bypassed as shown in lecture, and that branches are resolved in the decode stage. (Submit handwritten answer on sheet in a PDF, each pdf for each Question as well as ROUGH WORK) Consider again the first MIPS code segment, which performs one load per iteration: Ox108: 0x10C: 0x110: 0x114: 0x118: Ox11c: loop: LW 14,0 (13) MUL r2, r2, 14 ADDI 13, 13, 4 ADDI ri, r1, -1 BNEZ ri, loop SUB 0x41 we ns2 addr rdata rdih WS wdrdz GPRS Memory we addr rdata Memory wdata Imm Ext 1-Fetch Decode, Reg. Fetch Execute Memory Write-back (F) (D) (E) (M) (W) a) Assuming we stall for bank conflicts, what is the average number of clock cycles needed per iteration during the loop's steady-state execution? Does it depend upon the initial value of r3, the address of the first element of the array in memory? (04) b) Reorder the instructions in the code segment to reduce stalls without changing the functionality of the code, while still performing only one load per iteration. Indicate how many fewer stalls per iteration your revised loop incurs on the 5-stage pipeline? (06) [10 Marks) To increase clock frequency, we divide the data path into the five standard pipeline stages. As in Ben's design, we still use a Princeton architecture design with a single memory. The pipeline is shown below in a stylized fashion, with the same memory drawn once in the I-Fetch stage and once in the Memory stage. Both stages use the same two-bank interleaved memory described in the handout, but each stage is connected to a different port of that memory. Assume the pipeline is fully bypassed as shown in lecture, and that branches are resolved in the decode stage. (Submit handwritten answer on sheet in a PDF, each pdf for each Question as well as ROUGH WORK) Consider again the first MIPS code segment, which performs one load per iteration: Ox108: 0x10C: 0x110: 0x114: 0x118: Ox11c: loop: LW 14,0 (13) MUL r2, r2, 14 ADDI 13, 13, 4 ADDI ri, r1, -1 BNEZ ri, loop SUB 0x41 we ns2 addr rdata rdih WS wdrdz GPRS Memory we addr rdata Memory wdata Imm Ext 1-Fetch Decode, Reg. Fetch Execute Memory Write-back (F) (D) (E) (M) (W) a) Assuming we stall for bank conflicts, what is the average number of clock cycles needed per iteration during the loop's steady-state execution? Does it depend upon the initial value of r3, the address of the first element of the array in memory? (04) b) Reorder the instructions in the code segment to reduce stalls without changing the functionality of the code, while still performing only one load per iteration. Indicate how many fewer stalls per iteration your revised loop incurs on the 5-stage pipeline? (06)

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!