Question: We want to study several instruction level parallelism techniques, we are given the following bench- mark program, assuming R 1 is initialized by 0, and
We want to study several instruction level parallelism techniques, we are given the following bench- mark program, assuming R 1 is initialized by 0, and R6, R7, R8, R9 and F10 contain constant non- zero values:
Loop: LD F12, 0(R6)
DIVD F14, F12, F10
LD F16, O(R7)
ADDD F16, F14, F16
LD F17, 0(R8)
MULTD F18, F17, F16
SD O(R9), F18
ADDI R6, R6, #4
ADDI R7, R7, #4
ADDI R8, R8, #4
ADDI R9, R9, #4
ADDI RI, RI, #1
SUBI R2, R1, #1000
BNEQZ R2, Loop Assuming a single scalar architecture, the available hardware resources & their respective latency are given below:
| FU TYPE | #FUs | #EX cycles |
| integer | 2 | 1 |
| branch | 1 | 1 |
| load | 3 | 2 |
| store | 2 | 2 |
| FP adder | 2 | 7 |
| FP mulitplier | 1 | 5 |
| FP divider | 1 | 24 |
a) Draw the hardware organization to implement dynamic scheduling with the Tomasulo algorithm. Do you expect an improved execution time compared to T1, T2 and T3? (Hint: do not perform any computations, answer from the theoretical point of view)
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
