This part of our case study will focus on the amount of instruction-level parallelism available to the

Question:

This part of our case study will focus on the amount of instruction-level parallelism available to the run time hardware scheduler under the most favorable execution scenarios (the ideal case). (Later, we will consider less ideal scenarios for the run time hardware scheduler as well as the amount of parallelism available to a compiler scheduler.) For the ideal scenario, assume that the hash table is initially empty. Suppose there are 1024 new data elements, whose values are numbered sequentially from 0 to 1023, so that each goes in its own bucket (this reduces the problem to a matter of updating known array locations!). Figure 3.15 shows the hash table contents after the first three elements have been inserted, according to this "ideal case." Since the value of element[i] is simply i in this ideal case, each element is inserted into its own bucket.
For the purposes of this case study, assume that each line of code in Figure 3.14 takes one execution cycle (its dependence height is 1) and, for the purposes of computing ILP, takes one instruction. These (unrealistic) assumptions are made to greatly simplify bookkeeping in solving the following exercises. And while statements execute on each iteration of their respective loops, to test if the loop should continue. In this ideal case, most of the dependences in the code sequence are relaxed and a high degree of ILP is therefore readily available. We will later examine a general case, in which the realistic dependences in the code segment reduce the amount of parallelism available.
Further suppose that the code is executed on an "ideal" processor with infinite issue width, unlimited renaming, "omniscient" knowledge of memory access disambiguation, branch prediction, and so on, so that the execution of instructions is limited only by data dependence. Consider the following in this context:
a. Describe the data (true, anti, and output) and control dependences that govern the parallelism of this code segment, as seen by a run time hardware scheduler. Indicate only the actual dependences (i.e., ignore dependences between stores and loads that access different addresses, even if a compiler or processor would not realistically determine this). Draw the dynamic dependence graph for six consecutive iterations of the outer loop (for insertion of six elements), under the ideal case. In this dynamic dependence graph, we are identifying data dependences between dynamic instances of instructions: each static instruction in the original program has multiple dynamic instances due to loop execution. The following definitions may help you find the dependences related to each instruction:
• Data true dependence: On the results of which previous instructions does each instruction immediately depend?
• Data anti dependence: Which instructions subsequently write locations read by the instruction?
• Data output dependence: Which instructions subsequently write locations written by the instruction?
• Control dependence: On what previous decisions does the execution of a particular instruction depend (in what case will it be reached)?
b. Assuming the ideal case just described, and using the dynamic dependence graph you just constructed, how many instructions are executed, and in how many cycles?
c. What is the average level of ILP available during the execution of the for loop?
d. In part (c) we considered the maximum parallelism achievable by a run-time hardware scheduler using the code as written. How could a compiler increase the available parallelism, assuming that the compiler knows that it is dealing with the ideal case. Think about what is the primary constraint that prevents executing more iterations at once in the ideal case. How can the loop be restructured to relax that constraint?
e. For simplicity, assume that only variables i, hash_index, ptrCurr, and ptrUpdate need to occupy registers. Assuming general renaming, how many registers are necessary to achieve the maximum achievable parallelism in part (b)?
f. Assume that in your answer to part (a) there are 7 instructions in each iteration. Now, assuming a consistent steady-state schedule of the instructions in the example and an issue rate of 3 instructions per cycle, how is execution time affected?
g. Finally, calculate the minimal instruction window size needed to achieve the maximal level of parallelism?
Fantastic news! We've Found the answer you've been seeking!

Step by Step Answer:

Related Book For  book-img-for-question

Computer Architecture A Quantitative Approach

ISBN: 978-0123704900

4th edition

Authors: John L. Hennessy, David A. Patterson

Question Posted: