Let us now consider less favorable scenarios for extraction of instruction-level parallelism by a run-time hardware scheduler

Question:

Let us now consider less favorable scenarios for extraction of instruction-level parallelism by a run-time hardware scheduler in the hash table code in Figure 3.14 (the general case). Suppose that there is no longer a guarantee that each bucket will receive exactly one item. Let us reevaluate our assessment of the parallelism available, given the more realistic situation, which adds some additional, important dependences.
Recall that in the ideal case, the relatively serial inner loop was not in play, and the outer loop provided ample parallelism. In general, the inner loop is in play: the inner while loop could iterate one or more times. Keep in mind that the inner loop, the while loop, has only a limited amount of instruction-level parallelism. First of all, each iteration of the while loop depends on the result of the previous iteration. Second, within each iteration, only a small number of instructions are executed.
The outer loop is, on the contrary, quite parallel. As long as two elements of the outer loop are hashed into different buckets, they can be entered in parallel. Even when they are hashed to the same bucket, they can still go in parallel as long as some type of memory disambiguation enforces correctness of memory loads and stores performed on behalf of each element.
In reality, the data element values will likely be randomly distributed. Although we aim to provide the reader insight into more realistic execution scenarios, we will begin with some regular but non-ideal data value patterns that are amenable to systematic analysis. These value patterns offer some intermediate steps toward understanding the amount of instruction-level parallelism under the most general, random data values.
a. Draw a dynamic dependence graph for the hash table code in Figure 3.14 when the values of the 1024 data elements to be inserted are 0, 1, 1024, 1025, 2048, 2049, 3072, 3073, . . . . Describe the new dependences across iterations for the for loop when the while loop is iterated one or more times. Pay special attention to the fact that the inner while loop now can iterate one or more times. The number of instructions in the outer for loop will therefore likely vary as it iterates. For the purpose of determining dependences between loads and stores, assume a dynamic memory disambiguation that cannot resolve the dependences between two memory accesses based on different base pointer registers. For example, the run time hardware cannot disambiguate between a store based on ptr Update and a load based on ptrCurr.
b. Assuming the dynamic dependence graph you derived in part (a), how many instructions will be executed?
c. Assuming the dynamic dependence graph you derived in part (a) and an unlimited amount of hardware resources, how many clock cycles will it take to execute all the instructions you calculated in part (b)?
d. How much instruction-level parallelism is available in the dynamic dependence graph you derived in part (a)?
e. Using the same assumption of run time memory disambiguation mechanism as in part (a), identify a sequence of data elements that will cause the worst-case scenario of the way these new dependences affect the level of parallelism available.
f. Now, assume the worst-case sequence used in part (e), explain the potential effect of a perfect run time memory disambiguation mechanism (i.e., a system that tracks all outstanding stores and allows all non-conflicting loads to proceed). Derive the number of clock cycles required to execute all the instructions in the dynamic dependence graph. On the basis of what you have learned so far, consider a couple of qualitative questions: What is the effect of allowing loads to issue speculatively, before prior store addresses are known? How does such speculation affect the significance of memory latency in this code?
g. Continue the same assumptions as in part (f), and calculate the number of instructions executed.
h. Continue the same assumptions as in part (f), and calculate the amount of instruction-level parallelism available to the run-time hardware.
i. In part (h), what is the effect of limited instruction window sizes on the level of instruction-level parallelism?