Assume you are redesigning a hardware prefetcher for the unblocked matrix transposition code as in Exercise 5.7.

Question:

Assume you are redesigning a hardware prefetcher for the unblocked matrix transposition code as in Exercise 5.7. However, in this case we evaluate a simple two-stream sequential prefetcher. If there are level 2 access slots available, this prefetcher will fetch up to 4 sequential blocks after a miss and place them in a stream buffer. Stream buffers that have empty slots obtain access to the level 2 cache on a round-robin basis. On a level 1 miss, the stream buffer that has least recently supplied data on a miss is flushed and reused for the new miss stream.
a. In the steady state of the inner loop, what is the performance (in cycles per iteration) when using a simple two-stream sequential prefetcher assuming performance is limited by prefetching?
b. What percentage of prefetches are useful given the level 2 cache parameters?