Assume you are redesigning a hardware prefetcher for the unblocked matrix transposition code as in Exercise 5.7.

Question:

Assume you are redesigning a hardware prefetcher for the unblocked matrix transposition code as in Exercise 5.7. However, in this case we evaluate a simple two-stream sequential prefetcher. If there are level 2 access slots available, this prefetcher will fetch up to 4 sequential blocks after a miss and place them in a stream buffer. Stream buffers that have empty slots obtain access to the level 2 cache on a round-robin basis. On a level 1 miss, the stream buffer that has least recently supplied data on a miss is flushed and reused for the new miss stream.
a. In the steady state of the inner loop, what is the performance (in cycles per iteration) when using a simple two-stream sequential prefetcher assuming performance is limited by prefetching?
b. What percentage of prefetches are useful given the level 2 cache parameters?
Fantastic news! We've Found the answer you've been seeking!

Step by Step Answer:

Related Book For  book-img-for-question

Computer Architecture A Quantitative Approach

ISBN: 978-0123704900

4th edition

Authors: John L. Hennessy, David A. Patterson

Question Posted: