Assume you are redesigning a hardware prefetcher for the unblocked matrix transposition code above. The simplest type

Question:

Assume you are redesigning a hardware prefetcher for the unblocked matrix transposition code above. The simplest type of hardware prefetcher only prefetches sequential cache blocks after a miss. More complicated "nonunit stride" hardware prefetchers can analyze a miss reference stream, and detect and prefetch nonunit strides. In contrast, software prefetching can determine nonunit strides as easily as it can determine unit strides. Assume prefetches write directly into the cache and no pollution (overwriting data that needs to be used before the data that is prefetched). In the steady state of the inner loop, what is the performance (in cycles per iteration) when using an ideal nonunit stride prefetcher?
Fantastic news! We've Found the answer you've been seeking!

Step by Step Answer:

Related Book For  book-img-for-question

Computer Architecture A Quantitative Approach

ISBN: 978-0123704900

4th edition

Authors: John L. Hennessy, David A. Patterson

Question Posted: