For the simple implementation given above, this execution order would be nonideal for the input matrix. However,

Question:

For the simple implementation given above, this execution order would be nonideal for the input matrix. However, applying a loop interchange optimization would create a nonideal order for the output matrix. Because loop interchange is not sufficient to improve its performance, it must be blocked instead.
a. What block size should be used to completely fill the data cache with one input and output block?
b. How do the relative number of misses of the blocked and unblocked versions compare if the level 1 cache is direct mapped?
c. Write code to perform a transpose with a block size parameter B that uses B × B blocks.
Fantastic news! We've Found the answer you've been seeking!

Step by Step Answer:

Related Book For  book-img-for-question

Computer Architecture A Quantitative Approach

ISBN: 978-0123704900

4th edition

Authors: John L. Hennessy, David A. Patterson

Question Posted: