Question: For the simple implementation given above, this execution order would be nonideal for the input matrix. However, applying a loop interchange optimization would create a

For the simple implementation given above, this execution order would be nonideal for the input matrix. However, applying a loop interchange optimization would create a nonideal order for the output matrix. Because loop interchange is not sufficient to improve its performance, it must be blocked instead.
a. What block size should be used to completely fill the data cache with one input and output block?
b. How do the relative number of misses of the blocked and unblocked versions compare if the level 1 cache is direct mapped?
c. Write code to perform a transpose with a block size parameter B that uses B × B blocks.

Step by Step Solution

3.40 Rating (169 Votes )

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock

a Each element is 8 bytes The input and output blocks split th... View full answer

blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Document Format (1 attachment)

Word file Icon

903-C-S-S-A-D (3225).docx

120 KBs Word File

Students Have Also Explored These Related Systems Analysis And Design Questions!