Question: The transpose of a matrix interchanges its rows and columns; this is illustrated below: left [ begin { matrix } A 1 1
The transpose of a matrix interchanges its rows and columns; this is illustrated below:
leftbeginmatrixA&A&A&AA&A&A&AA&A&A&AA&A&A&AendmatrixrightLongrightarrowleftbeginmatrixA&A&A&AA&A&A&AA&A&A&AA&A&A&Aendmatrixright
Here is a simple C loop to show the transpose:
for i ; ; i
for j ; j ; j
outputji inputij;
Assume that both the input and output matrices are stored in the row major order row major order means that the row index changes fastest
Assume that you are executing a x doubleprecision transpose on a processor with a KB fully associative dont worry about cache conflicts; cache just has set least recently used LRU replacement L data cache with byte blocks.
Assume that the L cache misses or prefetches require cycles and always hit in the L cache, and that the L cache can process a request every two processor cycles.
Assume that each iteration of the inner loop above requires four cycles if the data are present in the L cache.
Assume that the cache has a writeallocate fetchonwrite policy for write misses.
Unrealistically, assume that writing back dirty cache blocks requires cycles.
For the simple implementation given above, this execution order would be nonideal for the input matrix; however, applying a loop interchange optimization would create a nonideal order for the output matrix. Because loop interchange is not sufficient to improve its performance, it must be blocked instead.
What should be the minimum size of the cache to take advantage of blocked execution?
Step by Step Solution
There are 3 Steps involved in it
1 Expert Approved Answer
Step: 1 Unlock
Question Has Been Solved by an Expert!
Get step-by-step solutions from verified subject matter experts
Step: 2 Unlock
Step: 3 Unlock
