Question: onsider the following code, which sums the elements of a product of two matrices: register int i , k; / * i , k are

onsider the following code, which sums the elements of a product of two matrices:
register int i, k; /* i, k are in the processor registers */
register float sum, a[8][8], b[8][8];
for(i =0; i <8; i++){/*1*/
for(k =0; k <8; k++){/*2*/
sum += a[i][k]* b[k][i]; /*3*/
}
}
Assume the following:
- There is a perfect instruction cache; i.e., do not worry about the time for any instruction accesses.
- Both int and float are of size 4 bytes.
- Assume that only the accesses to the arrays a and b generate accesses to the data cache. The rest of
the variables are all allocated in registers.
- Assume a fully associative, LRU data cache with 8 lines, where each line is 32 bytes.
- Initially, the data cache is empty.
- The arrays a and b are stored in row order.
- To keep things simple, we will assume that statements in the above code are executed sequentially.
Lines (1), and (2) take 10 cycles for each invocation. Line (3) takes 10 cycles plus an additional 20
cycles per data cache miss to wait for the data. That is, if both array accesses in line (3) miss, it takes
a total of 50 cycles.
- Assume that the arrays a and b both start at cache line boundaries.
(a) How many accesses to arrays a and b will result in cache misses? Explain your answer.
(b)
Now assume there is a data prefetch instruction with the format prefetch (array[index1][index2]). This
prefetches the entire block containing the word array[index1][index2] into the data cache.
It takes 1 cycle for the processor to execute this instruction and send it to the data cache. The processor
can then go ahead and execute subsequent instructions. If the prefetched data is not in the cache, it takes
20 cycles for the data to get loaded into the cache. Add prefetch instructions to minimize the execution
time. Do not transform the code in any other way. How many cache misses for accessing a and b at line
(3) in your modified code?
(Hint: since line 1,2, and 3 each takes 10 cycles when there is no cache miss, you can consider using
them to hide the 20-cycle latency for the data to get loaded into the cache when prefetching. If you insert
the prefetch instructions appropriately, the cache misses can be totally eliminated.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!