onsider the following code, which sums the elements of a product of two matrices register int i , k i , k are in the processor registers register float sum, a 8 8 , b 8 8 for ( i 0 i 8 i ) 1 for ( k 0 k 8 k ) 2 sum a i k b k i 3 Assume the following There is a perfect instruction cache i e , do not worry about the time for any instruction accesses Both int and float are of size 4 bytes Assume that only the accesses to the arrays a and b generate accesses to the data cache The rest of the variables are all allocated in registers Assume a fully associative, LRU data cache with 8 lines, where each line is 3 2 bytes Initially, the data cache is empty The arrays a and b are stored in row order To keep things simple, we will assume that statements in the above code are executed sequentially Lines ( 1 ) , and ( 2 ) take 1 0 cycles for each invocation Line ( 3 ) takes 1 0 cycles plus an additional 2 0 cycles per data cache miss to wait for the data That is , if both array accesses in line ( 3 ) miss, it takes a total of 5 0 cycles Assume that the arrays a and b both start at cache line boundaries ( a ) How many accesses to arrays a and b will result in cache misses Explain your answer ( b ) Now assume there is a data prefetch instruction with the format prefetch ( array index 1 index 2 ) This prefetches the entire block containing the word array index 1 index 2 into the data cache It takes 1 cycle for the processor to execute this instruction and send it to the data cache The processor can then go ahead and execute subsequent instructions If the prefetched data is not in the cache, it takes 2 0 cycles for the data to get loaded into the cache Add prefetch instructions to minimize the execution time Do not transform the code in any other way How many cache misses for accessing a and b at line ( 3 ) in your modified code ( Hint since line 1 , 2 , and 3 each takes 1 0 cycles when there is no cache miss, you can consider using them to hide the 2 0 cycle latency for the data to get loaded into the cache when prefetching If you insert the prefetch instructions appropriately, the cache misses can be totally eliminated

The Answer is in the image, click to view ...

Question: onsider the following code, which sums the elements of a product of two matrices: register int i , k; / * i , k are

onsider the following code, which sums the elements of a product of two matrices:

,

/ *

,

k are in the processor registers

* /

[8] [8],

[8] [8]

;

for

(

= 0

; i

< 8

; i

+ +) {/ * 1 * /

for

(

= 0

; k

< 8

; k

+ +) {/ * 2 * /

sum

+ =

[

] [

] *

[

] [

]

;

/ * 3 * /

}

}

Assume the following:

-

There is a perfect instruction cache; i

.

.,

do not worry about the time for any instruction accesses.

-

Both int and float are of size

4

bytes.

-

Assume that only the accesses to the arrays a and b generate accesses to the data cache. The rest of

the variables are all allocated in registers.

-

Assume a fully associative, LRU data cache with

8

lines, where each line is

32

bytes.

-

Initially, the data cache is empty.

-

The arrays a and b are stored in row order.

-

To keep things simple, we will assume that statements in the above code are executed sequentially.

Lines

(1),

and

(2)

take

10

cycles for each invocation. Line

(3)

takes

10

cycles plus an additional

20

cycles per data cache miss to wait for the data. That is

,

if both array accesses in line

(3)

miss, it takes

a total of

50

cycles.

-

Assume that the arrays a and b both start at cache line boundaries.

(

)

How many accesses to arrays a and b will result in cache misses? Explain your answer.

(

)

Now assume there is a data prefetch instruction with the format prefetch

(

array

[

index

1] [

index

2]) .

This

prefetches the entire block containing the word array

[

index

1] [

index

2]

into the data cache.

It takes

1

cycle for the processor to execute this instruction and send it to the data cache. The processor

can then go ahead and execute subsequent instructions. If the prefetched data is not in the cache, it takes

20

cycles for the data to get loaded into the cache. Add prefetch instructions to minimize the execution

time. Do not transform the code in any other way. How many cache misses for accessing a and b at line

(3)

in your modified code?

(

Hint: since line

1, 2,

and

3

each takes

10

cycles when there is no cache miss, you can consider using

them to hide the

20 -

cycle latency for the data to get loaded into the cache when prefetching. If you insert

the prefetch instructions appropriately, the cache misses can be totally eliminated.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!

For this assignment, you are to write a program to multiply two sparse matrices. You will implement a data structure that facilitates efficient processing of sparse matrices so that your program will...

Hello, this question set is for C++ programming. Your help will be greatly appreciated, these questions are closely related to what will be on my final. DO NOT answer if you are only going to answer...

Hello, this question set is for C++ programming. Your help will be greatly appreciated, these questions are closely related to what will be on my final. What is the correct way to call the...

.cpp file #include #include #include #include "Matrix.h" using namespace std; int main(int argc, const char * argv[]) { // Do not modify this code srand(time(NULL)); const int max_row = 3, max_col =...

I want answers for any questions possible.1,2 and 4 are linked. There are 2 pages for a total of 5 questions. Please post as soon as possible. Question 1 (20). Write a struct Student that has member...

NEED IT ASAP PLSSSSSSS 1 Introduction and purpose In this project you will write some functions to manipulate the instructions of a fictional simple processor named the MAD Raisin. The MAD Raisin CPU...

can someone solve this Modern workstations typically have memory systems that incorporate two or three levels of caching. Explain why they are designed like this. [4 marks] In order to investigate...

Provide a summary technical report with your own words about Pipelined Execution which is also named as Instruction Level Parallelism, addressing mainly the following areas: 1. What is Pipelined...

Verify the sample codes in "The INC, DEC ... Instructions Notes". Submit screen shot of each program. mov esi, value dec byte [esi] The ADD and SUB Instructions The ADD and SUB instructions are used...

The classic MIPS 5-stage pipeline is depicted below. instruction decode and execute memory write fetch register fetch access back (i) With reference to the 5-stage pipeline, what are data hazards and...

Two parallel rectangular plates measuring 20 cm by 40cm carry an electric charge of 0.2 C. Calculate the electric flux density. If the plates are spaced 5 mm apart and the voltage between them is...

Continuous process improvement methodologies can provide sizeable benefits through small incremental improvements Will continuous improvement take a company at the bottom of industry to the top...

What are some ERM considerations that are not directly relevant to internal control?

Suppose a resident of Canada buys some machine tools from a company in Japan. Explain why and in what directions this changes Canada net exports and Canada net capital outflow. {2.5 Marks)