Question: 4. (12 pts). Matrix Multiplication Cache Performance R10 running sum R12 A element value R3 A element address R4 A row base (fixed in inner


4. (12 pts). Matrix Multiplication Cache Performance R10 running sum R12 A element value R3 A element address R4 A row base (fixed in inner loop) R5 A column offset (variable in inner loop) R16 B element value R7 B element address R8 Brow base (variable in inner loop) R9 B column offset (fixed in inner loop) R0 = 0; A_base, B_base, C_base = base addresses of three matrices ; main: Outer2: addi addi addi Outerl: addi Inner: sub addi addi addi addi add add R4, RO, 400 R4, R4, -40 R9, RO, 40 R9, R9, -4 R10, R10, R10 R5, RO, 40 R8, RO, 400 R5, R5, -4 R8, R8, -40 R3, R4, R5 R7, R8, R9 R12, A base (R3) R16, B_base (R7) R12, R12, R16 R10, R10, R12 R5, Inner R3, R4, R9 C_base (R3), R10 R9, Outeri R4, Outer2 ; set to last column, R4 = R0 + 400 ; decrement A's row base ; set to last element of row ; decrement B's column offset ; clear sum i set to last element of row ; set to last column ; decrement A's column offset ; decrement B's row base ; form A's element address, R3 = R4 + R5 ; form B's element address ; load A's element into R12, R3 is offset ; load B's element into R16, R7 is offset ; compute product ; sum products ; loop across all elements ; compute result address ; store result (R10) to c matrix, R3=offset ; loop across all B's columns ; loop across all A's rows lw mul add bnez add SW bnez bnez Consider the 10 by 10 matrix multiplication algorithm used in the MIPS code above. Two 10 by 10 matrices A and B are multiplied leaving the result in C. The system running the application employs a data cache which is initially empty. The data cache has the following properties: cache organization the number of sets ..........1 The number of lines/set ..... 1024 The number of words/line ....4 T cache ..................... ..... 20 ns Tumain ...................... 100 ns per line write update policy ......... copy-back write allocation policy ..... insert in cache replacement policy .......... (a) For this application, what is the expected number of misses of each type? compulsory: capacity: conflict: (b) What is the hit rate of the cache? (show work) (c) What is the average data access time (T_effective) in ns? (show work) (d) If the cycle time of a pipelined system is 20 ns, how many stalls are introduced for each miss? (show work) 4. (12 pts). Matrix Multiplication Cache Performance R10 running sum R12 A element value R3 A element address R4 A row base (fixed in inner loop) R5 A column offset (variable in inner loop) R16 B element value R7 B element address R8 Brow base (variable in inner loop) R9 B column offset (fixed in inner loop) R0 = 0; A_base, B_base, C_base = base addresses of three matrices ; main: Outer2: addi addi addi Outerl: addi Inner: sub addi addi addi addi add add R4, RO, 400 R4, R4, -40 R9, RO, 40 R9, R9, -4 R10, R10, R10 R5, RO, 40 R8, RO, 400 R5, R5, -4 R8, R8, -40 R3, R4, R5 R7, R8, R9 R12, A base (R3) R16, B_base (R7) R12, R12, R16 R10, R10, R12 R5, Inner R3, R4, R9 C_base (R3), R10 R9, Outeri R4, Outer2 ; set to last column, R4 = R0 + 400 ; decrement A's row base ; set to last element of row ; decrement B's column offset ; clear sum i set to last element of row ; set to last column ; decrement A's column offset ; decrement B's row base ; form A's element address, R3 = R4 + R5 ; form B's element address ; load A's element into R12, R3 is offset ; load B's element into R16, R7 is offset ; compute product ; sum products ; loop across all elements ; compute result address ; store result (R10) to c matrix, R3=offset ; loop across all B's columns ; loop across all A's rows lw mul add bnez add SW bnez bnez Consider the 10 by 10 matrix multiplication algorithm used in the MIPS code above. Two 10 by 10 matrices A and B are multiplied leaving the result in C. The system running the application employs a data cache which is initially empty. The data cache has the following properties: cache organization the number of sets ..........1 The number of lines/set ..... 1024 The number of words/line ....4 T cache ..................... ..... 20 ns Tumain ...................... 100 ns per line write update policy ......... copy-back write allocation policy ..... insert in cache replacement policy .......... (a) For this application, what is the expected number of misses of each type? compulsory: capacity: conflict: (b) What is the hit rate of the cache? (show work) (c) What is the average data access time (T_effective) in ns? (show work) (d) If the cycle time of a pipelined system is 20 ns, how many stalls are introduced for each miss? (show work)
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
