3 Matrix Multiplication Matrix multiplication is a key operation supported in hardware by AI DNN accelerator DSAs such as Google TPU and Tesla Dojo So , it s worth analyzing the matrix multiplication calculation itself One common way to depict matrix multiplication is with the following triply nested loop float a M K , b K N , c M N M , N , and K are constants for ( int i 0 i M i ) for ( int j 0 j N j ) for ( int k 0 k K k ) c i j a i k b k j a ) Suppose that M 3 , N 4 , and K 5 , so that each of the dimensions are relatively prime Write out the order of accesses to memory locations in each of the three matrices A , B , and C ( you might start with two dimensional indices, then translate those to memory addresses or offsets from the start of each matrix ) For which matrices are the elements accessed sequentially Which are not Assume row major ( C language ) memory ordering b ) Suppose that you transpose matrix B , swapping its indices so that they are B N K instead So , now the innermost statement of the loop looks like c i j a i k b j k Now, for which matrices are the elements accessed sequentially c ) The innermost ( k indexed ) loop of our original routine performs a dot product operation Suppose that you are a given a hardware unit that can perform an 8 element dot product more efficiently than the raw C code, behaving effectively like this C function void hardware dot ( float accumulator , const float a slice, const float b slice ) float total 0 for ( int k 0 k 8 k ) total a slice k b slice k accumulator total How would you rewrite the routine with the transposed B matrix from part ( c ) to use this function d ) Suppose that instead, you are given a hardware unit that performs an 8 element saxpy operation, which behaves like this C function void hardware saxpy ( float accumulator , float a , const float input ) for ( int k 0 k 8 k ) accumulator k a input k Write another routine that uses the saxpy primitive to deliver equivalent results to the original loop, without the transposed memory ordering for the B matrix

The Answer is in the image, click to view ...

Question: 3 . [ Matrix Multiplication ] Matrix multiplication is a key operation supported in hardware by AI / DNN accelerator DSAs such as Google TPU

3 . [

Matrix Multiplication

]

Matrix multiplication is a key operation supported in hardware by AI

/

DNN accelerator DSAs such as Google TPU and Tesla Dojo. So

,

s worth analyzing the matrix multiplication calculation itself. One common way to depict matrix multiplication is with the following triply nested loop:

float a

[

] [

],

[

] [

],

[

] [

]

;

/ /

,

,

and K are constants.

for

(

int i

= 0

; i

<

+ +

)

for

(

int j

= 0

; j

<

+ +

)

for

(

int k

= 0

; k

<

+ +

)

[

] [

] + =

[

] [

] *

[

] [

]

;

)

Suppose that M

= 3,

= 4,

and K

= 5,

so that each of the dimensions are relatively prime. Write out the order of accesses to memory locations in each of the three matrices A

,

,

and C

(

you might start with two

-

dimensional indices, then translate those to memory addresses or offsets from the start of each matrix

) .

For which matrices are the elements accessed sequentially? Which are not? Assume row

-

major

(

-

language

)

memory ordering.

)

Suppose that you transpose matrix B

,

swapping its indices so that they are B

[

] [

]

instead. So

,

now the innermost statement of the loop looks like:

[

] [

] + =

[

] [

] *

[

] [

]

;

Now, for which matrices are the elements accessed sequentially?

)

The innermost

(

-

indexed

)

loop of our original routine performs a dot

-

product operation. Suppose that you are a given a hardware unit that can perform an

8 -

element dot

-

product more efficiently than the raw C code, behaving effectively like this C function:

void hardware

_

dot

(

float

*

accumulator

,

const float

*

_

slice, const float

*

_

slice

) {

float total

= 0 .

;

for

(

int k

= 0

; k

< 8

;

+ +

) {

total

+ =

_

slice

[

] *

_

slice

[

]

;

}

*

accumulator

+ =

total;

}

How would you rewrite the routine with the transposed B matrix from part

(

)

to use this function?

)

Suppose that instead, you are given a hardware unit that performs an

8 -

element

saxpy

operation, which behaves like this C function:

void hardware

_

saxpy

(

float

*

accumulator

,

float a

,

const float

*

input

) {

for

(

int k

= 0

; k

< 8

;

+ +

) {

accumulator

[

] + =

*

input

[

]

;

}

}

Write another routine that uses the saxpy primitive to deliver equivalent results to the original loop, without the transposed memory ordering for the B matrix

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!

Matrix multiplication is a key operation supported in hardware by the TPU. Before going into details of the TPU hardware, its worth analyzing the matrix multiplication calculation itself. One common...

Your assignment to design and Implement an Algorithm is Java to perform the following. Addition and Multiplication of the arrays. Below is the statement multiplying 2 matrices, As for Matrix...

True or False (T/F). Determine if the following statements are true or false. Circle the correct response or write TRUE or FALSE. (5 points each) T F T F 1. All square matrices have inverses. 2. Any...

Question 3 of 3. Matrix Multiplication: [50 marks] Write a function which can multiply matrices. A matrix is defined as a 2-dimensional array. Given two 2-dimensional arrays A and B. if A is an n m...

Hi guys I need to implement some code. Any help would be appreciated. Thanks anticipated. The program will implement the following operations on the Matrix class: 1. Matrix addition 2. Matrix...

c++ questions, please help... i need to build the implementation file with that header file. my biggest issue is doing constructor 1. any advice would be greatly appreciated! thanks Please use the...

with the class above I need to create the following overloaded operators. any tips are appreciated. some of them are tricky for me. more when dealing with matrices:/ thank you! this is in C++ to be...

i need overloaded functions for the class in figure 1. I have done 2 and 4 i need the rest. just the functions. thank you!! :) class matrixType private: int matrix[3][3]; public: void display(); void...

A 5 4 - WAP to find the product of given matrix. in C Description: Read no . of rows and columns for 2 arrays from user and allocate the memory dynamically using malloc or calloc ( Assume Matrix A...

USING C LANGUAGE Problem 3: Matrix Multiplication Write a parallel program that takes, as a command line argument, the name of a file and reads from it first two integers that correspond to the...

A steady supply of 1.0 m3/s air at 25C, 100 kPa, 50% relative humidity is needed to heat a building in the winter. The outdoor ambient is at 10C, 100 kPa, 50% relative humidity. What are the required...

The following series of statements or phrases are associated with product life-cycle viewpoints. Identify whether each one is associated with the marketing, production, or customer viewpoint. Where...

3. Which of the following statements is false? a. The text contained in identifying labels should be left-aligned within the label. b. An identifying label should be positioned either above or to the...

Find the limit of the function (if it exists). (If an answer does not exist, enter DNE.) lim x3-27 x 3 x 3 - 27 Write a simpler function that agrees with the given function at all but one point. Use...