Question: 3 . [ Matrix Multiplication ] Matrix multiplication is a key operation supported in hardware by AI / DNN accelerator DSAs such as Google TPU

3.[Matrix Multiplication] Matrix multiplication is a key operation supported in hardware by AI/DNN accelerator DSAs such as Google TPU and Tesla Dojo. So, its worth analyzing the matrix multiplication calculation itself. One common way to depict matrix multiplication is with the following triply nested loop:
float a[M][K], b[K][N], c[M][N];
// M, N, and K are constants.
for (int i =0; i < M; ++i)
for (int j =0; j < N; ++j)
for (int k =0; k < K; ++k)
c[i][j]+= a[i][k]* b[k][j];
a) Suppose that M=3, N=4, and K=5, so that each of the dimensions are relatively prime. Write out the order of accesses to memory locations in each of the three matrices A, B, and C (you might start with two-dimensional indices, then translate those to memory addresses or offsets from the start of each matrix). For which matrices are the elements accessed sequentially? Which are not? Assume row-major (C-language) memory ordering.
b) Suppose that you transpose matrix B, swapping its indices so that they are B[N][K] instead. So, now the innermost statement of the loop looks like:
c[i][j]+= a[i][k]* b[j][k];
Now, for which matrices are the elements accessed sequentially?
c) The innermost (k-indexed) loop of our original routine performs a dot-product operation. Suppose that you are a given a hardware unit that can perform an 8-element dot-product more efficiently than the raw C code, behaving effectively like this C function:
void hardware_dot(float *accumulator,
const float *a_slice, const float*b_slice){
float total =0.;
for (int k =0; k <8; ++k){
total += a_slice[k]* b_slice[k];
}
*accumulator += total;
}
How would you rewrite the routine with the transposed B matrix from part
(c) to use this function?
d) Suppose that instead, you are given a hardware unit that performs an 8-element saxpy operation, which behaves like this C function:
void hardware_saxpy(float *accumulator,
float a, const float *input){
for (int k =0; k <8; ++k){
accumulator[k]+= a * input[k];
}
}
Write another routine that uses the saxpy primitive to deliver equivalent results to the original loop, without the transposed memory ordering for the B matrix

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!