Question: How would you solve the following problem? - - - - - - - Design a CUDA program to perform matrix multiplication C = A
How would you solve the following problem?
Design a CUDA program to perform matrix multiplication CA xx B The size of each
matrix is K xxK Each element is byte. The matrices are initially stored in the global
memory of GPU. The GPU has one streaming multiprocessor SM with K CUDA cores.
Each CUDA core runs at GHz and can perform one floating point operation in each clock
cycle. The peak bandwidth between the GPU and the global memory is GBs and the
cache of GPU has been disabled.
Assuming the GPU has an onchip buffer shared memory which can store xx xx
elements and has a peak access bandwidth of TBs
Write the pseudo code for an optimized CUDA program using block matrix mul
tiplication approach including both kernel function and host function.
Derive a lower bound on the computation time, the local data access time in SM
and the global data access time of your optimized kernel function. Explain.
Step by Step Solution
There are 3 Steps involved in it
1 Expert Approved Answer
Step: 1 Unlock
Question Has Been Solved by an Expert!
Get step-by-step solutions from verified subject matter experts
Step: 2 Unlock
Step: 3 Unlock
