Question: How would you solve the following problem? - - - - - - - Design a CUDA program to perform matrix multiplication C = A

How would you solve the following problem?
-------
Design a CUDA program to perform matrix multiplication C=A xx B. The size of each
matrix is 1K xx1K. Each element is 1 byte. The matrices are initially stored in the global
memory of GPU. The GPU has one streaming multiprocessor (SM) with 1K CUDA cores.
Each CUDA core runs at 1 GHz and can perform one floating point operation in each clock
cycle. The peak bandwidth between the GPU and the global memory is 100GB//s and the
cache of GPU has been disabled.
Assuming the GPU has an on-chip buffer (shared memory) which can store 3xx32 xx32
elements and has a peak access bandwidth of 1TB//s.
Write the pseudo code for an optimized CUDA program using block matrix mul-
tiplication approach including both kernel function and host function.
Derive a lower bound on the computation time, the local data access time in SM,
and the global data access time of your optimized kernel function. Explain.
How would you solve the following problem? - - -

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!