Question: Part 2 : Implement matrix multiplication with basic CUDA ( 2 0 pts ) Instead of calling the cublasSgemm ( ) function, implement a kernel

Part 2: Implement matrix multiplication with basic CUDA (20pts)
Instead of calling the cublasSgemm() function, implement a kernel function for matrix multiplication with basic CUDA, and timing its performance.
You should name your program mmNaive.cu. You can reuse the structure and code segments of mmCUBLAS.cpp, and replace cublasSgemm with your own implementation. Check to make sure the computation on the device is correct.
Use the similar timing method for your implementation. For example, perform a warmup operation before timing the execution, time multiple iterations of matrix multiplication to make sure the execution time is long enough.
Compile the code, and collect execution time for the same matrix sizes. Show how performance changes with matrix size and compare your performance with MMCublas. How slow is your implementation? Can you identify the reasons?

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!