Question: Part 2 : Implement matrix multiplication with basic CUDA ( 2 0 pts ) Instead of calling the cublasSgemm ( ) function, implement a kernel
Part : Implement matrix multiplication with basic CUDA pts
Instead of calling the cublasSgemm function, implement a kernel function for matrix multiplication with basic CUDA, and timing its performance.
You should name your program mmNaive.cu You can reuse the structure and code segments of mmCUBLAS.cpp and replace cublasSgemm with your own implementation. Check to make sure the computation on the device is correct.
Use the similar timing method for your implementation. For example, perform a warmup operation before timing the execution, time multiple iterations of matrix multiplication to make sure the execution time is long enough.
Compile the code, and collect execution time for the same matrix sizes. Show how performance changes with matrix size and compare your performance with MMCublas. How slow is your implementation? Can you identify the reasons?
Step by Step Solution
There are 3 Steps involved in it
1 Expert Approved Answer
Step: 1 Unlock
Question Has Been Solved by an Expert!
Get step-by-step solutions from verified subject matter experts
Step: 2 Unlock
Step: 3 Unlock
