Question: Seeking your assistance in solving the second bullet point correctly. I am not clearly understanding if cudaMemcpyHostToDevice will copy over a variable value from CPU

Seeking your assistance in solving the second bullet point correctly. I am not clearly understanding if cudaMemcpyHostToDevice will copy over a variable value from CPU (System Shared Memory?) to GPU (Global Shared Memory on the device?). My understanding is that both the matrix A and the vector b needs to be copied over to GPU in this scenario, and the copy destination is the Global Shared Memory, and then, b needs to be copied over to the shared memory on each Streaming Multiprocessor. And the host code may have these actions within:
// Allocate memory for matrix A and vector b using cudaMalloc
// Copy A, b, and c into GPU via cudaMalloc
// Launch kernel threads by invoking __global__ matrixVectorMultiply which I need to write. dimGrid and dimBlock need to be set before invoking matrixVectorMultiply (should the argument to dimGrid be 1K, meaning 1024, and dimBlock 1 here?)
Please advise
--------- Attached Problem --------------
6[15 points]
Matrix-vector multiplication using CUDA
Design a CUDA program to perform matrix-vector multiplication c=A xx b. The size
of matrix A is 1K xx1K. The size of vectors b and c is 1K xx1. Your program should
use 1K threads in total. Assume the shared memory is large enough to hold the entire
vector b. The input matrix and the vector are initially stored in the host memory.
Write a pseudo code for the host function and kernel function. Note that your kernel
function must use shared memory to store vector b.
Assume each element of A,b,c is 4 bytes; data transfer between CPU and GPU is
through PCIe whose bandwidth is 16GB//s in each direction; the clock rate of GPU
is 1 GHz ; the access latency to global memory and shared memory is 100 clock cycles
and 10 clock cycles, respectively; multiply-add operations are overlapped with memory
access operations. What is the execution time of your CUDA program in the best case? 6[15 points]
Matrix-vector multiplication using CUDA
- Design a CUDA program to perform matrix-vector multiplication \( c=A \times b \). The size of matrix A is \(1 K \times 1 K \). The size of vectors \( b \) and \( c \) is \(1 K \times 1\). Your program should use \(1 K \) threads in total. Assume the shared memory is large enough to hold the entire vector \( b \). The input matrix and the vector are initially stored in the host memory. Write a pseudo code for the host function and kernel function. Note that your kernel function must use shared memory to store vector \( b \).
- Assume each element of \( A, b, c \) is 4 bytes; data transfer between CPU and GPU is through PCIe whose bandwidth is \(16\mathrm{~GB}/\mathrm{s}\) in each direction; the clock rate of GPU is 1 GHz ; the access latency to global memory and shared memory is 100 clock cycles and 10 clock cycles, respectively; multiply-add operations are overlapped with memory access operations. What is the execution time of your CUDA program in the best case?
Seeking your assistance in solving the second

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!