Question: Consider the problem of multiplying a dense n x n matrix A with an n x 1 vector B to generate an n x 1
Consider the problem of multiplying a dense n x n matrix A with an n x vector B to generate an n x vector C The ith element, Ci corresponds to the dotproduct of the ith row of A and the input vector B as illustrated in the following Figure
Part : Describe how you partition the computation tasks, organize threads, and map threads to the tasks.
Part : Write a matrixvector multiplication CUDA kernel matrixVectorMulKernel by completing the following code:
globalvoid matrixVectorMulKernel float A float B float C int vectorLen
Part : Write a host function matrix VectorMul that can be called in the main function with four parameters: pointer to the input matrix, pointer to the input vector, pointer to the output vector, and the number of elements in each dimension. This function should include statements for memory allocation, data transfer, thread organization, kernel function call and free memory. Complete the following code:
void matrixVectorMulfloat hA float hB float hC int vectorLen
Part : If matrixvector multiplication is implemented on a distributed memory system using multiple CPUs instead of GPUs and CUDA, which collective communication operations eg onetoall broadcast, alltoall broadcast, alltoone reduction, alltoall reduction, scatter, gather can be utilized to enhance performance? Describe how these operations can be applied effectively
Step by Step Solution
There are 3 Steps involved in it
1 Expert Approved Answer
Step: 1 Unlock
Question Has Been Solved by an Expert!
Get step-by-step solutions from verified subject matter experts
Step: 2 Unlock
Step: 3 Unlock
