Question: Problem 3 : You experiment with a GPU to compute the following multiplication C = A x B , where m = n = l

Problem 3: You experiment with a GPU to compute the following multiplication C=A x B, where m=n=l=102 the elements of A,B, and C are all with the same bit-width of 4 Bytes. When implementing on GPU, each thread is in charge of computing for one element in C. Please answer the following questions (please show detailed steps.)(25 pts).
A=[x1,1cdotsx1,lvdotsddotsvdotsxm,1cdotsxm,l] and B=[y1,1cdotsy1,nvdotsddotsvdotsyl,1cdotsyl,n]
a) From the hardware standpoint, the GPU has 4 Streaming Multiprocessors (SMs), each with exclusive L1 and L2 caches, and 8 warp schedulers managing 4 warps per scheduler. Each warp consists of 32 cores. From the software perspective, threads within the same block can share data directly, while threads across different blocks cannot. What is the maximum block size that matches the GPU hardware architecture? (5 pts)8432=1024
b) Following the result from a), how many blocks should we have for computing the entire matrix C? Please use dim3 we have learned to initialize the threading. (5 pts) dim 3block(32,32) dim 3 thread (32,32)
c) After applying tiling, with the tile size of 3232, houmany tiles can cover the entire matrix? Given the block size calculated in a), how many titles can cover thejentire block? (pts10241024256=4096,1024256=4
d) Following the block and tile sizes in c), what is the proper size of shared memory that we need to request (Hint: threads within the same block can share data directly, while threads across different blocks cannot)? Please explain your answer. (10 pts)
Just c & d
Problem 3 : You experiment with a GPU to compute

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Accounting Questions!