Question: 10. To manipulate tiles, a new CUDA programmer has written the following device kernel, which will transpose each tile in a matrix. The tiles are
10. To manipulate tiles, a new CUDA programmer has written the following device kernel, which will transpose each tile in a matrix. The tiles are of size BLOCK_WIDTH by BLOCK_WIDTH, and each of the dimensions of matrix A is known to be a multiple of BLOCK_WIDTH. The kernel invocation and code are shown below. BLOCK_WIDTH is known at compile time, but could be set anywhere from 1 to 20.
dim3 blockDim(BLOCK_WIDTH,BLOCK_WIDTH);
dim3 gridDim(A_width/blockDim.x,A_height/blockDim.y);
BlockTranspose>>(A, A_width, A_height);
__global__ void
BlockTranspose(float* A_elements, int A_width, int A_height) {
__shared__ float blockA[BLOCK_WIDTH][BLOCK_WIDTH];
int baseIdx=blockIdx.x * BLOCK_SIZE + threadIdx.x;
baseIdx += (blockIdx.y * BLOCK_SIZE + threadIdx.y) * A_width;
blockA[threadIdx.y][threadIdx.x]=A_elements[baseIdx];
A_elements[baseIdx]=blockA[threadIdx.x][threadIdx.y]; }
A. Out of the possible range of values for BLOCK_SIZE, for what values of BLOCK_SIZE will this kernel function execute correctly on the device?
B. If the code does not execute correctly for all BLOCK_SIZE values, suggest a fix to the code to make it work for all BLOCK_SIZE values
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
