Question: The following kernel is executed on a large matrix, which is tiled into submatrices. To manipulate tiles, a new CUDA programmer has written the following
The following kernel is executed on a large matrix, which is tiled into submatrices. To manipulate tiles, a new CUDA programmer has written the following device kernel to transpose each tile in the matrix. The tiles are of size BLOCK_SIZE by BLOCK_SIZE, and each of the dimensions of matrix A is known to be a multiple of BLOCK_SIZE. The kernel invocation and code are shown below. BLOCK_SIZE is known at compile time but could be set anywhere from 1 to 20.
dim3 blockDim(BLOCK_SIZE,BLOCK_SIZE);
dim3 gridDim(A_width/blockDim.x,A_height/blockDim.y);
BlockTranspose<<>>(A, A_width, A_height);
__global__ void BlockTranspose(float* A_elements, int A_width, int A_height)
{ __shared__ float blockA[BLOCK_SIZE][BLOCK_SIZE];
int baseIdx blockIdx.x * BLOCK_SIZE threadIdx.x;
baseIdx (blockIdx.y * BLOCK_SIZE threadIdx.y) * A_width;
blockA[threadIdx.y][threadIdx.x] A_elements[baseIdx];
A_elements[baseIdx] blockA[threadIdx.x][threadIdx.y];
}
Out of the possible range of values for BLOCK_SIZE, for what values of BLOCK_SIZE will this kernel function correctly when executing on the device?
If the code does not execute correctly for all BLOCK_SIZE values, suggest a fix to the code to make it work for all BLOCK_SIZE values.
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
