Question: The following kernel is executed on a large matrix, which is tiled into submatrices. To manipulate tiles, a new CUDA programmer has written the following

The following kernel is executed on a large matrix, which is tiled into submatrices. To manipulate tiles, a new CUDA programmer has written the following device kernel to transpose each tile in the matrix. The tiles are of size BLOCK_SIZE by BLOCK_SIZE, and each of the dimensions of matrix A is known to be a multiple of BLOCK_SIZE. The kernel invocation and code are shown below. BLOCK_SIZE is known at compile time but could be set anywhere from 1 to 20.

dim3 blockDim(BLOCK_SIZE,BLOCK_SIZE);

dim3 gridDim(A_width/blockDim.x,A_height/blockDim.y);

BlockTranspose<<>>(A, A_width, A_height);

__global__ void BlockTranspose(float* A_elements, int A_width, int A_height)

{ __shared__ float blockA[BLOCK_SIZE][BLOCK_SIZE];

int baseIdx blockIdx.x * BLOCK_SIZE threadIdx.x;

baseIdx (blockIdx.y * BLOCK_SIZE threadIdx.y) * A_width;

blockA[threadIdx.y][threadIdx.x] A_elements[baseIdx];

A_elements[baseIdx] blockA[threadIdx.x][threadIdx.y];

}

Out of the possible range of values for BLOCK_SIZE, for what values of BLOCK_SIZE will this kernel function correctly when executing on the device?

If the code does not execute correctly for all BLOCK_SIZE values, suggest a fix to the code to make it work for all BLOCK_SIZE values.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!