Question: We have discussed the blocking technique to improve the performance of dense matrix multiplication in the class. In this part of project assignment, you are
We have discussed the blocking technique to improve the performance of dense matrix multiplication in the class. In this part of project assignment, you are asked to implement the original and blocking codes of dense matrix multiplication, run the following experiments on your computer and analyze the results. Assume the array size is N x N.
1. Implement the original code of matrix multiplication and set the value of N to 1024. Compile the code using the optimization options -O0 (no compiler optimization) and -O3 (standard compiler optimization) with gcc, respectively. What is the speedup using -O3 compared with using -O0? If you are using another compiler other than gcc, use two different optimization options supported by the compiler to run the experiments, one without any compiler optimization and another with standard compiler optimization.
2. In the experiments thereafter, using -O3 option (or standard compiler optimization with compiler other than gcc) when compiling the code. Implement the original code of matrix multiplication, vary the array size from N = 16 to 4096 (doubling the value of N each time). When does the program execution time increase sharply? Whats the relationship between the cache size of your computer and the array size around the jump point?
3. Implement the blocking method, set the blocking factor B to 8, and vary the array size as above (from N = 16 to 4096). Compare the execution time of matrix multiplication with and without blocking. Based on the results, whats your suggestion on applying the blocking method?
4. Use the blocking method, set the array size N to 2048, and vary the blocking factor B from 4 to 512 (in power of two only). What is the optimal blocking factor for the program and what is your suggestion on choosing the blocking factor?
Note:
(1) For this project, you need to use a timing function that can measure the program execution time accurately, such as in microseconds. For example, you can use timing function gettimeofday( ) in C before and after the matrix multiplication to get the accurate execution time.
(2) The array elements can be initialized to some random numbers, but not all zeros.
(3) The expected execution time will increase with N doubled even without the cache miss effect.
(4) When the array size is large, allocating the array as static one may cause segmentation error. You can use malloc( ) to allocate array elements instead.
FYI
Blocking:
Operate on submatrices or blocks
Maximize accesses to data loaded into cache before they are replaced
/* After */ for (jj = 0; jj < N; jj = jj+B)
for (kk = 0; kk < N; kk = kk+B)
for (i = 0; i < N; i = i+1)
for (j = jj; j < min(jj+B,N); j = j+1) {r = 0;
for (k = kk; k < min(kk+B,N); k = k+1) { r = r + y[i][k]*z[k][j];}; x[i][j] = x[i][j] + r; };
B called Blocking Factor
Capacity Misses from 2N3 + N2 to 2N3/B+N2
But may suffer from conflict misses
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
