Question: programing assignment for matrix multiplication algorithm ([A]x[B] = [C]) sample code: #include #include #include #define FILE DOUBLEPRECISION_VMA to demonstrate evaluating Cycles per Ops ( CpOPs

programing assignment for matrix multiplication algorithm ([A]x[B] = [C])programing assignment for matrix multiplication algorithm ([A]x[B] = [C]) sample code: #include#include #include #define FILE "DOUBLEPRECISION_VMA to demonstrate evaluating Cycles per Ops (

sample code:

#include

#include

#include

#define FILE "DOUBLEPRECISION_VMA to demonstrate evaluating Cycles per Ops ( CpOPs )"

#define SIZE 1000 //1000 base. If vector = 1 and LOOPs=1E10, and /O3 cycle per flop represent approx latency to L1

#define TOTAL_ITERATIONS 1E10 // total iterantions is the produst of SIZE and # of outer accuracy improvement loops (LOOP)

#define CPU_CLK 2.394455e9 //3.43E9 for Dell i7-6600, 2.8E9 for i5 Surface Pro, 3.48E9 for HP i7-7700; 2.394455e9 for Xeon for Sun 1.593E9

int main()

{

//declare vectors and variables

int i;

long long j, LOOP;

double * z;

double * x;

double * y;

double a;

double NumOfOps;

double FLOPS;

double ElapsedTime;

double ElapsedTimePerVector;

double ElapsedTimePerVectorElement;

//declare time.h counter variables

time_t start_time;

time_t end_time;

a = 0.001; // declare the axpy coefficient and variables. Allocate memory/stack space for them

z = (double*)malloc(SIZE * sizeof(double));

x = (double*)malloc(SIZE * sizeof(double));

y = (double*)malloc(SIZE * sizeof(double));

LOOP = TOTAL_ITERATIONS / SIZE;

//fill vectors with random values

for (i = 0; i

{

x[i] = (double)1 + rand() % 100;

y[i] = (double)1 + rand() % 100;

z[i] = 0.0;

}

printf("File being run : %s ", FILE);

printf("Arrays filled with data... computation begins now... ");

//printf("Compiler flags: %s ", FLAGS); //print or suppress printing of compiler flags as needed

printf("Number of elements per vector is: %d ", SIZE);

printf("Number of resolution loops is: %e ", (double)LOOP);

printf("Total number of inner & outer loop iterations is: %e ", (double)TOTAL_ITERATIONS);

printf("Processor clock frequency is: %0.2e cycles per second ", CPU_CLK);

NumOfOps = (double)(2 * SIZE*(double)LOOP); //evaluate total number of multiply adds

printf("# of floating point adds is: %0.3e ", NumOfOps);

printf("Vectors populated.. computation begins now ... ");

start_time = (double)time(0);

for (j = 0; j

{

for (i = 0; i

{

//z[i] = a*x[i];

//z[i] = x[i] + y[i];

z[i] = (a*x[i]) + y[i]; //single line of code to implement daxpy

}

}

end_time = (double)time(0);

printf("Computation Done ! ");

ElapsedTime = ((end_time - start_time)); //elapsed time in double precision format

printf("Measured elasped time was: %0.4e seconds ", ElapsedTime);

/* printf("Start number of seconds since 0 is: %d ", start_time);

printf("End number of seconds since 0 is: %d ", end_time);

printf("Start time in number of system clock ticks since 0 is: %0.12e ", (double)CPU_CLK*start_time);

printf("End time in number of system clock ticks since 0 is: %0.12e ", (double)CPU_CLK*end_time);

printf("Number of system clock ticks in measurement is : %0.12e ", CPU_CLK*(end_time - start_time)); */

//time() is returning an integer value representing the number of seconds since time 0 .. so for a valid measurement,

//the entire computation must take at lease 1 second, otherwise the timer won't be able to capture the event.

//print execution time of multiply add operation

// printf("Execution time per multiply-add operation = %e : ", ElapsedTime/NumOfMultAdds);

ElapsedTimePerVector = (double)ElapsedTime / (double)LOOP;

printf("Execution time per vector is: %0.4e seconds ", ElapsedTimePerVector);

ElapsedTimePerVectorElement = ElapsedTimePerVector / ((double)SIZE);

printf("Execution time per vector element is: %0.4e seconds ", ElapsedTimePerVectorElement);

printf("Execution time per arithmetic Op is: %0.4e seconds ", ElapsedTimePerVectorElement / 2); /eed to divide by 2 for complete multiply add functionality

FLOPS = (2 / ElapsedTimePerVectorElement); /eed 2 in numerator for case of multiply add

printf("Estimated number of clock cycles per OP is %0.2f CPU Clock Cycles per OP ", (CPU_CLK) / (FLOPS));

printf("Estimated number of Ops per clock cycle is %0.2f OPs per CPU Clock Cycles ", (FLOPS) / (CPU_CLK));

printf("Estimated FLOPs per second is: %0.3e OPs per second ", FLOPS);

//getchar();

free(x);

free(y);

free(z);

return 0;

}

Your C program should be single threaded, and sequentia. All matrices [A], [B], and [C] are to be square, i.e. same number of rows and columns Execution should be scalable and be able to handle matrix dimension N x N, from 4 x 4, 16x16, 32x32, 64x64, 128x128, 256x256, 512x512, 1024x1024 and 2048x2048. Set the matrix dimension, N, number of accuracy improvement loops, and system CPU clock speed using DEFINE statements. Use a random number generator to fill the random data into the matrices. Compiler optimizations should be configured for full optimization ( -03 in gcc or /Ox in MS VS). Comment your code to explain what it is doing. As in programming exercise 1 and 2, make your code portable. Using malloc is the preferred, and recommended method. You should use the time.h header file library, and either the "time() or clock() timing functions to capture the start and end execution times for your benchmark. You will need two nested loops to perform the matrix multiplication, and in addition, as in PE2, you will need to use a 3rd, outer accuracy improvement loop. Adjust the total number of iterations of your inner and outer accuracy improvement loops so that total execution time is approximately constant as you vary the dimensions of your A, B and C matrices change from 4x4 to 2048x2048. Your C program should be single threaded, and sequentia. All matrices [A], [B], and [C] are to be square, i.e. same number of rows and columns Execution should be scalable and be able to handle matrix dimension N x N, from 4 x 4, 16x16, 32x32, 64x64, 128x128, 256x256, 512x512, 1024x1024 and 2048x2048. Set the matrix dimension, N, number of accuracy improvement loops, and system CPU clock speed using DEFINE statements. Use a random number generator to fill the random data into the matrices. Compiler optimizations should be configured for full optimization ( -03 in gcc or /Ox in MS VS). Comment your code to explain what it is doing. As in programming exercise 1 and 2, make your code portable. Using malloc is the preferred, and recommended method. You should use the time.h header file library, and either the "time() or clock() timing functions to capture the start and end execution times for your benchmark. You will need two nested loops to perform the matrix multiplication, and in addition, as in PE2, you will need to use a 3rd, outer accuracy improvement loop. Adjust the total number of iterations of your inner and outer accuracy improvement loops so that total execution time is approximately constant as you vary the dimensions of your A, B and C matrices change from 4x4 to 2048x2048

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!