Question: Using the example axpy computation problem(Provided under the problem) Write a C console progrm of the matrix multiplication algorithm ([A] * [B] = [C]) for

Using the example axpy computation problem(Provided under the problem)

Write a C console progrm of the matrix multiplication algorithm ([A] * [B] = [C]) for DOUBLE precision data types. All matrices [A], [B], and [C] are to be square i.e. same number of rows and columns. Execution should be scalable and be able to handle matrix dimension N x N, from 4 x 4, 16 x 16, 32 x 32, 64 x 64, 128 x 128, 512 x 512, 1024 x 1024, 2048 x 2048. Set the matrix dimension, N, number of accuracy improvement loops, and system clock speed using DEFINE statements. You'll need to loop many more times for small array sizes, then reduce the loop iterations as you increase array size (1 accuracy loop for matrices 1024x1024 and larger). Use a random number generator to fill the random data into the matrices, compiler optimizations should be on for full optimization.

Your program should printout using formatted printf 1.Vector length 2.Number of accuracy loops 3.Total computation time 4.Computation time for the complete NxN matrix multiplication 5.Computation time per arithmetic operation 6.Number of machine cycles per arithmetic operation

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~EXAMPLE~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Using the example axpy computation problem(Provided under the problem) Write a C

#include

#include

#include

#define SIZE 4 // vector dimension

#define LOOP 1E9 // number of accuracy improvement loops

#define CPU_CLK 3e9

int main()

{

//declare vectors and variables

long i;

long long j;

long * z;

long * x;

long * y;

long a;

double NumOfOps; //variable declaration for total number of arithmetic Ops in computation

// LOOP * SIZE * #of arithemtic ops requires per element

double OPS;

double ElapsedTime;

double ElapsedTimePerVector;

double ElapsedTimePerVectorElement;

long long OPS_PER_INSTR; /umber of arithmetic ops required per element

time_t start_time;

time_t end_time;

a = 1;

// declare the axpy coefficient and variables. Allocate memory/stack space for them

z = (long*)malloc(SIZE * sizeof(long));

x = (long*)malloc(SIZE * sizeof(long));

y = (long*)malloc(SIZE * sizeof(long));

//calulate LOOP value for defined accuracy value

//LOOP = MAX_ITERATIONS * ACCURACY;

OPS_PER_INSTR = 2; // arithmetic Ops per vector element from algorithm expressed in C code

//fill vectors with random values 1 to 100

for (i = 0; i

{

x[i] = (long)1 + rand() % 100;

y[i] = (long)1 + rand() % 100;

z[i] = 0.0;

}

printf("Number of elements per vector is: %d ", SIZE);

printf("Number of accuracy loops is: %e ", (double)LOOP);

printf("Processor clock frequency is: %0.2e cycles per second ", CPU_CLK);

NumOfOps = (double)(OPS_PER_INSTR * SIZE*(double)LOOP); //evaluate total number of multiply adds

printf("# of floating point multiply adds is: %0.3e ", NumOfOps);

printf("Ops per instruction = %d ", OPS_PER_INSTR);

///begin timed portion of benchmark

start_time = (double)time(0);

for (j = 0; j

{

for (i = 0; i

{

z[i] = (a*x[i]) + y[i]; //single line of code to implement axpy

}

}

end_time = (double)time(0);

///end timed portion of benchmark

ElapsedTime = ((end_time - start_time)); //elapsed time in double precision format

printf("Measured elasped time was: %0.4e seconds ", ElapsedTime);

ElapsedTimePerVector = (double)ElapsedTime / (double)LOOP;

printf("Execution time per vector is: %0.4e seconds ", ElapsedTimePerVector);

ElapsedTimePerVectorElement = ElapsedTimePerVector / ((double)SIZE);

printf("Execution time per vector element is: %0.4e seconds ", ElapsedTimePerVectorElement);

printf("Execution time per arithmetic Op is: %0.4e seconds ", ElapsedTimePerVectorElement / OPS_PER_INSTR); /eed to divide by 2 for complete multiply add functionality

printf("Estimated OPs per second is: %0.3e OPs per second ", OPS = (OPS_PER_INSTR / ElapsedTimePerVectorElement)); /ed 2 in numerator for case of multiply add

printf("Estimated number of clock cycles per OP is %0.2f CPU Clock Cycles per OP ", (CPU_CLK) / (OPS));

getchar(); //uncomment this if needed by IDE need to keep command concole open after execution

free(x);

free(y);

free(z);

return 0;

}

A C console program of the vector multiply-add axpy" algorithm for integer data types. Instrument and monitor and measure execution time for the vector multiply add. a) C program should be single threaded, and sequential. Execution should be scalable and be able to handle the number of vector elements, N, from 1 to 1,000,000. Set the vector dimension, N, number of accuracy improvement loops, and system clock speed using DEFINE statements. Use a random number generator to fill the random data into the vector elements. Recompile & execute for each vector length (and # of accuracy loops) . Compiler optimizations should be off/ defaulted. The console window should execute and remain open until manually closed. The code is portable, and not dependent on development environment or properties settings. b) Your console program should print out ( using formatted printf commands ) 1 Vector length ( number of vector elements) 2) The number of accuracy improvement loops you run the axpy computation to improve accuracy. 3) Total computation time 4) 5) 6) Computation time per axpy vector Computation time per vector element The number of machine cycles per arithmetic operation

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!