Question: programing assignment for matrix multiplication algorithm ([A]x[B] = [C]) sample code: #include #include #include #define FILE DOUBLEPRECISION_VMA to demonstrate evaluating Cycles per Ops ( CpOPs

programing assignment for matrix multiplication algorithm ([A]x[B] = [C]) #include #include #define FILE "DOUBLEPRECISION_VMA to demonstrate evaluating Cycles per Ops (

sample code:

#include

#define FILE "DOUBLEPRECISION_VMA to demonstrate evaluating Cycles per Ops ( CpOPs )"

#define SIZE 1000 //1000 base. If vector = 1 and LOOPs=1E10, and /O3 cycle per flop represent approx latency to L1

#define TOTAL_ITERATIONS 1E10 // total iterantions is the produst of SIZE and # of outer accuracy improvement loops (LOOP)

#define CPU_CLK 2.394455e9 //3.43E9 for Dell i7-6600, 2.8E9 for i5 Surface Pro, 3.48E9 for HP i7-7700; 2.394455e9 for Xeon for Sun 1.593E9

int main()

{

//declare vectors and variables

int i;

long long j, LOOP;

double * z;

double * x;

double * y;

double a;

double NumOfOps;

double FLOPS;

double ElapsedTime;

double ElapsedTimePerVector;

double ElapsedTimePerVectorElement;

//declare time.h counter variables

time_t start_time;

time_t end_time;

a = 0.001; // declare the axpy coefficient and variables. Allocate memory/stack space for them

z = (double*)malloc(SIZE * sizeof(double));

x = (double*)malloc(SIZE * sizeof(double));

y = (double*)malloc(SIZE * sizeof(double));

LOOP = TOTAL_ITERATIONS / SIZE;

//fill vectors with random values

for (i = 0; i

{

x[i] = (double)1 + rand() % 100;

y[i] = (double)1 + rand() % 100;

z[i] = 0.0;

}

printf("File being run : %s ", FILE);

printf("Arrays filled with data... computation begins now... ");

//printf("Compiler flags: %s ", FLAGS); //print or suppress printing of compiler flags as needed

printf("Number of elements per vector is: %d ", SIZE);

printf("Number of resolution loops is: %e ", (double)LOOP);

printf("Total number of inner & outer loop iterations is: %e ", (double)TOTAL_ITERATIONS);

printf("Processor clock frequency is: %0.2e cycles per second ", CPU_CLK);

NumOfOps = (double)(2 * SIZE*(double)LOOP); //evaluate total number of multiply adds

printf("# of floating point adds is: %0.3e ", NumOfOps);

printf("Vectors populated.. computation begins now ... ");

start_time = (double)time(0);

for (j = 0; j

{

for (i = 0; i

{

//z[i] = a*x[i];

//z[i] = x[i] + y[i];

z[i] = (a*x[i]) + y[i]; //single line of code to implement daxpy

}

end_time = (double)time(0);

printf("Computation Done ! ");

ElapsedTime = ((end_time - start_time)); //elapsed time in double precision format

printf("Measured elasped time was: %0.4e seconds ", ElapsedTime);

/* printf("Start number of seconds since 0 is: %d ", start_time);

printf("End number of seconds since 0 is: %d ", end_time);

printf("Start time in number of system clock ticks since 0 is: %0.12e ", (double)CPU_CLK*start_time);

printf("End time in number of system clock ticks since 0 is: %0.12e ", (double)CPU_CLK*end_time);

printf("Number of system clock ticks in measurement is : %0.12e ", CPU_CLK*(end_time - start_time)); */

//time() is returning an integer value representing the number of seconds since time 0 .. so for a valid measurement,

//the entire computation must take at lease 1 second, otherwise the timer won't be able to capture the event.

//print execution time of multiply add operation

// printf("Execution time per multiply-add operation = %e : ", ElapsedTime/NumOfMultAdds);

ElapsedTimePerVector = (double)ElapsedTime / (double)LOOP;

printf("Execution time per vector is: %0.4e seconds ", ElapsedTimePerVector);

ElapsedTimePerVectorElement = ElapsedTimePerVector / ((double)SIZE);

printf("Execution time per vector element is: %0.4e seconds ", ElapsedTimePerVectorElement);

printf("Execution time per arithmetic Op is: %0.4e seconds ", ElapsedTimePerVectorElement / 2); /eed to divide by 2 for complete multiply add functionality

FLOPS = (2 / ElapsedTimePerVectorElement); /eed 2 in numerator for case of multiply add

printf("Estimated number of clock cycles per OP is %0.2f CPU Clock Cycles per OP ", (CPU_CLK) / (FLOPS));

printf("Estimated number of Ops per clock cycle is %0.2f OPs per CPU Clock Cycles ", (FLOPS) / (CPU_CLK));

printf("Estimated FLOPs per second is: %0.3e OPs per second ", FLOPS);

//getchar();

free(x);

free(y);

free(z);

return 0;

}

Your C program should be single threaded, and sequentia. All matrices [A], [B], and [C] are to be square, i.e. same number of rows and columns Execution should be scalable and be able to handle matrix dimension N x N, from 4 x 4, 16x16, 32x32, 64x64, 128x128, 256x256, 512x512, 1024x1024 and 2048x2048. Set the matrix dimension, N, number of accuracy improvement loops, and system CPU clock speed using DEFINE statements. Use a random number generator to fill the random data into the matrices. Compiler optimizations should be configured for full optimization ( -03 in gcc or /Ox in MS VS). Comment your code to explain what it is doing. As in programming exercise 1 and 2, make your code portable. Using malloc is the preferred, and recommended method. You should use the time.h header file library, and either the "time() or clock() timing functions to capture the start and end execution times for your benchmark. You will need two nested loops to perform the matrix multiplication, and in addition, as in PE2, you will need to use a 3rd, outer accuracy improvement loop. Adjust the total number of iterations of your inner and outer accuracy improvement loops so that total execution time is approximately constant as you vary the dimensions of your A, B and C matrices change from 4x4 to 2048x2048. Your C program should be single threaded, and sequentia. All matrices [A], [B], and [C] are to be square, i.e. same number of rows and columns Execution should be scalable and be able to handle matrix dimension N x N, from 4 x 4, 16x16, 32x32, 64x64, 128x128, 256x256, 512x512, 1024x1024 and 2048x2048. Set the matrix dimension, N, number of accuracy improvement loops, and system CPU clock speed using DEFINE statements. Use a random number generator to fill the random data into the matrices. Compiler optimizations should be configured for full optimization ( -03 in gcc or /Ox in MS VS). Comment your code to explain what it is doing. As in programming exercise 1 and 2, make your code portable. Using malloc is the preferred, and recommended method. You should use the time.h header file library, and either the "time() or clock() timing functions to capture the start and end execution times for your benchmark. You will need two nested loops to perform the matrix multiplication, and in addition, as in PE2, you will need to use a 3rd, outer accuracy improvement loop. Adjust the total number of iterations of your inner and outer accuracy improvement loops so that total execution time is approximately constant as you vary the dimensions of your A, B and C matrices change from 4x4 to 2048x2048

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

Please help manipulate this code to answer the question below #include #include #include #define FILE "DOUBLEPRECISION_VMA to demonstrate evaluating Cycles per Ops ( CpOPs )" #define SIZE 1000 //1000...

an operation that yields a N aN value when neither of its arguments is a N aN, (b) an operation with finite arguments that yields +, (c) an operation with an argument + that yields a finite result....

In this question you will be asked to reflect on a project you have been involved in or observed, in which a design evolved, or could have evolved, through applying a theory of user behaviour. You...

A discrete sequence {xn} can be converted into a continuous representation x(t) = ts X n= (t n ts) xn, where ts is the sampling period. (a) State two characteristic properties of Dirac's function. [2...

Portray in words what transforms you would have to make to your execution to some degree (a) to accomplish this and remark on the benefits and detriments of this thought.You are approached to compose...

A creative engineer suggests structuring the TLB so that not all the bits of the presented address need match to result in a hit. Suggest how this might be achieved, and what might be the costs and...

What is the difference between MouseListener and MouseAdapter? [3 marks] (b) Via suitable HTML, the compiled version of the following Java code is presented to the appletviewer application: import...

re Regular Languages and Finite Automata (a) Let L be the set of all strings over the alphabet {a, b} that end in a and do not contain the substring bb. Describe a deterministic finite automaton...

(a) In SystemVerilog, what is the difference between: (i) The ternary operator ? and if...then...else statements? [2 marks] (ii) always_ff and always_comb? [2 marks] (iii) Blocking, non-blocking and...

Write a program that interfaces with the blackbox: it should call the functions with different n-values and record the runtime of each call. I don't know where to start when writing a program to take...

Consider a one-way or single-factor ANOVA with four treatments and five replications. Use the table of the F distribution percentage points to find lower and upper bounds on the P-value for the...

Z owns a rental building (its only asset) with a gross fair market value of $5,000 subject to the non-recourse mortgage of $2,000. Zs adjusted basis for this building is $1,500. All of Zs stock is...

Consider the following exchange rates for three currencies: Exchange Rate 1 : 1 USD = 0 . 8 5 EUR Exchange Rate 2 : 1 EUR = 1 . 2 0 GBP Exchange Rate 3 : 1 GBP = 1 . 1 5 USD Assume you start with 1 0...

Seved Help 14 Wisconsin Snowmobile Corp. is considering a switch to level production Cost efficiencies would occur under level production, and aftertax costs would decline by $31,500, but inventory...

How are Work Breakdown Statements Built and how do they appear in a Project Plan?

What is the most important part of any HCM Project Map and why?

What is the Phase that begins after Project rollover and what activities are part of the Phase?