Computer system architectures must aim to minimize the gap between computer arithmetic and real world arithmetic, and programmers need to be aware of the implications of underlying approximations This course project aims to enhance your computer organization and coding skills by developing methodologies for numerically stable, parallelizable, and efficiently compiled matrix QR decomposition ( QRD ) into RISC V assembly codes Our textbook 1 provides a running example of matrix multiplication, emphasizing the significance of subword parallelism, instruction level parallelism, cache blocking, and multiple processors in Chapters 3 , 4 , 5 , and 6 , respectively You will emulate the approach used for matrix multiplication to craft codes for matrix decompositions, with the goal of achieving performance enhancements through both software and hardware optimizations You are encouraged to work in groups of two, with each group member sharing the project workload equally Starting work on the project as early as possible is recommended to mitigate last minute problems or challenges It is also advised that code enhancement techniques be synchronized with corresponding class lectures Please do not hesitate to seek feedback and assistance promptly throughout the duration of the project Matrix Decompositions Matrix decomposition, also known as matrix factorization, involves transforming a matrix into a canonical form within the mathematical domain of linear algebra 2 This process holds significant importance in numerous scientific and engineering applications reliant on linear algebra and applied statistics Such decompositions offer analytic simplicity and computational convenience by breaking down complex matrix computation problems into simpler ones For instance, large scale matrix computations involving determinants and inversions can be streamlined by decomposing the matrix into lower rank canonical forms, providing insights into their characteristics and structures Therefore, such decompositions are very common, and making the common case fast means that investing in optimizing the implementation of such decompositions can significantly improve overall system performance Various types of matrix decompositions exist, with the most popular ones being relevant to solving systems of linear equations These decompositions include LU , QR , and singular value decompositions The LU decomposition breaks down a matrix A into lower triangular ( L ) and upper triangular ( U ) matrices, the QR decomposition into a unitary ( Q ) and upper triangular ( R ) matrices, and the singular value decomposition into a diagonal matrix of singular values ( D ) and two unitary matrices ( U and V ) The accuracy, throughput, latency, and area of matrix decomposition directly impact system performance Consequently, the literature abounds with proposals for efficient hardware implementations of these decompositions QRD has been extensively utilized in signal processing applications, particularly in multipleinput multiple output ( MIMO ) communication systems In these systems, QRD serves as the preferred matrix decomposition method for channel matrix preprocessing, aiming to simplify data detection processes 3 , 4 , 5 Algorithm 1 Classical Gram Schmidt 1 Q A , R 0 N 2 for k 1 N do 3 vk ak 4 for j 1 k 1 do 5 rjk qTj ak 6 vk vk rjkqj 7 end for 8 rkk vk 2 9 qk vk rkk 1 0 end for The QR decomposition ( QRD ) Algorithm 2 Modified Gram Schmidt 1 Q A , R 0 N 2 for k 1 N do 3 rkk qk 2 4 qk qk rkk 5 6 for j k 1 N do 7 rkj qTkqj 8 qj qj rkjqk 9 end for 1 0 end for We denote non bold lower and upper case letters ( a , A ) as scalars, bold lower case letters ( a ) as vectors, and bold upper case letters ( A ) as matrices Let A a 1 , a 2 , , aN be an N N matrix having columns ak a 1 k , a 2 k , , aNk T for k in the range 1 , N Similarly define the matrices Q and R The QRD of matrix A is given by A QR , where Q is a unitary matrix ( QTQ IN ) and QTA R is an upper triangular matrix QRD is typically computed using Gram Schmidt ( GS ) orthogonalization, and there are various ways to implement the GS process Algorithms 1 and 2 present two different implementations classical Gram Schmidt ( CGS ) and modified Gram Schmidt ( MGS ) In exact arithmetic, these two methods produce exactly the same output ( exercise convince yourself of this ) However, in the presence of rounding errors, the algorithms behave significantly differently Project deliverables You are tasked with compiling and verifying multiple variations of the two QRD algorithms in RISC V using Visual Studio Code The projects grade is contingent upon code correctness, presentation clarity, and innovative features Bonus points will be awarded for valid, cost effective extensions and use cases The code variations include performance optimization through techniques like loop unrolling and parallelization for improved efficiency, implementing error handling mechanisms to enhance numerical stability, and integrating the QRD algorithms with other signal processing or linear algebra algorithms Adhering to best practices in coding, documentation, and testing is essential, along with providing clear explanations of the implemented algorithms and their functionality Deliverable 0 Warm up Compile the floating point procedure of double precision, general matrix multiply ( DGEMM ) into RISC V , following the example provided in the textbook 1 in Section 3 5 ( pages 2 0 9 to 2 1 1 ) Follow each step, assuming identical data types and register allocations as described in the textbook This exercise is valuable for understanding how to handle indexing over two dimensional arrays ( where matrices are treated as single dimensional inside the code ) and floating point operations Save the resulting code as a function because you will utilize matrix multiplication later in your project Deliverable 1 CGS and MGS Proceed to compile both the CGS and MGS algorithms into separate routines Assume various data types, including variations of floating point representations, and conduct a comparative analysis of instruction count and performance Verify the superior numerical stability of MGS For performance comparison, invoke the DGEMM routine to compute QR and compare the result with A , calculating the mean square error In certain use cases where access to the original matrix A post decomposition is unnecessary, explore potential memory saving techniques in QRD For the rest of the project only consider MGS Deliverable 2 Data level parallelism Given that parallelism occurs within a wide word, subword parallelism is discussed in Sec 3 8 Consider how such forms of data level parallelism apply to QRD Since RISC V lacks support for subword parallelism, focus on commenting on how x 8 6 AVX instructions highlighted in the book can enhance CGS and MGS compilations Deliverable 3 Instruction level parallelism Section 4 1 2 underscores the significance of loop unrolling in DGEMM, which provides multiple issue, out of order execution processors with an abundance of instructions to facilitate instruction parallelism Implement loop unrolling in your QRD compilations and provide detailed commentary on the potential gains Deliverable 4 Cache blocking In Section 5 1 5 , the original DGEMM code is modified to compute on submatrices to ensure that the elements being accessed can fit in the cache A full matrix multiplication thus involves invoking DGEMM repeatedly on matrices of smaller block sizes ( the blocking factor ) Such blocking reduces cache misses and can aid in register allocation while minimizing the number of loads and stores in the program Assuming a QRD of a 3 2 3 2 matrix with 1 0 2 4 elements, where each element occupies 8 bytes, the three matrices ( A , Q , and R ) occupy 2 4 KiB, fitting comfortably within the 3 2 KiB data cache of the Intel Core i 7 Can blocking further enhance performance in QRD Can you partition the matrices into smaller blocks and perform QRD computations on these blocks Deliverable 5 Multiple processors Section 6 1 2 underscores the distribution of the outermost loops work across 1 6 cores in DGEMM, showcasing the potential of parallelization Parallelizing QRD can substantially improve performance, with various approaches documented in the literature, including recursive methods ( do some literature review ) In MATLAB, a multicore parallelizable QRD implementation can be simulated using parfor By distributing the QRD computation across multiple threads or cores, each processing a different subset of the data, performance gains can be realized, particularly for larger matrix dimensions However, for smaller matrices where everything fits within the first level data cache, parallelization may lead to performance degradation due to increased overheads and resource contention Hence, careful consideration of matrix dimensions and underlying architecture is crucial when determining the effectiveness of parallelization for QRD Propose and test parallelizable QRD algorithms Bonus extensions The potential extensions for this work are diverse, and innovation in pursuing these extensions is highly encouraged and rewarded with bonus points Options for extensions include leveraging the optimized QRD codes to simulate QRD based computations, such as for matrix inversion or zero forcing data detection with decision feedback in MIMO communication systems Additionally, exploring alternative implementations of QRD , such as employing Givens rotation ( GR ) or Householder transformations, presents avenues for further investigation Another consideration is dealing with complex valued matrices and implementing the more intricate complex QRD , which poses additional challenges and opportunities for novel approaches and optimizations Each of these extensions offers a unique opportunity to deepen understanding, develop novel techniques, and enhance the versatility and applicability of QRD algorithms References 1 L John et al , Computer organization and design The hardware software interface, 2 0 1 7 2 G H Golub and C F Van Loan, Matrix Computations, 3 rd ed Baltimore, MD Johns Hopkins Univ Press, 1 9 9 6 3 C Studer, P Blosch, P Friedli, and A Burg, Matrix decomposition architecture for MIMO systems Design and implementation trade offs, in 2 0 0 7 Conference Record of the FortyFirst Asilomar Conference on Signals, Systems and Computers, 2 0 0 7 , pp 1 9 8 6 1 9 9 0 4 H Sarieddeen and M M Mansour, Enhanced low complexity layer ordering for MIMO sphere detectors, in Proc IEEE Int Conf Commun ( ICC ) , May 2 0 1 6 , pp 1 6 5 H Sarieddeen, M M Mansour, and A Chehabnt near optimgmun and Netw Conf ( WCNC )

The Answer is in the image, click to view ...

Question: Computer system architectures must aim to minimize the gap between computer arithmetic and real - world arithmetic, and programmers need to be aware of the

Computer system architectures must aim to minimize the gap between computer arithmetic and real

-

world arithmetic, and programmers need to be aware of the implications of underlying approximations. This course project aims to enhance your computer organization and coding skills by developing methodologies for numerically stable, parallelizable, and efficiently compiled matrix QR decomposition

(

QRD

)

into RISC

-

V assembly codes. Our textbook

[1]

provides a running example of matrix multiplication, emphasizing the significance of subword parallelism, instruction

-

level parallelism, cache blocking, and multiple processors in Chapters

3, 4, 5,

and

6,

respectively. You will emulate the approach used for matrix multiplication to craft codes for matrix decompositions, with the goal of achieving performance enhancements through both software and hardware optimizations.

You are encouraged to work in groups of two, with each group member sharing the project workload equally. Starting work on the project as early as possible is recommended to mitigate last

-

minute problems or challenges. It is also advised that code enhancement techniques be synchronized with corresponding class lectures. Please do not hesitate to seek feedback and assistance promptly throughout the duration of the project.

Matrix Decompositions:

Matrix decomposition, also known as matrix factorization, involves transforming a matrix into a canonical form within the mathematical domain of linear algebra

[2] .

This process holds significant importance in numerous scientific and engineering applications reliant on linear algebra and applied statistics. Such decompositions offer analytic simplicity and computational convenience by breaking down complex matrix computation problems into simpler ones. For instance, large

-

scale matrix computations involving determinants and inversions can be streamlined by decomposing the matrix into lower

-

rank canonical forms, providing insights into their characteristics and structures. Therefore, such decompositions are very common, and making the common case fast means that investing in optimizing the implementation of such decompositions can significantly improve overall system performance.

Various types of matrix decompositions exist, with the most popular ones being relevant to solving systems of linear equations. These decompositions include LU

,

,

and singular value decompositions. The LU decomposition breaks down a matrix A into lower triangular

(

)

and upper triangular

(

)

matrices, the QR decomposition into a unitary

(

)

and upper triangular

(

)

matrices, and the singular value decomposition into a diagonal matrix of singular values

(

)

and two unitary matrices

(

U and V

) .

The accuracy, throughput, latency, and area of matrix decomposition directly impact system performance. Consequently, the literature abounds with proposals for efficient hardware implementations of these decompositions.

QRD has been extensively utilized in signal processing applications, particularly in multipleinput multiple

-

output

(

MIMO

)

communication systems. In these systems, QRD serves as the preferred matrix decomposition method for channel

-

matrix preprocessing, aiming to simplify data detection processes

[3, 4, 5] .

Algorithm

1

Classical Gram

-

Schmidt

1

: Q

=

,

= 0

2

: for k

= 1

: N do

3

: vk

=

4

: for j

= 1

: k

1

5

: rjk

=

qTj ak

6

: vk

=

vk rjkqj

7

: end for

8

: rkk

=

2 9

: qk

=

/

rkk

10

: end for

The QR decomposition

(

QRD

)

Algorithm

2

Modified Gram

-

Schmidt

1

: Q

=

,

= 0

2

: for k

= 1

: N do

3

: rkk

=

2 4

: qk

=

/

rkk

5

6

: for j

=

+ 1

: N do

7

: rkj

=

qTkqj

8

: qj

=

qj rkjqk

9

: end for

10

: end for

We denote non

-

bold lower and upper case letters

(

,

)

as scalars, bold lower case letters

(

)

as vectors, and bold upper case letters

(

)

as matrices. Let A

= [

1,

2,,

]

be an N N matrix having columns ak

= [

1

,

2

,,

aNk

]

T for k in the range

[1,

] .

Similarly define the matrices Q and R

.

The QRD of matrix A is given by

=

,

where Q is a unitary matrix

(

QTQ

=

)

and QTA

=

R is an upper triangular matrix.

QRD is typically computed using Gram

-

Schmidt

(

)

orthogonalization, and there are various ways to implement the GS process. Algorithms

1

and

2

present two different implementations: classical Gram

-

Schmidt

(

CGS

)

and modified Gram

-

Schmidt

(

MGS

) .

In exact arithmetic, these two methods produce exactly the same output

(

exercise: convince yourself of this

) .

However, in the presence of rounding errors, the algorithms behave significantly differently.

Project deliverables:

You are tasked with compiling and verifying multiple variations of the two QRD algorithms in RISC

-

V using Visual Studio Code. The projects grade is contingent upon code correctness, presentation clarity, and innovative features. Bonus points will be awarded for valid, cost

-

effective extensions and use cases. The code variations include performance optimization through techniques like loop unrolling and parallelization for improved efficiency, implementing error handling mechanisms to enhance numerical stability, and integrating the QRD algorithms with other signal processing or linear algebra algorithms. Adhering to best practices in coding, documentation, and testing is essential, along with providing clear explanations of the implemented algorithms and their functionality.

Deliverable

0

: Warm

-

up: Compile the floating

-

point procedure of double precision, general matrix multiply

(

DGEMM

)

into RISC

-

,

following the example provided in the textbook

[1]

in Section

3.5 (

pages

209

211) .

Follow each step, assuming identical data types and register allocations as described in the textbook. This exercise is valuable for understanding how to handle indexing over two

-

dimensional arrays

(

where matrices are treated as single

-

dimensional inside the code

)

and floating

-

point operations. Save the resulting code as a function because you will utilize matrix multiplication later in your project.

Deliverable

1

: CGS and MGS: Proceed to compile both the CGS and MGS algorithms into separate routines. Assume various data types, including variations of floating

-

point representations, and conduct a comparative analysis of instruction count and performance. Verify the superior numerical stability of MGS

.

For performance comparison, invoke the DGEMM routine to compute QR and compare the result with A

,

calculating the mean

-

square error. In certain use cases where access to the original matrix A post

-

decomposition is unnecessary, explore potential memory

-

saving techniques in QRD

.

For the rest of the project; only consider MGS

.

Deliverable

2

: Data

-

level parallelism: Given that parallelism occurs within a wide word, subword parallelism is discussed in Sec.

3.8 .

Consider how such forms of data

-

level parallelism apply to QRD

.

Since RISC

-

V lacks support for subword parallelism, focus on commenting on how x

86

AVX instructions highlighted in the book can enhance CGS and MGS compilations.

Deliverable

3

: Instruction

-

level parallelism: Section

4.12

underscores the significance of loop unrolling in DGEMM, which provides multiple

-

issue, out

-

-

order execution processors with an abundance of instructions to facilitate instruction parallelism. Implement loop unrolling in your QRD compilations and provide detailed commentary on the potential gains.

Deliverable

4

: Cache blocking: In Section

5.15,

the original DGEMM code is modified to compute on submatrices to ensure that the elements being accessed can fit in the cache. A full

-

matrix multiplication thus involves invoking DGEMM repeatedly on matrices of smaller block sizes

(

the blocking factor

) .

Such blocking reduces cache misses and can aid in register allocation while minimizing the number of loads and stores in the program. Assuming a QRD of a

3232

matrix with

1024

elements, where each element occupies

8

bytes, the three matrices

(

,

,

and R

)

occupy

24

KiB, fitting comfortably within the

32

KiB data cache of the Intel Core i

7 .

Can blocking further enhance performance in QRD

?

Can you partition the matrices into smaller blocks and perform QRD computations on these blocks?

Deliverable

5

: Multiple processors: Section

6.12

underscores the distribution of the outermost loops work across

16

cores in DGEMM, showcasing the potential of parallelization. Parallelizing QRD can substantially improve performance, with various approaches documented in the literature, including recursive methods

(

do some literature review

) .

In MATLAB, a multicore parallelizable QRD implementation can be simulated using parfor. By distributing the QRD computation across multiple threads or cores, each processing a different subset of the data, performance gains can be realized, particularly for larger matrix dimensions. However, for smaller matrices where everything fits within the first

-

level data cache, parallelization may lead to performance degradation due to increased overheads and resource contention. Hence, careful consideration of matrix dimensions and underlying architecture is crucial when determining the effectiveness of parallelization for QRD

.

Propose and test parallelizable QRD algorithms.

Bonus extensions:

The potential extensions for this work are diverse, and innovation in pursuing these extensions is highly encouraged and rewarded with bonus points. Options for extensions include leveraging the optimized QRD codes to simulate QRD

-

based computations, such as for matrix inversion or zero

-

forcing data detection with decision feedback in MIMO communication systems. Additionally, exploring alternative implementations of QRD

,

such as employing Givens rotation

(

)

or Householder transformations, presents avenues for further investigation. Another consideration is dealing with complex

-

valued matrices and implementing the more intricate complex QRD

,

which poses additional challenges and opportunities for novel approaches and optimizations. Each of these extensions offers a unique opportunity to deepen understanding, develop novel techniques, and enhance the versatility and applicability of QRD algorithms.

References

[1]

.

John et al

.,

Computer organization and design: The hardware

/

software interface,

2017 .

[2]

. -

.

Golub and C

. -

.

Van Loan, Matrix Computations,

3

rd ed

.

Baltimore, MD: Johns Hopkins Univ. Press,

1996 .

[3]

.

Studer, P

.

Blosch, P

.

Friedli, and A

.

Burg, Matrix decomposition architecture for MIMO systems: Design and implementation trade

-

offs, in

2007

Conference Record of the FortyFirst Asilomar Conference on Signals, Systems and Computers,

2007,

. 19861990 .

[4]

.

Sarieddeen and M

.

.

Mansour, Enhanced low

-

complexity layer

-

ordering for MIMO sphere detectors, in Proc. IEEE Int. Conf. Commun.

(

ICC

),

May

2016,

. 16 .

[5]

.

Sarieddeen, M

.

.

Mansour, and A

.

Chehabnt near

-

optimgmun. and Netw. Conf.

(

WCNC

)

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

Chapter 2 User-Centered Systems Design: A Brief History Abstract The intention of this book is to help you think about design from a user-centered perspective. Our aim is to help you understand what...

I have to create a program in C and I can't figure it out. The program has to read a source file. Please help. /******************************************************************** PROJECT: Glossary...

QUIZ... Let D be a poset and let f : D D be a monotone function. (i) Give the definition of the least pre-fixed point, fix (f), of f. Show that fix (f) is a fixed point of f. [5 marks] (ii) Show that...

For monotone functions f, f0: P Q between posets (P, vP ) and (Q, vQ), let f v f(i) Prove that the binary relation v is a partial order. [3 marks] (ii) For monotone functions between posets p : P 0...

MGT411 Innovative and Creative Business Thinking University of Phoenix Material Organizational Ecosystem Case Study Wal-Mart Stores, Inc. is a leading company in its industry and a widely recognized...

Write 2 paragraphs about THE FUTURE OF INTERNATIONAL TRADE GOVERNANCE IN A PROTECTIONIST WORLD: THEORIZING WTO NEGOTIATING PERSPECTIVES article. No word count, page count, or formatting requirements....

You are a systems architect and are asked by your manager to explain to the company chief technology officer (CTO) the different types of computer system architectures available and making a...

4. Title: Computer System Architectures and Green Computing Compare and contrast monolithic and modular computer system architectures. List their advantages and disadvantages. Which one of these two...

Company X and Company Y are discussing suitable computer system architectures for their respective organisations. Company X operates live booking systems. Employees working in a number of offices...

Grandma MeeMaw is retired and lives in sunny, tropical Florida where she doesn't get to see her grandkids as much as she would like. To fill her time, MeeMaw has started a small business focusing on...

Give an example of each of the following, other than those described in this chapter, and clearly explain why your example is this type of relationship and not of some other degree. a. Ternary...

Consider the following cash flows of two mutually exclusive projects for Fand limited . Assume the discount rate for Fand limited is 1 0 percent. a . Based on the payback period, which project should...

A premium annual-pay bond pays a $83 coupon, has a yield to maturity of 5.01%, and is priced at $1,106.56. How many years till the bond matures? Answer in years to at least two decimal places

Give an example of a skip-and-jump survey question. Why should this type of question be avoided? (Objective 3)

How can a writer determine whether to use a formal, third-person style or an informal, firstperson style when writing a formal report? (Objective 1)

You are a new employee for Chapel Hill Gas Company.Your supervisor has asked you to use your creative talents in developing a simple message that could be used to remind customers to pay their bill...