Question: Computer system architectures must aim to minimize the gap between computer arithmetic and real - world arithmetic, and programmers need to be aware of the

Computer system architectures must aim to minimize the gap between computer arithmetic and real-world arithmetic, and programmers need to be aware of the implications of underlying approximations. This course project aims to enhance your computer organization and coding skills by developing methodologies for numerically stable, parallelizable, and efficiently compiled matrix QR decomposition (QRD) into RISC-V assembly codes. Our textbook [1] provides a running example of matrix multiplication, emphasizing the significance of subword parallelism, instruction-level parallelism, cache blocking, and multiple processors in Chapters 3,4,5, and 6, respectively. You will emulate the approach used for matrix multiplication to craft codes for matrix decompositions, with the goal of achieving performance enhancements through both software and hardware optimizations.
You are encouraged to work in groups of two, with each group member sharing the project workload equally. Starting work on the project as early as possible is recommended to mitigate last-minute problems or challenges. It is also advised that code enhancement techniques be synchronized with corresponding class lectures. Please do not hesitate to seek feedback and assistance promptly throughout the duration of the project.
Matrix Decompositions:
Matrix decomposition, also known as matrix factorization, involves transforming a matrix into a canonical form within the mathematical domain of linear algebra [2]. This process holds significant importance in numerous scientific and engineering applications reliant on linear algebra and applied statistics. Such decompositions offer analytic simplicity and computational convenience by breaking down complex matrix computation problems into simpler ones. For instance, large-scale matrix computations involving determinants and inversions can be streamlined by decomposing the matrix into lower-rank canonical forms, providing insights into their characteristics and structures. Therefore, such decompositions are very common, and making the common case fast means that investing in optimizing the implementation of such decompositions can significantly improve overall system performance.
Various types of matrix decompositions exist, with the most popular ones being relevant to solving systems of linear equations. These decompositions include LU, QR, and singular value decompositions. The LU decomposition breaks down a matrix A into lower triangular (L) and upper triangular (U) matrices, the QR decomposition into a unitary (Q) and upper triangular (R) matrices, and the singular value decomposition into a diagonal matrix of singular values (D) and two unitary matrices (U and V). The accuracy, throughput, latency, and area of matrix decomposition directly impact system performance. Consequently, the literature abounds with proposals for efficient hardware implementations of these decompositions.
QRD has been extensively utilized in signal processing applications, particularly in multipleinput multiple-output (MIMO) communication systems. In these systems, QRD serves as the preferred matrix decomposition method for channel-matrix preprocessing, aiming to simplify data detection processes [3,4,5].
Algorithm 1 Classical Gram-Schmidt
1: Q = A, R =0N
2: for k =1 : N do
3: vk = ak
4: for j =1 : k 1 do
5: rjk = qTj ak 6: vk = vk rjkqj 7: end for
8: rkk = vk29: qk = vk/rkk
10: end for
The QR decomposition (QRD):
Algorithm 2 Modified Gram-Schmidt
1: Q = A, R =0N
2: for k =1 : N do
3: rkk = qk24: qk = qk/rkk
5:
6: for j = k +1 : N do
7: rkj = qTkqj 8: qj = qj rkjqk
9: end for
10: end for
We denote non-bold lower and upper case letters (a,A) as scalars, bold lower case letters (a) as vectors, and bold upper case letters (A) as matrices. Let A=[a1,a2,,aN] be an N N matrix having columns ak =[a1k,a2k,,aNk]T for k in the range [1,N]. Similarly define the matrices Q and R. The QRD of matrix A is given by
A = QR,
where Q is a unitary matrix (QTQ = IN) and QTA = R is an upper triangular matrix.
QRD is typically computed using Gram-Schmidt (GS) orthogonalization, and there are various ways to implement the GS process. Algorithms 1 and 2 present two different implementations: classical Gram-Schmidt (CGS) and modified Gram-Schmidt (MGS). In exact arithmetic, these two methods produce exactly the same output (exercise: convince yourself of this). However, in the presence of rounding errors, the algorithms behave significantly differently.
Project deliverables:
You are tasked with compiling and verifying multiple variations of the two QRD algorithms in RISC-V using Visual Studio Code. The projects grade is contingent upon code correctness, presentation clarity, and innovative features. Bonus points will be awarded for valid, cost-effective extensions and use cases. The code variations include performance optimization through techniques like loop unrolling and parallelization for improved efficiency, implementing error handling mechanisms to enhance numerical stability, and integrating the QRD algorithms with other signal processing or linear algebra algorithms. Adhering to best practices in coding, documentation, and testing is essential, along with providing clear explanations of the implemented algorithms and their functionality.
Deliverable 0: Warm-up: Compile the floating-point procedure of double precision, general matrix multiply (DGEMM) into RISC-V, following the example provided in the textbook [1] in Section 3.5(pages 209 to 211). Follow each step, assuming identical data types and register allocations as described in the textbook. This exercise is valuable for understanding how to handle indexing over two-dimensional arrays (where matrices are treated as single-dimensional inside the code) and floating-point operations. Save the resulting code as a function because you will utilize matrix multiplication later in your project.
Deliverable 1: CGS and MGS: Proceed to compile both the CGS and MGS algorithms into separate routines. Assume various data types, including variations of floating-point representations, and conduct a comparative analysis of instruction count and performance. Verify the superior numerical stability of MGS. For performance comparison, invoke the DGEMM routine to compute QR and compare the result with A, calculating the mean-square error. In certain use cases where access to the original matrix A post-decomposition is unnecessary, explore potential memory-saving techniques in QRD. For the rest of the project; only consider MGS.
Deliverable 2: Data-level parallelism: Given that parallelism occurs within a wide word, subword parallelism is discussed in Sec. 3.8. Consider how such forms of data-level parallelism apply to QRD. Since RISC-V lacks support for subword parallelism, focus on commenting on how x86 AVX instructions highlighted in the book can enhance CGS and MGS compilations.
Deliverable 3: Instruction-level parallelism: Section 4.12 underscores the significance of loop unrolling in DGEMM, which provides multiple-issue, out-of-order execution processors with an abundance of instructions to facilitate instruction parallelism. Implement loop unrolling in your QRD compilations and provide detailed commentary on the potential gains.
Deliverable 4: Cache blocking: In Section 5.15, the original DGEMM code is modified to compute on submatrices to ensure that the elements being accessed can fit in the cache. A full-matrix multiplication thus involves invoking DGEMM repeatedly on matrices of smaller block sizes (the blocking factor). Such blocking reduces cache misses and can aid in register allocation while minimizing the number of loads and stores in the program. Assuming a QRD of a 3232 matrix with 1024 elements, where each element occupies 8 bytes, the three matrices (A, Q, and R) occupy 24 KiB, fitting comfortably within the 32 KiB data cache of the Intel Core i7. Can blocking further enhance performance in QRD? Can you partition the matrices into smaller blocks and perform QRD computations on these blocks?
Deliverable 5: Multiple processors: Section 6.12 underscores the distribution of the outermost loops work across 16 cores in DGEMM, showcasing the potential of parallelization. Parallelizing QRD can substantially improve performance, with various approaches documented in the literature, including recursive methods (do some literature review). In MATLAB, a multicore parallelizable QRD implementation can be simulated using parfor. By distributing the QRD computation across multiple threads or cores, each processing a different subset of the data, performance gains can be realized, particularly for larger matrix dimensions. However, for smaller matrices where everything fits within the first-level data cache, parallelization may lead to performance degradation due to increased overheads and resource contention. Hence, careful consideration of matrix dimensions and underlying architecture is crucial when determining the effectiveness of parallelization for QRD. Propose and test parallelizable QRD algorithms.
Bonus extensions:
The potential extensions for this work are diverse, and innovation in pursuing these extensions is highly encouraged and rewarded with bonus points. Options for extensions include leveraging the optimized QRD codes to simulate QRD-based computations, such as for matrix inversion or zero-forcing data detection with decision feedback in MIMO communication systems. Additionally, exploring alternative implementations of QRD, such as employing Givens rotation (GR) or Householder transformations, presents avenues for further investigation. Another consideration is dealing with complex-valued matrices and implementing the more intricate complex QRD, which poses additional challenges and opportunities for novel approaches and optimizations. Each of these extensions offers a unique opportunity to deepen understanding, develop novel techniques, and enhance the versatility and applicability of QRD algorithms.
References
[1] L. John et al., Computer organization and design: The hardware/software interface, 2017.
[2] G.-H. Golub and C.-F. Van Loan, Matrix Computations, 3rd ed. Baltimore, MD: Johns Hopkins Univ. Press, 1996.
[3] C. Studer, P. Blosch, P. Friedli, and A. Burg, Matrix decomposition architecture for MIMO systems: Design and implementation trade-offs, in 2007 Conference Record of the FortyFirst Asilomar Conference on Signals, Systems and Computers, 2007, pp.19861990.
[4] H. Sarieddeen and M. M. Mansour, Enhanced low-complexity layer-ordering for MIMO sphere detectors, in Proc. IEEE Int. Conf. Commun. (ICC), May 2016, pp.16.
[5] H. Sarieddeen, M. M. Mansour, and A. Chehabnt near-optimgmun. and Netw. Conf. (WCNC)

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!