Question: MATHEMATICS FOR MACHINE LEARNING Marc Peter Deisenroth A. Aldo Faisal Cheng Soon Ong Contents Foreword 1 Part I Mathematical Foundations 9 1 Introduction and Motivation

MATHEMATICS FOR MACHINE LEARNING Marc Peter Deisenroth A. Aldo Faisal Cheng Soon Ong Contents Foreword 1 Part I Mathematical Foundations 9 1 Introduction and Motivation 11 1.1 Finding Words for Intuitions 12 1.2 Two Ways to Read This Book 13 1.3 Exercises and Feedback 16 2 Linear Algebra 17 2.1 Systems of Linear Equations 19 2.2 Matrices 22 2.3 Solving Systems of Linear Equations 27 2.4 Vector Spaces 35 2.5 Linear Independence 40 2.6 Basis and Rank 44 2.7 Linear Mappings 48 2.8 Affine Spaces 61 2.9 Further Reading 63 Exercises 64 3 Analytic Geometry 70 3.1 Norms 71 3.2 Inner Products 72 3.3 Lengths and Distances 75 3.4 Angles and Orthogonality 76 3.5 Orthonormal Basis 78 3.6 Orthogonal Complement 79 3.7 Inner Product of Functions 80 3.8 Orthogonal Projections 81 3.9 Rotations 91 3.10 Further Reading 94 Exercises 96 4 Matrix Decompositions 98 4.1 Determinant and Trace 99 i This material is published by Cambridge University Press as Mathematics for Machine Learning by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong (2020). This version is free to view and download for personal use only. Not for re-distribution, re-sale, or use in derivative works. c by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2021. https://mml-book.com. ii Contents 4.2 Eigenvalues and Eigenvectors 105 4.3 Cholesky Decomposition 114 4.4 Eigendecomposition and Diagonalization 115 4.5 Singular Value Decomposition 119 4.6 Matrix Approximation 129 4.7 Matrix Phylogeny 134 4.8 Further Reading 135 Exercises 137 5 Vector Calculus 139 5.1 Differentiation of Univariate Functions 141 5.2 Partial Differentiation and Gradients 146 5.3 Gradients of Vector-Valued Functions 149 5.4 Gradients of Matrices 155 5.5 Useful Identities for Computing Gradients 158 5.6 Backpropagation and Automatic Differentiation 159 5.7 Higher-Order Derivatives 164 5.8 Linearization and Multivariate Taylor Series 165 5.9 Further Reading 170 Exercises 170 6 Probability and Distributions 172 6.1 Construction of a Probability Space 172 6.2 Discrete and Continuous Probabilities 178 6.3 Sum Rule, Product Rule, and Bayes Theorem 183 6.4 Summary Statistics and Independence 186 6.5 Gaussian Distribution 197 6.6 Conjugacy and the Exponential Family 205 6.7 Change of Variables/Inverse Transform 214 6.8 Further Reading 221 Exercises 222 7 Continuous Optimization 225 7.1 Optimization Using Gradient Descent 227 7.2 Constrained Optimization and Lagrange Multipliers 233 7.3 Convex Optimization 236 7.4 Further Reading 246 Exercises 247 Part II Central Machine Learning Problems 249 8 When Models Meet Data 251 8.1 Data, Models, and Learning 251 8.2 Empirical Risk Minimization 258 8.3 Parameter Estimation 265 8.4 Probabilistic Modeling and Inference 272 8.5 Directed Graphical Models 278 Draft (2021-07-29) of Mathematics for Machine Learning. Feedback: https://mml-book.com. Contents iii 8.6 Model Selection 283 9 Linear Regression 289 9.1 Problem Formulation 291 9.2 Parameter Estimation 292 9.3 Bayesian Linear Regression 303 9.4 Maximum Likelihood as Orthogonal Projection 313 9.5 Further Reading 315 10 Dimensionality Reduction with Principal Component Analysis 317 10.1 Problem Setting 318 10.2 Maximum Variance Perspective 320 10.3 Projection Perspective 325 10.4 Eigenvector Computation and Low-Rank Approximations 333 10.5 PCA in High Dimensions 335 10.6 Key Steps of PCA in Practice 336 10.7 Latent Variable Perspective 339 10.8 Further Reading 343 11 Density Estimation with Gaussian Mixture Models 348 11.1 Gaussian Mixture Model 349 11.2 Parameter Learning via Maximum Likelihood 350 11.3 EM Algorithm 360 11.4 Latent-Variable Perspective 363 11.5 Further Reading 368 12 Classification with Support Vector Machines 370 12.1 Separating Hyperplanes 372 12.2 Primal Support Vector Machine 374 12.3 Dual Support Vector Machine 383 12.4 Kernels 388 12.5 Numerical Solution 390 12.6 Further Reading 392 References 395 Index 407 c 2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020). Foreword Machine learning is the latest in a long line of attempts to distill human knowledge and reasoning into a form that is suitable for constructing machines and engineering automated systems. As machine learning becomes more ubiquitous and its software packages become easier to use, it is natural and desirable that the low-level technical details are abstracted away and hidden from the practitioner. However, this brings with it the danger that a practitioner becomes unaware of the design decisions and, hence, the limits of machine learning algorithms. The enthusiastic practitioner who is interested to learn more about the magic behind successful machine learning algorithms currently faces a daunting set of pre-requisite knowledge: Programming languages and data analysis tools Large-scale computation and the associated frameworks Mathematics and statistics and how machine learning builds on it At universities, introductory courses on machine learning tend to spend early parts of the course covering some of these pre-requisites. For historical reasons, courses in machine learning tend to be taught in the computer science department, where students are often trained in the first two areas of knowledge, but not so much in mathematics and statistics. Current machine learning textbooks primarily focus on machine learning algorithms and methodologies and assume that the reader is competent in mathematics and statistics. Therefore, these books only spend one or two chapters on background mathematics, either at the beginning of the book or as appendices. We have found many people who want to delve into the foundations of basic machine learning methods who struggle with the mathematical knowledge required to read a machine learning textbook. Having taught undergraduate and graduate courses at universities, we find that the gap between high school mathematics and the mathematics level required to read a standard machine learning textbook is too big for many people. This book brings the mathematical foundations of basic machine learning concepts to the fore and collects the information in a single place so that this skills gap is narrowed or even closed. 1 This material is published by Cambridge University Press as Mathematics for Machine Learning by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong (2020). This version is free to view and download for personal use only. Not for re-distribution, re-sale, or use in derivative works. c by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2021. https://mml-book.com. 2 Foreword Why Another Book on Machine Learning? Machine learning builds upon the language of mathematics to express concepts that seem intuitively obvious but that are surprisingly difficult to formalize. Once formalized properly, we can gain insights into the task we want to solve. One common complaint of students of mathematics around the globe is that the topics covered seem to have little relevance to practical problems. We believe that machine learning is an obvious and direct motivation for people to learn mathematics. This book is intended to be a guidebook to the vast mathematical lit- Math is linked in erature that forms the foundations of modern machine learning. We mothe popular mind with phobia and anxiety. Youd think were discussing spiders. (Strogatz, 2014, page 281) tivate the need for mathematical concepts by directly pointing out their usefulness in the context of fundamental machine learning problems. In the interest of keeping the book short, many details and more advanced concepts have been left out. Equipped with the basic concepts presented here, and how they fit into the larger context of machine learning, the reader can find numerous resources for further study, which we provide at the end of the respective chapters. For readers with a mathematical background, this book provides a brief but precisely stated glimpse of machine learning. In contrast to other books that focus on methods and models of machine learning (MacKay, 2003; Bishop, 2006; Alpaydin, 2010; Barber, 2012; Murphy, 2012; Shalev-Shwartz and Ben-David, 2014; Rogers and Girolami, 2016) or programmatic aspects of machine learning (Muller and Guido, 2016; Raschka and Mirjalili, 2017; Chollet and Allaire, 2018), we provide only four representative examples of machine learning algorithms. Instead, we focus on the mathematical concepts behind the models themselves. We hope that readers will be able to gain a deeper understanding of the basic questions in machine learning and connect practical questions arising from the use of machine learning with fundamental choices in the mathematical model. We do not aim to write a classical machine learning book. Instead, our intention is to provide the mathematical background, applied to four central machine learning problems, to make it easier to read other machine learning textbooks. Who Is the Target Audience? As applications of machine learning become widespread in society, we believe that everybody should have some understanding of its underlying principles. This book is written in an academic mathematical style, which enables us to be precise about the concepts behind machine learning. We encourage readers unfamiliar with this seemingly terse style to persevere and to keep the goals of each topic in mind. We sprinkle comments and remarks throughout the text, in the hope that it provides useful guidance with respect to the big picture. The book assumes the reader to have mathematical knowledge commonly Draft (2021-07-29) of Mathematics for Machine Learning. Feedback: https://mml-book.com. Foreword 3 covered in high school mathematics and physics. For example, the reader should have seen derivatives and integrals before, and geometric vectors in two or three dimensions. Starting from there, we generalize these concepts. Therefore, the target audience of the book includes undergraduate university students, evening learners and learners participating in online machine learning courses. In analogy to music, there are three types of interaction that people have with machine learning: Astute Listener The democratization of machine learning by the provision of open-source software, online tutorials and cloud-based tools allows users to not worry about the specifics of pipelines. Users can focus on extracting insights from data using off-the-shelf tools. This enables nontech-savvy domain experts to benefit from machine learning. This is similar to listening to music; the user is able to choose and discern between different types of machine learning, and benefits from it. More experienced users are like music critics, asking important questions about the application of machine learning in society such as ethics, fairness, and privacy of the individual. We hope that this book provides a foundation for thinking about the certification and risk management of machine learning systems, and allows them to use their domain expertise to build better machine learning systems. Experienced Artist Skilled practitioners of machine learning can plug and play different tools and libraries into an analysis pipeline. The stereotypical practitioner would be a data scientist or engineer who understands machine learning interfaces and their use cases, and is able to perform wonderful feats of prediction from data. This is similar to a virtuoso playing music, where highly skilled practitioners can bring existing instruments to life and bring enjoyment to their audience. Using the mathematics presented here as a primer, practitioners would be able to understand the benefits and limits of their favorite method, and to extend and generalize existing machine learning algorithms. We hope that this book provides the impetus for more rigorous and principled development of machine learning methods. Fledgling Composer As machine learning is applied to new domains, developers of machine learning need to develop new methods and extend existing algorithms. They are often researchers who need to understand the mathematical basis of machine learning and uncover relationships between different tasks. This is similar to composers of music who, within the rules and structure of musical theory, create new and amazing pieces. We hope this book provides a high-level overview of other technical books for people who want to become composers of machine learning. There is a great need in society for new researchers who are able to propose and explore novel approaches for attacking the many challenges of learning from data. c 2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020). 4 Foreword Acknowledgments We are grateful to many people who looked at early drafts of the book and suffered through painful expositions of concepts. We tried to implement their ideas that we did not vehemently disagree with. We would like to especially acknowledge Christfried Webers for his careful reading of many parts of the book, and his detailed suggestions on structure and presentation. Many friends and colleagues have also been kind enough to provide their time and energy on different versions of each chapter. We have been lucky to benefit from the generosity of the online community, who have suggested improvements via https://github.com, which greatly improved the book. The following people have found bugs, proposed clarifications and suggested relevant literature, either via https://github.com or personal communication. Their names are sorted alphabetically. Abdul-Ganiy Usman Adam Gaier Adele Jackson Aditya Menon Alasdair Tran Aleksandar Krnjaic Alexander Makrigiorgos Alfredo Canziani Ali Shafti Amr Khalifa Andrew Tanggara Angus Gruen Antal A. Buss Antoine Toisoul Le Cann Areg Sarvazyan Artem Artemev Artyom Stepanov Bill Kromydas Bob Williamson Boon Ping Lim Chao Qu Cheng Li Chris Sherlock Christopher Gray Daniel McNamara Daniel Wood Darren Siegel David Johnston Dawei Chen Ellen Broad Fengkuangtian Zhu Fiona Condon Georgios Theodorou He Xin Irene Raissa Kameni Jakub Nabaglo James Hensman Jamie Liu Jean Kaddour Jean-Paul Ebejer Jerry Qiang Jitesh Sindhare John Lloyd Jonas Ngnawe Jon Martin Justin Hsi Kai Arulkumaran Kamil Dreczkowski Lily Wang Lionel Tondji Ngoupeyou Lydia Knufing Mahmoud Aslan Mark Hartenstein Mark van der Wilk Markus Hegland Martin Hewing Matthew Alger Matthew Lee Draft (2021-07-29) of Mathematics for Machine Learning. Feedback: https://mml-book.com. Foreword 5 Maximus McCann Mengyan Zhang Michael Bennett Michael Pedersen Minjeong Shin Mohammad Malekzadeh Naveen Kumar Nico Montali Oscar Armas Patrick Henriksen Patrick Wieschollek Pattarawat Chormai Paul Kelly Petros Christodoulou Piotr Januszewski Pranav Subramani Quyu Kong Ragib Zaman Rui Zhang Ryan-Rhys Griffiths Salomon Kabongo Samuel Ogunmola Sandeep Mavadia Sarvesh Nikumbh Sebastian Raschka Senanayak Sesh Kumar Karri Seung-Heon Baek Shahbaz Chaudhary Shakir Mohamed Shawn Berry Sheikh Abdul Raheem Ali Sheng Xue Sridhar Thiagarajan Syed Nouman Hasany Szymon Brych Thomas Buhler Timur Sharapov Tom Melamed Vincent Adam Vincent Dutordoir Vu Minh Wasim Aftab Wen Zhi Wojciech Stokowiec Xiaonan Chong Xiaowei Zhang Yazhou Hao Yicheng Luo Young Lee Yu Lu Yun Cheng Yuxiao Huang Zac Cranko Zijian Cao Zoe Nolan Contributors through GitHub, whose real names were not listed on their GitHub profile, are: SamDataMad bumptiousmonkey idoamihai deepakiim insad HorizonP cs-maillist kudo23 empet victorBigand 17SKYE jessjing1995 We are also very grateful to Parameswaran Raman and the many anonymous reviewers, organized by Cambridge University Press, who read one or more chapters of earlier versions of the manuscript, and provided constructive criticism that led to considerable improvements. A special mention goes to Dinesh Singh Negi, our LATEX support, for detailed and prompt advice about LATEX-related issues. Last but not least, we are very grateful to our editor Lauren Cowles, who has been patiently guiding us through the gestation process of this book. c 2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020). 6 Foreword Table of Symbols Symbol Typical meaning a, b, c, a, , ? Scalars are lowercase x, y, z Vectors are bold lowercase A, B, C Matrices are bold uppercase x >, A > Transpose of a vector or matrix A -1 Inverse of a matrix hx, yi Inner product of x and y x >y Dot product of x and y B = (b1, b2, b3) (Ordered) tuple B = [b1, b2, b3] Matrix of column vectors stacked horizontally B = {b1, b2, b3} Set of vectors (unordered) Z, N Integers and natural numbers, respectively R, C Real and complex numbers, respectively Rn n-dimensional vector space of real numbers ?x Universal quantifier: for all x ?x Existential quantifier: there exists x a := b a is defined as b a =: b b is defined as a a ? b a is proportional to b, i.e., a = constant b g ? f Function composition: g after f ?? If and only if =? Implies A, C Sets a ? A a is an element of set A Empty set A\B A without B: the set of elements in A but not in B D Number of dimensions; indexed by d = 1, . . . , D N Number of data points; indexed by n = 1, . . . , N Im Identity matrix of size m m 0m,n Matrix of zeros of size m n 1m,n Matrix of ones of size m n ei Standard/canonical vector (where i is the component that is 1) dim Dimensionality of vector space rk(A) Rank of matrix A Im(F) Image of linear mapping F ker(F) Kernel (null space) of a linear mapping F span[b1] Span (generating set) of b1 tr(A) Trace of A det(A) Determinant of A | | Absolute value or determinant (depending on context) kk Norm; Euclidean, unless specified ? Eigenvalue or Lagrange multiplier E? Eigenspace corresponding to eigenvalue ? Draft (2021-07-29) of Mathematics for Machine Learning. Feedback: https://mml-book.com. Foreword 7 Symbol Typical meaning x ? y Vectors x and y are orthogonal V Vector space V ? Orthogonal complement of vector space V PN Qn=1 xn Sum of the xn: x1 + . . . + xN N n=1 xn Product of the xn: x1 . . . xN ? Parameter vector ?f ?x Partial derivative of f with respect to x df dx Total derivative of f with respect to x ? Gradient f* = minx f(x) The smallest function value of f x* ? arg minx f(x) The value x* that minimizes f (note: arg min returns a set of values) L Lagrangian L Negative log-likelihood n k Binomial coefficient, n choose k VX[x] Variance of x with respect to the random variable X EX[x] Expectation of x with respect to the random variable X CovX,Y [x, y] Covariance between x and y. X ?? Y |Z X is conditionally independent of Y given Z X ~ p Random variable X is distributed according to p N , S Gaussian distribution with mean and covariance S Ber() Bernoulli distribution with parameter Bin(N, ) Binomial distribution with parameters N, Beta(a, ) Beta distribution with parameters a, Table of Abbreviations and Acronyms Acronym Meaning e.g. Exempli gratia (Latin: for example) GMM Gaussian mixture model i.e. Id est (Latin: this means) i.i.d. Independent, identically distributed MAP Maximum a posteriori MLE Maximum likelihood estimation/estimator ONB Orthonormal basis PCA Principal component analysis PPCA Probabilistic principal component analysis REF Row-echelon form SPD Symmetric, positive definite SVM Support vector machine c 2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020 Part I Mathematical Foundations 9 This material is published by Cambridge University Press as Mathematics for Machine Learning by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong (2020). This version is free to view and download for personal use only. Not for re-distribution, re-sale, or use in derivative works. c by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2021. https://mml-book.com. 1 Introduction and Motivation Machine learning is about designing algorithms that automatically extract valuable information from data. The emphasis here is on automatic, i.e., machine learning is concerned about general-purpose methodologies that can be applied to many datasets, while producing something that is meaningful. There are three concepts that are at the core of machine learning: data, a model, and learning. Since machine learning is inherently data driven, data is at the core data of machine learning. The goal of machine learning is to design generalpurpose methodologies to extract valuable patterns from data, ideally without much domain-specific expertise. For example, given a large corpus of documents (e.g., books in many libraries), machine learning methods can be used to automatically find relevant topics that are shared across documents (Hoffman et al., 2010). To achieve this goal, we design models that are typically related to the process that generates data, similar to model the dataset we are given. For example, in a regression setting, the model would describe a function that maps inputs to real-valued outputs. To paraphrase Mitchell (1997): A model is said to learn from data if its performance on a given task improves after the data is taken into account. The goal is to find good models that generalize well to yet unseen data, which we may care about in the future. Learning can be understood as a learning way to automatically find patterns and structure in data by optimizing the parameters of the model. While machine learning has seen many success stories, and software is readily available to design and train rich and flexible machine learning systems, we believe that the mathematical foundations of machine learning are important in order to understand fundamental principles upon which more complicated machine learning systems are built. Understanding these principles can facilitate creating new machine learning solutions, understanding and debugging existing approaches, and learning about the inherent assumptions and limitations of the methodologies we are working with. 11 This material is published by Cambridge University Press as Mathematics for Machine Learning by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong (2020). This version is free to view and download for personal use only. Not for re-distribution, re-sale, or use in derivative works. c by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2021. https://mml-book.com. 12 Introduction and Motivation 1.1 Finding Words for Intuitions A challenge we face regularly in machine learning is that concepts and words are slippery, and a particular component of the machine learning system can be abstracted to different mathematical concepts. For example, the word algorithm is used in at least two different senses in the context of machine learning. In the first sense, we use the phrase machine learning algorithm to mean a system that makes predictions based on inpredictor put data. We refer to these algorithms as predictors. In the second sense, we use the exact same phrase machine learning algorithm to mean a system that adapts some internal parameters of the predictor so that it performs well on future unseen input data. Here we refer to this adaptatraining tion as training a system. This book will not resolve the issue of ambiguity, but we want to highlight upfront that, depending on the context, the same expressions can mean different things. However, we attempt to make the context sufficiently clear to reduce the level of ambiguity. The first part of this book introduces the mathematical concepts and foundations needed to talk about the three main components of a machine learning system: data, models, and learning. We will briefly outline these components here, and we will revisit them again in Chapter 8 once we have discussed the necessary mathematical concepts. While not all data is numerical, it is often useful to consider data in a number format. In this book, we assume that data has already been appropriately converted into a numerical representation suitable for readdata as vectors ing into a computer program. Therefore, we think of data as vectors. As another illustration of how subtle words are, there are (at least) three different ways to think about vectors: a vector as an array of numbers (a computer science view), a vector as an arrow with a direction and magnitude (a physics view), and a vector as an object that obeys addition and scaling (a mathematical view). model A model is typically used to describe a process for generating data, similar to the dataset at hand. Therefore, good models can also be thought of as simplified versions of the real (unknown) data-generating process, capturing aspects that are relevant for modeling the data and extracting hidden patterns from it. A good model can then be used to predict what would happen in the real world without performing real-world experiments. learning We now come to the crux of the matter, the learning component of machine learning. Assume we are given a dataset and a suitable model. Training the model means to use the data available to optimize some parameters of the model with respect to a utility function that evaluates how well the model predicts the training data. Most training methods can be thought of as an approach analogous to climbing a hill to reach its peak. In this analogy, the peak of the hill corresponds to a maximum of some Draft (2021-07-29) of Mathematics for Machine Learning. Feedback: https://mml-book.com. 1.2 Two Ways to Read This Book 13 desired performance measure. However, in practice, we are interested in the model to perform well on unseen data. Performing well on data that we have already seen (training data) may only mean that we found a good way to memorize the data. However, this may not generalize well to unseen data, and, in practical applications, we often need to expose our machine learning system to situations that it has not encountered before. Let us summarize the main concepts of machine learning that we cover in this book: We represent data as vectors. We choose an appropriate model, either using the probabilistic or optimization view. We learn from available data by using numerical optimization methods with the aim that the model performs well on data not used for training. 1.2 Two Ways to Read This Book We can consider two strategies for understanding the mathematics for machine learning: Bottom-up: Building up the concepts from foundational to more advanced. This is often the preferred approach in more technical fields, such as mathematics. This strategy has the advantage that the reader at all times is able to rely on their previously learned concepts. Unfortunately, for a practitioner many of the foundational concepts are not particularly interesting by themselves, and the lack of motivation means that most foundational definitions are quickly forgotten. Top-down: Drilling down from practical needs to more basic requirements. This goal-driven approach has the advantage that the readers know at all times why they need to work on a particular concept, and there is a clear path of required knowledge. The downside of this strategy is that the knowledge is built on potentially shaky foundations, and the readers have to remember a set of words that they do not have any way of understanding. We decided to write this book in a modular way to separate foundational (mathematical) concepts from applications so that this book can be read in both ways. The book is split into two parts, where Part I lays the mathematical foundations and Part II applies the concepts from Part I to a set of fundamental machine learning problems, which form four pillars of machine learning as illustrated in Figure 1.1: regression, dimensionality reduction, density estimation, and classification. Chapters in Part I mostly build upon the previous ones, but it is possible to skip a chapter and work backward if necessary. Chapters in Part II are only loosely coupled and can be read in any order. There are many pointers forward and backward c 2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020). 14 Introduction and Motivation Figure 1.1 The foundations and four pillars of machine learning. Classification Density Regression Dimensionality Estimation Reduction Machine Learning Vector Calculus Probability & Distributions Optimization Linear Algebra Analytic Geometry Matrix Decomposition between the two parts of the book to link mathematical concepts with machine learning algorithms. Of course there are more than two ways to read this book. Most readers learn using a combination of top-down and bottom-up approaches, sometimes building up basic mathematical skills before attempting more complex concepts, but also choosing topics based on applications of machine learning. Part I Is about Mathematics The four pillars of machine learning we cover in this book (see Figure 1.1) require a solid mathematical foundation, which is laid out in Part I. We represent numerical data as vectors and represent a table of such data as a matrix. The study of vectors and matrices is called linear algebra, linear algebra which we introduce in Chapter 2. The collection of vectors as a matrix is also described there. Given two vectors representing two objects in the real world, we want to make statements about their similarity. The idea is that vectors that are similar should be predicted to have similar outputs by our machine learning algorithm (our predictor). To formalize the idea of similarity between vectors, we need to introduce operations that take two vectors as input and return a numerical value representing their similarity. The conanalytic geometry struction of similarity and distances is central to analytic geometry and is discussed in Chapter 3. In Chapter 4, we introduce some fundamental concepts about matrimatrix ces and matrix decomposition. Some operations on matrices are extremely decomposition useful in machine learning, and they allow for an intuitive interpretation of the data and more efficient learning. We often consider data to be noisy observations of some true underlying signal. We hope that by applying machine learning we can identify the signal from the noise. This requires us to have a language for quantifying what noise means. We often would also like to have predictors that Draft (2021-07-29) of Mathematics for Machine Learning. Feedback: https://mml-book.com. 1.2 Two Ways to Read This Book 15 allow us to express some sort of uncertainty, e.g., to quantify the confidence we have about the value of the prediction at a particular test data point. Quantification of uncertainty is the realm of probability theory and probability theory is covered in Chapter 6. To train machine learning models, we typically find parameters that maximize some performance measure. Many optimization techniques require the concept of a gradient, which tells us the direction in which to search for a solution. Chapter 5 is about vector calculus and details the vector calculus concept of gradients, which we subsequently use in Chapter 7, where we talk about optimization to find maxima/minima of functions. optimization Part II Is about Machine Learning The second part of the book introduces four pillars of machine learning as shown in Figure 1.1. We illustrate how the mathematical concepts introduced in the first part of the book are the foundation for each pillar. Broadly speaking, chapters are ordered by difficulty (in ascending order). In Chapter 8, we restate the three components of machine learning (data, models, and parameter estimation) in a mathematical fashion. In addition, we provide some guidelines for building experimental set-ups that guard against overly optimistic evaluations of machine learning systems. Recall that the goal is to build a predictor that performs well on unseen data. In Chapter 9, we will have a close look at linear regression, where our linear regression objective is to find functions that map inputs x ? RD to corresponding observed function values y ? R, which we can interpret as the labels of their respective inputs. We will discuss classical model fitting (parameter estimation) via maximum likelihood and maximum a posteriori estimation, as well as Bayesian linear regression, where we integrate the parameters out instead of optimizing them. Chapter 10 focuses on dimensionality reduction, the second pillar in Fig- dimensionality ure 1.1 reduction , using principal component analysis. The key objective of dimensionality reduction is to find a compact, lower-dimensional representation of high-dimensional data x ? RD, which is often easier to analyze than the original data. Unlike regression, dimensionality reduction is only concerned about modeling the data there are no labels associated with a data point x. In Chapter 11, we will move to our third pillar: density estimation. The density estimation objective of density estimation is to find a probability distribution that describes a given dataset. We will focus on Gaussian mixture models for this purpose, and we will discuss an iterative scheme to find the parameters of this model. As in dimensionality reduction, there are no labels associated with the data points x ? RD. However, we do not seek a low-dimensional representation of the data. Instead, we are interested in a density model that describes the data. Chapter 12 concludes the book with an in-depth discussion of the fourth c 2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020). 16 Introduction and Motivation classification pillar: classification. We will discuss classification in the context of support vector machines. Similar to regression (Chapter 9), we have inputs x and corresponding labels y. However, unlike regression, where the labels were real-valued, the labels in classification are integers, which requires special care. 1.3 Exercises and Feedback We provide some exercises in Part I, which can be done mostly by pen and paper. For Part II, we provide programming tutorials (jupyter notebooks) to explore some properties of the machine learning algorithms we discuss in this book. We appreciate that Cambridge University Press strongly supports our aim to democratize education and learning by making this book freely available for download at https://mml-book.com where tutorials, errata, and additional materials can be found. Mistakes can be reported and feedback provided using the preceding URL. Draft (2021-07-29) of Mathematics for Machine Learning. Feedback: https://mml-book.com. 2 Linear Algebra When formalizing intuitive concepts, a common approach is to construct a set of objects (symbols) and a set of rules to manipulate these objects. This is known as an algebra. Linear algebra is the study of vectors and certain algebra rules to manipulate vectors. The vectors many of us know from school are called geometric vectors, which are usually denoted by a small arrow above the letter, e.g., -?x and -?y . In this book, we discuss more general concepts of vectors and use a bold letter to represent them, e.g., x and y. In general, vectors are special objects that can be added together and multiplied by scalars to produce another object of the same kind. From an abstract mathematical viewpoint, any object that satisfies these two properties can be considered a vector. Here are some examples of such vector objects: 1. Geometric vectors. This example of a vector may be familiar from high school mathematics and physics. Geometric vectors see Figure 2.1(a) are directed segments, which can be drawn (at least in two dimensions). Two geometric vectors ? x, ? y can be added, such that ? x+ ? y = ? z is another geometric vector. Furthermore, multiplication by a scalar ? ? x, ? ? R, is also a geometric vector. In fact, it is the original vector scaled by ?. Therefore, geometric vectors are instances of the vector concepts introduced previously. Interpreting vectors as geometric vectors enables us to use our intuitions about direction and magnitude to reason about mathematical operations. 2. Polynomials are also vectors; see Figure 2.1(b): Two polynomials can Figure 2.1 Different types of vectors. Vectors can be surprising objects, including (a) geometric vectors and (b) polynomials. ? x ? y ? x + ? y (a) Geometric vectors. -2 0 2 x -6 -4 -2 0 2 4 y (b) Polynomials. 17 This material is published by Cambridge University Press as Mathematics for Machine Learning by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong (2020). This version is free to view and download for personal use only. Not for re-distribution, re-sale, or use in derivative works. c by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2021. https://mml-book.com. 18 Linear Algebra be added together, which results in another polynomial; and they can be multiplied by a scalar ? ? R, and the result is a polynomial as well. Therefore, polynomials are (rather unusual) instances of vectors. Note that polynomials are very different from geometric vectors. While geometric vectors are concrete drawings, polynomials are abstract concepts. However, they are both vectors in the sense previously described. 3. Audio signals are vectors. Audio signals are represented as a series of numbers. We can add audio signals together, and their sum is a new audio signal. If we scale an audio signal, we also obtain an audio signal. Therefore, audio signals are a type of vector, too. 4. Elements of Rn (tuples of n real numbers) are vectors. Rn is more abstract than polynomials, and it is the concept we focus on in this book. For instance, a = ? ? 1 2 3 ? ? ? R 3 (2.1) is an example of a triplet of numbers. Adding two vectors a, b ? Rn component-wise results in another vector: a + b = c ? Rn . Moreover, multiplying a ? Rn by ? ? R results in a scaled vector ?a ? Rn . Considering vectors as elements of Rn Be careful to check has an additional benefit that whether array operations actually perform vector operations when implementing on a computer. it loosely corresponds to arrays of real numbers on a computer. Many programming languages support array operations, which allow for convenient implementation of algorithms that involve vector operations. Linear algebra focuses on the similarities between these vector concepts. We can add them together and multiply them by scalars. We will largely Pavel Grinfelds series on linear algebra: http://tinyurl. com/nahclwm Gilbert Strangs course on linear algebra: http://tinyurl. com/29p5q8j 3Blue1Brown series on linear algebra: https://tinyurl. com/h5g4kps focus on vectors in Rn since most algorithms in linear algebra are formulated in Rn . We will see in Chapter 8 that we often consider data to be represented as vectors in Rn . In this book, we will focus on finitedimensional vector spaces, in which case there is a 1:1 correspondence between any kind of vector and Rn . When it is convenient, we will use intuitions about geometric vectors and consider array-based algorithms. One major idea in mathematics is the idea of closure. This is the question: What is the set of all things that can result from my proposed operations? In the case of vectors: What is the set of vectors that can result by starting with a small set of vectors, and adding them to each other and scaling them? This results in a vector space (Section 2.4). The concept of a vector space and its properties underlie much of machine learning. The concepts introduced in this chapter are summarized in Figure 2.2. This chapter is mostly based on the lecture notes and books by Drumm and Weil (2001), Strang (2003), Hogben (2013), Liesen and Mehrmann (2015), as well as Pavel Grinfelds Linear Algebra series. Other excellent Draft (2021-07-29) of Mathematics for Machine Learning. Feedback: https://mml-book.com. 2.1 Systems of Linear Equations 19 Figure 2.2 A mind map of the concepts introduced in this chapter, along with where they are used in other parts of the book. Vector Vector space Matrix Chapter 5 Vector calculus Group System of linear equations Matrix inverse Gaussian elimination Linear/affine mapping Linear independence Basis Chapter 10 Dimensionality reduction Chapter 12 Classification Chapter 3 Analytic geometry composes closure Abelian with + represents represents solved by solves property of maximal set resources are Gilbert Strangs Linear Algebra course at MIT and the Linear Algebra Series by 3Blue1Brown. Linear algebra plays an important role in machine learning and general mathematics. The concepts introduced in this chapter are further expanded to include the idea of geometry in Chapter 3. In Chapter 5, we will discuss vector calculus, where a principled knowledge of matrix operations is essential. In Chapter 10, we will use projections (to be introduced in Section 3.8) for dimensionality reduction with principal component analysis (PCA). In Chapter 9, we will discuss linear regression, where linear algebra plays a central role for solving least-squares problems. 2.1 Systems of Linear Equations Systems of linear equations play a central part of linear algebra. Many problems can be formulated as systems of linear equations, and linear algebra gives us the tools for solving them. Example 2.1 A company produces products N1, . . . , Nn for which resources R1, . . . , Rm are required. To produce a unit of product Nj , aij units of resource Ri are needed, where i = 1, . . . , m and j = 1, . . . , n. The objective is to find an optimal production plan, i.e., a plan of how many units xj of product Nj should be produced if a total of bi units of resource Ri are available and (ideally) no resources are left over. If we produce x1, . . . , xn units of the corresponding products, we need c 2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020). 20 Linear Algebra a total of ai1x1 + + ainxn (2.2) many units of resource Ri . An optimal production plan (x1, . . . , xn) ? Rn , therefore, has to satisfy the following system of equations: a11x1 + + a1nxn = b1 . . . am1x1 + + amnxn = bm , (2.3) where aij ? R and bi ? R. system of linear Equation (2.3) is the general form of a system of linear equations, and equations x1, . . . , xn are the unknowns of this system. Every n-tuple (x1, . . . , xn) ? Rn solution that satisfies (2.3) is a solution of the linear equation system. Example 2.2 The system of linear equations x1 + x2 + x3 = 3 (1) x1 - x2 + 2x3 = 2 (2) 2x1 + 3x3 = 1 (3) (2.4) has no solution: Adding the first two equations yields 2x1+3x3 = 5, which contradicts the third equation (3). Let us have a look at the system of linear equations x1 + x2 + x3 = 3 (1) x1 - x2 + 2x3 = 2 (2) x2 + x3 = 2 (3) . (2.5) From the first and third equation, it follows that x1 = 1. From (1)+(2), we get 2x1 + 3x3 = 5, i.e., x3 = 1. From (3), we then get that x2 = 1. Therefore, (1, 1, 1) is the only possible and unique solution (verify that (1, 1, 1) is a solution by plugging in). As a third example, we consider x1 + x2 + x3 = 3 (1) x1 - x2 + 2x3 = 2 (2) 2x1 + 3x3 = 5 (3) . (2.6) Since (1)+(2)=(3), we can omit the third equation (redundancy). From (1) and (2), we get 2x1 = 5-3x3 and 2x2 = 1+x3. We define x3 = a ? R as a free variable, such that any triplet 5 2 - 3 2 a, 1 2 + 1 2 a, a , a ? R (2.7) Draft (2021-07-29) of Mathematics for Machine Learning. Feedback: https://mml-book.com. 2.1 Systems of Linear Equations 21 Figure 2.3 The solution space of a system of two linear equations with two variables can be geometrically interpreted as the intersection of two lines. Every linear equation represents a line. 2x1 - 4x2 = 1 4x1 + 4x2 = 5 x1 x2 is a solution of the system of linear equations, i.e., we obtain a solution set that contains infinitely many solutions. In general, for a real-valued system of linear equations we obtain either no, exactly one, or infinitely many solutions. Linear regression (Chapter 9) solves a version of Example 2.1 when we cannot solve the system of linear equations. Remark (Geometric Interpretation of Systems of Linear Equations). In a system of linear equations with two variables x1, x2, each linear equation defines a line on the x1x2-plane. Since a solution to a system of linear equations must satisfy all equations simultaneously, the solution set is the intersection of these lines. This intersection set can be a line (if the linear equations describe the same line), a point, or empty (when the lines are parallel). An illustration is given in Figure 2.3 for the system 4x1 + 4x2 = 5 2x1 - 4x2 = 1 (2.8) where the solution space is the point (x1, x2) = (1, 1 4 ). Similarly, for three variables, each linear equation determines a plane in three-dimensional space. When we intersect these planes, i.e., satisfy all linear equations at the same time, we can obtain a solution set that is a plane, a line, a point or empty (when the planes have no common intersection). ? For a systematic approach to solving systems of linear equations, we will introduce a useful compact notation. We collect the coefficients aij into vectors and collect the vectors into matrices. In other words, we write the system from (2.3) in the following form: ? ? ? a11 . . . am1 ? ? ? x1 + ? ? ? a12 . . . am2 ? ? ? x2 + + ? ? ? a1n . . . amn ? ? ? xn = ? ? ? b1 . . . bm ? ? ? (2.9) c 2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020). 22 Linear Algebra ?? ? ? ? a11 a1n . . . . . . am1 amn ? ? ? ? ? ? x1 . . . xn ? ? ? = ? ? ? b1 . . . bm ? ? ? . (2.10) In the following, we will have a close look at these matrices and define computation rules. We will return to solving linear equations in Section 2.3. 2.2 Matrices Matrices play a central role in linear algebra. They can be used to compactly represent systems of linear equations, but they also represent linear functions (linear mappings) as we will see later in Section 2.7. Before we discuss some of these interesting topics, let us first define what a matrix is and what kind of operations we can do with matrices. We will see more properties of matrices in Chapter 4. matrix Definition 2.1 (Matrix). With m, n ? N a real-valued (m, n) matrix A is an mn-tuple of elements aij , i = 1, . . . , m, j = 1, . . . , n, which is ordered according to a rectangular scheme consisting of m rows and n columns: A = ? ? ? ? ? a11 a12 a1n a21 a22 a2n . . . . . . . . . am1 am2 amn ? ? ? ? ? , aij ? R . (2.11) row By convention (1, n)-matrices are called rows and (m, 1)-matrices are called column columns. These special matrices are also called row/column vectors. row vector column vector Figure 2.4 By stacking its columns, a matrix A can be represented as a long vector a. re-shape A ? R 42 a ? R 8 Rmn is the set of all real-valued (m, n)-matrices. A ? Rmn can be equivalently represented as a ? Rmn by stacking all n columns of the matrix into a long vector; see Figure 2.4. 2.2.1 Matrix Addition and Multiplication The sum of two matrices A ? Rmn , B ? Rmn is defined as the elementwise sum, i.e., A + B := ? ? ? a11 + b11 a1n + b1n . . . . . . am1 + bm1 amn + bmn ? ? ? ? R mn . (2.12) For matrices A ? Rmn , B ? Rnk Note the size of the , the elements cij of the product matrices. C = AB ? Rmk are computed as C = np.einsum(il, lj, A, B) cij = Xn l=1 ailblj , i = 1, . . . , m, j = 1, . . . , k. (2.13) Draft (2021-07-29) of Mathematics for Machine Learning. Feedback: https://mml-book.com. 2.2 Matrices 23 This means, to compute element cij we multiply the elements of the ith There are n columns in A and n rows in B so that we can compute ailblj for l = 1, . . . , n. Commonly, the dot product between two vectors a, b is denoted by a>b or ha, bi. row of A with the jth column of B and sum them up. Later in Section 3.2, we will call this the dot product of the corresponding row and column. In cases, where we need to be explicit that we are performing multiplication, we use the notation A B to denote multiplication (explicitly showing ). Remark. Matrices can only be multiplied if their neighboring dimensions match. For instance, an n k-matrix A can be multiplied with a k mmatrix B, but only from the left side: |{z} A nk |{z} B km = |{z} C nm (2.14) The product BA is not defined if m 6= n since the neighboring dimensions do not match. ? Remark. Matrix multiplication is not defined as an element-wise operation on matrix elements, i.e., cij 6= aij bij (even if the size of A, B was chosen appropriately). This kind of element-wise multiplication often appears in programming languages when we multiply (multi-dimensional) arrays with each other, and is called a Hadamard product. ? Hadamard product Example 2.3 For A = 1 2 3 3 2 1 ? R23 , B = ? ? 0 2 1 -1 0 1 ? ? ? R32 , we obtain AB = 1 2 3 3 2 1 ? ? 0 2 1 -1 0 1 ? ? = 2 3 2 5 ? R 22 , (2.15) BA = ? ? 0 2 1 -1 0 1 ? ? 1 2 3 3 2 1 = ? ? 6 4 2 -2 0 2 3 2 1 ? ? ? R 33 . (2.16) Figure 2.5 Even if both matrix multiplications AB and BA are defined, the dimensions of the results can be different. From this example, we can already see that matrix multiplication is not commutative, i.e., AB 6= BA; see also Figure 2.5 for an illustration. Definition 2.2 (Identity Matrix). In Rnn , we define the identity matrix identity matrix In := ? ? ? ? ? ? ? ? ? ? 1 0 0 0 0 1 0 0 . . . . . . . . . . . . . . . . . . 0 0 1 0 . . . . . . . . . . . . . . . . . . 0 0 0 1 ? ? ? ? ? ? ? ? ? ? ? R nn (2.17) c 2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020). 24 Linear Algebra as the n n-matrix containing 1 on the diagonal and 0 everywhere else. Now that we defined matrix multiplication, matrix addition and the identity matrix, let us have a look at some properties of matrices: associativity Associativity: ?A ? R mn , B ? R np , C ? R pq : (AB)C = A(BC) (2.18) distributivity Distributivity: ?A, B ? R mn , C, D ? R np : (A + B)C = AC + BC (2.19a) A(C + D) = AC + AD (2.19b) Multiplication with the identity matrix: ?A ? R mn : ImA = AIn = A (2.20) Note that Im 6= In for m 6= n. 2.2.2 Inverse and Transpose Definition 2.3 (Inverse). Consider a square matrix A ? Rnn A square matrix . Let matrix possesses the same number of columns and rows. B ? Rnn have the property that AB = In = BA. B is called the inverse of A and denoted by A -1 . inverse Unfortunately, not every matrix A possesses an inverse A -1 . If this regular inverse does exist, A is called regular/invertible/nonsingular, otherwise invertible nonsingular singular/noninvertible. When the matrix inverse exists, it is unique. In Secsingular noninvertible tion 2.3, we will discuss a general way to compute the inverse of a matrix by solving a system of linear equations. Remark (Existence of the Inverse of a 2 2-matrix). Consider a matrix A := a11 a12 a21 a22 ? R 22 . (2.21) If we multiply A with A 0 := a22 -a12 -a21 a11 (2.22) we obtain AA0 = a11a22 - a12a21 0 0 a11a22 - a12a21 = (a11a22 - a12a21)I . (2.23) Therefore, A -1 = 1 a11a22 - a12a21 a22 -a12 -a21 a11 (2.24) if and only if a11a22 -a12a21 6= 0. In Section 4.1, we will see that a11a22 - Draft (2021-07-29) of Mathematics for Machine Learning. Feedback: https://mml-book.com. 2.2 Matrices 25 a12a21 is the determinant of a 22-matrix. Furthermore, we can generally use the determinant to check whether a matrix is invertible. ? Example 2.4 (Inverse Matrix) The matrices A = ? ? 1 2 1 4 4 5 6 7 7 ? ? , B = ? ? -7 -7 6 2 1 -1 4 5 -4 ? ? (2.25) are inverse to each other since AB = I = BA. Definition 2.4 (Transpose). For A ? Rmn the matrix B ? Rnm with bij = aji is called the transpose of A. We write B = A > . transpose The main diagonal (sometimes called principal diagonal, primary diagonal, leading diagonal, or major diagonal) of a matrix A is the collection of entries Aij where i = j. In general, A > can be obtained by writing the columns of A as the rows of A > . The following are important properties of inverses and transposes: The scalar case of (2.28) is 1 2+4 = 1 6 6= 1 2 + 1 4 . AA-1 = I = A -1A (2.26) (AB) -1 = B -1A -1 (2.27) (A + B) -1 6= A -1 + B -1 (2.28) (A > ) > = A (2.29) (A + B) > = A > + B > (2.30) (AB) > = B >A > (2.31) Definition 2.5 (Symmetric Matrix). A matrix A ? Rnn is symmetric if symmetric matrix A = A > . Note that only (n, n)-matrices can be symmetric. Generally, we call (n, n)-matrices also square matrices because they possess the same num- square matrix ber of rows and columns. Moreover, if A is invertible, then so is A > , and (A -1 ) > = (A > ) -1 =: A -> . Remark (Sum and Product of Symmetric Matrices). The sum of symmetric matrices A, B ? Rnn is always symmetric. However, although their product is always defined, it is generally not symmetric: 1 0 0 0 1 1 1 1 = 1 1 0 0 . (2.32) ? 2.2.3 Multiplication by a Scalar Let us look at what happens to matrices when they are multiplied by a scalar ? ? R. Let A ? Rmn and ? ? R. Then ?A = K, Kij = ? aij . Practically, ? scales each element of A. For ?, ? ? R, the following holds: c 2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020). 26 Linear Algebra associativity Associativity: (??)C = ?(?C), C ? Rmn ?(BC) = (?B)C = B(?C) = (BC)?, B ? Rmn , C ? Rnk . Note that this allows us to move scalar values around. (?C) > = C > ? > = C > ? = ?C > since ? = ? > for all ? ? R. distributivity Distributivity: (? + ?)C = ?C + ?C, C ? Rmn ?(B + C) = ?B + ?C, B, C ? Rmn Example 2.5 (Distributivity) If we define C := 1 2 3 4 , (2.33) then for any ?, ? ? R we obtain (? + ?)C = (? + ?)1 (? + ?)2 (? + ?)3 (? + ?)4 = ? + ? 2? + 2? 3? + 3? 4? + 4? (2.34a) = ? 2? 3? 4? + ? 2? 3? 4? = ?C + ?C . (2.34b) 2.2.4 Compact Representations of Systems of Linear Equations If we consider the system of linear equations 2x1 + 3x2 + 5x3 = 1 4x1 - 2x2 - 7x3 = 8 9x1 + 5x2 - 3x3 = 2 (2.35) and use the rules for matrix multiplication, we can write this equation system in a more compact form as ? ? 2 3 5 4 -2 -7 9 5 -3 ? ? ? ? x1 x2 x3 ? ? = ? ? 1 8 2 ? ? . (2.36) Note that x1 scales the first column, x2 the second one, and x3 the third one. Generally, a system of linear equations can be compactly represented in their matrix form as Ax = b; see (2.3), and the product Ax is a (linear) combination of the columns of A. We will discuss linear combinations in more detail in Section 2.5. Draft (2021-07-29) of Mathematics for Machine Learning. Feedback: https://mml-book.com. 2.3 Solving Systems of Linear Equations 27 2.3 Solving Systems of Linear Equations In (2.3), we introduced the general form of an equation system, i.e., a11x1 + + a1nxn = b1 . . . am1x1 + + amnxn = bm , (2.37) where aij ? R and bi ? R are known constants and xj are unknowns, i = 1, . . . , m, j = 1, . . . , n. Thus far, we saw that matrices can be used as a compact way of formulating systems of linear equations so that we can write Ax = b, see (2.10). Moreover, we defined basic matrix operations, such as addition and multiplication of matrices. In the following, we will focus on solving systems of linear equations and provide an algorithm for finding the inverse of a matrix. 2.3.1 Particular and General Solution Before discussing how to generally solve systems of linear equations, let us have a look at an example. Consider the system of equations 1 0 8 -4 0 1 2 12 ? ? ? ? x1 x2 x3 x4 ? ? ? ? = 42 8 . (2.38) The system has two equations and four unknowns. Therefore, in general we would expect infinitely many solutions. This system of equations is in a particularly easy form, where the first two columns consist of a 1 P and a 0. Remember that we want to find scalars x1, . . . , x4, such that 4 i=1 xici = b, where we define ci to be the ith column of the matrix and b the right-hand-side of (2.38). A solution to the problem in (2.38) can be found immediately by taking 42 times the first column and 8 times the second column so that b = 42 8 = 42 1 0 + 8 0 1 . (2.39) Therefore, a solution is [42, 8, 0, 0]>. This solution is called a particular particular solution solution or special solution. However, this is not the only solution of this special solution system of linear equations. To capture all the other solutions, we need to be creative in generating 0 in a non-trivial way using the columns of the matrix: Adding 0 to our special solution does not change the special solution. To do so, we express the third column using the first two columns (which are of this very simple form) 8 2 = 8 1 0 + 2 0 1 (2.40) c 2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020). 28 Linear Algebra so that 0 = 8c1 + 2c2 - 1c3 + 0c4 and (x1, x2, x3, x4) = (8, 2, -1, 0). In fact, any scaling of this solution by ?1 ? R produces the 0 vector, i.e., 1 0 8 -4 0 1 2 12 ? ??? ?1 ? ? ? ? 8 2 -1 0 ? ? ? ? ? ??? = ?1(8c1 + 2c2 - c3) = 0 . (2.41) Following the same line of reasoning, we express the fourth column of the matrix in (2.38) using the first two columns and generate another set of non-trivial versions of 0 as 1 0 8 -4 0 1 2 12 ? ??? ?2 ? ? ? ? -4 12 0 -1 ? ? ? ? ? ??? = ?2(-4c1 + 12c2 - c4) = 0 (2.42) for any ?2 ? R. Putting everything together, we obtain all solutions of the general solution equation system in (2.38), which is called the general solution, as the set ? ??? ??? x ? R 4 : x = ? ? ? ? 42 8 0 0 ? ? ? ? + ?1 ? ? ? ? 8 2 -1 0 ? ? ? ? + ?2 ? ? ? ? -4 12 0 -1 ? ? ? ? , ?1, ?2 ? R ? ??? ??? . (2.43) Remark. The general approach we followed consisted of the following three steps: 1. Find a particular solution to Ax = b. 2. Find all solutions to Ax = 0. 3. Combine the solutions from steps 1. and 2. to the general solution. Neither the general nor the particular solution is unique. ? The system of linear equations in the preceding example was easy to solve because the matrix in (2.38) has this particularly convenient form, which allowed us to find the particular and the general solution by inspection. However, general equation systems are not of this simple form. Fortunately, there exists a constructive algorithmic way of transforming any system of linear equations into this particularly simple form: Gaussian elimination. Key to Gaussian elimination are elementary transformations of systems of linear equations, which transform the equation system into a simple form. Then, we can apply the three steps to the simple form that we just discussed in the context of the example in (2.38). 2.3.2 Elementary Transformations elementary Key to solving a system of linear equations are elementary transformations transformations that keep the solution set the same, but that transform the equation system into a simpler form: Draft (2021-07-29) of Mathematics for Machine Learning. Feedback: https://mml-book.com. 2.3 Solving Systems of Linear Equations 29 Exchange of two equations (rows in the matrix representing the system of equations) Multiplication of an equation (row) with a constant ? ? R\{0} Addition of two equations (rows) Example 2.6 For a ? R, we seek all solutions of the following system of equations: -2x1 + 4x2 - 2x3 - x4 + 4x5 = -3 4x1 - 8x2 + 3x3 - 3x4 + x5 = 2 x1 - 2x2 + x3 - x4 + x5 = 0 x1 - 2x2 - 3x4 + 4x5 = a . (2.44) We start by converting this system of equations into the compact matrix notation Ax = b. We no longer mention the variables x explicitly and build the augmented matrix (in the form A | b ) augmented matrix ? ? ? ? -2 4 -2 -1 4 -3 4 -8 3 -3 1 2 1 -2 1 -1 1 0 1 -2 0 -3 4 a ? ? ? ? Swap with R3 Swap with R1 where we used the vertical line to separate the left-hand side from the right-hand side in (2.44). We use to indicate a transformation of the augmented matrix using elementary transformations. The augmented matrix A | b compactly represents the system of linear equations Ax = b. Swapping Rows 1 and 3 leads to ? ? ? ? 1 -2 1 -1 1 0 4 -8 3 -3 1 2 -2 4 -2 -1 4 -3 1 -2 0 -3 4 a ? ? ? ? -4R1 +2R1 -R1 When we now apply the indicated transformations (e.g., subtract Row 1 four times from Row 2), we obtain ? ? ? ? 1 -2 1 -1 1 0 0 0 -1 1 -3 2 0 0 0 -3 6 -3 0 0 -1 -2 3 a ? ? ? ? -R2 - R3 ? ? ? ? 1 -2 1 -1 1 0 0 0 -1 1 -3 2 0 0 0 -3 6 -3 0 0 0 0 0 a+ 1 ? ? ? ? (-1) (- 1 3 ) ? ? ? ? 1 -2 1 -1 1 0 0 0 1 -1 3 -2 0 0 0 1 -2 1 0 0 0 0 0 a+ 1 ? ? ? ? c 2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020). 30 Linear Algebra row-echelon form This (augmented) matrix is in a convenient form, the row-echelon form (REF). Reverting this compact notation back into the explicit notation with the variables we seek, we obtain x1 - 2x2 + x3 - x4 + x5 = 0 x3 - x4 + 3x5 = -2 x4 - 2x5 = 1 0 = a + 1 . (2.45) particular solution Only for a = -1 this system can be solved. A particular solution is ? ? ? ? ? ? x1 x2 x3 x4 x5 ? ? ? ? ? ? = ? ? ? ? ? ? 2 0 -1 1 0 ? ? ? ? ? ? . (2.46) general solution The general solution, which captures the set of all possible solutions, is ? ????? ????? x ? R 5 : x = ? ? ? ? ? ? 2 0 -1 1 0 ? ? ? ? ? ? + ?1 ? ? ? ? ? ? 2 1 0 0 0 ? ? ? ? ? ? + ?2 ? ? ? ? ? ? 2 0 -1 2 1 ? ? ? ? ? ? , ?1, ?2 ? R ? ????? ????? . (2.47) In the following, we will detail a constructive way to obtain a particular and general solution of a system of linear equations. Remark (Pivots and Staircase Structure). The leading coefficient of a row pivot (first nonzero number from the left) is called the pivot and is always strictly to the right of the pivot of the row above it. Therefore, any equation system in row-echelon form always has a staircase structure. ? row-echelon form Definition 2.6 (Row-Echelon Form). A matrix is in row-echelon form if All rows that contain only zeros are at the bottom of the matrix; correspondingly, all rows that contain at least one nonzero element are on top of rows that contain only zeros. Looking at nonzero rows only, the first nonzero number from the left pivot (also called the pivot or the leading coefficient) is always strictly to the leading coefficient right of the pivot of the row above it. In other texts, it is sometimes required that the pivot is 1. Remark (Basic and Free Variables). The variables corresponding to the pivots in the row-echelon form are called basic variables and the other basic variable variables are free variables. For example, in (2.45), x1, x3, x4 are basic free variable variables, whereas x2, x5 are free variables. ? Remark (Obtaining a Particular Solution). The row-echelon form makes Draft (2021-07-29) of Mathematics for Machine Learning. Feedback: https://mml-book.com. 2.3 Solving Systems of Linear Equations 31 our lives easier when we need to determine a particular solution. To do this, we express the right-hand side of the equation system using the pivot columns, such that b = PP i=1 ?ipi , where pi , i = 1, . . . , P, are the pivot columns. The ?i are determined easiest if we start with the rightmost pivot column and work our way to the left. In the previous example, we would try to find ?1, ?2, ?3 so that ?1 ? ? ? ? 1 0 0 0 ? ? ? ? + ?2 ? ? ? ? 1 1 0 0 ? ? ? ? + ?3 ? ? ? ? -1 -1 1 0 ? ? ? ? = ? ? ? ? 0 -2 1 0 ? ? ? ? . (2.48) From here, we find relatively directly that ?3 = 1, ?2 = -1, ?1 = 2. When we put everything together, we must not forget the non-pivot columns for which we set the coefficients implicitly to 0. Therefore, we get the particular solution x = [2, 0, -1, 1, 0]>. ? Remark (Reduced Row Echelon Form). An equation system is in reduced reduced row-echelon form (also: row-reduced echelon form or row canonical form) if row-echelon form It is in row-echelon form. Every pivot is 1. The pivot is the only nonzero entry in its column. ? The reduced row-echelon form will play an important role later in Section 2.3.3 because it allows us to determine the general solution of a system of linear equations in a straightforward way. Gaussian Remark (Gaussian Elimination). Gaussian elimination is an algorithm that elimination performs elementary transformations to bring a system of linear equations into reduced row-echelon form. ? Example 2.7 (Reduced Row Echelon Form) Verify that the following matrix is in reduced row-echelon form (the pivots are in bold): A = ? ? 1 3 0 0 3 0 0 1 0 9 0 0 0 1 -4 ? ? . (2.49) The key idea for finding the solutions of Ax = 0 is to look at the nonpivot columns, which we will need to express as a (linear) combination of the pivot columns. The reduced row echelon form makes this relatively straightforward, and we express the non-pivot columns in terms of sums and multiples of the pivot columns that are on their left: The second column is 3 times the first column (we can ignore the pivot columns on the right of the second column). Therefore, to obtain 0, we need to subtract c 2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020). 32 Linear Algebra the second column from three times the first column. Now, we look at the fifth column, which is our second non-pivot column. The fifth column can be expressed as 3 times the first pivot column, 9 times the second pivot column, and -4 times the third pivot column. We need to keep track of the indices of the pivot columns and translate this into 3 times the first column, 0 times the second column (which is a non-pivot column), 9 times the third column (which is our second pivot column), and -4 times the fourth column (which is the third pivot column). Then we need to subtract the fifth column to obtain 0. In the end, we are still solving a homogeneous equation system. To summarize, all solutions of Ax = 0, x ? R5 are given by ? ????? ????? x ? R 5 : x = ?1 ? ? ? ? ? ? 3 -1 0 0 0 ? ? ? ? ? ? + ?2 ? ? ? ? ? ? 3 0 9 -4 -1 ? ? ? ? ? ? , ?1, ?2 ? R ? ????? ????? . (2.50) 2.3

Step by Step Solution

★★★★★

3.37 Rating (153 Votes )

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock

Sure lets dive into how to solve systems of linear equations Well use a stepbystep approach to understand the solution methodology and ensure the explanation is clear and comprehensive Solving Systems ... View full answer

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!

what would you like to recommend to Australian government in regards to their fiscal stimulus policy for supporting covid affected society?

Jim Hanks borrowed money by issuing two notes on January 1, Year 1. The financing transactions are described next. 1. Borrowed funds by issuing a $60,000 face value discount note to State Bank. The...

For the partial ANOV table for a simple linear regression model given in Figure 8.5, determine the values of the missing entries.

Cousins Salted Snack Company is considering two possible investments: a delivery truck or a bagging machine. The delivery truck would cost $39,287 and could be used to deliver an additional 48,200...

13. If the company pursues the investment opportunity and otherwise performs the same as last year, what residual income will it earn this year? 14. If Westervilles chief executive officer will earn...

Susan Jones, owner of HyperTech, LLC decided that her storefront windows need to be washed in order to attract more foot traffic. The business sells iDevices, which are visible from the windows on...

SAP is a piece of enterprise resource planning (ERP) software that provides businesses with a unified platform from which they can manage various facets of their operations. SAP's ERP software...

Use Mowrer's two-process theory of avoidance and the anxiety conservation hypothesis to explain the following scenario: you were bitten by fox when walking through the forest near your house 5-years...

Suppose we are given a set S of n items, each with a distinct value and a positive weight. For any element x S, we define two subsets S Answered: 1 week ago

Cyber risk is reduced through a) the use of floating interest rates. b) hiring traders to actively manage buying and selling of securities. c) cloud computing and other information redundancies. d)...

Problem 3. Consider the following pseudo-code fragment. def stars (n): for i in [1:n]: print '*' i many times 123 a) Using the O-notation, upperbound the running time of STARS. b) Using the...

15. Palmer Frosted Flakes Company offers its customers a pottery cereal bowl if they send in 3 boxtops from Palmer Frosted Flakes boxes and $1. The company estimates that 60% of the boxtops will be...

Lata Inc. produces aluminum cans. Production of 12-ounce cans has a standard unit quantity of 4.5 ounces of aluminum per can. During the month of April, 300,000 cans were produced using 1,250,000...

Suppose Green Network Energy needs to raise money to finance its new manufacturing facility, but their CFO does not think the company is financially capable of making the periodic interest payments...

7. A trader executes a bear spread on the Japanese yen consisting of a long 103 March put and a short 101 March put. (a) If the price of the 103 put is 2.81(100ths of / ), while the price of the 101...

3. What currency options contracts are currently traded on the PHLX?

4. What are the implied volatilities of the euro and the Japanese yen over the past week and month? Are implied volatilities generally higher or lower for longer maturity contracts? Explain.