Pattern Recognition And Machine Learning 1st Edition Christopher M Bishop - Solutions

3.12 is given by a Student’s t-distribution of the form p(t|x, t) = St(t|μ, λ, ν) (3.114)and obtain expressions for μ, λ and ν.
3.13 ( ) Show that the predictive distribution p(t|x, t) for the model discussed in Exercise
3.12 ( ) We saw in Section 2.3.6 that the conjugate prior for a Gaussian distribution with unknown mean and unknown precision (inverse variance) is a normal-gamma distribution. This property also holds for the case of the conditional Gaussian distribution p(t|x,w, β) of the linear regression
3.11 ( ) We have seen that, as the size of a data set increases, the uncertainty associated with the posterior distribution over model parameters decreases. Make use of the matrix identity (Appendix C)M+ vvT−1 = M−1 −(M−1v)vTM−1 1 + vTM−1v(3.110)to show that the uncertainty σ2N(x)
3.10 ( ) www By making use of the result (2.115) to evaluate the integral in (3.57), verify that the predictive distribution for the Bayesian linear regression model is given by (3.58) in which the input-dependent variance is given by (3.59).
3.9 ( ) Repeat the previous exercise but instead of completing the square by hand, make use of the general result for linear-Gaussian models given by (2.116).
3.8 ( ) www Consider the linear basis function model in Section 3.1, and suppose that we have already observed N data points, so that the posterior distribution over w is given by (3.49). This posterior can be regarded as the prior for the next observation.By considering an additional data point
3.7 () By using the technique of completing the square, verify the result (3.49) for the posterior distribution of the parameters w in the linear basis function model in which mN and SN are defined by (3.50) and (3.51) respectively.
3.6 () www Consider a linear basis function regression model for a multivariate target variable t having a Gaussian distribution of the form p(t|W,Σ) = N(t|y(x,W),Σ) (3.107)where y(x,W) =WTφ(x) (3.108)together with a training data set comprising input basis vectors φ(xn) and corresponding
3.5 () www Using the technique of Lagrange multipliers, discussed in Appendix E, show that minimization of the regularized error function (3.29) is equivalent to minimizing the unregularized sum-of-squares error (3.12) subject to the constraint (3.30).Discuss the relationship between the
3.4 () www Consider a linear model of the form y(x,w) = w0 +D i=1 wixi (3.105)together with a sum-of-squares error function of the form ED(w) =1 2N n=1{y(xn,w) − tn}2 . (3.106)Now suppose that Gaussian noise i with zero mean and variance σ2 is added independently to each of the input
3.3 () Consider a data set in which each data point tn is associated with a weighting factor rn > 0, so that the sum-of-squares error function becomes ED(w) =1 2N n=1 rntn − wTφ(xn)2. (3.104)Find an expression for the solution w that minimizes this error function. Give two alternative
3.2 ( ) Show that the matrixΦ(ΦTΦ)−1ΦT (3.103)takes any vector v and projects it onto the space spanned by the columns of Φ. Use this result to show that the least-squares solution (3.15) corresponds to an orthogonal projection of the vector t onto the manifold S as shown in Figure 3.2.
3.1 () www Show that the ‘tanh’ function and the logistic sigmoid function (3.6)are related by tanh(a) = 2σ(2a) − 1. (3.100)Hence show that a general linear combination of logistic sigmoid functions of the form y(x,w) = w0 +Mj=1 wjσx − μj s(3.101)is equivalent to a linear combination
2.61 () Show that the K-nearest-neighbour density model defines an improper distribution whose integral over all space is divergent.
2.60 ( ) www Consider a histogram-like density model in which the space x is divided into fixed regions for which the density p(x) takes the constant value hi over the ith region, and that the volume of region i is denoted Δi. Suppose we have a set of N observations of x such that ni of these
2.59 () By changing variables using y = x/σ, show that the density (2.236) will be correctly normalized, provided f(x) is correctly normalized.
2.58 () The result (2.226) showed that the negative gradient of ln g(η) for the exponential family is given by the expectation of u(x). By taking the second derivatives of(2.195), show that−∇∇ln g(η) = E[u(x)u(x)T] − E[u(x)]E[u(x)T] = cov[u(x)]. (2.300)
2.57 () Verify that the multivariate Gaussian distribution can be cast in exponential family form (2.194) and derive expressions for η, u(x), h(x) and g(η) analogous to(2.220)–(2.223).
2.56 ( ) www Express the beta distribution (2.13), the gamma distribution (2.146), and the von Mises distribution (2.179) as members of the exponential family (2.194)and thereby identify their natural parameters.
2.55 () By making use of the result (2.168), together with (2.184) and the trigonometric identity (2.178), show that the maximum likelihood solution mML for the concentration of the von Mises distribution satisfies A(mML) = r where r is the radius of the mean of the observations viewed as unit
2.54 () By computing first and second derivatives of the von Mises distribution (2.179), and using I0(m) > 0 form > 0, show that the maximum of the distribution occurs when θ = θ0 and that the minimum occurs when θ = θ0 + π (mod 2π).
2.53 () Using the trigonometric identity (2.183), show that solution of (2.182) for θ0 is given by (2.184).
2.52 ( ) For large m, the von Mises distribution (2.179) becomes sharply peaked around the mode θ0. By defining ξ = m1/2(θ − θ0) and making the Taylor expansion of the cosine function given by cos α = 1− α2 2+ O(α4) (2.299)show that as m→∞, the von Mises distribution tends to a
2.51 () www The various trigonometric identities used in the discussion of periodic variables in this chapter can be proven easily from the relation exp(iA) = cosA + i sinA (2.296)in which i is the square root of minus one. By considering the identity exp(iA) exp(−iA) = 1 (2.297)prove the result
2.50 () Show that in the limit ν →∞, the multivariate Student’s t-distribution (2.162)reduces to a Gaussian with mean μ and precision Λ.
2.49 ( ) By using the definition (2.161) of the multivariate Student’s t-distribution as a convolution of a Gaussian with a gamma distribution, verify the properties (2.164),(2.165), and (2.166) for the multivariate t-distribution defined by (2.162).
2.48 () By following analogous steps to those used to derive the univariate Student’s t-distribution (2.159), verify the result (2.162) for the multivariate form of the Student’s t-distribution, by marginalizing over the variable η in (2.161). Using the definition (2.161), show by exchanging
2.47 () www Show that in the limit ν → ∞, the t-distribution (2.159) becomes a Gaussian. Hint: ignore the normalization coefficient, and simply look at the dependence on x.
2.46 () www Verify that evaluating the integral in (2.158) leads to the result (2.159).
2.45 () Verify that the Wishart distribution defined by (2.155) is indeed a conjugate prior for the precision matrix of a multivariate Gaussian.
2.44 ( ) Consider a univariate Gaussian distribution N(x|μ, τ−1) having conjugate Gaussian-gamma prior given by (2.154), and a data set x = {x1, . . . , xN} of i.i.d.observations. Show that the posterior distribution is also a Gaussian-gamma distribution of the same functional form as the
2.43 () The following distribution p(x|σ2, q) = q 2(2σ2)1/qΓ(1/q)exp−|x|q 2σ2(2.293)is a generalization of the univariate Gaussian distribution. Show that this distribution is normalized so that ∞−∞p(x|σ2, q) dx = 1 (2.294)and that it reduces to the Gaussian when q = 2. Consider a
2.42 ( ) Evaluate the mean, variance, and mode of the gamma distribution (2.146).
2.41 () Use the definition of the gamma function (1.141) to show that the gamma distribution(2.146) is normalized.
2.40 ( ) www Consider a D-dimensional Gaussian random variable x with distribution N(x|μ,Σ) in which the covariance Σ is known and for which we wish to infer the mean μ from a set of observationsX = {x1, . . . , xN}. Given a prior distribution p(μ) = N(μ|μ0,Σ0), find the corresponding
2.39 ( ) Starting from the results (2.141) and (2.142) for the posterior distribution of the mean of a Gaussian random variable, dissect out the contributions from the first N − 1 data points and hence obtain expressions for the sequential update ofμN and σ2N. Now derive the same results
2.38 () Use the technique of completing the square for the quadratic form in the exponent to derive the results (2.141) and (2.142).
2.37 ( ) Using an analogous procedure to that used to obtain (2.126), derive an expression for the sequential estimation of the covariance of a multivariate Gaussian distribution, by starting with the maximum likelihood expression (2.122). Verify that substituting the expression for a Gaussian
2.36 ( ) www Using an analogous procedure to that used to obtain (2.126), derive an expression for the sequential estimation of the variance of a univariate Gaussian distribution, by starting with the maximum likelihood expression σ2 ML = 1 N N n=1 (xn − μ)2. (2.292)Verify that substituting
2.35 ( ) Use the result (2.59) to prove (2.62). Now, using the results (2.59), and (2.62), show that E[xnxm] = μμT + InmΣ (2.291)where xn denotes a data point sampled from a Gaussian distribution with mean μand covariance Σ, and Inm denotes the (n,m) element of the identity matrix. Hence prove
2.34 ( ) www To find the maximum likelihood solution for the covariance matrix of a multivariate Gaussian, we need to maximize the log likelihood function (2.118)with respect to Σ, noting that the covariance matrix must be symmetric and positive definite. Here we proceed by ignoring these
2.33 ( ) Consider the same joint distribution as in Exercise 2.32, but now use the technique of completing the square to find expressions for the mean and covariance of the conditional distribution p(x|y). Again, verify that these agree with the corresponding expressions (2.111) and (2.112).
2.32 ( ) www This exercise and the next provide practice at manipulating the quadratic forms that arise in linear-Gaussian models, as well as giving an independent check of results derived in the main text. Consider a joint distribution p(x, y)defined by the marginal and conditional distributions
2.31 ( ) Consider two multidimensional random vectors x and z having Gaussian distributions p(x) = N(x|μx,Σx) and p(z) = N(z|μz,Σz) respectively, together with their sum y = x+z. Use the results (2.109) and (2.110) to find an expression for the marginal distribution p(y) by considering the
2.30 () By starting from (2.107) and making use of the result (2.105), verify the result(2.108).
2.29 ( ) Using the partitioned matrix inversion formula (2.76), show that the inverse of the precision matrix (2.104) is given by the covariance matrix (2.105).
2.28 ( ) www Consider a joint distribution over the variable z =x y(2.290)whose mean and covariance are given by (2.108) and (2.105) respectively. By making use of the results (2.92) and (2.93) show that the marginal distribution p(x) is given (2.99). Similarly, by making use of the results
2.27 () Let x and z be two independent random vectors, so that p(x, z) = p(x)p(z).Show that the mean of their sum y = x+z is given by the sum of the means of each of the variable separately. Similarly, show that the covariance matrix of y is given by the sum of the covariance matrices of x and z.
2.26 ( ) A very useful result from linear algebra is the Woodbury matrix inversion formula given by(A + BCD)−1 = A−1 − A−1B(C−1 + DA−1B)−1DA−1. (2.289)By multiplying both sides by (A + BCD) prove the correctness of this result.
2.25 ( ) In Sections 2.3.1 and 2.3.2, we considered the conditional and marginal distributions for a multivariate Gaussian. More generally, we can consider a partitioning of the components of x into three groups xa, xb, and xc, with a corresponding partitioning of the mean vector μ and of the
2.24 ( ) www Prove the identity (2.76) by multiplying both sides by the matrixA B C D(2.287)and making use of the definition (2.77).
2.23 ( ) By diagonalizing the coordinate system using the eigenvector expansion (2.45), show that the volume contained within the hyperellipsoid corresponding to a constant Mahalanobis distance Δ is given by VD|Σ|1/2ΔD (2.286)where VD is the volume of the unit sphere in D dimensions, and the
2.22 () www Show that the inverse of a symmetric matrix is itself symmetric.
2.21 () Show that a real, symmetric matrix of size D×D has D(D+1)/2 independent parameters.
2.20 ( ) www A positive definite matrix Σ can be defined as one for which the quadratic form aTΣa (2.285)is positive for any real value of the vectora. Show that a necessary and sufficient condition for Σ to be positive definite is that all of the eigenvalues λi of Σ, defined by (2.45), are
2.19 ( ) Show that a real, symmetric matrix Σ having the eigenvector equation (2.45)can be expressed as an expansion in the eigenvectors, with coefficients given by the eigenvalues, of the form (2.48). Similarly, show that the inverse matrix Σ−1 has a representation of the form (2.49).
2.18 ( ) Consider a real, symmetric matrix Σ whose eigenvalue equation is given by (2.45). By taking the complex conjugate of this equation and subtracting the original equation, and then forming the inner product with eigenvector ui, show that the eigenvalues λi are real. Similarly, use the
2.17 () www Consider the multivariate Gaussian distribution given by (2.43). By writing the precision matrix (inverse covariance matrix) Σ−1 as the sum of a symmetric and an anti-symmetric matrix, show that the anti-symmetric term does not appear in the exponent of the Gaussian, and hence that
2.16 ( ) www Consider two random variables x1 and x2 having Gaussian distributions with means μ1, μ2 and precisions τ1, τ2 respectively. Derive an expression for the differential entropy of the variable x = x1 + x2. To do this, first find the distribution of x by using the relation p(x) =
2.15 ( ) Show that the entropy of the multivariate Gaussian N(x|μ,Σ) is given by H[x] =1 2ln |Σ| + D 2(1 + ln(2π)) (2.283)where D is the dimensionality of x.
2.14 ( ) www This exercise demonstrates that the multivariate distribution with maximum entropy, for a given covariance, is a Gaussian. The entropy of a distribution p(x) is given by H[x] = −p(x) lnp(x) dx. (2.279)We wish to maximize H[x] over all distributions p(x) subject to the constraints
2.13 ( ) Evaluate the Kullback-Leibler divergence (1.113) between two Gaussians p(x) = N(x|μ,Σ) and q(x) = N(x|m,L).
2.12 () The uniform distribution for a continuous variable x is defined by U(x|a,b) =1 b − a, a x b. (2.278)Verify that this distribution is normalized, and find expressions for its mean and variance.
2.11 () www By expressing the expectation of ln μj under the Dirichlet distribution(2.38) as a derivative with respect to αj , show that E[ln μj] = ψ(αj) − ψ(α0) (2.276)where α0 is given by (2.39) andψ(a) ≡ d da ln Γ(a) (2.277)is the digamma function.
2.10 ( ) Using the property Γ(x + 1) = xΓ(x) of the gamma function, derive the following results for the mean, variance, and covariance of the Dirichlet distribution given by (2.38)E[μj] = αjα0(2.273)var[μj] = αj(α0 − αj)α2 0(α0 + 1)(2.274)cov[μjμl] = − αjαlα2 0(α0 + 1), j = l
2.9 ( ) www . In this exercise, we prove the normalization of the Dirichlet distribution(2.38) using induction. We have already shown in Exercise 2.5 that the beta distribution, which is a special case of the Dirichlet for M = 2, is normalized.We now assume that the Dirichlet distribution is
2.8 () Consider two variables x and y with joint distribution p(x, y). Prove the following two results E[x] = Ey [Ex[x|y]] (2.270)var[x] = Ey [varx[x|y]] + vary [Ex[x|y]] . (2.271)Here Ex[x|y] denotes the expectation of x under the conditional distribution p(x|y), with a similar notation for the
2.7 ( ) Consider a binomial random variable x given by (2.9), with prior distribution for μ given by the beta distribution (2.13), and suppose we have observed m occurrences of x = 1and l occurrences of x = 0. Show that the posterior mean value of x lies between the prior mean and the maximum
2.6 () Make use of the result (2.265) to show that the mean, variance, and mode of the beta distribution (2.13) are given respectively by E[μ] = a a + b(2.267)var[μ] = ab(a + b)2(a + b + 1)(2.268)mode[μ] = a − 1 a + b − 2. (2.269)
2.5 ( ) www In this exercise, we prove that the beta distribution, given by (2.13), is correctly normalized, so that (2.14) holds. This is equivalent to showing that 1 0μa−1(1 − μ)b−1 dμ =Γ(a)Γ(b)Γ(a +b) . (2.265)From the definition (1.141) of the gamma function, we haveΓ(a)Γ(b) =
2.4 ( ) Show that the mean of the binomial distribution is given by (2.11). To do this, differentiate both sides of the normalization condition (2.264) with respect to μ and then rearrange to obtain an expression for the mean of n. Similarly, by differentiating(2.264) twice with respect to μ and
2.3 ( ) www In this exercise, we prove that the binomial distribution (2.9) is normalized.First use the definition (2.10) of the number of combinations of m identical objects chosen from a total of N to show thatN m+N m − 1=N + 1 m. (2.262)Use this result to prove by induction the following
2.2 ( ) The form of the Bernoulli distribution given by (2.2) is not symmetric between the two values of x. In some situations, it will be more convenient to use an equivalent formulation for which x ∈ {−1, 1}, in which case the distribution can be written p(x|μ) =1 − μ2(1−x)/2 1 +
2.1 () www Verify that the Bernoulli distribution (2.2) satisfies the following properties1 x=0 p(x|μ) = 1 (2.257)E[x] = μ (2.258)var[x] = μ(1 − μ). (2.259)Show that the entropy H[x] of a Bernoulli distributed random binary variable x is given by H[x] = −μ ln μ − (1 − μ) ln(1 −
1.30 ( ) Evaluate the Kullback-Leibler divergence (1.113) between two Gaussians p(x) = N(x|μ, σ2) and q(x) = N(x|m, s2).
1.29 () www Consider an M-state discrete random variable x, and use Jensen’s inequality in the form (1.115) to show that the entropy of its distribution p(x) satisfies H[x] lnM.
1.28 () In Section 1.6, we introduced the idea of entropy h(x) as the information gained on observing the value of a random variable x having distribution p(x). We saw that, for independent variables x and y for which p(x, y) = p(x)p(y), the entropy functions are additive, so that h(x, y) =
1.27 ( ) www Consider the expected loss for regression problems under the Lq loss function given by (1.91). Write down the condition that y(x) must satisfy in order to minimize E[Lq]. Show that, for q = 1, this solution represents the conditional median, i.e., the function y(x) such that the
1.26 () By expansion of the square in (1.151), derive a result analogous to (1.90) and hence show that the function y(x) that minimizes the expected squared loss for the case of a vector t of target variables is again given by the conditional expectation of t.
1.25 () www Consider the generalization of the squared loss function (1.87) for a single target variable t to the case of multiple target variables described by the vector t given by E[L(t, y(x))] =y(x) − t2p(x, t) dx dt. (1.151)Using the calculus of variations, show that the function y(x)
1.24 ( ) www Consider a classification problem in which the loss incurred when an input vector from class Ck is classified as belonging to class Cj is given by the loss matrix Lkj, and for which the loss incurred in selecting the reject option is λ.Find the decision criterion that will give the
1.23 () Derive the criterion for minimizing the expected loss when there is a general loss matrix and general prior probabilities for the classes.
1.22 () www Given a loss matrix with elements Lkj, the expected risk is minimized if, for each x, we choose the class that minimizes (1.81). Verify that, when the loss matrix is given by Lkj = 1 − Ikj, where Ikj are the elements of the identity matrix, this reduces to the criterion of choosing
1.37 () Using the definition (1.111) together with the product rule of probability, prove the result (1.112).
1.31 ( ) www Consider two variables x and y having joint distribution p(x, y). Show that the differential entropy of this pair of variables satisfies H[x, y] H[x] + H[y] (1.152)with equality if, and only if, x and y are statistically independent.
1.41 () www Using the sum and product rules of probability, show that the mutual information I(x, y) satisfies the relation (1.121).
1.40 () By applying Jensen’s inequality (1.115) with f(x) = lnx, show that the arithmetic mean of a set of real numbers is never less than their geometrical mean.
1.39 ( ) Consider two binary variables x and y having the joint distribution given in Table 1.3.Evaluate the following quantities(a) H[x] (c) H[y|x] (e) H[x, y](b) H[y] (d) H[x|y] (f) I[x, y].Draw a diagram to show the relationship between these various quantities.
1.38 ( ) www Using proof by induction, show that the inequality (1.114) for convex functions implies the result (1.115).
1.36 () A strictly convex function is defined as one for which every chord lies above the function. Show that this is equivalent to the condition that the second derivative of the function be positive.
1.35 () www Use the results (1.106) and (1.107) to show that the entropy of the univariate Gaussian (1.109) is given by (1.110).
1.34 ( ) www Use the calculus of variations to show that the stationary point of the functional (1.108) is given by (1.108). Then use the constraints (1.105), (1.106), and (1.107) to eliminate the Lagrange multipliers and hence show that the maximum entropy solution is given by the Gaussian
1.33 ( ) Suppose that the conditional entropy H[y|x] between two discrete random variables x and y is zero. Show that, for all values of x such that p(x) > 0, the variable y must be a function of x, in other words for each x there is only one value of y such that p(y|x) = 0.
1.32 () Consider a vector x of continuous variables with distribution p(x) and corresponding entropy H[x]. Suppose that we make a nonsingular linear transformation of x to obtain a new variable y = Ax. Show that the corresponding entropy is given by H[y] = H[x] + ln|A| where |A| denotes the
1.21 ( ) Consider two nonnegative numbers a andb, and show that, if a b, then a (ab)1/2. Use this result to show that, if the decision regions of a two-class classification problem are chosen to minimize the probability of misclassification, this probability will satisfy p(mistake) {p(x,
1.20 ( ) www In this exercise, we explore the behaviour of the Gaussian distribution in high-dimensional spaces. Consider a Gaussian distribution in D dimensions given by p(x) =1(2πσ2)D/2 exp−x2 2σ2. (1.147)We wish to find the density with respect to radius in polar coordinates in which
1.19 ( ) Consider a sphere of radius a in D-dimensions together with the concentric hypercube of side 2a, so that the sphere touches the hypercube at the centres of each of its sides. By using the results of Exercise 1.18, show that the ratio of the volume of the sphere to the volume of the cube
1.7 ( ) www In this exercise, we prove the normalization condition (1.48) for the univariate Gaussian. To do this consider, the integral I = ∞−∞exp− 1 2σ2 x2dx (1.124)which we can evaluate by first writing its square in the form I2 = ∞−∞ ∞−∞exp− 1 2σ2 x2 − 1 2σ2
1.6 () Show that if two variables x and y are independent, then their covariance is zero.

Showing 700 - 800 of 816

Pattern Recognition And Machine Learning 1st Edition Christopher M Bishop - Solutions

Step by Step Answers