Pattern Recognition And Machine Learning 1st Edition Christopher M Bishop - Solutions

6.22 ( ) Consider a regression problem with N training set input vectors x1, . . . , xN and L test set input vectors xN+1, . . . , xN+L, and suppose we define a Gaussian process prior over functions t(x). Derive an expression for the joint predictive distribution for t(xN+1), . . . , t(xN+L), given
6.21 ( ) www Consider a Gaussian process regression model in which the kernel function is defined in terms of a fixed set of nonlinear basis functions. Show that the predictive distribution is identical to the result (3.58) obtained in Section 3.3.2 for the Bayesian linear regression model. To do
6.20 ( ) www Verify the results (6.66) and (6.67).
6.19 ( ) Another viewpoint on kernel regression comes from a consideration of regression problems in which the input variables as well as the target variables are corrupted with additive noise. Suppose each target value tn is generated as usual by taking a function y(zn) evaluated at a point zn,
6.18 () Consider a Nadaraya-Watson model with one input variable x and one target variable t having Gaussian components with isotropic covariances, so that the covariance matrix is given by σ2I where I is the unit matrix. Write down expressions for the conditional density p(t|x) and for the
6.17 ( ) www Consider the sum-of-squares error function (6.39) for data having noisy inputs, where ν(ξ) is the distribution of the noise. Use the calculus of variations to minimize this error function with respect to the function y(x), and hence show that the optimal solution is given by an
6.16 ( ) Consider a parametric model governed by the parameter vector w together with a data set of input values x1, . . . , xN and a nonlinear feature mapping φ(x).Suppose that the dependence of the error function on w takes the form J(w) = f(wTφ(x1), . . . ,wTφ(xN)) + g(wTw) (6.97)where g(·)
6.15 () By considering the determinant of a 2 × 2 Gram matrix, show that a positivedefinite kernel function k(x, x) satisfies the Cauchy-Schwartz inequality k(x1, x2)2 k(x1, x1)k(x2, x2). (6.96)
6.14 () www Write down the form of the Fisher kernel, defined by (6.33), for the case of a distribution p(x|μ) = N(x|μ, S) that is Gaussian with mean μ and fixed covariance S.
6.13 () Show that the Fisher kernel, defined by (6.33), remains invariant if we make a nonlinear transformation of the parameter vector θ → ψ(θ), where the functionψ(·) is invertible and differentiable.
6.12 ( ) www Consider the space of all possible subsets A of a given fixed set D.Show that the kernel function (6.27) corresponds to an inner product in a feature space of dimensionality 2|D| defined by the mapping φ(A) where A is a subset of D and the element φU(A), indexed by the subset U, is
6.11 () By making use of the expansion (6.25), and then expanding the middle factor as a power series, show that the Gaussian kernel (6.23) can be expressed as the inner product of an infinite-dimensional feature vector.
6.10 () Show that an excellent choice of kernel for learning a function f(x) is given by k(x, x) = f(x)f(x) by showing that a linear learning machine based on this kernel will always find a solution proportional to f(x).
6.9 () Verify the results (6.21) and (6.22) for constructing valid kernels.
6.8 () Verify the results (6.19) and (6.20) for constructing valid kernels.
6.7 () www Verify the results (6.17) and (6.18) for constructing valid kernels.
6.6 () Verify the results (6.15) and (6.16) for constructing valid kernels.
6.5 () www Verify the results (6.13) and (6.14) for constructing valid kernels.
6.4 () In Appendix C, we give an example of a matrix that has positive elements but that has a negative eigenvalue and hence that is not positive definite. Find an example of the converse property, namely a 2 × 2 matrix with positive eigenvalues yet that has at least one negative element.
6.3 () The nearest-neighbour classifier (Section 2.5.2) assigns a new input vector x to the same class as that of the nearest input vector xn from the training set, where in the simplest case, the distance is defined by the Euclidean metric x − xn2. By expressing this rule in terms of scalar
6.2 ( ) In this exercise, we develop a dual formulation of the perceptron learning algorithm. Using the perceptron learning rule (4.55), show that the learned weight vector w can be written as a linear combination of the vectors tnφ(xn) where tn ∈{−1, +1}. Denote the coefficients of this
6.1 ( ) www Consider the dual formulation of the least squares linear regression problem given in Section 6.1. Show that the solution for the components an of the vector a can be expressed as a linear combination of the elements of the vectorφ(xn). Denoting these coefficients by the vector w, show
5.41 ( ) By following analogous steps to those given in Section 5.7.1 for regression networks, derive the result (5.183) for the marginal likelihood in the case of a network having a cross-entropy error function and logistic-sigmoid output-unit activation function.
5.40 () www Outline the modifications needed to the framework for Bayesian neural networks, discussed in Section 5.7.3, to handle multiclass problems using networks having softmax output-unit activation functions.
5.39 () www Make use of the Laplace approximation result (4.135) to show that the evidence function for the hyperparameters α and β in the Bayesian neural network model can be approximated by (5.175).
5.38 () Using the general result (2.115), derive the predictive distribution (5.172) for the Laplace approximation to the Bayesian neural network model.
5.37 () Verify the results (5.158) and (5.160) for the conditional mean and variance of the mixture density network model.
5.36 () Derive the result (5.157) for the derivative of the error function with respect to the network output activations controlling the component variances in the mixture density network.
5.35 () Derive the result (5.156) for the derivative of the error function with respect to the network output activations controlling the component means in the mixture density network.
5.34 () www Derive the result (5.155) for the derivative of the error function with respect to the network output activations controlling the mixing coefficients in the mixture density network.
5.33 () Write down a pair of equations that express the Cartesian coordinates (x1, x2)for the robot arm shown in Figure 5.18 in terms of the joint angles θ1 and θ2 and the lengths L1 and L2 of the links. Assume the origin of the coordinate system is given by the attachment point of the lower
5.32 ( ) Show that the derivatives of the mixing coefficients {πk}, defined by (5.146), with respect to the auxiliary parameters {ηj} are given by∂πk∂ηj= δjkπj − πjπk. (5.208)Hence, by making use of the constraintk πk = 1, derive the result (5.147).
5.31 () Verify the result (5.143).
5.30 () Verify the result (5.142).
5.29 () www Verify the result (5.141).
5.28 () www Consider a neural network, such as the convolutional network discussed in Section 5.5.6, in which multiple weights are constrained to have the same value.Discuss how the standard backpropagation algorithm must be modified in order to ensure that such constraints are satisfied when
5.27 ( ) www Consider the framework for training with transformed data in the special case in which the transformation consists simply of the addition of random noise x → x + ξ where ξ has a Gaussian distribution with zero mean and unit covariance. By following an argument analogous to that of
5.26 ( ) Consider a multilayer perceptron with arbitrary feed-forward topology, which is to be trained by minimizing the tangent propagation error function (5.127) in which the regularizing function is given by (5.128). Show that the regularization term Ω can be written as a sum over patterns of
5.25 ( ) www Consider a quadratic error function of the form E = E0 +1 2(w − w)TH(w − w) (5.195)where w represents the minimum, and the Hessian matrixHis positive definite and constant. Suppose the initial weight vector w(0) is chosen to be at the origin and is updated using simple gradient
5.24 () Verify that the network function defined by (5.113) and (5.114) is invariant under the transformation (5.115) applied to the inputs, provided the weights and biases are simultaneously transformed using (5.116) and (5.117). Similarly, show that the network outputs can be transformed
5.23 ( ) Extend the results of Section 5.4.5 for the exact Hessian of a two-layer network to include skip-layer connections that go directly from inputs to outputs.
5.22 ( ) Derive the results (5.93), (5.94), and (5.95) for the elements of the Hessian matrix of a two-layer feed-forward network by application of the chain rule of calculus.
5.21 ( ) Extend the expression (5.86) for the outer product approximation of the Hessian matrix to the case ofK > 1 output units. Hence, derive a recursive expression analogous to (5.87) for incrementing the number N of patterns and a similar expression for incrementing the number K of outputs. Use
5.20 () Derive an expression for the outer product approximation to the Hessian matrix for a network having K outputs with a softmax output-unit activation function and a cross-entropy error function, corresponding to the result (5.84) for the sum-ofsquares error function.
5.19 () www Derive the expression (5.85) for the outer product approximation to the Hessian matrix for a network having a single output with a logistic sigmoid output-unit activation function and a cross-entropy error function, corresponding to the result (5.84) for the sum-of-squares error
5.18 () Consider a two-layer network of the form shown in Figure 5.1 with the addition of extra parameters corresponding to skip-layer connections that go directly from the inputs to the outputs. By extending the discussion of Section 5.3.2, write down the equations for the derivatives of the
5.17 () Consider a squared loss function of the form E =1 2{y(x,w) − t}2 p(x, t) dx dt (5.193)where y(x,w) is a parametric function such as a neural network. The result (1.89)shows that the function y(x,w) that minimizes this error is given by the conditional expectation of t given x. Use this
5.16 () The outer product approximation to the Hessian matrix for a neural network using a sum-of-squares error function is given by (5.84). Extend this result to the case of multiple outputs.
5.15 ( ) In Section 5.3.4, we derived a procedure for evaluating the Jacobian matrix of a neural network using a backpropagation procedure. Derive an alternative formalism for finding the Jacobian based on forward propagation equations.
5.14 () By making a Taylor expansion, verify that the terms that are O() cancel on the right-hand side of (5.69).
5.13 () Show that as a consequence of the symmetry of the Hessian matrix H, the number of independent elements in the quadratic error function (5.28) is given by W(W + 3)/2.
5.12 ( ) www By considering the local Taylor expansion (5.32) of an error function about a stationary point w, show that the necessary and sufficient condition for the stationary point to be a local minimum of the error function is that the Hessian matrix H, defined by (5.30) with w = w, be
5.11 ( ) www Consider a quadratic error function defined by (5.32), in which the Hessian matrix H has an eigenvalue equation given by (5.33). Show that the contours of constant error are ellipses whose axes are aligned with the eigenvectors ui, with lengths that are inversely proportional to the
5.10 () www Consider a Hessian matrix H with eigenvector equation (5.33). By setting the vector v in (5.39) equal to each of the eigenvectors ui in turn, show that H is positive definite if, and only if, all of its eigenvalues are positive.
5.9 () www The error function (5.21) for binary classification problems was derived for a network having a logistic-sigmoid output activation function, so that 0 y(x,w) 1, and data having target values t ∈ {0, 1}. Derive the corresponding error function if we consider a network having an
5.8 () We saw in (4.88) that the derivative of the logistic sigmoid activation function can be expressed in terms of the function value itself. Derive the corresponding result for the ‘tanh’ activation function defined by (5.59).
5.7 () Show the derivative of the error function (5.24) with respect to the activation ak for output units having a softmax activation function satisfies (5.18).
5.6 () www Show the derivative of the error function (5.21) with respect to the activation ak for an output unit having a logistic sigmoid activation function satisfies(5.18).
5.5 () www Show that maximizing likelihood for a multiclass neural network model in which the network outputs have the interpretation yk(x,w) = p(tk = 1|x) is equivalent to the minimization of the cross-entropy error function (5.24).
5.4 ( ) Consider a binary classification problem in which the target values are t ∈{0, 1}, with a network output y(x,w) that represents p(t = 1|x), and suppose that there is a probability that the class label on a training data point has been incorrectly set. Assuming independent and
5.3 ( ) Consider a regression problem involving multiple target variables in which it is assumed that the distribution of the targets, conditioned on the input vector x, is a Gaussian of the form p(t|x,w) = N(t|y(x,w),Σ) (5.192)where y(x,w) is the output of a neural network with input vector x and
5.2 () www Show that maximizing the likelihood function under the conditional distribution (5.16) for a multioutput neural network is equivalent to minimizing the sum-of-squares error function (5.11).
5.1 ( ) Consider a two-layer network function of the form (5.7) in which the hiddenunit nonlinear activation functions g(·) are given by logistic sigmoid functions of the formσ(a) = {1 + exp(−a)}−1 . (5.191)Show that there exists an equivalent network, which computes exactly the same
4.26 ( ) In this exercise, we prove the relation (4.152) for the convolution of a probit function with a Gaussian distribution. To do this, show that the derivative of the lefthand side with respect to μ is equal to the derivative of the right-hand side, and then integrate both sides with respect
4.25 ( ) Suppose we wish to approximate the logistic sigmoid σ(a) defined by (4.59)by a scaled probit function Φ(λa), where Φ(a) is defined by (4.114). Show that ifλ is chosen so that the derivatives of the two functions are equal at a = 0, thenλ2 = π/8.
4.24 ( ) Use the results from Section 2.3.2 to derive the result (4.151) for the marginalization of the logistic regression model with respect to a Gaussian posterior distribution over the parameters w.
4.23 ( ) www In this exercise, we derive the BIC result (4.139) starting from the Laplace approximation to the model evidence given by (4.137). Show that if the prior over parameters is Gaussian of the form p(θ) = N(θ|m,V0), the log model evidence under the Laplace approximation takes the form ln
4.22 () Using the result (4.135), derive the expression (4.137) for the log model evidence under the Laplace approximation.
4.21 () Show that the probit function (4.114) and the erf function (4.115) are related by(4.116).
4.20 ( ) Show that the Hessian matrix for the multiclass logistic regression problem, defined by (4.110), is positive semidefinite. Note that the full Hessian matrix for this problem is of size MK ×MK, where M is the number of parameters and K is the number of classes. To prove the positive
4.19 () www Write down expressions for the gradient of the log likelihood, as well as the corresponding Hessian matrix, for the probit regression model defined in Section 4.3.5. These are the quantities that would be required to train such a model using IRLS.
4.18 () Using the result (4.91) for the derivatives of the softmax activation function, show that the gradients of the cross-entropy error (4.108) are given by (4.109).
4.17 () www Show that the derivatives of the softmax activation function (4.104), where the ak are defined by (4.105), are given by (4.106).
4.16 () Consider a binary classification problem in which each observation xn is known to belong to one of two classes, corresponding to t = 0 and t = 1, and suppose that the procedure for collecting training data is imperfect, so that training points are sometimes mislabelled. For every data
4.15 ( ) Show that the Hessian matrix H for the logistic regression model, given by(4.97), is positive definite. Here R is a diagonal matrix with elements yn(1 − yn), and yn is the output of the logistic regression model for input vector xn. Hence show that the error function is a concave
4.14 () Show that for a linearly separable data set, the maximum likelihood solution for the logistic regression model is obtained by finding a vector w whose decision boundary wTφ(x) = 0 separates the classes and then taking the magnitude of w to infinity.
4.13 () www By making use of the result (4.88) for the derivative of the logistic sigmoid, show that the derivative of the error function (4.90) for the logistic regression model is given by (4.91).
4.12 () www Verify the relation (4.88) for the derivative of the logistic sigmoid function defined by (4.59).
4.11 ( ) Consider a classification problem with K classes for which the feature vectorφ has M components each of which can take L discrete states. Let the values of the components be represented by a 1-of-L binary coding scheme. Further suppose that, conditioned on the class Ck, the M components
4.10 ( ) Consider the classification model of Exercise 4.9 and now suppose that the class-conditional densities are given by Gaussian distributions with a shared covariance matrix, so that p(φ|Ck) = N(φ|μk,Σ). (4.160)Show that the maximum likelihood solution for the mean of the Gaussian
4.9 () www Consider a generative classification model for K classes defined by prior class probabilities p(Ck) = πk and general class-conditional densities p(φ|Ck)where φ is the input feature vector. Suppose we are given a training data set {φn, tn}where n = 1, . . . , N, and tn is a binary
4.8 () Using (4.57) and (4.58), derive the result (4.65) for the posterior class probability in the two-class generative model with Gaussian densities, and verify the results(4.66) and (4.67) for the parameters w and w0.
4.7 () www Show that the logistic sigmoid function (4.59) satisfies the propertyσ(−a) = 1 − σ(a) and that its inverse is given by σ−1(y) = ln{y/(1 − y)}.
4.6 () Using the definitions of the between-class and within-class covariance matrices given by (4.27) and (4.28), respectively, together with (4.34) and (4.36) and the choice of target values described in Section 4.1.5, show that the expression (4.33)that minimizes the sum-of-squares error
4.5 () By making use of (4.20), (4.23), and (4.24), show that the Fisher criterion (4.25)can be written in the form (4.26).
4.4 () www Show that maximization of the class separation criterion given by (4.23)with respect to w, using a Lagrange multiplier to enforce the constraint wTw = 1, leads to the result that w ∝ (m2 −m1).
4.3 ( ) Extend the result of Exercise 4.2 to show that if multiple linear constraints are satisfied simultaneously by the target vectors, then the same constraints will also be satisfied by the least-squares prediction of a linear model.
4.2 ( ) www Consider the minimization of a sum-of-squares error function (4.15), and suppose that all of the target vectors in the training set satisfy a linear constraint aTtn + b = 0 (4.157)where tn corresponds to the nth row of the matrix T in (4.15). Show that as a consequence of this
4.1 ( ) Given a set of data points {xn}, we can define the convex hull to be the set of all points x given by x =nαnxn (4.156)where αn 0 andn αn = 1. Consider a second set of points {yn} together with their corresponding convex hull. By definition, the two sets of points will be linearly
3.24 ( ) Repeat the previous exercise but now use Bayes’ theorem in the form p(t) = p(t|w, β)p(w, β)p(w, β|t)(3.119)and then substitute for the prior and posterior distributions and the likelihood function in order to derive the result (3.118).
3.23 ( ) www Show that the marginal probability of the data, in other words the model evidence, for the model described in Exercise 3.12 is given by p(t) =1(2π)N/2 ba0 0baN NΓ(aN)Γ(a0)|SN|1/2|S0|1/2(3.118)by first marginalizing with respect to w and then with respect to β.
3.22 ( ) Starting from (3.86) verify all of the steps needed to show that maximization of the log marginal likelihood function (3.86) with respect to β leads to the re-estimation equation (3.95).
3.21 ( ) An alternative way to derive the result (3.92) for the optimal value of α in the evidence framework is to make use of the identity ddαln |A| = TrA−1 d dαA. (3.117)Prove this identity by considering the eigenvalue expansion of a real, symmetric matrix A, and making use of the
3.20 ( ) www Starting from (3.86) verify all of the steps needed to show that maximization of the log marginal likelihood function (3.86) with respect to α leads to the re-estimation equation (3.92).
3.19 ( ) Show that the integration over w in the Bayesian linear regression model gives the result (3.85). Hence show that the log marginal likelihood is given by (3.86).
3.18 ( ) www By completing the square over w, show that the error function (3.79)in Bayesian linear regression can be written in the form (3.80).
3.17 () Show that the evidence function for the Bayesian linear regression model can be written in the form (3.78) in which E(w) is defined by (3.79).
3.16 ( ) Derive the result (3.86) for the log evidence function p(t|α, β) of the linear regression model by making use of (2.115) to evaluate the integral (3.77) directly.
3.15 () www Consider a linear basis function model for regression in which the parametersα and β are set using the evidence framework. Show that the function E(mN) defined by (3.82) satisfies the relation 2E(mN) = N.
3.14 ( ) In this exercise, we explore in more detail the properties of the equivalent kernel defined by (3.62), where SN is defined by (3.54). Suppose that the basis functions φj(x) are linearly independent and that the number N of data points is greater than the number M of basis functions.

Showing 600 - 700 of 816

Pattern Recognition And Machine Learning 1st Edition Christopher M Bishop - Solutions

Step by Step Answers