Foundations Of Machine Learning 2nd Edition Mehryar Mohri, Afshin Rostamizadeh - Solutions

13.2 Stability analysis for L2-regularized conditional Maxent.(a) Give an upper bound on the stability of the L2-regularized conditional Maxent in terms of the sample size and (Hint: use the techniques and results of Chapter 14).(b) Use the previous question to derive a stability-based
13.1 Extension to Bregman divergences.(a) Show how conditional Maxent models can be extended by using arbitrary Bregman divergences instead of the (unnormalized) relative entropy.(b) Prove a duality theorem similar to Theorem 13.1 for theses extensions.(c) Derive theoretical guarantees for these
12.5 L2-regularization. Let w be the solution of Maxent with a norm-2 squared regularization.(a) Prove the following inequality: kwk2 2r (Hint: you could compare the values of the objective function at w and 0.). Generalize this result to other k kpp -regularizations with p > 1.(b) Use the
12.4 Extension to Bregman divergences. Derive theoretical guarantees for the extensions discussed in Section 12.8. What additional property is needed for the Bregman divergence so that your learning guarantees hold?
12.3 Dual of norm-2 squared regularized Maxent. Derive the dual formulation of the norm-2 squared regularized Maxent optimization shown in equation (12.16).
12.2 Lagrange duality. Derive the dual problem of the Maxent problem and justify it carefully in the case of the stricter constraint of positivity for the distribution p: p(x) > 0 for all x 2 X.
12.1 Convexity. Prove directly that the function w 7! log Z(w) = log(P x2X ew(x))is convex (Hint: compute its Hessian).
11.11 On-line quadratic SVR. Derive an on-line algorithm for the quadratic SVR algorithm (provide the full pseudocode).
11.10 On-line Lasso. Use the formulation (11.33) of the optimization problem of Lasso and stochastic gradient descent (see section 8.3.1) to show that the problem can be solved using the on-line algorithm of gure 11.9.
11.9 Leave-one-out error. In general, the computation of the leave-one-out error can be very costly since, for a sample of size m, it requires training the algorithm m times. The objective of this problem is to show that, remarkably, in the case of kernel ridge regression, the leave-one-out error
11.8 Optimal kernel matrix. Suppose in addition to optimizing the dual variables 2 Rm, as in (11.16), we also wish to optimize over the entries of the PDS kernel matrix K 2 Rmm.min K0 max????> ???? >K + 2>y ; s:t: kKk2 1(a) What is the closed-form solution for the optimal K for the joint
11.7 SVR dual formulations. Give a detailed and carefully justied derivation of the dual formulations of the SVR algorithm both for the -insensitive loss and the quadratic -insensitive loss.
11.6 SVR and squared loss. Assuming that 2r 1, use theorem 11.13 to derive a generalization bound for the squared loss.
11.5 Huber loss. Derive the primal and dual optimization problem used to solve the SVR problem with the Huber loss:Lc(i) =(1 2 2 i ; if jij c ci ???? 1 2 c2; otherwise;where i = w (xi) + b ???? yi.
11.4 Perturbed kernels. Suppose two dierent kernel matrices, K and K0, are used to train two kernel ridge regression hypothesis with the same regularization parameter . In this problem, we will show that the dierence in the optimal dual variables, and 0 respectively, is bounded by a quantity
11.3 Linear regression.(a) What condition is required on the data X in order to guarantee that XX>is invertible?(b) Assume the problem is under-determined. Then, we can choose a solution w such that the equality X>w = X>(XX>)yXy (which can be shown to equal XyXy) holds. One particular choice that
11.2 Pseudo-dimension of linear functions. Let H be the set of all linear functions in dimensiond, i.e. h(x) = w>x for some w 2 Rd. Show that Pdim(H) = d.
11.1 Pseudo-dimension and monotonic functions.Assume that is a strictly monotonic function and let H be the family of functions dened by H = f(h()) : h 2 Hg, where H is some set of real-valued functions. Show that Pdim( H) = Pdim(H).
10.10 k-partite weight function. Show how the weight function ! can be dened so that L! encodes the natural loss function associated to a k-partite ranking scenario.
10.9 Deviation bound for the AUC. Let h be a xed scoring function used to rank the points of X. Use Hoeding's bound to show that with high probability the AUC of h for a nite sample is close to its average.
10.8 Multipartite ranking. Consider the ranking scenario in a k-partite setting where X is partitioned into k subsets X1; : : : ;Xk with k 1. The bipartite case (k = 2)is already specically examined in the chapter. Give a precise formulation of the problem in terms of k distributions. Does
10.7 Bipartite ranking. Suppose that we use a binary classier for ranking in the bipartite setting. Prove that if the error of the binary classier is , then that of the ranking it induces is also at most . Show that the converse does not hold.
10.6 Margin-maximization ranking. Give a linear programming (LP) algorithm returning a linear hypothesis for pairwise ranking based on margin maximization.
10.5 RankPerceptron. Adapt the Perceptron algorithm to derive a pairwise ranking algorithm based on a linear scoring function. Assume that the training sample is linear separable for pairwise ranking. Give an upper bound on the number of updates made by the algorithm in terms of the ranking margin.
10.4 Margin maximization and RankBoost. Give an example showing that Rank-Boost does not achieve the maximum margin, as in the case of AdaBoost.
10.3 Empirical margin loss of RankBoost. Derive an upper bound on the empirical pairwise ranking margin loss of RankBoost similar to that of theorem 7.7 for AdaBoost.
10.2 On-line ranking. Give an on-line version of the SVM-based ranking algorithm presented in section 10.3.
10.1 Uniform margin-bound for ranking. Use theorem 10.1 to derive a margin-based learning bound for ranking that holds uniformly for all > 0 (see similar binary classication bounds of theorem 5.9 and exercise 5.2).
9.6 Give an example where the generalization error of each of the k(k????1)=2 binary classiers hll0 , l 6= l0, used in the denition of the OVO technique is r and that of the OVO hypothesis (k ???? 1)r.
9.5 Decision trees. Show that VC-dimension of a binary decision tree with n nodes in dimension N is in O(n logN).
9.4 Multi-class algorithm based on RankBoost. This problem requires familiarity with the material presented both in this chapter and in chapter 10. An alternative boosting-type multi-class classication algorithm is one based on a ranking criterion. We will dene and examine that algorithm in the
9.3 Alternative multi-class boosting algorithm. Consider the objective function G dened for any sample S = ((x1; y1); : : : ; (xm; ym)) 2 (X Y)m and =(1; : : : ; n) 2 Rn, n 1, by G() =Xm i=1 e????1 kPk l=1 yi[l]fn(xi;l) =Xm i=1 e????1 kPk l=1 yi[l]Pn t=1 tht(xi;l): (9.25)Use the convexity of
9.2 Multi-class classication with kernel-based hypotheses constrained by an Lp norm. Use corollary 9.4 to dene alternative multi-class classication algorithms with kernel-based hypotheses constrained by an Lp norm with p 6= 2. For which value of p 1 is the bound of proposition 9.3 tightest?
9.1 Generalization bounds for multi-label case. Use similar techniques to those used in the proof of theorem 9.2 to derive a margin-based learning bound in the multi-label case.
8.11 On-line to batch | kernel Perceptron margin bound. In this problem, we give a margin-based generalization guarantee for the kernel Perceptron algorithm. Let h1; : : : ; hT be the sequence of hypotheses generated by the kernel Perceptron algorithm and let bh be dened as in exercise 8.10.
8.9 General inequality. In this exercise we generalize the result of exercise 8.7 by using a more general inequality: log(1 ???? x) ????x ???? x2 for some 0 < < 2.(a) First prove that the inequality is true for x 2 [0; 1 ????2 ]. What does this imply about the valid range of ?(b) Give a
8.8 Polynomial weighted algorithm. The objective of this problem is to show how another regret minimization algorithm can be dened and studied. Let L be a loss function convex in its rst argument and taking values in [0;M].We will assume N > e2 and then for any expert i 2 [N], we denote by rt;i
8.7 Second-order regret bound. Consider the randomized algorithm that diers from the RWMalgorithm only by the weight update, i.e., wt+1;i (1????(1????)lt;i)wt;i, t 2 [T], which is applied to all i 2 [N] with 1=2 < 1. This algorithm can be used in a more general setting than RWM since the losses
8.6 Margin Perceptron. Given a training sample S that is linearly separable with a maximum margin > 0, theorem 8.8 states that the Perceptron algorithm run cyclically over S is guaranteed to converge after at most R2=2 updates, where R is the radius of the sphere containing the sample points.
8.5 On-line SVM algorithm. Consider the algorithm described in gure 8.11. Show that this algorithm corresponds to the stochastic gradient descent technique applied to the SVM problem (5.24) with hinge loss and no oset (i.e., x p = 1 and b = 0).
8.4 Tightness of lower bound. Is the lower bound of theorem 8.5 tight? Explain why or show a counter-example.
8.3 Sparse instances. Suppose each input vector xt, t 2 [T], coincides with the tth unit vector of RT . How many updates are required for the Perceptron algorithm to converge? Show that the number of updates matches the margin bound of theorem 8.8.
8.2 Generalized mistake bound. Theorem 8.8 presents a margin bound on the maximum number of updates for the Perceptron algorithm for the special case = 1.Consider now the general Perceptron update wt+1 wt + ytxt, where > 0.Prove a bound on the maximum number of mistakes. How does aect the
8.1 Perceptron lower bound. Let S be a labeled sample of m points in RN with xi = ((????1)i; : : : ; (????1)i; (????1)i+1| {z }i rst components; 0; : : : ; 0) and yi = (????1)i+1: (8.30)Show that the Perceptron algorithm makes (2N) updates before nding a separating hyperplane, regardless of the
7.12 Empirical margin loss boosting. As discussed in the chapter, AdaBoost can be viewed as coordinate descent applied to a convex upper bound on the empirical error. Here, we consider an algorithm seeking to minimize the empirical margin loss. For any 0 < 1 let bR S;(f) = 1 mPm i=1 1yif(xi)
7.10 Boosting in the presence of unknown labels. Consider the following variant of the classication problem where, in addition to the positive and negative labels +1 and ????1, points may be labeled with 0. This can correspond to cases where the true label of a point is unknown, a situation that
7.9 AdaBoost example.In this exercise we consider a concrete example that consists of eight training points and eight weak classiers.(a) Dene an mn matrix M where Mij = yihj(xi), i.e., Mij = +1 if training example i is classied correctly by weak classier hj , and ????1 otherwise. Let dt;t 2
7.8 Simplied AdaBoost. Suppose we simplify AdaBoost by setting the parameter t to a xed value t = > 0, independent of the boosting round t.(a) Let be such that ( 1 2 ???? t) > 0. Find the best value of as a function of by analyzing the empirical error.(b) For this value of , does the algorithm
7.7 Noise-tolerant AdaBoost. AdaBoost may be signicantly overtting in the presence of noise, in part due to the high penalization of misclassied examples. To reduce this eect, one could use instead the following objective function:F =Xm i=1 G(????yif(xi)); (7.32)where G is the function dened
7.6 Fix 2 (0; 1=2). Let the training sample be dened by m points in the plane with m 4 negative points all at coordinate (1; 1), another m 4 negative points all at coordinate (????1;????1), m(1????)4 positive points all at coordinate (1;????1), and m(1+)4 positive points all at coordinate
7.5 Dene the unnormalized correlation of two vectors x and x0 as the inner product between these vectors. Prove that the distribution vector (Dt+1(1); : : : ;Dt+1(m))dened by AdaBoost and the vector of components yiht(xi) are uncorrelated.
7.4 Weighted instances. Let the training sample be S = ((x1; y1); : : : ; (xm; ym)).Suppose we wish to penalize dierently errors made on xi versus xj . To do that, we associate some non-negative importance weight wi to each point xi and dene the objective function F() = Pm i=1 wie????yif(xi),
7.3 Update guarantee. Assume that the main weak learner assumption of AdaBoost holds. Let ht be the base learner selected at round t. Show that the base learner ht+1 selected at round t + 1 must be dierent from ht.
7.2 Alternative objective functions. This problem studies boosting-type algorithms dened with objective functions dierent from that of AdaBoost. We assume that the training data are given as m labeled examples (x1; y1); : : : ; (xm; ym) 2 X f????1; +1g. We further assume that is a strictly
7.1 VC-dimension of the hypothesis set of AdaBoost. Prove the upper bound on the VC-dimension of the hypothesis set FT of AdaBoost after T rounds of boosting, as stated in equation (7.9).
6.22 Anomaly detection. For this problem, consider a Hilbert space H with associated feature map : X ! H and kernel K(x; x0) = (x) (x0).(a) First, let us consider nding the smallest enclosing sphere for a given sample S = (x1; : : : ; xm). Let c 2 H denote the center of the sphere and let r >
6.21 Mercer's condition. Let X RN be a compact set and K: X X ! R a continuous kernel function. Prove that if K veries Mercer's condition (theorem 6.2), then it is PDS. (Hint: assume that K is not PDS and consider a set fx1; : : : ; xmg X and a column-vector c 2 Rm1 such that Pm i;j=1
6.20 n-gram kernel. Show that for all n 1, and any n-gram kernel Kn, Kn(x; y)can be computed in linear time O(jxj + jyj), for all x; y 2 assuming n and the alphabet size are constants.
6.19 Sequence kernels. Let X = fa; c; g; tg. To classify DNA sequences using SVMs, we wish to dene a kernel between sequences dened over X. We are given a nite set I X of non-coding regions (introns). For x 2 X, denote by jxj the length of x and by F(x) the set of factors of x, i.e., the set
6.18 Metrics and Kernels. Let X be a non-empty set and K: XX ! R be a negative denite symmetric kernel such that K(x; x) = 0 for all x 2 X.(a) Show that there exists a Hilbert space H and a mapping (x) from X to H such that:K(x; y) = jj(x) ???? (x0)jj2 :Assume that K(x; x0) = 0 ) x = x0. Use
6.17 Relationship between NDS and PDS kernels. Prove the statement of theorem 6.17. (Hint: Use the fact that if K is PDS then exp(K) is also PDS, along with theorem 6.16.)
6.16 Fraud detection. To prevent fraud, a credit-card company decides to contact Professor Villebanque and provides him with a random list of several thousand fraudulent and non-fraudulent events. There are many dierent types of events, e.g., transactions of various amounts, changes of address or
6.15 Image classication kernel. For 0, the kernel K : (x; x0) 7!XN k=1 min(jxkj; jx0kj) (6.30)over RN RN is used in image classication. Show that K is PDS for all 0. To do so, proceed as follows.(a) Use the fact that (f; g) 7!R +1 t=0 f(t)g(t)dt is an inner product over the set of
6.14 Classier based kernel. Let S be a training sample of size m. Assume that S has been generated according to some probability distribution D(x; y), where(x; y) 2 X f????1; +1g.(a) Dene the Bayes classier h : X ! f????1; +1g. Show that the kernel Kdened by K(x; x0) = h(x)h(x0) for any
6.13 High-dimensional mapping. Let : X ! H be a feature mapping such that the dimension N of H is very large and let K: XX ! R be a PDS kernel dened by K(x; x0) = E iD[(x)]i[(x0)]i; (6.27)where [(x)]i is the ith component of (x) (and similarly for 0(x)) and where D is a distribution over
6.12 Explicit polynomial kernel mapping. Let K be a polynomial kernel of degreed, i.e., K: RN RN ! R, K(x; x0) = (x x0 + c)d, with c > 0, Show that the dimension of the feature space associated to K isN + d d: (6.26)Write K in terms of kernels ki : (x; x0) 7! (x x0)i, i 2 f0; : : : ; dg.
6.11 Explicit mappings.(a) Denote a data set x1; : : : ; xm and a kernel K(xi; xj) with a Gram matrix K. Assuming K is positive semidenite, then give a map () such that K(xi; xj) = h(xi); (xj)i.(b) Show the converse of the previous statement, i.e., if there exists a mapping(x) from input
6.10 For any p > 0, let Kp be the kernel dened over R+ R+ by Kp(x; y) = e????(x+y)p: (6.25)Show that Kp is positive denite symmetric (PDS) i p 1. (Hint: you can use the fact that if K is NDS, then for any 0 < 1, K is also NDS.)
6.9 Let H be a Hilbert space with the corresponding dot product h; i. Show that the kernel K dened over H H by K(x; y) = 1 ???? hx; yi is negative denite.
6.8 Is the kernel K dened over Rn Rn by K(x; y) = kx????yk3=2 PDS? Is it NDS?
6.7 Dene a dierence kernel as K(x; x0) = jx ???? x0j for x; x0 2 R. Show that this kernel is not positive denite symmetric (PDS).
6.6 Show that the following kernels K are NDS:(a) K(x; y) = [sin(x ???? y)]2 over R R.(b) K(x; y) = log(x + y) over (0;+1) (0;+1).
6.5 Set kernel. Let X be a nite set. Let K0 be a PDS kernel over X, show that K0 dened by 8A;B 2 2X;K0(A;B) =X x2A;x02B K0(x; x0)is a PDS kernel.
6.4 Symmetric dierence kernel. Let X be a nite set. Show that the kernel K dened over 2X, the set of subsets of X, by 8A;B 2 2X;K(A;B) = exp????1 2jABj;where AB is the symmetric dierence of A and B is PDS (Hint: you could use the fact that K is the result of the normalization of a kernel
6.3 Graph kernel. Let G = (V; E) be an undirected graph with vertex set V and edge set E. V could represent a set of documents or biosequences and E the set of connections between them. Let w[e] 2 R denote the weight assigned to edge e 2 E. The weight of a path is the product of the weights of its
6.2 Show that the following kernels K are PDS:(a) K(x; y) = cos(x ???? y) over R R.(b) K(x; y) = cos(x2 ???? y2) over R R.(c) For all integers n > 0;K(x; y) =PN i=1 cosn(x2i???? y2 i ) over RN RN.(d) K(x; y) = (x + y)????1 over (0;+1) (0;+1).(e) K(x; x0) = cos\(x; x0) over Rn Rn, where
6.1 Let K: X X ! R be a PDS kernel, and let : X ! R be a positive function.Show that the kernel K0 dened for all x; y 2 X by K0(x; y) = K(x;y)(x)(y) is a PDS kernel.
5.7 VC-dimension of canonical hyperplanes. The objective of this problem is derive a bound on the VC-dimension of canonical hyperplanes that does not depend on the dimension of feature space. Let S fx: kxk rg. We will show that the VC-dimension d of the set of canonical hyperplanes fx 7!
5.6 Sparse SVM. One can give two types of arguments in favor of the SVM algorithm:one based on the sparsity of the support vectors, another based on the notion of margin. Suppose that instead of maximizing the margin, we choose instead to maximize sparsity by minimizing the Lp norm of the vector
5.5 SVMs hands-on.(a) Download and install the libsvm software library from:http://www.csie.ntu.edu.tw/~cjlin/libsvm/.(b) Download the satimage data set found at:http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.Merge the training and validation sets into one. We will refer to the resulting
5.4 Sequential minimal optimization (SMO). The SMO algorithm is an optimization algorithm introduced to speed up the training of SVMs. SMO reduces a(potentially) large quadratic programming (QP) optimization problem into a series of small optimizations involving only two Lagrange multipliers. SMO
5.3 Importance weighted SVM. Suppose you wish to use SVMs to solve a learning problem where some training data points are more important than others. More formally, assume that each training point consists of a triplet (xi; yi; pi), where 0 pi 1 is the importance of the ith point. Rewrite the
5.2 Tighter Rademacher Bound. Derive the following tighter version of the bound of theorem 5.9: for any > 0, with probability at least 1 ???? , for all h 2 H and 2 (0; 1] the following holds:R(h) bR S;(h) +2Rm(H) +s log logm+s log 22m(5.49)for any> 1.
5.1 Soft margin hyperplanes. The function of the slack variables used in the optimization problem for soft margin hyperplanes has the form: 7!Pm i=1 i.Instead, we could use 7!Pm i=1 p i , with p > 1.(a) Give the dual formulation of the problem in this general case.(b) How does this more
4.5 Same questions as in Exercise 4.5 with the loss of h: X ! R at point (x; y) 2 X f????1; +1g dened instead to be 1yh(x)
4.4 In this problem, the loss of h: X ! R at point (x; y) 2 X f????1; +1g is dened to be 1yh(x)0.(a) Dene the Bayes classier and a Bayes scoring function h for this loss.(b) Express the excess error of h in terms of h (counterpart of Lemma 4.5, for loss considered here).(c) Give a
4.3 Show that for the squared Hinge loss, (u) = max(0; 1 + u)2, the statement of Theorem 4.7 holds with s = 2 and c = 1 2 and therefore that the excess error can be upper bounded as follows:R(h) ???? R L(h) ???? L 1 2 :
4.2 Show that for the squared loss, (u) = (1 + u)2, the statement of Theorem 4.7 holds with s = 2 and c = 1 2 and therefore that the excess error can be upper bounded as follows:R(h) ???? R L(h) ???? L 1 2 :
4.1 For any hypothesis set H, show that the following inequalities hold:E SDm hbR S????hERM S i inf h2H R(h) E SDm hR????hERM S i: (4.13)
3.31 Generalization bound based on covering numbers. Let H be a family of functions mapping X to a subset of real numbers Y R. For any > 0, the covering number N(H; ) of H for the L1 norm is the minimal k 2 N such that H can be covered with k balls of radius , that is, there exists fh1; : : :
3.30 VC-dimension generalization bound { realizable case. In this exercise we show that the bound given in corollary 3.19 can be improved to O( d log(m=d)m ) in the realizable setting. Assume we are in the realizable scenario, i.e. the target concept is included in our hypothesis class H. We will
3.29 Innite VC-dimension.(a) Show that if a concept class C has innite VC-dimension, then it is not PAC-learnable.(b) In the standard PAC-learning scenario, the learning algorithm receives all examples rst and then computes its hypothesis. Within that setting, PAClearning of concept classes with
3.28 VC-dimension of convex combinations. Let H be a family of functions mapping from an input space X to f????1; +1g and let T be a positive integer. Give an upper bound on the VC-dimension of the family of functions FT dened by F =(sgnXT t=1 tht!: ht 2 H; t 0;XT t=1 t 1):(Hint: you can use
3.27 VC-dimension of neural networks.Let C be a concept class over Rr with VC-dimensiond. A C-neural network with one intermediate layer is a concept dened over Rn that can be represented by a directed acyclic graph such as that of Figure 3.7, in which the input nodes are those at the bottom and
3.26 Symmetric functions. A function h: f0; 1gn ! f0; 1g is symmetric if its value is uniquely determined by the number of 1's in the input. Let C denote the set of all symmetric functions.(a) Determine the VC-dimension of C.(b) Give lower and upper bounds on the sample complexity of any consistent
3.25 VC-dimension of symmetric dierence of concepts. For two sets A and B, let AB denote the symmetric dierence of A and B, i.e., AB = (A[B)????(A\B).Let H be a non-empty family of subsets of X with nite VC-dimension. Let A be an element of H and dene HA = fXA: X 2 Hg. Show that VCdim(HA) =
3.24 VC-dimension of union of concepts. Let A and B be two sets of functions mapping from X into f0; 1g, and assume that both A and B have nite VCdimension, with VCdim(A) = dA and VCdim(B) = dB. Let C = A [ B be the union of A and B.(a) Prove that for all m, C(m) A(m) + B(m).(b) Use Sauer's
3.23 VC-dimension of intersection concepts.(a) Let C1 and C2 be two concept classes. Show that for any concept class C = fc1 \ c2 : c1 2 C1; c2 2 C2g,C(m) C1 (m)C2 (m): (3.53)(b) Let C be a concept class with VC-dimension d and let Cs be the concept class formed by all intersections of s
3.22 VC-dimension of intersection of halfspaces. Consider the class Ck of convex intersections of k halfspaces. Give lower and upper bound estimates for VCdim(Ck).
3.21 VC-dimension of union of halfspaces. Provide an upper bound on the VCdimension of the class of hypotheses described by the unions of k halfspaces.

Showing 200 - 300 of 816

Foundations Of Machine Learning 2nd Edition Mehryar Mohri, Afshin Rostamizadeh - Solutions

Step by Step Answers