2. Which properties of Lasso path generalize to other loss functions? Recall we showed the optimality...

Fantastic news! We've Found the answer you've been seeking!

Question:

image text in transcribed

Transcribed Image Text:

2. Which properties of Lasso path generalize to other loss functions? Recall we showed the optimality conditions for a Lasso solution: where as we noted in class, k B(X)=0 = = X(Y - XB(A)) = sgn(B(A)) B(A) k = 0 |X (Y XB(A))| < 2 < NE (1) 2 Vk |X (Y XB(A))| (2) (3) 2' X(YXB(A)) ARSS(B) |B=B(X) is the derivative of the loss function. We noted in class the following properties of the set of solutions {B(A) : 0 < }: i All the variables in the solution are "highly correlated" with the current residual from (1) above, and all the variables with zero coefficients are less correlated" with the current residual from (23) above. ii The solution path {(A) : 0 x 0} as a function of A can be described by a collection of "breakpoints" > 1 > 2 > ... > K >0 such that the set Ak of active variables with non-zero coefficients is fixed for all solutions B(A) with Ak k+1. iii B(A) is a piecewise linear function, in other words, for in this range we have: B(A) = (Ak) + Uk(Ak ), for a vector vk we explicitly derived in class. Assume now that we want to build a different type of model with a different convex and infinitely differentiable loss function, say a logistic regression model for a binary classification task, and add lasso penalty to that: B(X) n = arg min log {1+ exp{yx{{B}} + \||B||1 i=1 We would like to investigate which of the properties above still holds for the solution of this problem. (a) Using simple arguments about derivatives and sub-derivatives as we used in class for the quadratic loss case, argue that that three conditions like (1)-(3) can be written for this case too, with the appropriate derivative replacing the empirical correlation. Derive these expressions explicitly for the logistic case. (b) Explain clearly why this implies that properties (i), (ii) still hold (for (ii), you may find the continuity of the derivative useful). (c) Does the piecewise linearity still hold? A clear intuitive explanation is sufficient here. Hint: Consider how we obtained the linearity for squared loss in A in class by decomposing the correlation vector XT (Y - X) = XTY XTX. 2. Which properties of Lasso path generalize to other loss functions? Recall we showed the optimality conditions for a Lasso solution: where as we noted in class, k B(X)=0 = = X(Y - XB(A)) = sgn(B(A)) B(A) k = 0 |X (Y XB(A))| < 2 < NE (1) 2 Vk |X (Y XB(A))| (2) (3) 2' X(YXB(A)) ARSS(B) |B=B(X) is the derivative of the loss function. We noted in class the following properties of the set of solutions {B(A) : 0 < }: i All the variables in the solution are "highly correlated" with the current residual from (1) above, and all the variables with zero coefficients are less correlated" with the current residual from (23) above. ii The solution path {(A) : 0 x 0} as a function of A can be described by a collection of "breakpoints" > 1 > 2 > ... > K >0 such that the set Ak of active variables with non-zero coefficients is fixed for all solutions B(A) with Ak k+1. iii B(A) is a piecewise linear function, in other words, for in this range we have: B(A) = (Ak) + Uk(Ak ), for a vector vk we explicitly derived in class. Assume now that we want to build a different type of model with a different convex and infinitely differentiable loss function, say a logistic regression model for a binary classification task, and add lasso penalty to that: B(X) n = arg min log {1+ exp{yx{{B}} + \||B||1 i=1 We would like to investigate which of the properties above still holds for the solution of this problem. (a) Using simple arguments about derivatives and sub-derivatives as we used in class for the quadratic loss case, argue that that three conditions like (1)-(3) can be written for this case too, with the appropriate derivative replacing the empirical correlation. Derive these expressions explicitly for the logistic case. (b) Explain clearly why this implies that properties (i), (ii) still hold (for (ii), you may find the continuity of the derivative useful). (c) Does the piecewise linearity still hold? A clear intuitive explanation is sufficient here. Hint: Consider how we obtained the linearity for squared loss in A in class by decomposing the correlation vector XT (Y - X) = XTY XTX.

Posted Date: May 17, 2024 06:02 PM

See More Questions