Question: In the lectures, we introduced Gradient Descent, an optimization method to find the minimum value of a function. In this problem we try to solve

 In the lectures, we introduced Gradient Descent, an optimization method to

In the lectures, we introduced Gradient Descent, an optimization method to find the minimum value of a function. In this problem we try to solve a fairly simple optimization problem: min f(x) = x2 XER That is, finding the minimum value of x2 over the real line. Of course you know it is when x = 0, but this time we do it with gradient descent. Recall that to perform gradient descent, you start at an arbitrary initial point xo, and you update Xt+1 = Xt 1Vxf(xt), where is the learning rate. Hopefully, after T iterations, XT will be close to the minimum point. (a) (10 pts] Assume Xo = 1 and we choose the learning rate to be l = 1. Now suppose a sequence, X1, ..., XT, is obtained through gradient descent algorithm. Prove that for arbitrary T > 0, f(xT) = 1. Hence, the gradient descent fails completely. Can you provide an intuitive explanation as to why? (b) [10 pts] Assume xo = 1 and 1= 2. Prove that Xt+1 > xt is always true. The gradient descent even increases the function value! (c) (10 pts What is the reason gradient descent fails to work in the above two cases, even for a simple optimization problem? What can be done to make gradient descent work? (You don't need a perfect solution here. In fact, a lot of research, even today, have been put into improving the stability and efficiency of (stochastic) gradient descent algorithms.) In the lectures, we introduced Gradient Descent, an optimization method to find the minimum value of a function. In this problem we try to solve a fairly simple optimization problem: min f(x) = x2 XER That is, finding the minimum value of x2 over the real line. Of course you know it is when x = 0, but this time we do it with gradient descent. Recall that to perform gradient descent, you start at an arbitrary initial point xo, and you update Xt+1 = Xt 1Vxf(xt), where is the learning rate. Hopefully, after T iterations, XT will be close to the minimum point. (a) (10 pts] Assume Xo = 1 and we choose the learning rate to be l = 1. Now suppose a sequence, X1, ..., XT, is obtained through gradient descent algorithm. Prove that for arbitrary T > 0, f(xT) = 1. Hence, the gradient descent fails completely. Can you provide an intuitive explanation as to why? (b) [10 pts] Assume xo = 1 and 1= 2. Prove that Xt+1 > xt is always true. The gradient descent even increases the function value! (c) (10 pts What is the reason gradient descent fails to work in the above two cases, even for a simple optimization problem? What can be done to make gradient descent work? (You don't need a perfect solution here. In fact, a lot of research, even today, have been put into improving the stability and efficiency of (stochastic) gradient descent algorithms.)

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!