Question: Consider a 2-dimensional weight space, and two different error functions: E([w,w]) = w + 4 w - 97 w + 13 w E([w,w]) = 1000
Consider a 2-dimensional weight space, and two different error
functions:
E([w,w]) = w + 4 w - 97 w + 13 w
E([w,w]) = 1000 w + 10 w + 7 w - 3 w
If you optimize each of these using batch gradient descent, with the
learning rate set as high as you can without the system oscillating,
what is the highest learning rate you can use for each of these? What is
each of their rates of convergence? Which is faster? Please show the
relevant formula and your calculations, and draw a diagram.
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
