Question: Gradient Descent Optimization ( a ) Consider Figure 5 , which depicts a loss function L ( x ) :R ^ ( 2 ) rarrR

Gradient Descent Optimization

(

)

Consider Figure

5,

which depicts a loss function L

(

)

^(2)

rarrR

.

The red dot represents the current estimate of x

_(

) = [

_(1),

_(2)]

at step t

.

Please sketch an estimate of the path of updates that would be taken by vanilla SGD until "convergence". Hint: use the tangent vector to help illustrate your reasoning. Figure

5

: Contour lines of an arbitrary cost function with current estimate x

_(

) .

(

)

It is worth mentioning that the contour lines shown in Fig.

5

will change during optimization since the loss is evaluated over a single batch rather than the whole dataset. As discussed in class, this observation suggests that for unfortunate updates we might get stuck in saddle points, where the vanilla SGD gradient is

0 .

One way to combat this problem is to use the first and

/

or second order momentum. Please briefly answer the following questions:

.

Why is the first order momentum helpful in handling the saddle points problem?

.

Give one example of other difficulties of optimization that the first order momentum

can help address and explain why.

iii. How would using the first and second order momentum change the update path sketched in

(

) (

sketch the changed update direction and explain why you think the updates will change in that way

) ?

Hint: use the tangent and the momentum vectors to help illustrate your reasoning.

Gradient Descent Optimization ( a ) Consider

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!

MATHEMATICS FOR MACHINE LEARNING Marc Peter Deisenroth A. Aldo Faisal Cheng Soon Ong Contents Foreword 1 Part I Mathematical Foundations 9 1 Introduction and Motivation 11 1.1 Finding Words for...

Jupiter Notebook We have covered some of the limitations of single layer neural networks in class, but they are still powerful learning systems that provide a good way to begin learning about how to...

Problem 3 A Bookmark this page Stochastic gradient descent (SGD) is a simple but widely applicable optimization technique. For example, we can use it to train a Support Vector Machine. The objective...

Submitted to Management Science manuscript MS-0001-1922.65 Authors are encouraged to submit new papers to INFORMS journals by means of a style file template, which includes the journal title....

Jones & Bartlett Learning, LLC. NOT FOR RESALE OR DISTRIBUTION CHAPTER Hot Spot Analysis 10 LEARNING OBJECTIVES C A R R Provide a working definition of a \"hot spot.\" , Be able to explain different...

( d ) Suppose max pooling is applied on an 8 8 image with a 2 2 filter and stride 2 pixels. What will be the number of parameters in this layer? ( 1 mark ) ( e ) Consider the following plot of the...

I need help in developing two or more solutions or interventions that align with my Ishikawa root cause thematic analysis factors. I need to trace back to the Ishikawa root cause analysis diagram. I...

NASA/SP-2011-3422 Version 1.0 November 2011 NASA Risk Management Handbook NASA/SP-2011-3422 Version 1.0 NASA Risk Management Handbook National Aeronautics and Space Administration NASA Headquarters...

Please answer the following questions and make it about one page long: What is the basic theme of the article? Try to state it in just one paragraph. Did the article present a good support base?...

It appears that because of COVID limitations, data was collected virtually through phone, email, and surveys with a population that may have barriers to those methods of data collection. How do you...

Write a query to return the last name of all employees whose names have exactly 7 characters. - Write a query to return the first name of all employees who have both a r and a s in their first or...

In addition to the human relations aspect, leadership is often about managing several moving parts such as information, resources, and materials. Leaders need to create systems and structures to most...

Question 1 4 Which one of the following bonds is the least sensitive to interest rate risk? 3 years to maturity with 1 0 % coupon rate 5 vears to maturity with 8 X coupon rate 5 years to maturity...

Lack of experience and lack of positive experience with speaking both generally increase a speaker's public speaking anxiety. Group of answer choices False True