Question: # Problem 1: Stochastic Variance Reduced Gradient Descent (SVRG) As we discussed in the video lectures, decomposable functions of the form $$ min_{omega} left [

# Problem 1: Stochastic Variance Reduced Gradient Descent (SVRG) As we

# Problem 1: Stochastic Variance Reduced Gradient Descent (SVRG)

As we discussed in the video lectures, decomposable functions of the form $$ \min_{\omega} \left [ F(\omega) = \frac{1}{n} \sum_i^n f_i(\omega) ight ], $$ are very common in statistics/ML problems. Here, each $f_i$ corresponds to a loss for a particular training example. For example, if $f_i(\omega) = (\omega^\top x_i - y_i)^2$, then $F(\omega)$ is a least squares regression problem. The standard gradient descent (GD) update $$ \omega_t = \omega_{t-1} - \eta_t abla F(\omega_{t-1}) $$

evaluates the full gradient $ abla F(\omega) = \frac{1}{n} \sum_i^n abla f_i(\omega)$, which requires evaluating $n$ derivatives. This can be prohibitively expensive when the number of training examples $n$ is large. SGD evaluates the gradient of one (or a small subset) of the training examples--drawn randomly from ${1,...n}$--per iteration: $$ \omega_t = \omega_{t-1} - \eta_t abla f_i(\omega_{t-1}). $$

In expectation, the updates are equivalent, but SGD has the computational advantage of only evaluating a single gradient $ abla f_i(\omega)$. The disadvatage is that the randomness introduces variance, which slows convergence. This was our motivation in class to introduce the SVRG algorithm.

Given the dataset in **digits.zip**, plot the performance of GD, SGD, and SVRG for logistic regression with $l2$ regularization in terms of negative log likelihood on the training data against the number of gradient evaluations for a single training example (GD performs $n$ such evaluations per iteration and SGD performs $1$). Choose the $l2$ parameter to optimize performance on the test set. How does the choice of $T$ (the number of inner loops) affect the performance of SVRG? There should be one plot with a title and three lines with different colors, markers, and legend labels.

Problem 1: Stochastic Variance Reduced Gradient Descent (SVRG) As we discussed in the video lectures, decomposable functions of the form min[F()=n1infi()], are very common in statistics/ML problems. Here, each fi corresponds to a loss for a particular training example. For example, if fi()=(xiyi)2, then F() is a least squares regression problem. The standard gradient descent (GD) update t=t1tF(t1) evaluates the full gradient F()=n1infi(), which requires evaluating n derivatives. This can be prohibitively expensive when the number of training examples n is large. SGD evaluates the gradient of one (or a small subset) of the training examples--drawn randomly from 1,n--per iteration: t=t1tfi(t1). In expectation, the updates are equivalent, but SGD has the computational advantage of only evaluating a single gradient fi(). The disadvatage is that the randomness introduces variance, which slows convergence. This was our motivation in class to introduce the SVRG algorithm. Given the dataset in digits.zip, plot the performance of GD, SGD, and SVRG for logistic regression with 12 regularization in terms of negative log likelihood on the training data against the number of gradient evaluations for a single training example (GD performs n such evaluations per iteration and SGD performs 1). Choose the l2 parameter to optimize performance on the test set. How does the choice of T (the number of inner loops) affect the performance of SVRG? There should be one plot with a title and three lines with different colors, markers, and legend labels

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

Problem 1 (Stochastic gradient descent). Lets go back to finding a function that approximates a whole data set using the least-squares method. In particular, were going to try and find a linear...

STRAIGHT PROBLEMS Problem 1: Material Variance Information on Dulce's direct material costs for May is as follows: Actual quantity of direct materials purchased and used 30,000 lbs. Actual cost of...

Python Problem: Variance Explanation: Problem 1: Computing variance For this problem, you will write a function variance that takes a list whose elements are numbers (floats or ints), and returns...

The executive team at Current Designs has gathered to evaluate the company's operations for the last month One of the topics on the agenda is the special order from Huegel Hollow, which was presented...

Problem 1: Comprehensive Variance Analysis ABC Company manufactures and sells trash bags. ABC Company mostly operates in the South East of the United States, with a major presence in Florida. For the...

Please help with the below problem PROBLEM 1 1-13. Variance Analysis [LO 3] Will Norton, the general manager of Cummings Manufactured Siding. is reviewing a monthly variance summary. The summary...

PROBLEM 1 1-7. Variance Analysis Will Norton, the general manager of Cummings Manufactured Siding, is reviewing a monthly variance summary. The summary reveals a large favorable material price...

Please provide compelte solution, show full calculations, explanations and proofs to the following question: Please only answer question 2 only 2. Portfolio optimization with only risky assets:...

questionsare given belows Problem 1 (Consumption Based Capital Asset Pricing Model): We will now use the Euler Equation to derive an asset pricing equation. Recall the Euler Equation: We will again...

Now your boss wants you to evaluate the dividend policy of Information Systems, Inc. (ISI), which develops software for the health care industry. ISI was founded five years ago by Donald Brown and...

Outside Magazine tested 10 different models of day hikers and backpacking boots. The following data show the upper support and price for each model tested. Upper support was measured using a rating...

when the experts at the whoppler firm suggest to a client that it is advisable to engage in arbitrage what exactly are they reccomending their clients do

The first scenario will be a Verbal Judo scenario in which your scenario follows the standard Verbal Judo interaction: You need to ask somebody to modify their behavior either to do something or to...

In your first job out of college, do you think you would prefer to work as part of a self-directed work team or in a more traditionally arranged team where a manager takes control? What would be the...

Group Size and Communication

Understanding Group Roles