Question: 1 . ( 4 0 points ) The weight decay regularizer is also called L 2 regularizer, since wT w is the square of the

1 . (40

points

)

The weight decay regularizer is also called L

2

regularizer, since

wT

w is the square

of the

2 -

norm of the weight vector

w

2 =

qPd

i

= 0

w

2

i

.

Another common regularizer is called

L

1

regularizer, since

1 -

norm

(

w

1 =

Pd

i

= 0 |

wi

|)

is used as the regularizer.

Below are the definitions of the two regularizations

1

:

L

1

regularization: Eaug

(

w

) =

Ein

(

w

) +

w

1

L

2

regularization: Eaug

(

w

) =

Ein

(

w

) +

wT

w

(

a

) (10

points out of

40

points

)

Answer LFD Problem

4.8 .

(

b

) (10

points out of

40

points

)

Similar to Problem

4.8,

derive the update rule of gradient

descent for minimizing the augmented error with L

1

regularizer.

Note that the gradient of

1 -

norm is not well

-

defined at

0 .

To address this issue, we can

utilize the subgradient idea defined as follows:

wi

w

1 =

+ 1

if wi

> 0

any value in

[1, 1]

if wi

= 0

1

if wi

< 0

1

When applying these regulerizations to linear regression, they are called Ridge Regression

(

L

2

regularizer

)

and

Lasso Regression

(

L

1

regularizer

)

respectively.

1

To simplify the discussion, we let

wi

w

1 = 0

when wi

= 0 .

Please write down the

update rule of gradient descent for L

1

regularization.

(

You can define a sign

()

function

that returns

+ 1, 0, 1

when the input is positive, zero, negative

) .

Truncated gradient

(

for part

(

c

))

: In Lasso regression

(

linear regression with L

1

regularization

),

one nice property is that it tends to learn a weight vector with many

0

s

.

However, if we perform gradient descent on the augmented error with L

1

regularization,

it won

t lead to this nice property partly due to the not

-

well

-

defined behavior of

subgradient. In this homework, you will implement truncated gradient

[1],

an approach

trying to maintain the nice property of L

1

regularizations, as described below.

Let

w

(

t

+ 1)

w

(

t

)

Ein

(

w

(

t

))

be the update rule of gradient descent without

regularization. The update rule for L

1

regularization that you derived should be in the

form of

w

(

t

+ 1)

w

(

t

+ 1) +

additional term

The additional term represents the effect of L

1

regularization compared with no regularization.

Truncated gradient works as follows: At each step t

,

you first perform the

update and obtain

w

(

t

+ 1) .

Then for each dimension i

,

if wi

(

t

+ 1)

and w

i

(

t

+ 1)

have

different signs and when w

i

(

t

+ 1) = 0,

we set the update wi

(

t

+ 1)

to

0 (

i

.

e

.,

we truncate

the update if the additional term makes the new weight change signs

) . 2

(

c

) (20

points out of

40

points

)

Update your implementation of logistic regression in HW

2

to include the L

1

and L

2

regularizers

(

use truncated gradient for L

1

regularizer and

regular gradient descent for L

2

regularizers

) .

Conduct the following experiment and

include the results in your report. Also submit the updated python implementation

(

feel free to update the function headers and

/

or define new functions

) .

You will work with digits dataset, classifying whether a digit belongs to

{1, 6, 9} (

labeled

as

1)

or

{0, 7, 8} (

labeled as

+ 1) .

Please download the pre

-

processed data

(

check

the label format and make sure you are working with the

+ 1 / 1

labels

)

on Canvas.

Examine different

= 0, 0.0001, 0.001, 0.005, 0.01, 0.05, 0.1

for both L

1

and L

2

regularizations.

Train your models on the training set. For each trained model, report

(1)

the

classification error on the test set and

(2)

the number of

0

s in your learned weight vector.

Describe your observations and the property of the L

1

regularizer

(

when coupled

with truncated gradient

) .

For the other parameters, please use the following. Normalize the features. Set learning

rate

= 0.01 .

The maximum number of iterations is

104 .

Terminate learning if the

magnitude of every element of the gradient

(

of Ein

)

is less than

10 6 .

When calculating

classification error, classify the data using a cutoff probability of

0.5 .

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!

Q:

Describing Data Once we have collected data from surveys or experiments, we need to summarize and present the data in a way that will be meaningful to the reader. We will begin with graphical...

Q:

2.7. Explicit Solutions to Dierential Equations 109 power (kw) 20 15 10 5 0 08:00 10:00 12:00 14:00 16:00 time 18:00 2.7 Explicit Solutions to Dierential Equations In the very rare case in which an...

Q:

Applied Mathematics and Computation 95 (1998) 181192 Love dynamics: The case of linear couples Sergio Rinaldi 1 Centro Teoria dei Sistemi, CNR, Politecnico di Milano, Via Ponzio 34/5, 20133 Milan,...

Q:

What is Exercise 32-45? I don't quite understand it, so I'm not sure if my work or answers are correct. Linear Algebra Lab Exercises Linear algebra deals with vectors and linear functions that act on...

Q:

MATHEMATICS FOR MACHINE LEARNING Marc Peter Deisenroth A. Aldo Faisal Cheng Soon Ong Contents Foreword 1 Part I Mathematical Foundations 9 1 Introduction and Motivation 11 1.1 Finding Words for...

Q:

Solutions from Montgomery, D. C. (2012) Design and Analysis of Experiments, Wiley, NY Chapter 6 k The 2 Factorial Design Solutions 6.1. An engineer is interested in the effects of cutting speed (A),...

Q:

PLEASE COMPLETE NO LATER THAN 10/14 @3:30PM Each question(1,2,& 3) must be a minimum of 200 words. Please EXPLAIN answers in FULL detail and make answers knowledgeable based off the attached reading,...

Q:

PLEASE COMPLETE NO LATER THAN 10/14 @3:30PM Each question(1,2,& 3) must be a minimum of 200 words. Please EXPLAIN answers in FULL detail and make answers knowledgeable based off the attached reading,...

Q:

MAT 343 LAB 3 Example We can generate the three matrices listed in (1) using the following MATLAB commands >>E1eye (4) >> E1( [2,3], :) E1( [3,2], :) % swap rows 2 and 3 E1 = % generate the 4x4...

Q:

Math\t107-6381\t-\tQuiz\t#4\t-\tSchultz\t-\tDue\tFebruary\t21,\t2016\t-\tpage\t1\tof 3 Follow\tthese\tdirections\tcarefully. This\tquiz\tis\tdue\tby\t11:59\tEastern\ttime\ton\tFebruary\t21,\t2016. o...

Q:

ABC, Inc. has the following account balances at December 31, 2004. During the year 2004, the company issued $6,000 of new common stock. Required From this information, prepare (A) An income...

Q:

How are objective performance measures and subjective performance measures simultaneously advantageous? Explain.

Q:

1 ) Can you take out a life insurance policy on my life? Why or why not?

Q:

Name two thing s_() that South Africans have forced on N!ai's people. Explain why you chose your answers

Recommended Textbook

More Books

The Core Ios Developer S Cookbook Core Recipes For Programmers

Authors: Erica Sadun ,Rich Wardwell

5th Edition

0321948106, 978-0321948106

Ask a Question and Get Instant Help!