Question: Problem 1 . ( 5 0 pt ) Given a Markov stationary policy pi , consider the policy evaluation problem to compute v

Problem

1 . (50

)

Given a Markov stationary policy

\

,

consider the policy evaluation problem to compute v

\

.

For example, we can apply the temporal difference

(

)

learning algorithm given by v

+ 1

(

) =

(

) + \

alpha

\

delta

(

)

{

=

}

,

where

\

delta

=

+ \

gamma v

(

+ 1

)

(

)

is known as TD error. Alternatively, we can apply the n

-

step TD learning algorithm given by v

+ 1

(

) =

(

) + \

alpha

(

(

)

(

))

{

=

}

,

where G

(

)

=

+ \

gamma r

+ 1

+ . . . + \

gamma

1

+

1

+ \

gamma

\

(

+

)

for n

= 1, 2, . . .

Note that

\

delta

=

(1)

(

) .

The n

-

step TD algorithms for n

< \

infty use bootstrapping. Therefore, they use biased estimate of v

\

.

On the other hand, as n

- > \

infty

,

the n

-

step TD algorithm becomes a Monte Carlo method, where we use an unbiased estimate of v

\

.

However, these approaches delay the update for n stages and we update the value function estimate only for the current state. As an intermediate step to address these challenges, we first introduce the

\

lambda

-

return algorithm given by v

+ 1

(

) =

(

) + \

alpha

(

\

lambda

(

))

{

=

}

,

where given

\

lambda in

[0, 1],

we define G

\

lambda

= (1 \

lambda

)

= 1

\

infty

\

lambda

1

(

)

taking a weighted average of G

(

),

.

(

)

By the definition of G

(

)

,

we can show that G

(

)

=

+ \

gamma G

+ 1

(

1)

.

Derive an analogous recursive relationship for G

\

lambda

and G

+ 1

\

lambda

. (

)

Show that the term G

\

lambda

(

)

in the

\

lambda

-

return update can be written as the sum of TD errors. The TD algorithm, Monte Carlo method and

\

lambda

-

return algorithm looks forward to approximate v

\

.

Alternatively, we can look backward via the eligibility trace method. TheTD

(\

lambda

)

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

Problem 1 . ( 5 0 pt ) Given a Markov stationary policy \ pi , consider the policy evaluation problem to compute v ^ \ pi . For example, we can apply the temporal difference ( TD ) learning algorithm...

Problem 1 . ( 5 0 pt ) Given a Markov stationary policy , consider the policy evaluation problem to compute v . For example, we can apply the temporal difference ( TD ) learning algorithm given by v...

Define univariate, bivariate, and multivariate data analysis. Review box 15.2 and give 2 examples of evaluation questions for each type of data analysis in regards to an STD program for college...

man2 Exercise #1. Annas preferences over consumption bundles (x1,x2) are summarized by the utility function U (x1,x2) = x1 (x2 + 1)2. (a) Derive an algebraic expression for the marginal utility MU1...

In payment of the purchase price of a used motorboat that had been fraudulently misrepresented, Young signed and delivered to Armstrong his negotiable note in the amount of $2,000 due October 1, with...

Prepare journal entry to record the following issuances of stock; A corporation issued 700 shares of no-par common stock to its promoters in exchange for their efforts, estimated to be worth $800....

16. Asample of 50 Fortune 500 companies (Fortune, April 14, 2003) showed 5 were based in New York, 6 in California, 2 in Minnesota, and 1 in Wisconsin. a. Develop an estimate of the proportion of...

last two options for the multiple choice are : performance management development A construction equipment manufacturer, Roswell Corporation, is focusing on becoming a leader in sustainability in...

Recruitment. With an emphasis on specialized skills and the possibilities of enhancing qualifications within the company through further training, recruitment is often first through internal sourcing...

An emphasis on Technik as both means and ends. The importance of engineering knowledge and craft skills that go into production is evidenced by the high standing of engineers in society and in...

Reward systems. Wage levels are complex and involve issues of wage fairness with principles laid down in collective wage agreements, labourmanagement agreements and individual contracts of...