Problem 1 ( 5 0 pt ) Given a Markov stationary policy pi , consider the policy evaluation problem to compute v pi For example, we can apply the temporal difference ( TD ) learning algorithm given by v t 1 ( s ) v t ( s ) alpha delta t ( s ) s t s , where delta t r t gamma v t ( s t 1 ) v t ( s t ) is known as TD error Alternatively, we can apply the n step TD learning algorithm given by v t 1 ( s ) v t ( s ) alpha ( G t ( n ) v t ( s ) ) s t s , where G t ( n ) r t gamma r t 1 gamma n 1 r t n 1 gamma n v t pi ( s t n ) for n 1 , 2 , Note that delta t G t ( 1 ) v t ( s t ) The n step TD algorithms for n infty use bootstrapping Therefore, they use biased estimate of v pi On the other hand, as n infty , the n step TD algorithm becomes a Monte Carlo method, where we use an unbiased estimate of v pi However, these approaches delay the update for n stages and we update the value function estimate only for the current state As an intermediate step to address these challenges, we first introduce the lambda return algorithm given by v t 1 ( s ) v t ( s ) alpha ( G t lambda v t ( s ) ) s t s , where given lambda in 0 , 1 , we define G t lambda ( 1 lambda ) n 1 infty lambda n 1 G t ( n ) taking a weighted average of G t ( n ) ' s ( a ) By the definition of G t ( n ) , we can show that G t ( n ) r t gamma G t 1 ( n 1 ) Derive an analogous recursive relationship for G t lambda and G t 1 lambda ( b ) Show that the term G t lambda v t ( s ) in the lambda return update can be written as the sum of TD errors The TD algorithm, Monte Carlo method and lambda return algorithm looks forward to approximate v pi Alternatively, we can look backward via the eligibility trace method TheTD ( lambda ) algorithm is given by z t ( s ) gamma lambda z t 1 ( s ) s s t s in S v t 1 ( s ) v t ( s ) alpha delta t z t ( s ) s in S , where z t in S is called the eligibility vector and the initial z 1 ( s ) 0 for all s ( c ) In the TD ( lambda ) algorithm, z t is computed recursively Express z t only in terms of the states visited in the past This representation of the eligibility vector will show that eligibility vectors combine the frequency heuristic and recency heuristic to address the credit assignment problem For the rewards received, the frequency heuristic assigns higher credit to the frequently visited states while the recency heuristic assigns higher credit to the recently visited states The eligibility vector assigns higher credits to the frequently and recently visited states Note that in the TD ( lambda ) algorithm, value function estimate for every state gets updated different from the n step TD algorithms, where only the estimate for the current state gets updated If a state has not been visited recently and frequently then the eligibility of that state ( i e , the associated entry of the eligibility vector ) will be close to zero Therefore, the update via the TD error will take very small steps for such states Though lambda return is forward looking while TD ( lambda ) is backward looking, they are equivalent as you will show next for the finite horizon problem with horizon length T infty ( d ) Assume that the initial value function estimates are zero, i e , v 0 ( s ) 0 for all s Then, the recursive update in the lambda return algorithm yields that v T ( s ) can be written as v T ( s ) t 0 T 1 alpha ( G t lambda v t ( s t ) ) s t s Correspondingly, the recursive update in the TD ( lambda ) algorithm yields that v T ( s ) can be written as v T ( s ) t 0 T 1 alpha delta t z t ( s ) Show that t 0 T 1 alpha delta t z t ( s ) t 0 T 1 alpha ( G t lambda v t ( s t ) ) s t s s

The Answer is in the image, click to view ...

Question: Problem 1 . ( 5 0 pt ) Given a Markov stationary policy pi , consider the policy evaluation problem to compute v ^

Problem

1 . (50

)

Given a Markov stationary policy

\

,

consider the policy evaluation problem to compute v

^\

.

For example, we can apply the temporal difference

(

)

learning algorithm given by v

_

+ 1 (

) =

_

(

) + \

alpha

\

delta

_

(

)_{

_

=

},

where

\

delta

_

=

_

+ \

gamma v

_

(

_

+ 1) -

_

(

_

)

is known as TD error. Alternatively, we can apply the n

-

step TD learning algorithm given by v

_

+ 1 (

) =

_

(

) + \

alpha

(

_

^(

) -

_

(

))_{

_

=

},

where G

_

^(

)

=

_

+ \

gamma r

_

+ 1 + . . . + \

gamma

^

- 1

_

+

- 1 + \

gamma

^

n v

_

^\

(

_

+

)

for n

= 1, 2, . . . .

Note that

\

delta

_

=

_

^(1) -

_

(

_

) .

The n

-

step TD algorithms for n

< \

infty use bootstrapping. Therefore, they use biased estimate of v

^\

.

On the other hand, as n

- > \

infty

,

the n

-

step TD algorithm becomes a Monte Carlo method, where we use an unbiased estimate of v

^\

.

However, these approaches delay the update for n stages and we update the value function estimate only for the current state. As an intermediate step to address these challenges, we first introduce the

\

lambda

-

return algorithm given by v

_

+ 1 (

) =

_

(

) + \

alpha

(

_

^\

lambda

-

_

(

))_{

_

=

},

where given

\

lambda in

[0, 1],

we define G

_

^\

lambda :

= (1 - \

lambda

)_

= 1^\

infty

\

lambda

^

- 1

_

^(

)

taking a weighted average of G

_

^(

)'

. (

)

By the definition of G

_

^(

),

we can show that G

_

^(

) =

_

+ \

gamma G

_

+ 1^(

- 1) .

Derive an analogous recursive relationship for G

_

^\

lambda and G

_

+ 1^\

lambda

. (

)

Show that the term G

_

^\

lambda

-

_

(

)

in the

\

lambda

-

return update can be written as the sum of TD errors. The TD algorithm, Monte Carlo method and

\

lambda

-

return algorithm looks forward to approximate v

^\

.

Alternatively, we can look backward via the eligibility trace method. TheTD

(\

lambda

)

algorithm is given by

[

_

(

) = \

gamma

\

lambda z

_

- 1 (

) +_{

=

_

}

s in S; v

_

+ 1 (

) =

_

(

) + \

alpha

\

delta

_

t z

_

(

)

s in S

,]

where z

_

t in

^|

|

is called the eligibility vector and the initial z

_- 1 (

) = 0

for all s

. (

)

In the TD

(\

lambda

)

algorithm, z

_

t is computed recursively. Express z

_

t only in terms of the states visited in the past. This representation of the eligibility vector will show that eligibility vectors combine the frequency heuristic and recency heuristic to address the credit assignment problem. For the rewards received, the frequency heuristic assigns higher credit to the frequently visited states while the recency heuristic assigns higher credit to the recently visited states. The eligibility vector assigns higher credits to the frequently and recently visited states. Note that in the TD

(\

lambda

)

algorithm, value function estimate for every state gets updated different from the n

-

step TD algorithms, where only the estimate for the current state gets updated. If a state has not been visited recently and frequently then the eligibility of that state

(

.

.,

the associated entry of the eligibility vector

)

will be close to zero. Therefore, the update via the TD

-

error will take very small steps for such states. Though

\

lambda

-

return is forward

-

looking while TD

(\

lambda

)

is backward looking, they are equivalent as you will show next for the finite horizon problem with horizon length T

< \

infty

. (

)

Assume that the initial value function estimates are zero, i

.

.,

_0 (

) = 0

for all s

.

Then, the recursive update in the

\

lambda

-

return algorithm yields that v

_

(

)

can be written as v

_

(

) =_

= 0^

- 1 \

alpha

(

_

^\

lambda

-

_

(

_

))_{

_

=

} .

Correspondingly, the recursive update in the TD

(\

lambda

)

algorithm yields that v

_

(

)

can be written as v

_

(

) =_

= 0^

- 1 \

alpha

\

delta

_

t z

_

(

) .

Show that

_

= 0^

- 1 \

alpha

\

delta

_

t z

_

(

) =_

= 0^

- 1 \

alpha

(

_

^\

lambda

-

_

(

_

))_{

_

=

}

.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

Problem 1 . ( 5 0 pt ) Given a Markov stationary policy \ pi , consider the policy evaluation problem to compute v ^ \ pi . For example, we can apply the temporal difference ( TD ) learning algorithm...

Problem 1 . ( 5 0 pt ) Given a Markov stationary policy \ pi , consider the policy evaluation problem to compute v \ pi . For example, we can apply the temporal difference ( TD ) learning algorithm...

Problem 1 . ( 5 0 pt ) Given a Markov stationary policy , consider the policy evaluation problem to compute v . For example, we can apply the temporal difference ( TD ) learning algorithm given by v...

Define univariate, bivariate, and multivariate data analysis. Review box 15.2 and give 2 examples of evaluation questions for each type of data analysis in regards to an STD program for college...

man2 Exercise #1. Annas preferences over consumption bundles (x1,x2) are summarized by the utility function U (x1,x2) = x1 (x2 + 1)2. (a) Derive an algebraic expression for the marginal utility MU1...

Integral Printing Company currently leases its only copy machine for $ 1,200 a month. The company is considering replacing this leasing agreement with a new contract that is entirely commission...

Prepare a simple research proposal related to investigate the impacts of motivation on employee performance and also describe that how performance is measured by Barclays Bank

What is the purpose of the "Follow TCP Stream" feature in Wireshark? Question 6 options: To display the entire conversation of a TCP session To analyze UDP traffic To decrypt encrypted network...

Calculate the minimum and maximum takt time for the given

1. Discuss the reasons for some line managers apparent reluctance to get involved in performance management and suggest ways to convince them of the value of this process to them.

5. From the techniques presented in this chapter, which ones would be useful in establishing if the candidate has the knowledge and skills, as stated in their CV?

5. Why do employment interviews have such a low validity and what can be done to improve the validity?