Question: Problem 1 . ( 5 0 pt ) Given a Markov stationary policy , consider the policy evaluation problem to compute v . For example,

Problem

1 . (50

)

Given a Markov stationary policy

,

consider the policy evaluation problem

to compute

v^{} .

For example, we can apply the temporal difference

(

)

learning algorithm

given by

v_{t + 1} (s) = v_{t} (s) +_{t} (s) * I_{{s_{t})} = s,

where

_{t}

= r_{t} + v_{t} (s_{t + 1}) - v_{t} (s_{t})

is known as TD error. Alternatively, we can apply the

n -

step

TD learning algorithm given by

v_{t + 1} (s) = v_{t} (s) + (G_{t}^{(n)} - v_{t} (s)) * I_{{s_{t})} = s,

where

G_{t}^{(n)}

= r_{t} + r_{t + 1} +

dots

+^{n - 1} r_{t + n - 1} +^{n} v_{t}^{} (s_{t + n})

for

n = 1, 2,

dots. Note that

_{t} =

G_{t}^{(1)} - v_{t} (s_{t}) .

The

n -

step TD algorithms for

n

use bootstrapping. Therefore, they use biased estimate

v^{} .

On the other hand, as

n,

the

n -

step TD algorithm becomes a Monte Carlo method,

where we use an unbiased estimate of

v^{} .

However, these approaches delay the update for

n

stages and we update the value function estimate only for the current state.

As an intermediate step to address these challenges, we first introduce the

-

return algo

-

rithm given by

v_{t + 1} (s) = v_{t} (s) + (G_{t}^{} - v_{t} (s)) * I_{{s_{t})} = s,

where given

i n [0, 1],

we define

G_{t}^{}

= (1 -)_{n = 1}^{}^{n - 1} G_{t}^{(n)}

taking a weighted average of

G_{t}^{(n)}'

.

(

)

By the definition of

G_{t}^{(n)},

we can show that

G_{t}^{(n)} = r_{t} + G_{t + 1}^{(n - 1)} .

Derive an analogous

recursive relationship for

G_{t}^{}

and

G_{t + 1}^{} .

(

)

Show that the term

G_{t}^{} - v_{t} (s)

in the

-

return update can be written as the sum of TD errors.

The TD algorithm, Monte Carlo method and

-

return algorithm looks forward to approx

-

imate

v^{} .

Alternatively, we can look backward via the eligibility trace method. TheTD

()

Problem 1.(50pt) Given a Markov stationary policy , consider the policy

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

Problem 1 . ( 5 0 pt ) Given a Markov stationary policy , consider the policy evaluation problem to compute v . For example, we can apply the temporal difference ( TD ) learning algorithm given by v...

Problem 1 . ( 5 0 pt ) Given a Markov stationary policy \ pi , consider the policy evaluation problem to compute v ^ \ pi . For example, we can apply the temporal difference ( TD ) learning algorithm...

Problem 1 . ( 5 0 pt ) Given a Markov stationary policy \ pi , consider the policy evaluation problem to compute v \ pi . For example, we can apply the temporal difference ( TD ) learning algorithm...

5 Consider the success run chain in Example 8.2.16. Suppose that the chain has been running for a while and is currently in state 10. (a) What is the expected number of steps until the chain is back...

An object of mass M 1 = m collides with velocity v 0 i into an object of mass M 2 = 2m with velocity 1/2v 0 j. Following the collision, the mass m 2 has a velocity v 0 /4 i. (a) Determine the...

Percentage elongation during tensile test is indication of O Ductility Malleability Creep O Toughness

Select ALL the functions that have a domain of ( - 2 , 2 ) . f ( x ) = 1 x - 2 h ( x ) = 1 4 - x 2 2 g ( x ) = 4 - x 2 2 m ( x ) = a r c t a n ( x - 2 ) j ( x ) = 4 - x 2 3 r ( x ) = t a n ( 4 )

Which project management tool uses the analogy "skateboard - bicycle - motorcycle - car"? Group of answer choices Scrum Waterfall Fate - Gate Star - gate

Explain how you would go about establishing a performance improvement plan.

What are the purposes of performance appraisals?

Suppose that your employer uses a rating scale for items that are generally personality characteristics. What criticisms would you have of this method?