Question: Problem 1 . ( 5 0 pt ) Given a Markov stationary policy , consider the policy evaluation problem to compute v . For example,

Problem

1 . (50

)

Given a Markov stationary policy

,

consider the policy evaluation problem

to compute

v^{} .

For example, we can apply the temporal difference

(

)

learning algorithm

given by

v_{t + 1} (s) = v_{t} (s) +_{t} (s) * I_{{s_{t})} = s,

where

_{t}

= r_{t} + v_{t} (s_{t + 1}) - v_{t} (s_{t})

is known as TD error. Alternatively, we can apply the

n -

step

TD learning algorithm given by

v_{t + 1} (s) = v_{t} (s) + (G_{t}^{(n)} - v_{t} (s)) * I_{{s_{t})} = s,

where

G_{t}^{(n)}

= r_{t} + r_{t + 1} +

dots

+^{n - 1} r_{t + n - 1} +^{n} v_{t}^{} (s_{t + n})

for

n = 1, 2,

dots. Note that

_{t} =

G_{t}^{(1)} - v_{t} (s_{t}) .

The

n -

step TD algorithms for

n

use bootstrapping. Therefore, they use biased estimate

v^{} .

On the other hand, as

n,

the

n -

step TD algorithm becomes a Monte Carlo method,

where we use an unbiased estimate of

v^{} .

However, these approaches delay the update for

n

stages and we update the value function estimate only for the current state.

As an intermediate step to address these challenges, we first introduce the

-

return algo

-

rithm given by

v_{t + 1} (s) = v_{t} (s) + (G_{t}^{} - v_{t} (s)) * I_{{s_{t})} = s,

where given

i n [0, 1],

we define

G_{t}^{}

= (1 -)_{n = 1}^{}^{n - 1} G_{t}^{(n)}

taking a weighted average of

G_{t}^{(n)}'

.

(

)

By the definition of

G_{t}^{(n)},

we can show that

G_{t}^{(n)} = r_{t} + G_{t + 1}^{(n - 1)} .

Derive an analogous

recursive relationship for

G_{t}^{}

and

G_{t + 1}^{} .

(

)

Show that the term

G_{t}^{} - v_{t} (s)

in the

-

return update can be written as the sum of TD errors.

The TD algorithm, Monte Carlo method and

-

return algorithm looks forward to approx

-

imate

v^{} .

Alternatively, we can look backward via the eligibility trace method. TheTD

()

algorithm is given by

z_{t} (s) = z_{t - 1} (s) + I_{{s)} = s_{t},

AAsinS

v_{t + 1} (s) = v_{t} (s) +_{t} z_{t} (s),

AAsinS,

where

z_{t} i n R^{| S |}

is called the eligibility vector and the initial

z_{- 1} (s) = 0

for all

s .

Problem 1.(50pt) Given a Markov stationary policy , consider the policy

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

Problem 1 . ( 5 0 pt ) Given a Markov stationary policy , consider the policy evaluation problem to compute v . For example, we can apply the temporal difference ( TD ) learning algorithm given by v...

Problem 1 . ( 5 0 pt ) Given a Markov stationary policy \ pi , consider the policy evaluation problem to compute v ^ \ pi . For example, we can apply the temporal difference ( TD ) learning algorithm...

Problem 1 . ( 5 0 pt ) Given a Markov stationary policy \ pi , consider the policy evaluation problem to compute v \ pi . For example, we can apply the temporal difference ( TD ) learning algorithm...

5 Consider the success run chain in Example 8.2.16. Suppose that the chain has been running for a while and is currently in state 10. (a) What is the expected number of steps until the chain is back...

Same as Problem 16, except now payoff values of each person are a. If Smith and Jones are payoff maximizers and make their decisions individually, what will they do? b. If Smith and Jones can make...

Sarah's preferences can be described by the utility function U(X, Y) = X2 + Y2. For this utility function, MUX = 2X and MUY = 2Y. What is her MRSXY? Do her preferences have the declining MRS...

Cannas Chojnowsal Incorporated makes a single product - a cooling coll used in commercial relfigerators. The company has a standard cont system in witich it applies overhead to this product based on...

Discuss several ways ( at least four ways ) to improve the requirement process and thus, a project team can avoid a situation where too many changes to the project requirements occur at the later...

4.19 A company specializes in installing and servicing central-heating furnaces. In the prewinter period, service calls may result in an order for a new furnace. The following table shows estimated...

4.18 An automobile dealer calculates the proportion of new cars sold that have been returned a various numbers of times for the correction of defects during the warranty period. The results are shown...

4.17 Consider the probability distribution function x 0 1 Probability 0.50 0.50 a. Graph the probability distribution function. b. Calculate and graph the cumulative probability distribution. c. Find...