Problem 2 ( 3 0 pt ) Given a Markov stationary policy , we studied the minimization of the projected Bellman error for policy evaluation via function approximation Alternatively, we can choose the objective function as J ( ) 1 2 s i n S ( s ) v ( s ) v ( s ) 2 , where i n ( S ) is the stationary distribution of the Markov chain induced by and v ( ) is the approximation of v with the parameter i n R d Then, the gradient of J ( ) with respect to is given by gradJ ( ) E s ( v ( s ) v ( s ) ) g r a d v ( s ) To find approximating v ( s ) , we can apply the stochastic gradient method according to k 1 k ( v ( s k ) v ( s k k ) ) g r a d v ( s k k ) where s k inS denotes the current state at stage k ( a ) Show that with direct parametrization, i e , v , the update ( ) reduces to n step TD learning algorithm if we use v ( s t ) G t ( n ) , Monte Carlo method if we use v ( s t ) G t , where G t lim n G t ( n ) , return update if we use v ( s t ) G t Recall the indicator function I s t ) s in these ( non parametric ) updates ( b ) The direct parameterization can be viewed as linear function approximation with the fea ture matrix I i n R S S What if we have the feature matrix 1 d o t s d h a t ( ) T ( s 0 ) v d o t s h a t ( ) T ( s S ) i n R S d , where i i n R S and hat ( ) ( s ) i n R d We have d S and is full column rank Formulate the counterparts of n step TD learning, Monte Carlo method, and return algorithms based on ( ) under linear function approximation according to the feature matrix

The Answer is in the image, click to view ...

Question: Problem 2 . ( 3 0 pt ) Given a Markov stationary policy , we studied the minimization of the projected Bellman error for policy

Problem

2 . (30

)

Given a Markov stationary policy

,

we studied the minimization of the

projected Bellman error for policy evaluation via function approximation. Alternatively, we

can choose the objective function as

J () = \frac{1}{2}_{s i n S}^{?} (s) | v^{} (s) - v (s

;

) |^{2},

where

i n (S)

is the stationary distribution of the Markov chain induced by

and

v (*

;

)

the approximation of

v^{}

with the parameter

i n R^{d} .

Then, the gradient of

J ()

with respect to

is given by

gradJ

() = E_{s} [(v^{} (s) - v (s

;

)) * g r a d v (s

;

)] .

To find

approximating

v^{} (s),

we can apply the stochastic gradient method according to

_{k + 1} =_{k} + (v^{} (s_{k}) - v (s_{k}

;

_{k})) * g r a d v (s_{k}

;

_{k})

where

s_{k}

inS denotes the current state at stage

k .

(

)

Show that with direct parametrization, i

.

., v =,

the update

(^{* *})

reduces to

n -

step TD learning algorithm if we use

v^{} (s_{t})

G_{t}^{(n)},

Monte Carlo method if we use

v^{} (s_{t})

G_{t},

where

G_{t} = lim_{n} G_{t}^{(n)},

-

return update if we use

v^{} (s_{t})

G_{t}^{} .

Recall the indicator function

I_{{s_{t})} = s

in these

(

non

-

parametric

)

updates.

(

)

The direct parameterization can be viewed as linear function approximation with the fea

-

ture matrix

I i n R^{| S | | S |} .

What if we have the feature matrix

= [\begin{matrix} _{1} & d o t s & _{d} \end{matrix}] = [\begin{matrix} h a t ()^{T} (s_{0}) \\ v d o t s \\ h a t ()^{T} (s_{| S |}) \end{matrix}] i n R^{| S | d},

where

_{i} i n R^{| S |}

and hat

() (s) i n R^{d} .

We have

d | S |

and

is full column rank. Formulate the

counterparts of

n -

step TD learning, Monte Carlo method, and

-

return algorithms based on

(* *)

under linear function approximation according to the feature matrix

.

Problem 2.(30pt) Given a Markov stationary policy , we studied the

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

Probability and Statistics - Problem Set c Keith M. Chugg October 2, 2015 1 Preliminaries, Combinatorics, Set Probability 1.1. A number of bats are in a cave. 2 bats can see out of their left eye. 3...

Questions: Assume a period of time has passed. Can you Assess your communication/education plan. Has the plan been implemented as specified? Have the objectives of the plan been achieved? How have...

Article review: see attached, there are two different articles to review. Abstract: One paragraph Literature Review: a brief no more than one page discussion of the important (top three articles)...

Article review: see attached Abstract: One paragraph Literature Review: a brief no more than one page discussion of the important (top three articles) literature and the findings of that literature....

Write 2 paragraphs about Effects of Covid-19 on Euro area GDP and inflation: demand vs. supply disturbances article. No max word count, page count, or formatting requirements but must be submit to my...

Microkernel operating systems aim to address perceived modularity and reliability issues in traditional "monolithic" operating systems. (i) Describe the typical architecture of a microkernel...

A discrete sequence {xn} can be converted into a continuous representation x(t) = ts X n= (t n ts) xn, where ts is the sampling period. (a) State two characteristic properties of Dirac's function. [2...

IOE 419 Mark S. Daskin Service Operations Management IOE Department Winter, 2017 University of Michigan Problem set 4 DUE: MONDAY - February 20, 2017 Points: 100 points total Problem 1: Babette has...

informs Vol. 34, No. 3, May-June 2004, pp. 191-205 issn 0092-2102 \u0001 eissn 1526-551X \u0001 04 \u0001 3403 \u0001 0191 doi 10.1287/inte.1030.0068 2004 INFORMS Inventory Decisions in Dell's Supply...

A seismic probe bores itself into the seabed, going as deep as it can before running out of fuel. This takes about five minutes. It rotates its spiral drill head at rate R(t) that follows a...

Amazon.com and Apple have recently agreed to sell ebooks for $9.99. Do you think this price reflects marginal cost? Is it too high or too low? Explain.

Clarence, Mike, Clara, and Stephen are doing a final year project in their graduate studies. Clara is the project leader and is responsible for overseeing that intermediate milestones and the final...

Estimate the Cash Flow After Taxes ( CFAT ) for a company that has taxable income of $ 1 2 0 , 0 0 0 , depreciation of $ 1 3 3 , 3 5 0 , and an effective tax rate of 2 5 % per year. Also find Taxes...

P-1) (100 Pts.) A chemical manufacturing company (CMC) has a contract for the procurement of the neccssaly chemicals from four suppliers. The chemicals purchased from Supplier A are priced at $20...

1. How have the origins of the study of intercultural communication in the United States affected its present focus?

5. What are the advantages of a dialectical approach to intercultural communication?

g. What were their occupations before they came, and what jobs did they take on their arrival?