Question: Problem 3 . ( 5 0 pt ) Consider an infinite horizon MDP , characterized by = ( : , , , , : )

Problem

3 .

(

50

)

Consider an infinite horizon MDP

,

characterized by

=

(

,

,

,

,

)

and

\

times

- >

[

0

,

1

]

.

We would like to evaluate the value of a Markov stationary policy

- >

(

)

.

However

,

we do not know the transition kernel

.

Rather than applying

a model

-

free approach, we decided to use a model

-

based approach where we first estimate

the underlying transition kernel by follow some fully stochastic policy in the MDP

(

for good

exploration

)

and observe the triples

(

,

,

+

1

)

\

times

\

times

for

=

0

,

1

,

dots. Let widehat

(

)

be our

estimate of

based on the data collected. Now, we can apply value iteration directly as if the

underlying MDP is widehat

(

)

=

(

,

,

,

widehat

(

)

,

)

and obtain widehat

(

)

.

Prove the simulation lemma bounding the difference between hat

(

)

and the true value of the

policy, denoted by

,

by showing that

|

(

0

)

-

widehat

(

)

(

0

)

|

< =

(

1

-

)

2

0

,

(

)

|

|

(

)

(

*

|

,

)

-

(

*

|

,

)

|

|

1

,

where

0

is the initial state and

0

is the discounted state visitation distribution under policy

.

Note that the difference

|

(

0

)

-

widehat

(

)

(

0

)

|

gets smaller with the smaller model approximation

error

|

|

(

)

(

*

|

,

)

-

(

*

|

,

)

|

|

1

.

However

,

the impact of model approximation error gets larger

with

1

as the approximation error propagates more across stages.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

Problem 3. (50pt) Consider an infinite horizon MDP, characterized by M = (S, A, r,p,y) and r : S A [0,1]. We would like to evaluate the value of a Markov stationary policy : S A(A). However, we do...

2 Reducing Variance in Policy Gradient Methods In class, we explored REINFORCE as a policy gradient method with no bias but high variance. In this problem, we will explore methods to dramatically...

I do not understand how to approach this question please help as much as possible! Problem 4 : Given the continuous LTI system: 0 0 1 0 0 0 0 1 i(t) = (t) 0 0 0 0 1 0 0 1 ut) -2 1 -1 1 1 -2 1 - 1...

This is all the info there is. I don't understand how to derive the Euler equation when the adjust cost function doesn't have K in it Consider the infinite horizon model of a firm facing adjustment...

Mohammed Yaaseen Gomdah MATH 203 Winter 2016 F WeBWorK assignment number Assignment 9 W14 is due : 03/23/2016 at 03:00am EDT. The (* replace with url for the course home page *) for the course...

a prediction about which direction your answers would change and justify why your prediction makes coonomic 3. Suppose there are two large -scale agricultural water users : Barbara and Dianne . Lety...

3. Derive the equilibrium wage and rental rate, wt, from the firm's prob- lem. The production function has constant returns to scale property in land and labor: F (M, L) : M\" L1 x. (6 points) 4....

3. Derive the equilibrium wage and rental rate, w,, from the firm's prob- lem. The production function has constant returns to scale property in land and labor: F(M, L) = Mall-2. (6 points) 4. Define...

Planning Demand and Supply in a Supply Chain Capacity Planning and Assignment 1 utdallas.edu/~metin Outline Capacity Planning Product-to-plant Assignment utdallas.edu/~metin 2 Deterministic Capacity...

IOE 419 Mark S. Daskin Service Operations Management IOE Department Winter, 2017 University of Michigan Problem set 4 DUE: MONDAY - February 20, 2017 Points: 100 points total Problem 1: Babette has...

What is a handle, and how does a process obtain a handle?

Compute the required annual payment in Question 6-13. In Question 6-13, Assume that you borrowed $500 from a friend and promised to repay the loan in five equal annual installments beginning one year...

Mutual fund is the pool of which things, derived from the cross section of society? a . funds, b . cash, c . objects, d . savings

5. Develop a scenario comparing two PH programs and involving the use of a CBA.

Why do mergers and acquisitions have such an impact on employees?

What is strategic management? How does that apply to the healthcare industry? Be specific.

2. Describe the functions of communication