Question: Problem 3. (50pt) Consider an infinite horizon MDP, characterized by M = (S, A, r,p,y) and r : S A [0,1]. We would like

Problem 3. (50pt) Consider an infinite horizon MDP, characterized by M = (S, A, r,p,y) and r : S A [0,1]. We would like to evaluate the value of a Markov stationary policy : S A(A). However, we do not know the transition kernel p. Rather than applying a model-free approach, we decided to use a model-based approach where we first estimate the underlying transition kernel by follow some fully stochastic policy in the MDP (for good exploration) and observe the triples (Sk, ak, Sk+1) S A S for k 0,1,.... Let be our estimate of p based on the data collected. Now, we can apply value iteration directly as if the underlying MDP is M = (S, A, r,p,y) and obtain T. = Prove the simulation lemma bounding the difference between and the true value of the policy, denoted by v", by showing that v (so) (so)| < (1) Es~d~(s) ||( | s, a) p( | s, a)||1, where so is the initial state and do is the discounted state visitation distribution under policy . Note that the difference |vT (80) (so)| gets smaller with the smaller model approximation error ||( | s, a) p( | s, a)||1. However, the impact of model approximation error gets larger with 1 as the approximation error propagates more across stages.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Mathematics Questions!

Georgia Beemster, CPA, is examining the financial statements of the Louisville Sales Corporation, which recently installed a computerized processing system. The following comments have been extracted...

Can you help me identify the sampling error for this problem?

MATHEMATICS FOR MACHINE LEARNING Marc Peter Deisenroth A. Aldo Faisal Cheng Soon Ong Contents Foreword 1 Part I Mathematical Foundations 9 1 Introduction and Motivation 11 1.1 Finding Words for...

Algorithms in Artificial Intelligence (or, the old name: Introduction to Algorithmic Decision Making) Part 1 Based on slides by David Sarne and Lirong Xia Course Tentative Schedule Introduction...

Problem 3 . ( 5 0 pt ) Consider an infinite horizon MDP , characterized by = ( : , , , , : ) and : \ times - > [ 0 , 1 ] . We would like to evaluate the value of a Markov stationary policy : - > ( )...

Problem 3 . ( 5 0 pt ) Consider an infinite horizon MDP , characterized by M = ( : S , A , r , p , : ) and r : S A [ 0 , 1 ] . We would like to evaluate the value of a Markov stationary policy : S (...

Problem 3 . ( 5 0 pt ) Consider an infinite horizon MDP , characterized by M = ( :S , A , r , p , \ gamma :) and r:S \ times A - > [ 0 , 1 ] . We would like to evaluate the value of a Markov...

a manufactured lot of buggy whips has 40 items, of which 5 are defective. a random sample of 20 items is chosen to be inspected. find the probability that the sample contains no defective items if...

I do not understand how to approach this question please help as much as possible! Problem 4 : Given the continuous LTI system: 0 0 1 0 0 0 0 1 i(t) = (t) 0 0 0 0 1 0 0 1 ut) -2 1 -1 1 1 -2 1 - 1...

Math 104A Homework #3 Instructor: Lihui Chai General Instructions: Please write your homework papers neatly. You need to turn in both full printouts of your codes and the appropriate runs you made....

For the Greeks, the body was the visible means of conveying perfection, but like their ever changing philosophies, the representation of the human form changed throughout the different periods of...

After meeting with the regional sales managers, Lauretta Anderson, president of Cowpie Computers, Inc., you find that she believes that the probability that sales will grow by 10% in the next year is...

According to Salvatore Maddi, personality encompasses which of the following traits? ? A it is a momentary result of social and biolopical pressures. 3 It involves characteristics that are relatively...

Compiler optimizations may result in improvements to code size and/or performance. Consider one or more of the benchmark programs from the SPEC CPU2006 suite. Use a processor available to you and the...

The interest rate on a $100,000 loan is 6% compounded monthly. How much longer will it take to pay off the loan with end-of month payments of $1000 than with end-of-month payments of $1050?

The following diagram represents the collection of CO 2 and H 2 O molecules formed by complete combustion of a hydrocarbon. What is the empirical formula of the hydrocarbon?

On December 31, 2022, Sterling Bank enters into a debt restructuring agreement with Barkley plc, which is now experiencing financial trouble. The bank agrees to restructure a 12%, issued at par,...

Kennedy Company has the following portfolio of trading investments at December 31, 2022. On December 31, 2023, Kennedys portfolio of trading investments consisted of the following investments. At the...

Design a minimum-mass symmetric three-bar truss (the area of member 1 and that of member 3 are the same) to support a load P, as was shown in Fig. 2.9. The following notation may be used: Pu = P cos...

Sohan does not respond or react to the workers non-verbal expressions of their attitude towards him. Would his approach have been more eff ective if it were supported by some verbal communication?

Discuss the benefi ts of using PowerPoint and visual aids when giving a presentation to a foreign audience.

Could this situation be avoided? If no, why? If yes, how?