Question: Problem 3 . ( 5 0 pt ) Consider an infinite horizon MDP , characterized by M = ( :S , A , r ,

Problem 3.(50pt) Consider an infinite horizon MDP, characterized by M=(:S,A,r,p,\gamma :) and r:S\times A->[0,1]. We would like to evaluate the value of a Markov stationary policy \pi :S->\Delta (A). However, we do not know the transition kernel p . Rather than applying a model-free approach, we decided to use a model-based approach where we first estimate the underlying transition kernel by follow some fully stochastic policy in the MDP (for good exploration) and observe the triples (s_(k),a_(k),s_(k+1))inS\times A\times S for k=0,1,dots . Let widehat(p) be our estimate of p based on the data collected. Now, we can apply value iteration directly as if the underlying MDP is widehat(M)=(:S,A,r,widehat(p),\gamma :) and obtain widehat(v)^(\pi ). Prove the simulation lemma bounding the difference between hat(v)^(\pi ) and the true value of the policy, denoted by v^(\pi ), by showing that |v^(\pi )(s_(0))-widehat(v)^(\pi )(s_(0))|<=(\gamma )/((1-\gamma )^(2))E_(sd_(s_(0))^(\pi ),a\pi (s))||widehat(p)(*|s,a)-p(*|s,a)||_(1), where s_(0) is the initial state and d_(s_(0))^(\pi ) is the discounted state visitation distribution under policy \pi . Note that the difference |v^(\pi )(s_(0))-widehat(v)^(\pi )(s_(0))| gets smaller with the smaller model approximation error ||widehat(p)(*|s,a)-p(*|s,a)||_(1). However, the impact of model approximation error gets larger with \gamma 1 as the approximation error propagates more across stages.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!