Question: Problem 3 . ( 5 0 pt ) Consider an infinite horizon MDP , characterized by M = ( : S , A , r

Problem 3.(50pt) Consider an infinite horizon MDP, characterized by M=(:S,A,r,p,:)
and r:SA[0,1]. We would like to evaluate the value of a Markov stationary policy
:S(A). However, we do not know the transition kernel p. Rather than applying
a model-free approach, we decided to use a model-based approach where we first estimate
the underlying transition kernel by follow some fully stochastic policy in the MDP (for good
exploration) and observe the triples (sk,ak,sk+1)inSAS for k=0,1,dots Let hat(p) be our
estimate of p based on the data collected. Now, we can apply value iteration directly as if the
underlying MDP is widehat(M)=(:S,A,r,widehat(p),:) and obtain widehat(v).
Prove the simulation lemma bounding the difference between hat(v) and the true value of the
policy, denoted by v, by showing that
|v(s0)-widehat(v)(s0)|(1-)2Esds0,a(s)||widehat(p)(*|s,a)-p(*|s,a)||1,
where s0 is the initial state and ds0 is the discounted state visitation distribution under policy
. Note that the difference |v(s0)-widehat(v)(s0)| gets smaller with the smaller model approximation
error ||widehat(p)(*|s,a)-p(*|s,a)||1. However, the impact of model approximation error gets larger
with ~~1 as the approximation error propagates more across stages.
 Problem 3.(50pt) Consider an infinite horizon MDP, characterized by M=(:S,A,r,p,:) and

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!