Question: Problem 3 . ( 5 0 pt ) Consider an infinite horizon MDP , characterized by M = ( :S , A , r ,
Problem pt Consider an infinite horizon MDP characterized by M:SArpgamma :) and r:Stimes A We would like to evaluate the value of a Markov stationary policy pi :SDelta A However, we do not know the transition kernel p Rather than applying a modelfree approach, we decided to use a modelbased approach where we first estimate the underlying transition kernel by follow some fully stochastic policy in the MDP for good exploration and observe the triples skakskinStimes Atimes S for kdots Let widehatp be our estimate of p based on the data collected. Now, we can apply value iteration directly as if the underlying MDP is widehatM:SArwidehatpgamma :) and obtain widehatvpi Prove the simulation lemma bounding the difference between hatvpi and the true value of the policy, denoted by vpi by showing that vpi swidehatvpi sgamma gamma Esdspi api swidehatpsapsa where s is the initial state and dspi is the discounted state visitation distribution under policy pi Note that the difference vpi swidehatvpi s gets smaller with the smaller model approximation error widehatpsapsa However, the impact of model approximation error gets larger with gamma as the approximation error propagates more across stages.
Step by Step Solution
There are 3 Steps involved in it
1 Expert Approved Answer
Step: 1 Unlock
Question Has Been Solved by an Expert!
Get step-by-step solutions from verified subject matter experts
Step: 2 Unlock
Step: 3 Unlock
