Question: Defined a proper policy for an MDP as one that is guaranteed

Defined a proper policy for an MDP as one that is guaranteed to reach a terminal state, show that it is possible for a passive ADP agent to learn a transition model for which its policy π is improper even if π is proper for the true MDP with such models, the value determination step may fail if γ = 1. Show that this problem cannot arise if value determination is applied to the learned model only at the end of a trial.
View Solution:


Sale on SolutionInn
Sales2
Views330
Comments
  • CreatedFebruary 14, 2011
  • Files Included
Post your question
5000