Question: Defined a proper policy for an MDP as one that is guaranteed to reach a terminal state, show that it is possible for a passive
Defined a proper policy for an MDP as one that is guaranteed to reach a terminal state, show that it is possible for a passive ADP agent to learn a transition model for which its policy π is improper even if π is proper for the true MDP with such models, the value determination step may fail if γ = 1. Show that this problem cannot arise if value determination is applied to the learned model only at the end of a trial.
Step by Step Solution
3.32 Rating (176 Votes )
There are 3 Steps involved in it
Consider a world with two states S 0 and S 1 with two a... View full answer
Get step-by-step solutions from verified subject matter experts
Document Format (1 attachment)
21-C-S-A-I (299).docx
120 KBs Word File
