Question: Consider the Markov Decision Process ( MDP ) with transition probabilities and reward function as given in the tables below. Assume the discount factor y
Consider the Markov Decision Process MDP with transition probabilities and reward function as given in the tables below. Assume the discount factor yie there is no actual discounting sa Rsa a s Ts a s AA AB Ts as sa Rs a We follow the steps of the Policy Iteration algorithm as explained in the class. Write down the Bellman equation. The initial policy is A and B That means that action is taken when in state A and the same action is taken when in state B as well. Calculate the values VA and VB from two iterations of policy evaluation Bellman equation after initializing both VT A and VTB to Find an improved policy Tnew based on the calculated values VT A and VB
Step by Step Solution
There are 3 Steps involved in it
1 Expert Approved Answer
Step: 1 Unlock
Question Has Been Solved by an Expert!
Get step-by-step solutions from verified subject matter experts
Step: 2 Unlock
Step: 3 Unlock
