Question: Consider the Markov Decision Process ( MDP ) with transition probabilities and reward function as given in the tables below. Assume the discount factor y

Consider the Markov Decision Process (MDP) with transition probabilities and reward function as given in the tables below. Assume the discount factor y=1(i.e., there is no actual discounting). sa R(sa) a s' T(s, a, s') A1A A1B 00.50.5|2 T(s, a,s') sa Rs, a 20 We follow the steps of the Policy Iteration algorithm as explained in the class. 1. Write down the Bellman equation. 2. The initial policy is 7(A)=1 and 7(B)=1. That means that action 1 is taken when in state A, and the same action is taken when in state B as well. Calculate the values V"(A) and V.(B) from two iterations of policy evaluation (Bellman equation) after initializing both VT (A) and VT(B) to 0.3. Find an improved policy Tnew based on the calculated values VT (A) and V(B).

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!