Question: Problem 3 . ( 2 5 points ) Consider the MDP shown below. There are five states { A , B , C , D

Problem 3.(25 points) Consider the MDP shown below. There are five states {A,B,C,D,T} in MDP, where state {T} is the target state, and two actions {a1,a2}. The numbers on each transition show the probability of moving to the next state and the reward of transition, respectively. For example, if agent takes action a1 at state A, it will end up at state B with probability 0.8 and will be rewarded -10, and with probability 0.2 will move to state C and will be rewarded -10.
a) For a policy which always takes action a1 at every state, write down the Bellman recursive value function for each state, i.e.,v(A),v(B),v(C),v(D),v(T), and compute the final state values when =12.
b) Consider a random policy which uniformly selects actions at each state (the probability of taking each of two actions under this policy is 12). Apply one iteration of Value Iteration algorithm (one-step policy evaluation followed by policy greedification) on this MDP with =1 and show the new improved policy.
c) Consider the following episode generated by an arbitrary policy . Assume the current values of all the state values are: v(A)=0,v(B)=5,v(C)=2,v(D)=10, and v(T)=0.
Please, i) write down the Temporal Difference (TD) evaluation equation for updating the values of states, ii) compute the final values after processing the episode shown in the figure with ==12.
Problem 3 . ( 2 5 points ) Consider the MDP shown

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!