Question: Exercise 3 Consider a discounted dynamic programming problem with the state space S = {0, 1}, and the set of admissible actions at any state

Exercise 3 Consider a discounted dynamic programming problem with the state space S = {0, 1}, and the set of admissible actions at any state x E Sis A(x) = {1, 2}. The cost function C(x, a) is given by C(0, 1) = 1, C(1, 1) = 2, C(0, 2) = 0, C(1, 2) = 2. The transition probabilities ply|x, a) are fully determined by p(0|0, 1) = 1/2, p(0|0, 2) = 1/4, p(0|1, 1) = 2/3, p(0|1, 2) = 1/3. Let B = 1/2. (a) Starting with W(x) = (0)(x) = 0 for all x E S, use the value iteration algorithm to approximate the value function We by W(3) := T(3); W. Then what is the stationary policy obtained as the minimiser of TB WB)? Determine with justifications whether it is an optimal policy. [40 marks] (b) Now let f be the stationary policy that chooses action 1 in both states 0 and 1. Apply the policy iteration algorithm with the initial policy g(0) = f to generate policies until you reach an optimal stationary policy. [10 marks] (Hint: adjust the dynamic programming operator TB as well as the value iteration and policy iteration algorithms accordingly, as we are dealing with a minimization problem here.) Exercise 3 Consider a discounted dynamic programming problem with the state space S = {0, 1}, and the set of admissible actions at any state x E Sis A(x) = {1, 2}. The cost function C(x, a) is given by C(0, 1) = 1, C(1, 1) = 2, C(0, 2) = 0, C(1, 2) = 2. The transition probabilities ply|x, a) are fully determined by p(0|0, 1) = 1/2, p(0|0, 2) = 1/4, p(0|1, 1) = 2/3, p(0|1, 2) = 1/3. Let B = 1/2. (a) Starting with W(x) = (0)(x) = 0 for all x E S, use the value iteration algorithm to approximate the value function We by W(3) := T(3); W. Then what is the stationary policy obtained as the minimiser of TB WB)? Determine with justifications whether it is an optimal policy. [40 marks] (b) Now let f be the stationary policy that chooses action 1 in both states 0 and 1. Apply the policy iteration algorithm with the initial policy g(0) = f to generate policies until you reach an optimal stationary policy. [10 marks] (Hint: adjust the dynamic programming operator TB as well as the value iteration and policy iteration algorithms accordingly, as we are dealing with a minimization problem here.)
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
