Question: The aim of this problem is to program value iteration and policy iteration for Markov decision processes in Python. Consider this MDP example 7=0.9

The aim of this problem is to program value iteration and policy

The aim of this problem is to program value iteration and policy iteration for Markov decision processes in Python. Consider this MDP example 7=0.9 Poor & Unknown A Poor & Famous +0 +0 S 1/2 Rich & Unknown +10 Rich & Famous +10 You own a company In every state you must choose between Saving money or Advertising Write a program in Python to implement value iteration and policy iteration specifically for this simple MDP example. For this matter, you should start by creating a simple MDP class using class MDP. This class should include the following members: a constructor for the MDP class def_init()_ that has the following parameters: self, T, R, discount. T-- Transition function: |A| x |S| x |S'| array R -- Reward function: |A| x |S| array discount-- discount factor y: scalar in [0,1) The constructor should verify that the inputs are valid (using the assert command) and set corresponding variables in an MDP object. a procedure for the value iteration def valueIteration() that has the following parameters: self, initialV, nIterations, tolerance. Set nIterations and tolerance to np.inf and 0.01 as default values, respectively. initialV Initial value function: array of |S| entries nlterations -- limit on the number of iterations: scalar (default: infinity) tolerance -- threshold on IV-V+1ll that will be compared to a variable epsilon (initialized to np.inf): scalar (default: 0.01) This procedure should return a new value function V. newV - New value function: array of |S| entries. iteration - the number of iterations performed: scalar epsilon--||V-Vn+1llo: scalar " a procedure to extract a policy from a value function def extractPolicy () that has the following parameters: self, V. V -- Value function: array of |S| entries This procedure should return a policy. policy--Policy: array of |S| entries. a procedure to evaluate a policy by solving a system of linear equations def evaluate Policy () that has the following parameters: self, policy. policy Policy: array of |S| entries This procedure should return a value function V. " V -- Value function: array of |S| entries. a procedure for the policy iteration def policyIteration() that has the following parameters: self, initialPolicy, nIterations. Set nIterations to np.inf as a default value. o initialPolicy -- Initial policy: array of |S| entries nlterations -- limit on the number of iterations: scalar (default: infinity) This procedure should return a new policy. 0 newPolicy - New policy: array of |S| entries. iteration - the number of iterations performed: scalar " a procedure for partial policy evaluation def evaluate PolicyPartially () that has the following parameters: self, policy, initialV, nIterations, tolerance. Set nIterations and tolerance to np.inf and 0.01 as default values, respectively. policy -- Policy: array of |S| entries initialV Initial value function: array of |S| entries nlterations -- limit on the number of iterations: scalar (default: infinity) tolerance -- threshold on ||Vn - Vn+1ll that will be compared to a variable epsilon (initialized to np.inf): scalar (default: 0.01) This procedure should return a new value function V. newV - New value function: array of |S| entries. After defining your MDP class with all its members, you should instantiate an MDP object to construct the simple MDP as described in the given network: mdp = MDP (T, R, discount) Transition function: |A| |S| | S'| array Reward function: |A| |S| array Discount factor: scalar in [0,1) Finally, you should test each procedure on the given example and report your findings. You can verify that your code is running properly by adding print statements to verify that the output of each function makes sense and matches the results reported in the lecture "Markov Decision Processes" slide for the value iteration and in the lecture "Policy Iteration" slide for the policy iteration. Report your findings in tables like those in the afore mentioned slides. You should also do the following: Report the policy, value function, and the number of iterations needed by value iteration when using a tolerance of 0.01 and starting from a value function set to 0 for all states. Report the policy, value function, and the number of iterations needed by policy iteration to find an optimal policy when starting from the policy that chooses action O in all states. Note: action 0 corresponds to "A: Advertising" whereas action 1 corresponds to "S: Saving money". Report the number of iterations needed by modified policy iteration to converge when varying the number of iterations in partial policy evaluation from 1 to 10. Use a tolerance of 0.01, start with the policy that chooses action 0 in all states and start with the value function that assigns 0 to all states. Discuss the impact of the number of iterations in partial policy evaluation on the results and relate the results to value iteration and policy iteration.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!