Question: CSC 792: Topics Applied Reinforcement Learning Assignment 1 Due Date: 2/23/ 2023 11:59 pm The aim of this assignment is to program value iteration, policy

CSC 792: Topics Applied Reinforcement Learning Assignment 1 Due Date: 2/23/ 2023 11:59 pm The aim of this assignment is to program value iteration, policy iteration, and modified policy iteration for Markov decision processes in Python.

CSC 792: Topics Applied Reinforcement Learning Assignment 1 Due Date: 2/23/ 2023

Write a program in Python to implement value iteration, policy iteration, and modified policy iteration specifically for this simple MDP example. For this matter, you should start by creating a simple MDP class using class MDP. This class should include the following members: a constructor for the MDP class def __init()__ that has the following parameters: self, T, R, discount.

o T -- Transition function: |A| x |S| x |S'| array o R -- Reward function: |A| x |S| array o discount -- discount factor : scalar in [0,1) The constructor should verify that the inputs are valid (using the assert command) and set corresponding variables in an MDP object. a procedure for the value iteration def valueIteration() that has the following parameters: self, initialV , nIterations, tolerance. Set nIterations and tolerance to np.inf and 0.01 as default values, respectively. o initialV -- Initial value function: array of |S| entries o nIterations -- limit on the number of iterations: scalar (default: infinity) o tolerance -- threshold on +1 that will be compared to a variable epsilon (initialized to np.inf): scalar (default: 0.01) This procedure should return a new value function V. o newV New value function: array of |S| entries. o iteration the number of iterations performed: scalar o epsilon -- +1: scalar a procedure to extract a policy from a value function def extractPolicy () that has the following parameters: self, V. o V -- Value function: array of |S| entries This procedure should return a policy. o policy-- Policy: array of |S| entries. a procedure to evaluate a policy by solving a system of linear equations def evaluatePolicy () that has the following parameters: self, policy. o policy -- Policy: array of |S| entries This procedure should return a value function V. o V -- Value function: array of |S| entries. a procedure for the policy iteration def policyIteration() that has the following parameters: self, initialPolicy , nIterations. Set nIterations to np.inf as a default value. o initialPolicy -- Initial policy: array of |S| entries o nIterations -- limit on the number of iterations: scalar (default: infinity) This procedure should return a new policy. o newPolicy New policy: array of |S| entries. o iteration the number of iterations performed: scalar

a procedure for partial policy evaluation def evaluatePolicyPartially () that has the following parameters: self,policy,initialV,nIterations,tolerance. Set nIterations and tolerance to np.inf and 0.01 as default values, respectively. o policy -- Policy: array of |S| entries o initialV -- Initial value function: array of |S| entries o nIterations -- limit on the number of iterations: scalar (default: infinity) o tolerance -- threshold on +1 that will be compared to a variable epsilon (initialized to np.inf): scalar (default: 0.01) This procedure should return a new value function V. o newV New value function: array of |S| entries. a procedure for the modified policy iteration def modifiedPolicyIteration () that has the following parameters: self,initialPolicy,initialV,nEvalIterations,nIterations,tolerance. Set nEvalIterations, nIterations, and tolerance to 5, np.inf and 0.01 as default values, respectively. o initialPolicy Initial policy: array of |S| entries o initialV -- Initial value function: array of |S| entries o nEvalIterations -- limit on the number of iterations to be performed in each partial policy evaluation: scalar (default: 5) o nIterations -- limit on the number of iterations to be performed in modified policy iteration: scalar (default: infinity) o tolerance -- threshold on +1 that will be compared to a variable epsilon (initialized to np.inf): scalar (default: 0.01) This procedure should return a policy. o policy -- Policy: array of |S| entries. o iteration the number of iterations performed: scalar o epsilon -- +1: scalar After defining your MDP class with all its members, you should instantiate an MDP object to construct the simple MDP as described in the given network: mdp = MDP(T,R,discount) o Transition function: |A| x |S| x |S'| array o Reward function: |A| x |S| array o Discount factor: scalar in [0,1) Finally, you should test each procedure on the given example and report your findings. You can verify that your code is running properly by adding print statements to verify that the output of each function makes sense and matches the results reported in the lecture Markov Decision Processes slide 14 for the value iteration and in the lecture Policy Iteration slide 5 for the policy iteration. Report your findings in tables like those in the aforementioned slides. You should also do the following:

Report the policy, value function, and the number of iterations needed by value iteration when using a tolerance of 0.01 and starting from a value function set to 0 for all states. Report the policy, value function, and the number of iterations needed by policy iteration to find an optimal policy when starting from the policy that chooses action 0 in all states. Note: action 0 corresponds to A: Advertising whereas action 1 corresponds to S: Saving money. Report the number of iterations needed by modified policy iteration to converge when varying the number of iterations in partial policy evaluation from 1 to 10. Use a tolerance of 0.01, start with the policy that chooses action 0 in all states and start with the value function that assigns 0 to all states. Discuss the impact of the number of iterations in partial policy evaluation on the results and relate the results to value iteration and policy iteration. Upload your code along with a report of your findings.

You own a company In every state you must choose between Saving money or Advertising

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!