Question: CSC 792: Topics Applied Reinforcement Learning Assignment 1 Due Date: 2/23/ 2023 11:59 pm The aim of this assignment is to program value iteration, policy

CSC 792: Topics Applied Reinforcement Learning Assignment 1 Due Date: 2/23/ 2023 11:59 pm The aim of this assignment is to program value iteration, policy iteration, and modified policy iteration for Markov decision processes in Python.

CSC 792: Topics Applied Reinforcement Learning Assignment 1 Due Date: 2/23/ 2023

Write a program in Python to implement value iteration, policy iteration, and modified policy iteration specifically for this simple MDP example. For this matter, you should start by creating a simple MDP class using class MDP. This class should include the following members: a constructor for the MDP class def __init()__ that has the following parameters: self, T, R, discount.

o T -- Transition function: |A| x |S| x |S'| array o R -- Reward function: |A| x |S| array o discount -- discount factor : scalar in [0,1) The constructor should verify that the inputs are valid (using the assert command) and set corresponding variables in an MDP object. a procedure for the value iteration def valueIteration() that has the following parameters: self, initialV , nIterations, tolerance. Set nIterations and tolerance to np.inf and 0.01 as default values, respectively. o initialV -- Initial value function: array of |S| entries o nIterations -- limit on the number of iterations: scalar (default: infinity) o tolerance -- threshold on +1 that will be compared to a variable epsilon (initialized to np.inf): scalar (default: 0.01) This procedure should return a new value function V. o newV New value function: array of |S| entries. o iteration the number of iterations performed: scalar o epsilon -- +1: scalar a procedure to extract a policy from a value function def extractPolicy () that has the following parameters: self, V. o V -- Value function: array of |S| entries This procedure should return a policy. o policy-- Policy: array of |S| entries. a procedure to evaluate a policy by solving a system of linear equations def evaluatePolicy () that has the following parameters: self, policy. o policy -- Policy: array of |S| entries This procedure should return a value function V. o V -- Value function: array of |S| entries. a procedure for the policy iteration def policyIteration() that has the following parameters: self, initialPolicy , nIterations. Set nIterations to np.inf as a default value. o initialPolicy -- Initial policy: array of |S| entries o nIterations -- limit on the number of iterations: scalar (default: infinity) This procedure should return a new policy. o newPolicy New policy: array of |S| entries. o iteration the number of iterations performed: scalar

a procedure for partial policy evaluation def evaluatePolicyPartially () that has the following parameters: self,policy,initialV,nIterations,tolerance. Set nIterations and tolerance to np.inf and 0.01 as default values, respectively. o policy -- Policy: array of |S| entries o initialV -- Initial value function: array of |S| entries o nIterations -- limit on the number of iterations: scalar (default: infinity) o tolerance -- threshold on +1 that will be compared to a variable epsilon (initialized to np.inf): scalar (default: 0.01) This procedure should return a new value function V. o newV New value function: array of |S| entries. a procedure for the modified policy iteration def modifiedPolicyIteration () that has the following parameters: self,initialPolicy,initialV,nEvalIterations,nIterations,tolerance. Set nEvalIterations, nIterations, and tolerance to 5, np.inf and 0.01 as default values, respectively. o initialPolicy Initial policy: array of |S| entries o initialV -- Initial value function: array of |S| entries o nEvalIterations -- limit on the number of iterations to be performed in each partial policy evaluation: scalar (default: 5) o nIterations -- limit on the number of iterations to be performed in modified policy iteration: scalar (default: infinity) o tolerance -- threshold on +1 that will be compared to a variable epsilon (initialized to np.inf): scalar (default: 0.01) This procedure should return a policy. o policy -- Policy: array of |S| entries. o iteration the number of iterations performed: scalar o epsilon -- +1: scalar After defining your MDP class with all its members, you should instantiate an MDP object to construct the simple MDP as described in the given network: mdp = MDP(T,R,discount) o Transition function: |A| x |S| x |S'| array o Reward function: |A| x |S| array o Discount factor: scalar in [0,1) Finally, you should test each procedure on the given example and report your findings. You can verify that your code is running properly by adding print statements to verify that the output of each function makes sense and matches the results reported in the lecture Markov Decision Processes slide 14 for the value iteration and in the lecture Policy Iteration slide 5 for the policy iteration. Report your findings in tables like those in the aforementioned slides. You should also do the following:

Report the policy, value function, and the number of iterations needed by value iteration when using a tolerance of 0.01 and starting from a value function set to 0 for all states. Report the policy, value function, and the number of iterations needed by policy iteration to find an optimal policy when starting from the policy that chooses action 0 in all states. Note: action 0 corresponds to A: Advertising whereas action 1 corresponds to S: Saving money. Report the number of iterations needed by modified policy iteration to converge when varying the number of iterations in partial policy evaluation from 1 to 10. Use a tolerance of 0.01, start with the policy that chooses action 0 in all states and start with the value function that assigns 0 to all states. Discuss the impact of the number of iterations in partial policy evaluation on the results and relate the results to value iteration and policy iteration. Upload your code along with a report of your findings.

You own a company In every state you must choose between Saving money or Advertising

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

BUS707 Applied Business Research T123 All information in the Subject Outline is correct at the time of approval. KOI reserves the right to make changes to the Subject Outline if they become...

Module 9 Assignment: TOC Answer all the questions and submit your answer report to Module 9 Assignment in Dropbox by the deadline . The report should be typed, single spaced, in one MS Word file. You...

Answer question "Google in China" using image provided G C HB HB G X + X canvas.biola.edu/groups/39504/discussion_topics/545276 Update : Apps YouTube Netflix M Inbox (651) - devon... My Tasks - My...

Please help me with this assignment, 100% human! Reference book George, J. M. (2024). Contemporary management (12th ed.). McGraw-Hill Education. keiser library Syahbinah, S., & Suhardianto, N....

Al-Driven Contextual Advertising: Toward Relevant Messaging Without Personal Data E. Haglund and J. Bjorklund Department of Computing Science, Umea University, Umed, Sweden ABSTRACT In programmatic...

Good day can you please Assist: Questions are at the End of the case study Case Study: Introduction "Transformation is not the same as change. The latter, by definition, is fleeting; the former,...

This week is mainly a review of the basic concepts of applied behavior analysis. One, we want to make sure we are all on the same page before diving into the application of these behavior principles...

Possible to solve the duty required Doing each homework first page is necessary tomorrow majoring in business administration + : 12 : 13 A B C D D E F G H cap.investment 300.0 manufacturing 1 0 1 0 1...

Delta Manufacturing Company applies manufacturing overhead to jobs on the basis of machine hours. The 2012 estimates of manufacturing overhead and machine hours were: Manufacturing overhead . . . . ....

should Sue borrow from bank a or bank b? why?

If euros sell for $ 1 . 4 5 ( U . S . ) per euro, what should dollars sell for in euros per dollar? Round your answer to four

Financial Analysis Project Case Common-Size Financial Statement Analysis.ALGO Case Job-Order Sheets Cost Analysis.ALGO Case Break-Even Analysis.ALGO Progress:1/3 items Question Content Area...

___ 22. Using my skills to make the world a better place to live and work is more important to me than achieving a highlevel managerial position.

___ 18. I will feel successful in my career only if I become a general manager in some organization.

___ 17. Becoming a senior functional manager in my area of expertise is more attractive to me than becoming a general manager.