Question: The system parameters are d , D , S m a x , , 1 , 2 , 3 , and k . Some of

The system parameters are d,D,Smax,,1,2,3, and k. Some of these parameters capture the probability distributions governing state transition and reward. You will assume that these parameters are known for all the deliverables below.
Deliverables:
(10 marks) Implement iterative policy evaluation to find the value function of a greedy policy for problem 1. Greedy policy is the one that minimizes the immediate average reward. You MUST implement this code in the Python script policy_evaluation.py that is provided to you. The function MUST return the value function for the greedy policy. The only thing that you should include in report.pdf for this part is an equation that describes the greedy policy.
?1 I don't think that the hardcoded rule is sub-optimal. You will get 5 buffer marks (out of 100 marks for this entire course) if you manage to RIGOROUSLY prove it. I have not taught you these kind of proofs, you have to find it out yourself. Alternately, you can also disprove my claim and get these 5 buffer marks.
(15 marks) Implement value iteration to solve the Bellman optimality equation for problem 3. You MUST implement this code in the Python script value_iteration.py that is provided to you. The function MUST return the optimal value function and the optimal policy. You MUST NOT include anything in report.pdf for this part. It is crucial that your code can run for any value of , even =0(=0 means non-deferrable demands like in problem 1).
(15 marks) Implement policy iteration to solve the Bellman optimality equation for problem 3. You MUST implement this code in the Python script policy_iteration.py that is provided to you. The function MUST return the optimal value function and the optimal policy. You MUST NOT include anything in report.pdf for this part. It is crucial that your code can run for any value of , even =0(=0 means non-deferrable demands like in problem 1).
Read these points before attempting the above deliverables:
Default system parameters: The following are the default values of the involved parameters: D=5,=4,Smax=15,1=10,2=1,3=0.2,k=0.3kAAk. The default value of discount factor =0.95. In the pseudocode of iterative policy evaluation, value iteration, and policy iteration, there was a convergence threshold in the lecture slides (this is different from 1,2, and 3). Set this convergence threshold as 2. An additional convergence parameter is Kmin, the minimum number of iterations for iterative policy evaluation, value iteration, and policy evaluation of policy iteration. Set Kmin=10. These default value MUST be used for testing your code and doing analysis in the next section unless mentioned otherwise. NOTE: Your code must run for any values and not just the default ones.
You are provided a Python script
Assignment2Tools.py that contains a function prob_vector_generator(). This function can be used to generate a probability distribution, d, that has a pre-specified mean and standard deviation. The instruction to use this function is there in policy_evaluation.py, value_iteration.py, and policy_iteration.py.
For a given set of system parameters, the optimal value function obtained using value and policy iteration should almost be the same. Otherwise, there is something wrong with your code.
While coding the Q -function, you may want to use the broadcasting operation of Numpy to speed your computation. The following are the run times for default parameters in my office laptop:
policy_evaluation.py produces results in like 5 seconds.
The run time of value_iteration.py and policy_iteration.py is less than 32 minutes. One of them runs around three times faster than the other (not going to tell which one).
Must account for action space: You must take into consideration that different states may have different action space. This means a few things. First, while implementing value/policy iteration, the maxima/minima should be over the action space corresponding to the state. Second, for policy iteration, the policy can be initialized to any arbitrary value in the action space corresponding to the state.
The system parameters are d , D , S m a x , , 1 ,

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!