Question: The system parameters are d , D , S m a x , , 1 , 2 , 3 , and k . Some of

The system parameters are

_{d}, D, S_{m a x},,_{1},_{2},_{3},

and

_{k} .

Some of these parameters capture the probability distributions governing state transition and reward. You will assume that these parameters are known for all the deliverables below.

Deliverables:

(10

marks

)

Implement iterative policy evaluation to find the value function of a greedy policy for problem

1 .

Greedy policy is the one that minimizes the immediate average reward. You MUST implement this code in the Python script policy

_

evaluation.py that is provided to you. The function MUST return the value function for the greedy policy. The only thing that you should include in report.pdf for this part is an equation that describes the greedy policy.

?^{1}

I don't think that the hardcoded rule is sub

-

optimal. You will get

5

buffer marks

(

out of

100

marks for this entire course

)

if you manage to RIGOROUSLY prove it

.

I have not taught you these kind of proofs, you have to find it out yourself. Alternately, you can also disprove my claim and get these

5

buffer marks.

(15

marks

)

Implement value iteration to solve the Bellman optimality equation for problem

3 .

You MUST implement this code in the Python script value

_

iteration.py that is provided to you. The function MUST return the optimal value function and the optimal policy. You MUST NOT include anything in report.pdf for this part. It is crucial that your code can run for any value of

,

even

= 0 (= 0

means non

-

deferrable demands like in problem

1) .

(15

marks

)

Implement policy iteration to solve the Bellman optimality equation for problem

3 .

You MUST implement this code in the Python script policy

_

,

even

= 0 (= 0

means non

-

deferrable demands like in problem

1) .

Read these points before attempting the above deliverables:

Default system parameters: The following are the default values of the involved parameters:

D = 5, = 4, S_{m a x} = 15,_{1} = 10,_{2} = 1,_{3} = 0.2,_{k} = 0.3

kAAk. The default value of discount factor

= 0.95 .

In the pseudocode of iterative policy evaluation, value iteration, and policy iteration, there was a convergence threshold

in the lecture slides

(

this

is different from

_{1},_{2},

and

_{3}) .

Set this convergence threshold as

2 .

An additional convergence parameter is

K_{m i n},

the minimum number of iterations for iterative policy evaluation, value iteration, and policy evaluation of policy iteration. Set

K_{m i n} = 10 .

These default value MUST be used for testing your code and doing analysis in the next section unless mentioned otherwise. NOTE: Your code must run for any values and not just the default ones.

You are provided a Python script

Assignment

2

Tools.py that contains a function prob

_

vector

_

generator

() .

This function can be used to generate a probability distribution,

_{d},

that has a pre

-

specified mean and standard deviation. The instruction to use this function is there in policy

_

evaluation.py

,

value

_

iteration.py

,

and policy

_

iteration.py

.

For a given set of system parameters, the optimal value function obtained using value and policy iteration should almost be the same. Otherwise, there is something wrong with your code.

While coding the Q

-

function, you may want to use the broadcasting operation of Numpy to speed your computation. The following are the run times for default parameters in my office laptop:

policy

_

evaluation.py produces results in like

5

seconds.

The run time of value

_

iteration.py and policy

_

iteration.py is less than

32

minutes. One of them runs around three times faster than the other

(

not going to tell which one

) .

Must account for action space: You must take into consideration that different states may have different action space. This means a few things. First, while implementing value

/

policy iteration, the maxima

/

minima should be over the action space corresponding to the state. Second, for policy iteration, the policy can be initialized to any arbitrary value in the action space corresponding to the state.

The system parameters are d , D , S m a x , , 1 ,

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!

I need help:- \fUsing the estimates of the parameters of the Weibull model obtained in Exercise 1, estimate the probability that this kind of diaphragm valve will perform satisfactorily for at least...

2 We are modelling a single-server queue with finite waiting room, in discrete time. Up to 3 customers can be waiting in the queue at any time. At each step, there is either an arrival event (with...

63. Let x denote the number of items in an order and y denote time (min) necessary to process the order. Processing time may be determined by various factors other than order size. So for any...

show all explanations for all questios please 4. Let {X (1).1 2 0) be a Poisson process with rate A. Let T. denote the arrival time of the nith event. (a) Find E[Tm], the expected arrival time of the...

Question 2 : Calculation question 1 [ 1 7 marks ] The table below gives the probability of an individual aged x , surviving until the age x + 1 i . e . , the probability they survive for one year for...

(4 points) If X is a binomial random variable, compute P(X = x) for each of the following cases: a) n= 6,x=1,p= 0.5 P(X=x) = b) n=5,x=5,p=0.1 P(X=x)= c) n=6,x=2,p=0.6 P(X=x)= d) n=5,x=3,p=0.3 P(X=x)...

A two equal area power system with reheater turbines is connected by a tie line with the following characteristics is shown below. The system parameters are D= 8.33x10-3 pu MW/Hz H= 5 sec KR = 120...

- Consider a cellular network in Figure 3. - A base-station is located at the center of each cell. - Distance between base-stations is D. - We will be interested in the center-cell and take the power...

One positive number exceeds another by 6, and their product is 72. Find the numbers. The numbers are O (Type an integer or a simplified fraction. Use a comma to separate answers as needed.)

The Loxedose Company near Chicago transfers computer models into hard physical copies. Computer programmers design the representation, and machines sculpt the product line by line from the bottom to...

8. Chicago is a city in Michigan and Michigan is part of the United States. Th erefore, Chicago is a city in the United States. The following arguments are deductive. Determine whether each is valid...

Value based pricing often uses the concept of economic value to the customer (EVC), which calculates the difference between the price of a company's product and the price of the nearest competitive...