Question: Exercise 3 Consider a discounted dynamic programming problem with the state space S = {0, 1}, and the set of admissible actions at any state

Exercise 3 Consider a discounted dynamic programming problem with the state

Exercise 3 Consider a discounted dynamic programming problem with the state space S = {0, 1}, and the set of admissible actions at any state x E Sis A(x) = {1, 2}. The cost function C(x, a) is given by C(0, 1) = 1, C(1, 1) = 2, C(0, 2) = 0, C(1, 2) = 2. The transition probabilities ply|x, a) are fully determined by p(0|0, 1) = 1/2, p(0|0, 2) = 1/4, p(0|1, 1) = 2/3, p(0|1, 2) = 1/3. Let B = 1/2. (a) Starting with W(x) = (0)(x) = 0 for all x E S, use the value iteration algorithm to approximate the value function We by W(3) := T(3); W. Then what is the stationary policy obtained as the minimiser of TB WB)? Determine with justifications whether it is an optimal policy. [40 marks] (b) Now let f be the stationary policy that chooses action 1 in both states 0 and 1. Apply the policy iteration algorithm with the initial policy g(0) = f to generate policies until you reach an optimal stationary policy. [10 marks] (Hint: adjust the dynamic programming operator TB as well as the value iteration and policy iteration algorithms accordingly, as we are dealing with a minimization problem here.) Exercise 3 Consider a discounted dynamic programming problem with the state space S = {0, 1}, and the set of admissible actions at any state x E Sis A(x) = {1, 2}. The cost function C(x, a) is given by C(0, 1) = 1, C(1, 1) = 2, C(0, 2) = 0, C(1, 2) = 2. The transition probabilities ply|x, a) are fully determined by p(0|0, 1) = 1/2, p(0|0, 2) = 1/4, p(0|1, 1) = 2/3, p(0|1, 2) = 1/3. Let B = 1/2. (a) Starting with W(x) = (0)(x) = 0 for all x E S, use the value iteration algorithm to approximate the value function We by W(3) := T(3); W. Then what is the stationary policy obtained as the minimiser of TB WB)? Determine with justifications whether it is an optimal policy. [40 marks] (b) Now let f be the stationary policy that chooses action 1 in both states 0 and 1. Apply the policy iteration algorithm with the initial policy g(0) = f to generate policies until you reach an optimal stationary policy. [10 marks] (Hint: adjust the dynamic programming operator TB as well as the value iteration and policy iteration algorithms accordingly, as we are dealing with a minimization problem here.)

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

question au015 a briefing note for information (not a briefing note for decision) on either electricity prices for Ontario in language that is understandable to an elected official such as a Minister...

EXplain each..:: . {11]} Consider the discrete time monetary-search model we saw in class. At day time, trade takes place ii a decentralized market characterized by anonymity and bilateral meetings...

AAAA... P4 ITS ECON Problem 1 The purpose of this recitation is to familiarize students with a variety of integer programming modeling techniques as described in the IP Formulation Guide and in the...

Write single linear constraint that is equivalent to the statement "x2 = 1 or x3 = 0," but not both. (c) (4 points) Add a binary variable w1, and add two constraints that ensure that w1 = 1 if x5 +...

ECO987Q3 1. (10) Briey discuss the following statements (keep your answers short and concise): (a) The consumption-based capital asset pricing model is inconsistent with high volatility of stock...

Economics fr. 3. (20) Consider a standard Solow growth model that is augmented with labor migration. As is typical, the aggregate production function is given by Y = (AL)" Klo where Y is output, A is...

provide references. l. [ltlll Consider a consumer with preferences U= Ire\"TIME, 1Where p is the subjective discount rate__ at: is consumption, and EEC} = lnc. She receives an exogenous ow of income...

Microkernel operating systems aim to address perceived modularity and reliability issues in traditional "monolithic" operating systems. (i) Describe the typical architecture of a microkernel...

Write an alternative definition that is tail-recursive (iterative) and makes use of accumulator variables. [10 marks] Explain why your alternative definition executes more efficiently. [3 marks] 1...

Listed below are the playing times (in seconds) of songs that were popular at the time of this writing. Find the (a) mean, (b) median, (c) mode, and (d) midrange for the given sample data. Is there...

North Shore University Hospital in New York has initiated some improvement projects in order to accommodate the increasing number of patients. One project involved reducing the turnaround time of...

Which of the following situations does NOT describe a material participant? Harry participated in Activity A for 2 1 hours, Activity B for 8 3 hours, and Activity C for 7 4 hours. Another person...

Please answer this question, a description of the video is pasted below the image, thanks!! Video description: To a vial containing a small amount of fine, red crystals, room temperature deionized...

What would be the shared Data Elements between Position Control and Salary Grade Tables in providing cost input to Budgets for vacant positions?

What is the purpose of the Salary Structure Table?

What is the scope and use of a Job Family Table?