Problem 1 Consider a MDP with two states S 0 , 1 , two actions A 1 , 2 , and the follow reward function R s ( a ) 1 , ( s , a ) ( 0 , 1 ) 4 , ( s , a ) ( 0 , 2 ) 3 , ( s , a ) ( 1 , 1 ) 2 , ( s , a ) ( 1 , 2 ) and the transition probabilities P s s ' ( a ) as follows P 0 0 ( 1 ) P 0 0 ( 2 ) P 1 0 ( 1 ) P 1 0 ( 2 ) 1 3 1 2 1 4 2 3 The other probabilities can be deduced, for example P 0 1 ( 1 ) 1 P 0 0 ( 1 ) 1 1 3 2 3 The discount factor is 3 4 ( a ) For the policy that chooses action 1 in state 0 , and action 2 in state 1 , find the state value function v ( s ) , by writing out the Bellman's expectation equation, and solve the equation explicitly ( b ) For the same policy , obtain the state value function using iterative update based on the Bellman's expectation equation You need to list the first 5 iteration values of v ( s ) ( c ) For the policy , calculate the q ( s , a ) function ( d ) Based on the value function v ( s ) , obtain an improved policy ' based on ' ( s ) a r g m a x a q ( s , a ) ( e ) Obtain the optimal value function v ( s ) using value iteration based on the Bellman's optimality equation, with all initial values set to 0 ( f ) Obtain the optimal policy

The Answer is in the image, click to view ...

Question: Problem 1 . Consider a MDP with two states S = { 0 , 1 } , two actions A = { 1 , 2

Problem

1 .

Consider a MDP with two states

S = {0, 1},

two actions

A = {1, 2},

and the follow reward function

R_{s}^{(a)} = {\begin{matrix} 1, & (s, a) = (0, & 1) \\ 4, & (s, a) = (0, & 2) \\ 3, & (s, a) = (1, & 1) \\ 2, & (s, a) = (1, & 2) \end{matrix}

and the transition probabilities

P_{s s^{'}}^{(a)}

as follows:

[\begin{matrix} P_{00}^{(1)} & P_{00}^{(2)} \\ P_{10}^{(1)} & P_{10}^{(2)} \end{matrix}] = [\begin{matrix} \frac{1}{3} & \frac{1}{2} \\ \frac{1}{4} & \frac{2}{3} \end{matrix}]

The other probabilities can be deduced, for example:

P_{01}^{(1)} = 1 - P_{00}^{(1)} = 1 - \frac{1}{3} = \frac{2}{3} .

The discount factor is

= \frac{3}{4 .}

(

)

For the policy

that chooses action

1

in state

0,

and action

2

in state

1,

find the state value function

v_{} (s),

by writing out the Bellman's expectation equation, and solve the equation explicitly.

(

)

For the same policy

,

obtain the state value function using iterative update based on the Bellman's expectation equation. You need to list the first

5

iteration values of

v (s) .

(

)

For the policy

,

calculate the

q_{} (s, a)

function.

(

)

Based on the value function

v_{} (s),

obtain an improved policy

^{'}

based on

^{'} (s) = a r g m a x_{a} q_{} (s, a) .

(

)

Obtain the optimal value function

v_{* *} (s)

using value iteration based on the Bellman's optimality equation, with all initial values set to

0 .

(

)

Obtain the optimal policy.

Problem 1 . Consider a MDP with two states S = {

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!

Part 1 pls Problem 1. Consider the MDP with the transition model, reward function, and V0 as given in the Tables 1,2 , and 3 . The set of states is {A,B}, and the set of actions is {1,2,3}. Assume...

1. Consider an economy with two periods, t = 0 and t = 1, and three states: bad, normal, and good. Three assets, Asset A, B, and C, are traded: Their payo structures are listed as follows: Payof of...

Problem 4 Consider a world with two states of nature State 1 and State 2 at each date t. This means that at each point in time the economy can either be in State 1 or in State 2. The following matrix...

Consider an MDP with three states capturing scoring in robot soccer: None, Against, and For with reward 0, -1, +1, respectively. Also consider three actions capturing playing strategies: 1. Balanced:...

hi, i need solutions for these problems! thank you so much! MAT1856S/ APM466S Mathematical Theory of Finance Final Exam - April 2003 Problem 1. Consider a call option on a stock. The stock value...

In C++ Project requires code to be executable. Base Code is included at the bottom for your convenience. Base Code for your convenience: You will write an nxn tic-tac-toe game/program that utilizes...

In C++! Project requires code to be executable. Base Code is included at the bottom for your convenience. Base Code for your convenience: You will write an nxn tic-tac-toe game/program that utilizes...

Project requires code to be executable. Base Code is included at the bottom for your convenience. Code in C++. Base Code for your convenience: You will write an nxn tic-tac-toe game/program that...

Question 7 [15 pt: Consider a system with two states and two actions. You perform actions and observe the rewards and transitions listed below Step 1: Start-Si, Action = al, Reward =-10. End Step 2:...

Widden Company, which sells electric razors, had $400,000 of cost of goods sold during the month of June. The company projects a 5 percent increase in cost of goods sold during July. The inventory...

Daniel DeNardo and Joy Bax had been co-workers at Alaska Newspapers, Inc. (ANI). After DeNardo no longer worked for ANI, he filed a defamation lawsuit against Bax because she had told other employees...

You win $ 2 . 5 million prize, which will be paid 8 5 years from today. The annual interest rate is 6 . 5 % . What is today's value of the $ 2 . 5 million prize? $ 0 . 0 1 2 million $ 0 . 0 4 5...

Simon would like to count the number of times a button is clicked on his website. What metric should he use? Group of answer choices Sessions Pageviews Users Events