Problem 2 Consider a MDP with two states S 0 , 1 , two actions A 1 , 2 , and the follow reward function R s ( a ) 1 , ( s , a ) ( 0 , 1 ) 4 , ( s , a ) ( 0 , 2 ) 3 , ( s , a ) ( 1 , 1 ) 2 , ( s , a ) ( 1 , 2 ) and the transition probabilities P s s ' ( a ) as follows P 0 0 ( 1 ) P 0 0 ( 2 ) P 1 0 ( 1 ) P 1 0 ( 2 ) 1 3 1 2 1 4 2 3 The other probabilities can be deduced, for example P 0 1 ( 1 ) 1 P 0 0 ( 1 ) 1 1 3 2 3 The discount factor is 3 4 Exercise on model free prediction ( a ) For the policy that chooses action 1 in state 0 , and action 2 in state 1 , starting from state 0 , generate one episode E of 1 0 0 0 0 triplets of ( R i , S i , A i ) , i 0 , 2 , dots, 9 9 9 9 , with R 0 0 , S 0 0 ( b ) Based on the episode E , use Monte Carlo policy evaluation to estimate the value function v ( s ) ( c ) Based on the episode E , use n step temporal difference policy evaluation to estimate the value function v ( s ) Exercise on model free control ( a ) Use the SARSA algorithm to estimate the optimal action value function q ( s , a ) , by running the algorithm in Sutton and Barto's book ( 2 nd edition, available online ) ( b ) Use the Q learning algorithm to estimate the optimal action value function q ( s , a ) , by running the algorithm in Sutton and Barto's book ( 2 nd edition, available online ) You only need to simulate one episode In both cases, you will need to decide an appropriate fixed step size , and exploration probability l o n , and number of time steps in the episode

The Answer is in the image, click to view ...

Question: Problem 2 . Consider a MDP with two states S = { 0 , 1 } , two actions A = { 1 , 2

Problem

2 .

Consider a MDP with two states

S = {0, 1},

two actions

A = {1, 2},

and the

follow reward function

R_{s}^{(a)} = {\begin{matrix} 1, & (s, a) = (0, & 1) \\ 4, & (s, a) = (0, & 2) \\ 3, & (s, a) = (1, & 1) \\ 2, & (s, a) = (1, & 2) \end{matrix}

and the transition probabilities

P_{s s^{'}}^{(a)}

as follows:

[\begin{matrix} P_{00}^{(1)} & P_{00}^{(2)} \\ P_{10}^{(1)} & P_{10}^{(2)} \end{matrix}] = [\begin{matrix} \frac{1}{3} & \frac{1}{2} \\ \frac{1}{4} & \frac{2}{3} \end{matrix}]

The other probabilities can be deduced, for example:

P_{01}^{(1)} = 1 - P_{00}^{(1)} = 1 - \frac{1}{3} = \frac{2}{3} .

The discount factor is

= \frac{3}{4 .}

Exercise on model

-

free prediction:

(

)

For the policy

that chooses action

1

in state

0,

and action

2

in state

1,

starting from

state

0,

generate one episode

E

10000

triplets of

(R_{i}, S_{i}, A_{i}), i = 0, 2,

dots,

9999,

with

R_{0} = 0, S_{0} = 0 .

(

)

Based on the episode

E,

use Monte Carlo policy evaluation to estimate the value

function

v_{} (s) .

(

)

Based on the episode

E,

use

n -

step temporal difference policy evaluation to estimate

the value function

v_{} (s) .

Exercise on model

-

free control:

(

)

Use the SARSA algorithm to estimate the optimal action

-

value function

q_{* *} (s, a),

running the algorithm in Sutton and Barto's book

(2

nd edition, available online

) .

(

)

Use the Q

-

learning algorithm to estimate the optimal action

-

value function

q_{* *} (s, a),

running the algorithm in Sutton and Barto's book

(2

nd edition, available online

) .

You only need to simulate one episode. In both cases, you will need to decide an appropriate

fixed step

-

size

,

and exploration probability

l o n,

and number of time steps in the episode.

Problem 2 . Consider a MDP with two states S = {

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!

Problem 1 . Consider a MDP with two states S = { 0 , 1 } , two actions A = { 1 , 2 } , and the follow reward function R s ( a ) = { 1 , ( s , a ) = ( 0 , 1 ) 4 , ( s , a ) = ( 0 , 2 ) 3 , ( s , a ) =...

2. Consider an economy with two periods (0 and 1) in which the individuals are expected utility maximizers and only consume in period 1. The Bernoulli utility function is u(W) = In W", for h=1, 2,...

consider a reinforcement learning setup, where the agent can take two actions a={0, 1}. There are two states s = {0, 1}, and there is no discounting (gamma=1). Over an episode of three time steps,...

Problem 4 Consider a world with two states of nature State 1 and State 2 at each date t. This means that at each point in time the economy can either be in State 1 or in State 2. The following matrix...

Consider the following economy. Two dates: t = 0, 1. One risk-free asset with a certain rate of return rf > 0. One risky asset: - At t = 1, the asset yields a risky cash flow per share. Assume that...

consider a reinforcement learning setup, where the agent can take two actions a={0, 1}. There are two states s = {0, 1}, and there is no discounting (gamma=1). Over an episode of three time steps,...

Consider the co ntinuous Markov chain of Example 11.17: A chain with two states S = {0, 1} and 0 = 1 = > 0. In that example, we found that the transition matrix for any t 0 is given by a. Find...

Consider an MDP with three states capturing scoring in robot soccer: None, Against, and For with reward 0, -1, +1, respectively. Also consider three actions capturing playing strategies: 1. Balanced:...

Question 1 Consider Figure 1, water is entering the pipe with the speed of u and exiting to atmosphere. The flow is steady. The inlet diameter is three times larger than the outlet diameter. Analyse...

The organizations strategic plan calls for an aggressive growth plan, requiring investment in facilities and equipment, growth in productivity, and labor over the next five years. Write a report...

Procter & Gamble belongs to FMCG sector is an example for _ _ _ _ _ _ _ _ _ _ _ and Cossette Ad agency is an example for _ _ _ _ _ _ _ _ _ _ _ _ _ _ Question 2 5 options: a ) Process Costing system,...

Sarah has worked for seven years, nine months and two weeks for her Ontario employer which has a payroll of more than $2.5 million. Now her job has been eliminated and her employment has been...