Question: Consider an undiscounted MDP having three states, ( 1 , 2 , 3 ) , with rewards 1 , 2 , and 0 , respectively.

Consider an undiscounted MDP having three states,

(1, 2, 3),

with rewards

1, 2,

and

0,

respectively. State

3

is a terminal state. In states

1

and

2

there are two possible actions: A and B

.

The transition model is as follows:

In state

1,

action A moves the agent to state

2

with probability

0.8,

and

0.2

chance do not moveaction B moves the agent to state

3

with probability

0.1,

and

0.9

chance do not move

In state

2,

action A moves the agent to state

1

with probability

0.8

and

0.2

chance do not moveaction B moves the agent to state

3

with probability

0.1

and

0.9

chance do not move

Let us apply policy iteration. We determine the optimal policy and the values of states

1

and

2

each step.

We call utility at state

1,

1,

utility at state

2,

2,

and utility at state

3,

3

The whole process includes several iterations, and each iteration includes three major steps

1 .

initialization

2 .

value determination

3 .

policy update.

Assume that the initial policy choose action b in both states. Let us calculate the first iteration.

First, initialization is easy, because we already said "Assume that the initial policy choose action b in both states".

After initialization, we do value determination. We have a set of three linear equations with u

1,

2

and u

3 .

1 .

find u

1,

2,

and u

3

2 .

Which action is preferred for state

1

at this iteration?

3 .

Which action is preferred for state

2

at this iteration?

4 .

Now we start the second iteration. Based on the preferred action calculated from previous iteration, we initialize it again.Then, in the value determination of this second iteration, the set of equation now been updated. Solve them again, and find u

1,

2,

and u

3

5 .

Which action is preferred for state

1

at this iteration?

6 .

Which action is preferred for state

2

at this iteration?

Now, what will happens to policy iteration if we let the initial policy choose action A in both states?

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!

Russell (birthdate February 2, 1970) and Linda (birthdate August 30, 1975) Long have brought you the following information regarding their income and expenses for the current year. Russell owns and...

Consider an undiscounted MDP having three states, (1, 2, 3), with rewards 1, 2, 0 respectively. State 3 is a terminal stale. In states I and 2 there are two possible actions: a and b. The transition...

Problem 4 Consider a random walk over three states {1. 2. 3}. At each state, with proba- I:Iiilit:,,r 1,32, the particle stajllrsl and 1with probability 1.34, the particle moves to one of the other...

1. Suppose there are two basis assets on the market - a stock and a risk-free zero-coupon bond with face value $100 and time-to-maturity of one month. The current price of the bond and the stock are...

Please answer A continuous-time Markov chain (CTMC) has three states {1, 2, 3). The average time the process stays in states 1, 2, and 3 are 1.5, 15.3, and 4.4 seconds, respectively. The steady-state...

Consider a continuous-time Markov chain with three states 1, 2, 3, 4, 5 and transition rates 912 = 1,913 = 2, 921 =0, 923 = 3, - 931 = 0, 932 = 0. (1) Write the system of ODEs for the corresponding...

Let's consider an MDP defined by the set of states ? = {-1, 0, +1, +2, +3). The start state is Sstart1. The set of actions is given by A Left, Rigth). From state s, the agent, by moving Right, will...

summarize the main idea of each article, discuss issues being highlighted briefly, give opinion pertaining the coverage of each article and provide recommendations for each issue in the article....

please help CS 1 2 3 31 initial state 2. Consider the Towers of Hanoi puzzle shown here consisting of three disks (labeled 1, 2 and 3 based on their size) on pegs labeled A, B, and C. States are...

A collar that slides along a horizontal rod has a pin that is constrained to move in the slot of arm AB (Fig. P13-109). The arm oscillates with angular position given by 0(t) = 90 - 30 cos wt where w...

Niger Corp. provided you with the following information about its investment in Fahad Corp. shares purchased in May 2014 and accounted for using the FV-OCI method: Cost ...................... $39,900...

Which category of tax analytics involve summarizing an aggregating transaction by jurisdiction or category to enhance the accuracy of tax liability calculations descriptive diagnostic prescriptive...

Current Attempt in Progress "ou are told that a note has repayment terms of $1.250 per quarter for & years, with a stated interest rate of 8%. How much of the total payment is for principal, and how...