# Question: Consider an undiscounted MDP having three states 1 2 3

Consider an undiscounted MDP having three states, (1, 2, 3), with rewards —1, —2, 0 respectively. State 3 is a terminal stale. In states I and 2 there are two possible actions: a and b. The transition model is as follows:

• In state 1, action a moves the agent to state 2 with probability 0.8 and makes the agent stay put with probability 0.2.

• In state 2, action a moves the agent to state 1 with probability 0.8 and makes the agent stay put with probability 0.2.

• In either state 1 or state 2, action b moves the agent to stale 3 with probability 0.1 and makes the agent stay put with probability 0.9. Answer the following questions:

a. What can he determined qualitatively about the optimal policy in states 1 and 2?

b. Apply policy iteration, showing each step in full, to determine the optimal policy and the values of states 1 and 2. Assume that the initial policy has action bin both states.

c. What happens to policy iteration if the initial policy has action a in both stales? Does discounting help? Does the optimal policy depend on the discount factor?

• In state 1, action a moves the agent to state 2 with probability 0.8 and makes the agent stay put with probability 0.2.

• In state 2, action a moves the agent to state 1 with probability 0.8 and makes the agent stay put with probability 0.2.

• In either state 1 or state 2, action b moves the agent to stale 3 with probability 0.1 and makes the agent stay put with probability 0.9. Answer the following questions:

a. What can he determined qualitatively about the optimal policy in states 1 and 2?

b. Apply policy iteration, showing each step in full, to determine the optimal policy and the values of states 1 and 2. Assume that the initial policy has action bin both states.

c. What happens to policy iteration if the initial policy has action a in both stales? Does discounting help? Does the optimal policy depend on the discount factor?

**View Solution:**## Answer to relevant Questions

Sometimes MDPs are formulated with a reward function R(s, a) that depends on the action taken or a reward function R (s, a, s’) that also depends on the outcome state.a. Write the Bellman equations for these ...Solve the game of three-finger Morra.Suppose we generate a training set from a decision tree and then apply decision-tree learning to that training set. Is it the case that the learning algorithm will eventually return the correct tree as the training set size ...Show, by translating into conjunctive normal form and applying resolution, that the conclusion drawn concerning Brazilians is sound.Two statisticians go to the doctor and are both given the same prognosis: A 40% chance that the problem is the deadly disease A. and a 60% chance of the fatal disease B. Fortunately, there are anti-A and anti-B drugs that ...Post your question