Problem 3 ( 2 5 points ) Consider the MDP shown below There are five states A , B , C , D , T in MDP , where state T is the target state, and two actions a 1 , a 2 The numbers on each transition show the probability of moving to the next state and the reward of transition, respectively For example, if agent takes action a 1 at state A , it will end up at state B with probability 0 8 and will be rewarded 1 0 , and with probability 0 2 will move to state C and will be rewarded 1 0 a ) For a policy which always takes action a 1 at every state, write down the Bellman recursive value function for each state, i e , v ( A ) , v ( B ) , v ( C ) , v ( D ) , v ( T ) , and compute the final state values when 1 2 b ) Consider a random policy which uniformly selects actions at each state ( the probability of taking each of two actions under this policy is 1 2 ) Apply one iteration of Value Iteration algorithm ( one step policy evaluation followed by policy greedification ) on this MDP with 1 and show the new improved policy c ) Consider the following episode generated by an arbitrary policy Assume the current values of all the state values are v ( A ) 0 , v ( B ) 5 , v ( C ) 2 , v ( D ) 1 0 , and v ( T ) 0 Please, i ) write down the Temporal Difference ( TD ) evaluation equation for updating the values of states, ii ) compute the final values after processing the episode shown in the figure with 1 2

The Answer is in the image, click to view ...

Question: Problem 3 . ( 2 5 points ) Consider the MDP shown below. There are five states { A , B , C , D

Problem

3 . (25

points

)

Consider the MDP shown below. There are five states

{A, B, C, D, T}

in MDP

,

where state

{T}

is the target state, and two actions

{a_{1}, a_{2}} .

The numbers on each transition show the probability of moving to the next state and the reward of transition, respectively. For example, if agent takes action

a_{1}

at state A

,

it will end up at state B with probability

0.8

and will be rewarded

- 10,

and with probability

0.2

will move to state

C

and will be rewarded

- 10 .

)

For a policy

which always takes action

a_{1}

at every state, write down the Bellman recursive value function for each state, i

.

., v_{} (A), v_{} (B), v_{} (C), v_{} (D), v_{} (T),

and compute the final state values when

= \frac{1}{2} .

)

Consider a random policy which uniformly selects actions at each state

(

the probability of taking each of two actions under this policy is

\frac{1}{2}) .

Apply one iteration of Value Iteration algorithm

(

one

-

step policy evaluation followed by policy greedification

)

on this MDP with

= 1

and show the new improved policy.

)

Consider the following episode generated by an arbitrary policy

.

Assume the current values of all the state values are:

v_{} (A) = 0, v_{} (B) = 5, v_{} (C) = 2, v_{} (D) = 10,

and

v_{} (T) = 0 .

Please, i

)

write down the Temporal Difference

(

)

evaluation equation for updating the values of states, ii

)

compute the final values after processing the episode shown in the figure with

= = \frac{1}{2} .

Problem 3 . ( 2 5 points ) Consider the MDP shown

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!

QUESTION NO 3: MANDATORY/DOPTIONAL/BONUS (5 POINTS) Consider that we have a tower of Hanoi problem, and we went to move the discs from pole A to pole C. You can move one disc only in each step. The...

A. Suppose we want to design a robot system that uses to diagnose COVID-19 patients. a) (pl) What kind of Al application must this robot use to work correctly? b) (2pts) Supposed that the above...

Q4. Model-free Reinforcement Learning: Cycle (20 points) Consider an MDP with 3 states, A, B and C; and 2 actions Clockwise and Counterclockwise. We do not know the transition function or the reward...

Consider the MDP shown in the state-transition diagram below. There are six states and two actions {L, R} meaning left and right. The state Z is a terminal state, and no actions are allowed from that...

B QUESTIONS AND PROBLEM 1. Assume that you expect that the average return on a security in various markers shown in the following table. Assume further that the historical correlation cos cients...

Question 2 (20 Marks). a) A bond with a coupon rate of 7% makes semi-annual coupon payments on January 15 and July 15 of each year. The Wall Street Journal reports the ask price for the bond on...

1. (3 points) List and briefly describe six Operating-System Services: 2. (3 points) Briefly describe the difference between multiprogramming and multiprocessor,What is degree of multiprogramming? 3....

This question calls for a straightforward application of definitions introduced in the Week 6 lecture. Consider the MDP shown in the figure below. It has two states: s1 and s2; and three actions: a,...

Kindly answer all questions as soon as possible. it's very urgent. Will upvote for the first Problem 2 (40 points) Consider an economy with a representative agent with CRRA utility given by u(C) =...

What have we learned from the research on participative leadership?

Lucid Images Ltd manufactures premium high definition televisions. The firms fixed costs are$4,000,000 per year. The variable cost of each TV is $2,000, and the TVs are sold for $3,000 each. The...

On January 1, 2017, EE Corp. issues 5-year bonds with a total face value of $4,000,000 and a stated/contract interest rate of 3.5% with interest paid annually each 12/31. The bonds are issued to...

QUESTION 2 The following summary statistics gives information on GPAs and starting salaries (rounded to the nearest hundred RM) of seven college graduates. Given that: x = 20.7 =168 = 64.25 = 4152 =...