You're trying to get to work. Home is at state 1 and work is at state...

Fantastic news! We've Found the answer you've been seeking!

Question:

You're trying to get to work. Home is at state 1 and work is at state N=10. In between home and work are

Transcribed Image Text:

You're trying to get to work. Home is at state 1 and work is at state N=10. In between home and work are states [2,9]. Naturally, you're trying to get from state 1 to 10. At each state, you have the option to walk or to take the subway. If you walk, you will get to the next sequential state with a probability of 1.0, and it will take you 1 minute e.g. if at state 3, you'll get to state 4 with a probability of 1.0 and lose 1 minute. You also have the option of taking the subway but this is rife with complexities: taking the subway will take you 2 minutes (because it is presumably more time consuming to go underground and get to the station); taking the subway has a 0.5 probability of failing (e.g. the train isn't available) and so you'll stay in the same state; BUT it also has a 0.5 probability of doubling your state e.g. if you're currently at state 4, you'll jump to state 8. Note that an action can only be taken if it results in a valid state e.g. you can't take the subway at state 6 and beyond (because you'll be ending up at non-existent states). a) Write down the transition function 7 (s, a, s') and reward function R(s, a, s') for this MDP. [3 marks] b) There are too many states (almost all of which have a non-zero reward) in this problem to feasibly carry out (manual) value iteration for a meaningful number of iterations. Rather, we'll opt to use policy iteration. Starting with the policy below, carry out one step of policy evaluation, and then a subsequent step of policy extraction using the policy values to update the policy. [9 marks] State TT, (S) Vt(s) T1+1(S) 1 2 3 4 5 walk walk subway subway subway 6 walk 7 walk 8 walk 9 walk 10 - 0 You're trying to get to work. Home is at state 1 and work is at state N=10. In between home and work are states [2,9]. Naturally, you're trying to get from state 1 to 10. At each state, you have the option to walk or to take the subway. If you walk, you will get to the next sequential state with a probability of 1.0, and it will take you 1 minute e.g. if at state 3, you'll get to state 4 with a probability of 1.0 and lose 1 minute. You also have the option of taking the subway but this is rife with complexities: taking the subway will take you 2 minutes (because it is presumably more time consuming to go underground and get to the station); taking the subway has a 0.5 probability of failing (e.g. the train isn't available) and so you'll stay in the same state; BUT it also has a 0.5 probability of doubling your state e.g. if you're currently at state 4, you'll jump to state 8. Note that an action can only be taken if it results in a valid state e.g. you can't take the subway at state 6 and beyond (because you'll be ending up at non-existent states). a) Write down the transition function 7 (s, a, s') and reward function R(s, a, s') for this MDP. [3 marks] b) There are too many states (almost all of which have a non-zero reward) in this problem to feasibly carry out (manual) value iteration for a meaningful number of iterations. Rather, we'll opt to use policy iteration. Starting with the policy below, carry out one step of policy evaluation, and then a subsequent step of policy extraction using the policy values to update the policy. [9 marks] State TT, (S) Vt(s) T1+1(S) 1 2 3 4 5 walk walk subway subway subway 6 walk 7 walk 8 walk 9 walk 10 - 0

Related Book For answer-question

answer-question

Income Tax Fundamentals 2013

Income Tax Fundamentals 2013

ISBN: 9781285586618

31st Edition

Authors: Gerald E. Whittenburg, Martha Altus Buller, Steven L Gill

See More Books

Posted Date: Dec 03, 2023 08:52 AM

See More Questions