Question: 3. Efficient Routing MDP You are leading a routing and planning team at a self-driving car company and have decided to model your latest urban

3. Efficient Routing MDP You are leading a3. Efficient Routing MDP You are leading a3. Efficient Routing MDP You are leading a

3. Efficient Routing MDP You are leading a

3. Efficient Routing MDP You are leading a routing and planning team at a self-driving car company and have decided to model your latest urban navigation problem as an MDP. Consider the following environment (Fig.1). Your car must navigate along the road (gray squares) while avoiding obstacles (red squares) to reach the rider's destination (the green square). Because the road is gridlocked, your car must change lanes whenever it wishes to move forward. From any gray square, your car can either move right & up, or right & down. For example, starting from state 3, your car can move to state 8 or 10. Note that it is not be possible to reach the green square from every state. Actions are deterministic and always succeed unless they will cause you to run into an impassible barrier. The thick outer edge indicates an impassible barrier, and attempting to move in the direction of a barrier from gray square results in your car moving up one square (e.g. taking any action from state 32 moves the car to state 31) 1 13 19 25 31 7 13 19 25 31 2 2 8 14 20 26 32 2 8 14 20 32 3 3 9 15 21 27 33 3 15 27 4 10 16 22 28 34 10 22 28 34 5 11 17 23 29 35 5 11 17 23 29 35 6 12 18 24 30 36 12 18 24 30 36 (a) Grid World (b) A successful run in Grid World. Figure 1 A successful run in Grid World 1 is shown in Figure 1b. Taking any action from the green destination square (no. 33) earns a reward of rg and ends the episode. Taking any action from the red squares that depict obstacles (no. 1, 7, 13...) earns a reward of rr and ends the episode. Otherwise, from every other square, taking any action is associated with = reward rs. Assume the discount factor y = 0.9, lg = +5, and r; = -5 unless otherwise specified. Notice the horizon is technically infinite. (a) Let rs {-5, -0.5, 0, 2}. Starting in square 2, for each of the possible values of rs, briefly explain what the optimal policy would be in Grid World. In each case is the optimal policy unique and does the optimal policy depends on the value of the discount factor y? Explain your answer. (b) Which values of rs (-5, -0.5, 0, 2} will yield a policy that returns the shortest path to the green square? (Hint: At least one does.) Explain which ones do, then, pick the minimum of this set of rewards that does, and then find the optimal value function for states 2, 13, 21 and 32. 3. Efficient Routing MDP You are leading a routing and planning team at a self-driving car company and have decided to model your latest urban navigation problem as an MDP. Consider the following environment (Fig.1). Your car must navigate along the road (gray squares) while avoiding obstacles (red squares) to reach the rider's destination (the green square). Because the road is gridlocked, your car must change lanes whenever it wishes to move forward. From any gray square, your car can either move right & up, or right & down. For example, starting from state 3, your car can move to state 8 or 10. Note that it is not be possible to reach the green square from every state. Actions are deterministic and always succeed unless they will cause you to run into an impassible barrier. The thick outer edge indicates an impassible barrier, and attempting to move in the direction of a barrier from gray square results in your car moving up one square (e.g. taking any action from state 32 moves the car to state 31) 1 13 19 25 31 7 13 19 25 31 2 2 8 14 20 26 32 2 8 14 20 32 3 3 9 15 21 27 33 3 15 27 4 10 16 22 28 34 10 22 28 34 5 11 17 23 29 35 5 11 17 23 29 35 6 12 18 24 30 36 12 18 24 30 36 (a) Grid World (b) A successful run in Grid World. Figure 1 A successful run in Grid World 1 is shown in Figure 1b. Taking any action from the green destination square (no. 33) earns a reward of rg and ends the episode. Taking any action from the red squares that depict obstacles (no. 1, 7, 13...) earns a reward of rr and ends the episode. Otherwise, from every other square, taking any action is associated with = reward rs. Assume the discount factor y = 0.9, lg = +5, and r; = -5 unless otherwise specified. Notice the horizon is technically infinite. (a) Let rs {-5, -0.5, 0, 2}. Starting in square 2, for each of the possible values of rs, briefly explain what the optimal policy would be in Grid World. In each case is the optimal policy unique and does the optimal policy depends on the value of the discount factor y? Explain your answer. (b) Which values of rs (-5, -0.5, 0, 2} will yield a policy that returns the shortest path to the green square? (Hint: At least one does.) Explain which ones do, then, pick the minimum of this set of rewards that does, and then find the optimal value function for states 2, 13, 21 and 32

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related General Management Questions!