Question: 1.2 Reward Functions (20 pts) For this problem consider the MDP is shown in Figure1. The numbers in each square represent reward the agent receives

1.2 Reward Functions (20 pts) For this problem consider the MDP

1.2 Reward Functions (20 pts) For this problem consider the MDP is shown in Figure1. The numbers in each square represent reward the agent receives for entering the square. In the event, the agent bumps into the wall and stays put this also counts as entering the square and the agent receives the reward. When the agent enters the terminal state it will then no matter what action is taken, transition to the absorbing state and receive a reward of 0 every tme-step after regardless of the reward function action bsorbing State -.04 04 04 0.8 0.1 -.04 04 Terminal 80% probability of taking the desired action, 10% probability of "strafing, left (or right) instead. If you bump into a wall, you just stay where you were. 2START .04 04 .04 Figure 1: Gridworld MDP 1.3 Choosing a reward function For the MDP in Figure1 consider replacing the reward function with one of the following: 1. Reward 0 everywhere 2. Reward 1 everywhere 3. Reward 0 everywhere except (0, 3) then reward is 1 4. Reward -1 everywhere except (0, 3) then reward is 1 5. Reward 0 everywhere except (0, 3) then reward is 1000 and (3,1) reward is 999 6. Reward 1 everywhere except (3,1) then reward is -1 1.3.1 Behavior of Reward Functions (15 pts) For each of the above reward functions, assune the agent has an optimal policy. Describe what the behavior of the agent that follows an optimal policy for each reward function. Some things you might want to consider: Does the agent head to a terminal state? Does it try to get somewhere as fast as possible? Will it avoid terminal states? Recall that the discount factor is ? = 1 and an optimal policy is any policy that mhaximizes the expected discounted sum of rewards. 1. Answer: 2. Answer: 3. Answer: 4. Answer: 5. Answer: nswer: 1.3.2 Creating Reward Functions (5 pts) Designing reward functions can be tricky and needs to be done carefully so the agent doesn't learn a policy that has undesired behaviour. Create a reward function that incentives the agent to navigate to state (3,1), but avoids the state (2,2). Keep in mind that (3,0) is still a terminal state. Answer: 1.2 Reward Functions (20 pts) For this problem consider the MDP is shown in Figure1. The numbers in each square represent reward the agent receives for entering the square. In the event, the agent bumps into the wall and stays put this also counts as entering the square and the agent receives the reward. When the agent enters the terminal state it will then no matter what action is taken, transition to the absorbing state and receive a reward of 0 every tme-step after regardless of the reward function action bsorbing State -.04 04 04 0.8 0.1 -.04 04 Terminal 80% probability of taking the desired action, 10% probability of "strafing, left (or right) instead. If you bump into a wall, you just stay where you were. 2START .04 04 .04 Figure 1: Gridworld MDP 1.3 Choosing a reward function For the MDP in Figure1 consider replacing the reward function with one of the following: 1. Reward 0 everywhere 2. Reward 1 everywhere 3. Reward 0 everywhere except (0, 3) then reward is 1 4. Reward -1 everywhere except (0, 3) then reward is 1 5. Reward 0 everywhere except (0, 3) then reward is 1000 and (3,1) reward is 999 6. Reward 1 everywhere except (3,1) then reward is -1 1.3.1 Behavior of Reward Functions (15 pts) For each of the above reward functions, assune the agent has an optimal policy. Describe what the behavior of the agent that follows an optimal policy for each reward function. Some things you might want to consider: Does the agent head to a terminal state? Does it try to get somewhere as fast as possible? Will it avoid terminal states? Recall that the discount factor is ? = 1 and an optimal policy is any policy that mhaximizes the expected discounted sum of rewards. 1. Answer: 2. Answer: 3. Answer: 4. Answer: 5. Answer: nswer: 1.3.2 Creating Reward Functions (5 pts) Designing reward functions can be tricky and needs to be done carefully so the agent doesn't learn a policy that has undesired behaviour. Create a reward function that incentives the agent to navigate to state (3,1), but avoids the state (2,2). Keep in mind that (3,0) is still a terminal state

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

function LAB04ex1 %NOTE: you CAN run this type of function m file as a script because it requires no inputs. Click the Green arrow above the "Run" tab t0 = 0; tf = 40; %initial and final time y0 =...

MATLAB sessions: Laboratory 4 MAT 275 Laboratory 4 MATLAB solvers for First-Order IVP In this laboratory session we will learn how to 1. Use MATLAB solvers for solving scalar IVP 2. Use MATLAB...

PLEASE COMPLETE NO LATER THAN 10/07 @8:00AM Each question(1,2,& 3) must be a minimum of 200 words. Please make answers detailed and knowledgeable based off the attached reading. ARE YOU ABLE TO...

Question: Evaluate the two forecasting models described in the case for predicting daily check-in volume. What are the strengths and weaknesses of each one? Do you find any of the results surprising?...

Algorithms in Artificial Intelligence (or, the old name: Introduction to Algorithmic Decision Making) Part 1 Based on slides by David Sarne and Lirong Xia Course Tentative Schedule Introduction...

Math\t107-6381\t-\tQuiz\t#4\t-\tSchultz\t-\tDue\tFebruary\t21,\t2016\t-\tpage\t1\tof 3 Follow\tthese\tdirections\tcarefully. This\tquiz\tis\tdue\tby\t11:59\tEastern\ttime\ton\tFebruary\t21,\t2016. o...

Microkernel operating systems aim to address perceived modularity and reliability issues in traditional "monolithic" operating systems. (i) Describe the typical architecture of a microkernel...

Let A, B be sets. Define: (a) the Cartesian product (A B) (b) the set of relations R between A and B (c) the identity relation A on the set A [3 marks] Suppose S, T are relations between A and B, and...

Please see attachment. All three question need to be answered in narrative format. If you have questions, just let me know. Normal requirement for references are 2 outside our course text....

answer the question clearly You are building a flight-control system for which a convincing safety case must be made. Would you assign the tasks of safety requirements engineering, test case...

Accenture recently wrote an article entitledMeet the Finance 2020 Workforce,suggesting that accountants will need to not only embrace traditional financial questions of What happened and Why did it...

Describe the work product doctrine. What is its primary purpose?

For purposes of the Gramm-Leach-Bliley Act (GLBA), what is the criteria for being a bank's customer? Select the correct answer and click Submit. A. A "customer" is a consumer with a continuing...

Seved Help 14 Wisconsin Snowmobile Corp. is considering a switch to level production Cost efficiencies would occur under level production, and aftertax costs would decline by $31,500, but inventory...

Are Pay Policies typically the same for all Occupation Groups in an organization?

Why are Medians sometimes more indicative of Central Tendency than are Averages?

What types of data are Dimensional Relational Databases in both RDMSs and OLAP Databases primarily designed to hold?