Question: Problem 5 (30 marks) Re-implement in Python the results presented in Figure 6.4 of the Sutton & Barto book comparing SARSA and Q-learning in the

Problem 5 (30 marks) Re-implement in Python the results presented in

Figure 6.4 of the Sutton & Barto book comparing SARSA and Q-learning

Problem 5 (30 marks) Re-implement in Python the results presented in Figure 6.4 of the Sutton & Barto book comparing SARSA and Q-learning in the cliff-walking task. Investigate the effect of choosing different values for the exploration parameter for both methods. Present your code and results. In your discussion clearly describe the main difference between SARSA and Q-learning in relation to your findings. Note: For this problem, use a = 0.1 and y = 1 for both algorithms. The "smoothing" that is mentioned in the caption of Figure 6.4 is a result of 1) averaging over 10 runs, and 2) plotting a moving average over the last 10 episodes. Example 6.6: Cliff Walking This gridworld example compares Sarsa and Q-learning, highlighting the difference between on-policy (Sarsa) and off-policy (Q-learning) methods. Consider the gridworld shown in the upper part of Figure 6.4. This is a standard undiscounted, episodic task, with start and goal states, and the usual actions causing movement up, down, right, and left. Reward is -1 on all transitions except those into the region marked "The Cliff." Stepping into this region incurs a reward of 100 and sends the agent instantly back to the start. The lower part of Figure 6.4 shows the performance of the Sarsa and Q-learning methods with e- greedy action selection, = 0.1. After an initial transient, Q-learning learns values for the optimal policy, that which travels right along the edge of the cliff. Unfortunately, this results in its occasionally falling off the cliff because of the e-greedy action selection. Sarsa, on the other hand, takes the action selection into account and learns the longer but safer path through the upper part of the grid. Although Q-learning actually learns the values of the optimal policy, its on-line performance is worse than that of Sarsa, which learns the roundabout policy. Of course, if were gradually reduced, then both methods would asymptotically converge to the optimal policy. R= -1 safe path optimal path S The Cliff G R = -100 Sarsa -25- mm -50- Sum of rewards during episode Q-learning -75- -100+ 0 100 300 200 400 500 Episodes Figure 6.4: The cliff-walking task. The results are from a single run, but smoothed by averaging the reward sums from 10 successive episodes. Problem 5 (30 marks) Re-implement in Python the results presented in Figure 6.4 of the Sutton & Barto book comparing SARSA and Q-learning in the cliff-walking task. Investigate the effect of choosing different values for the exploration parameter for both methods. Present your code and results. In your discussion clearly describe the main difference between SARSA and Q-learning in relation to your findings. Note: For this problem, use a = 0.1 and y = 1 for both algorithms. The "smoothing" that is mentioned in the caption of Figure 6.4 is a result of 1) averaging over 10 runs, and 2) plotting a moving average over the last 10 episodes. Example 6.6: Cliff Walking This gridworld example compares Sarsa and Q-learning, highlighting the difference between on-policy (Sarsa) and off-policy (Q-learning) methods. Consider the gridworld shown in the upper part of Figure 6.4. This is a standard undiscounted, episodic task, with start and goal states, and the usual actions causing movement up, down, right, and left. Reward is -1 on all transitions except those into the region marked "The Cliff." Stepping into this region incurs a reward of 100 and sends the agent instantly back to the start. The lower part of Figure 6.4 shows the performance of the Sarsa and Q-learning methods with e- greedy action selection, = 0.1. After an initial transient, Q-learning learns values for the optimal policy, that which travels right along the edge of the cliff. Unfortunately, this results in its occasionally falling off the cliff because of the e-greedy action selection. Sarsa, on the other hand, takes the action selection into account and learns the longer but safer path through the upper part of the grid. Although Q-learning actually learns the values of the optimal policy, its on-line performance is worse than that of Sarsa, which learns the roundabout policy. Of course, if were gradually reduced, then both methods would asymptotically converge to the optimal policy. R= -1 safe path optimal path S The Cliff G R = -100 Sarsa -25- mm -50- Sum of rewards during episode Q-learning -75- -100+ 0 100 300 200 400 500 Episodes Figure 6.4: The cliff-walking task. The results are from a single run, but smoothed by averaging the reward sums from 10 successive episodes

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

Problem 1 (24 marks) Re-implement in Python the results presented in Figure 2.2 of the Sutton & Barto book comparing a greedy method with two e-greedy methods ( = 0.01 and = 0.1), on the 10-armed...

Instuctor's Annotated Edition TENTH EDITION Understandable Statistics Concepts and Methods Charles Henry Brase Regis University Corrinne Pellillo Brase Arapahoe Community College Australia Brazil...

Set Student Name: 1. Describe the relationship between two variables that have a correlation coefficient value: a. Near -1 b. Near 0 c. Near 1 2. Data was collected where a weightlifter was asked to...

PROBLEM 5: (30 Marks) Agri-Pro is a company that sells agricultural products to farmers in a number of provinces. One service that it provides to customers is custom feed mixing, for which a farmer...

Please give a step by step solution in excel with the excel formulas. And answer all the parts of the question. Will give it an upvote. PROBLEM 5: (30 Marks) Agri-Pro is a company that sells...

Need help with this homework question. Please help with an excel solution including the steps. PROBLEM 5: (30 Marks) Agri-Pro is a company that sells agricultural products to farmers in a number of...

: WILL SURELY UPVOTE IF I GET A PROPER STEP BY STEP ANSWER AND EXCEL FILE. PLEASE ATTACH YOUR EXCEL FILE TOO. CHEERS PROBLEM 5: (30 Marks) Agri-Pro is a company that sells agricultural products to...

**Please indicate the cell formulas you use** Assignment Questions: PROBLEM 5: (30 Marks) Agri-Pro is a company that sells agricultural products to farmers in a number of provinces. One service that...

Assignment Questions: PROBLEM 5: (30 Marks) Agri-Pro is a company that sells agricultural products to farmers in a number of provinces. One service that it provides to customers is custom feed...

How is the material standard developed? Why are the quantities shown in the bill of materials not always the same quantities shown in the standard cost card?

Given the above UML design for a home-design project, answer the following questions: 1. List all relationships used in the diagram (for example, "ClassA HAS-A ClassB" or "ClassC INHERITS from...

Banker's acceptances Repurchase agreements Question 3 1 pts . Which of the following best explains why LIBOR has historically traded above the U . S . federal funds rate? LIBOR is quoted on a...

Seved Help 14 Wisconsin Snowmobile Corp. is considering a switch to level production Cost efficiencies would occur under level production, and aftertax costs would decline by $31,500, but inventory...

1. Describe what is so special about SIAs five elements of its successful HR practices?

4. Why do you think are US full-service airlines largely undifferentiated lowquality providers? What are the reasons that none of the full-service airlines positioned itself and delivers as a high...

2. Describe the performance management tool/process that you use to monitor your cabin crew.