Question: Problem 1 (24 marks) Re-implement in Python the results presented in Figure 2.2 of the Sutton & Barto book comparing a greedy method with two

Problem 1 (24 marks) Re-implement in Python the results presented in

Figure 2.2 of the Sutton & Barto book comparing a greedy method

Problem 1 (24 marks) Re-implement in Python the results presented in Figure 2.2 of the Sutton & Barto book comparing a greedy method with two e-greedy methods ( = 0.01 and = 0.1), on the 10-armed testbed, and present your code and results. Include a discussion of the exploration - exploitation dilemma in relation to your findings. 2.4. INCREMENTAL IMPLEMENTATION 23 1.5 E = 0.01 E = 0 (greedy) Average reward 0.5 0 250 n 750 1000 500 Steps 100% 80% E = 0.1 E=0.01 % 60% Optimal action 40% E = 0 (greedy) 20% 0% 250 750 1000 500 Steps Figure 2.2: Average performance of e-greedy action-value methods on the 10-armed testbed. These data are averages over 2000 runs with different bandit problems. All methods used sample averages as their action-value estimates. value of each action after trying it once. In this case the greedy method might actually perform best because it would soon find the optimal action and then never explore. But even in the deterministic case there is a large advantage to exploring if we weaken some of the other assumptions. For example, suppose the bandit task were nonstationary, that is, the true values of the actions changed over time. In this case exploration is needed even in the deterministic case to make sure one of the nongreedy actions has not changed to become better than the greedy one. As we shall see in the next few chapters, nonstationarity is the case most commonly encountered in reinforcement learning. Even if the underlying task is stationary and deterministic, the learner faces a set of banditlike decision tasks each of which changes over time as learning proceeds and the agent's policy changes. Reinforcement learning requires a balance between exploration and exploitation. Problem 1 (24 marks) Re-implement in Python the results presented in Figure 2.2 of the Sutton & Barto book comparing a greedy method with two e-greedy methods ( = 0.01 and = 0.1), on the 10-armed testbed, and present your code and results. Include a discussion of the exploration - exploitation dilemma in relation to your findings. 2.4. INCREMENTAL IMPLEMENTATION 23 1.5 E = 0.01 E = 0 (greedy) Average reward 0.5 0 250 n 750 1000 500 Steps 100% 80% E = 0.1 E=0.01 % 60% Optimal action 40% E = 0 (greedy) 20% 0% 250 750 1000 500 Steps Figure 2.2: Average performance of e-greedy action-value methods on the 10-armed testbed. These data are averages over 2000 runs with different bandit problems. All methods used sample averages as their action-value estimates. value of each action after trying it once. In this case the greedy method might actually perform best because it would soon find the optimal action and then never explore. But even in the deterministic case there is a large advantage to exploring if we weaken some of the other assumptions. For example, suppose the bandit task were nonstationary, that is, the true values of the actions changed over time. In this case exploration is needed even in the deterministic case to make sure one of the nongreedy actions has not changed to become better than the greedy one. As we shall see in the next few chapters, nonstationarity is the case most commonly encountered in reinforcement learning. Even if the underlying task is stationary and deterministic, the learner faces a set of banditlike decision tasks each of which changes over time as learning proceeds and the agent's policy changes. Reinforcement learning requires a balance between exploration and exploitation

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

Problem 5 (30 marks) Re-implement in Python the results presented in Figure 6.4 of the Sutton & Barto book comparing SARSA and Q-learning in the cliff-walking task. Investigate the effect of choosing...

The best definition for nonparametric statistics I can think of is when data does not fit inside the normal parameters of a normal distribution, it calls for the use of this statistical method....

Jupyter Notebook Now that we have tried our hand at some single-layer nets, let's see how they stack up compared to multi-layer nets. :) We will be exploring the basic concepts of learning non-linear...

Card, D., Krueger, A.B. (1994). Minimum wages and employment: a case study of the fast-food industry in New Jersey and Pennsylvania. American Economic Review, 84(4), 772-93. Once you have read your...

I'm in serious need of help in my accounting class (acc205) . I'm falling behind and I need help with this weeks assignment. for week three for the may-june month . can someone please help me ....

MNG3702/101/3/2016 Tutorial Letter 101/3/2016 Strategy Implementation and Control MNG3702 Semesters 1 and 2 Department of Business Management PLEASE NOTE: This tutorial letter contains important...

Budgeting for Nonprofit Organizations Although budgeting is just as important for nonprofit organizations as for for-profit companies, the approach taken toward budgeting can be very different. In...

Hello Dr.Ramsey Can you please assist e with my discussion again? Minimum of 150 words. This is due Friday 6/3/16 *************************************** Financial Statement Analysis In this unit you...

The correlation coefficient between stocks A and B is: = .7. Both stocks have an expected return of 15% and a standard deviation of 20%. In addition, you calculated that the minimum variance...

In the system of the figure of the bar OA, of negligible mass, it has a point mass m concentrated at the end. If support B is subjected to a vertical displacement of the form x(t) = X sin wt,...

yle established a revocable trust two years ago and named the trust the beneficiary of his life insurance policy. Lyle s wife is the beneficiary of the trust. The proceeds paid to the trust at Lyle s...

Compared with half a century ago, adoption has become _ _ _ _ _ _ _ _ _ common, but it is more open and acceptabl e , so we probably discuss it _ _ _ _ _ _ _ . fill in the blanks more or much less or...

5. Go to cs.gmu.edu/cne/modules/dau/stat, the Web site for an interactive tutorial that provides a refresher on probability and statistics. Click on Index. Choose a topic (such as Data Analysis)....

3. Explain to a manager how to ensure that transfer of training occurs.

4. Discuss the implications of identical elements, stimulus generalization, and cognitive theories for transfer of training.