Question: Exercise 11.9 Consider four different ways to derive the value of k from k in Qlearning (note that for Q-learning with varying k, there must
Exercise 11.9 Consider four different ways to derive the value of αk from k in Qlearning
(note that for Q-learning with varying αk, there must be a different count k for each state–action pair).
i) Let αk = 1/k.
ii) Let αk = 10/(9 + k).
iii) Let αk = 0.1.
iv) Let αk = 0.1 for the first 10,000 steps, αk = 0.01 for the next 10,000 steps,
αk = 0.001 for the next 10,000 steps, αk = 0.0001 for the next 10,000 steps, and so on.
(a) Which of these will converge to the true Q-value in theory?
(b) Which converges to the true Q-value in practice (i.e., in a reasonable number of steps)? Try it for more than one domain.
(c) Which can adapt when the environment adapts slowly?
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
