Figure 17.13 shows two MDPs: one, M, represents a two-armed bandit where one has the choice to
Question:
Figure 17.13 shows two MDPs: one, M, represents a two-armed bandit where one has the choice to continue with the first arm or to switch permanently to a second arm with fixed reward λ; the other, a restart MDP Ms , gives one a choice to continue with the first arm or restart the sequence. The figure illustrates the construction of Ms for the case where M has a deterministic reward sequence and just two arms (including the λ-arm). Explain how to construct Ms when M has k + 1 arms (including the λ-arm) and each arm is a general MRP. Show that the value of Ms equals the minimum value of λ such that one would be indifferent in M between pulling the best arm and switching to the λ-arm forever.
Step by Step Answer:
Artificial Intelligence A Modern Approach
ISBN: 9780134610993
4th Edition
Authors: Stuart Russell, Peter Norvig