Question: Consider a bandit problem in which you know the set of expected payoffs for pulling various arms, but you do not know which arm maps
Consider a bandit problem in which you know the set of expected payoffs for pulling various arms, but you do not know which arm maps to which expected payoff. For example, consider a 5 arm bandit problem and you know that the arms 1 through 5 have payoffs 3.1, 2.3, 4.6, 1.2, 0.9, but not necessarily in that order.
a) Can you design a regret minimizing algorithm that will achieve better bounds than UCB? What makes you believe that it is possible?
b) What parts of the analysis of UCB will you modify to achieve better bounds? Note that you are not asked for a complete algorithm, only the intuition.
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
