Question: An epsilon - greedy strategy for the stochastic multi - armed bandits set up exploits the current best arm with probability ( 1 ) and

An epsilon-greedy strategy for the stochastic multi-armed bandits set up exploits the current best arm with probability (1) and explores with a small probability . Consider a problem instance with 10 arms where the reward for the i-th (i =1,...,10) arm is Beta distributed with parameters \alpha i =5,\beta i =5 i. Implement the epsilon-greedy algorithm and compare it with the performance of the UCB and the EXP-3 algorithm. Plot the regret bounds and comment on your observations. (Bonus: Can you formally show a regret
guarantee for the epsilon-greedy algorithm?)

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!