Question: Part 1 : Write code for a multi - arm bandit algorithm that has the following characteristics: A: number of arms P: Distribution of rewards

Part 1:
Write code for a multi-arm bandit algorithm that has the following characteristics:
A: number of arms
P: Distribution of rewards [0,1]. Use the beta distribution so you can tune the rewards distribution based on two parameters. Choose your own parameter settings and graph the distributions in one plot.
r_i: reward (0 or 1) taken from probability distribution P_i
T: number of rounds played (gambles)
Part 2:
Suppose you have 4 arms (A=4). Implement a random, a greedy, an epsilon-first greedy, and epsilon greedy, and a upper confidence band (UCB1) approach to selecting the best arm to play. Ensure the strategies only use the rewards when determining
Part 3:
Evaluate the performance of the 5 strategies by 1) plotting the regret of each round [i.e., plot Regret(round#) versus round] and 2) plotting the expected regret averaged over 50 rounds [i.e., plot average Regret(round#) versus round]. Regret is the difference between actual reward and reward if you played optimally.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!