Question: Q3 (45 points). Upper Confidence Bound (UCB) algorithm: In this problem, we will run the UCB algorithm that we saw in class for a few

 Q3 (45 points). Upper Confidence Bound (UCB) algorithm: In this problem,

Q3 (45 points). Upper Confidence Bound (UCB) algorithm: In this problem, we will run the UCB algorithm that we saw in class for a few rounds. We will assume that the desired probability of failure is given by p=0.05. We consider a setting with 4 different arms. Each time step, we must decide i) which arm to pick and ii) how to update our belief about this arm. We do so in a partial information setting; i.e., at each time step, we only get to observe the reward rt(i) of the arm i we decide to pick. We saw in class that UCB must explore each arm once at the start, i.e. from time steps 1 to 4 . Here, we assume that in the first 4 time steps, UCB obtains rewards 0.95 for arm 1, 0.88 for arm 2, 0.60 for arm 3 , and 0.70 for arm 4. The table below gives the reward of each arm at each time step starting at t=5 : Table 1: Rewards at each time step Remember once again that we can only use the rewards of the arms we pick. You should act as if you only know the reward of the arm you pulled at each time step. For example, if at time t, you pull arm 3 , you cannot use information about the rewards of arms 1,2 , and 4 in that time step (i.e. r1(t),r2(t),r4(t) in the subsequent steps of the algorithm; you only have access to r3(t) as your information from time step t ). (a) (5 points) We saw that UCB works by first computing an upper confidence bound for each arm. This confidence interval, for a given arm i, depends on how many times N(i) we have pulled that arm, and is computed by adding a number U(i) to the average reward. What is the value of U(i) for N(i)=1 ? N(i)=2?N(i)=3?N(i)=4 ? N(i)=5 ? (b) (10 points) At the beginning of t=5, what is Q^5(i)+U^5(i) for each arm? Which arm should I pull next? (c) (10 points) What is the new N(i)+U(i) for each arm at the end of day 5/ beginning of day 6 ? (d) (20 points) Keep running the algorithm until time step t=9. (Your answer should include the choice of arm at the beginning of as well as the upper confidence at the end of time steps t=5 to t=9; do not forgot the upper confidence bounds at the end of t=9.)

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related General Management Questions!