Question: Q3 (45 points). Upper Confidence Bound (UCB) algorithm: In this problem, we will run the UCB algorithm that we saw in class for a few

Q3 (45 points). Upper Confidence Bound (UCB) algorithm: In this problem,

Q3 (45 points). Upper Confidence Bound (UCB) algorithm: In this problem, we will run the UCB algorithm that we saw in class for a few rounds. We will assume that the desired probability of failure is given by p=0.05. We consider a setting with 4 different arms. Each time step, we must decide i) which arm to pick and ii) how to update our belief about this arm. We do so in a partial information setting; i.e., at each time step, we only get to observe the reward rt(i) of the arm i we decide to pick. We saw in class that UCB must explore each arm once at the start, i.e. from time steps 1 to 4 . Here, we assume that in the first 4 time steps, UCB obtains rewards 0.95 for arm 1, 0.88 for arm 2, 0.60 for arm 3 , and 0.70 for arm 4. The table below gives the reward of each arm at each time step starting at t=5 : Table 1: Rewards at each time step Remember once again that we can only use the rewards of the arms we pick. You should act as if you only know the reward of the arm you pulled at each time step. For example, if at time t, you pull arm 3 , you cannot use information about the rewards of arms 1,2 , and 4 in that time step (i.e. r1(t),r2(t),r4(t) in the subsequent steps of the algorithm; you only have access to r3(t) as your information from time step t ). (a) (5 points) We saw that UCB works by first computing an upper confidence bound for each arm. This confidence interval, for a given arm i, depends on how many times N(i) we have pulled that arm, and is computed by adding a number U(i) to the average reward. What is the value of U(i) for N(i)=1 ? N(i)=2?N(i)=3?N(i)=4 ? N(i)=5 ? (b) (10 points) At the beginning of t=5, what is Q^5(i)+U^5(i) for each arm? Which arm should I pull next? (c) (10 points) What is the new N(i)+U(i) for each arm at the end of day 5/ beginning of day 6 ? (d) (20 points) Keep running the algorithm until time step t=9. (Your answer should include the choice of arm at the beginning of as well as the upper confidence at the end of time steps t=5 to t=9; do not forgot the upper confidence bounds at the end of t=9.)

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related General Management Questions!

Problem 3. (15 points) Assume you are running Monte-Carlo tree search on a binary tree (i.e. branching factor of 2 everywhere). Also assume on ties you pick the left-most tied node. You see the...

Portray in words what transforms you would have to make to your execution to some degree (a) to accomplish this and remark on the benefits and detriments of this thought.You are approached to compose...

A creative engineer suggests structuring the TLB so that not all the bits of the presented address need match to result in a hit. Suggest how this might be achieved, and what might be the costs and...

Briefly describe ASCII and Unicode and draw attention to any relationship between them. [3 marks] (b) Briefly explain what a Reader is in the context of reading characters from data. [3 marks] A...

Suppose that R(A, B, C) is a relational schema with functional dependencies F = {A, B C, C B}. (i) Is this schema in 3NF? Explain. [2 marks] (ii) Is this schema in BCNF? Explain. [2 marks] (b)...

Prolog You are approached to compose a Prolog program to work with twofold trees. Your code shouldn't depend on any library predicates and you ought to expect that the mediator is running without...

ttth Suppose that the sequence of bags {Bn | n N} is recursively enumerated by the computable function e(n, x) = fn(x), [7 marks] Hence prove that the set of all recursive bags cannot be recursively...

Q3: (45 points) Dima's Company purchased a new machine that costs $300,000 and has a useful life of 7 years. (a) (5 points) Assume that the salvage value of this machine at the end of its useful life...

write down the updating equation in SGD for w and b, for both unregularized logistic regression (15 points]) and regularized logistic regression ([5 points]). In particular, at iteration t using one...

I create a Block Diagram showing BCD serial input stream, and the 0101 stream to be serially added to create the BCD to Excess-5 Serial Code C

What are the two major differences between insurance and gambling?

E21.20 (LO 3, 4) (Error Analysis) The before-tax income for Fitzgerald Co. for 2024 was $101,000, and for 2025 was $77,400. However, the accountant noted that the following errors had been made. 1....

Free cash flow valuation You are evaluating the potential purchase of a small business with no debt or preferred stock that is currently generating $ 4 2 , 7 0 0 of free cash flow ( F C F 0 = $ 4 2 ,...