Question: For the multi - arms bandit problem we discussed in the class. Suppose that we get return Gn at n - th time we do

For the multi-arms bandit problem we discussed in the class. Suppose that we get return Gn at n-th time we do action a, and EGn = r, n =1,2,. Let Qn+1 be our estimates of r after we do action a the n-th time, and we have the following update rule Qn+1=Qn +n(Gn Qn), Q1=0. We define Vn = E [(Qn r)2].(a)(Decreasing step size) Let n = n1, show that i.(5 points) Qn+1= n1 Pni=1 Gi, n =1,2,, ii.(10 points) limn Vn =0.(b)(Constant step size) Let n = ,0< <2, show that i.(15 points) Vn+1=(1 )2 Vn + 2Var[Gn], where Var[Gn]= E [(Gn r)2] ii.(20 points) limn Vn+1 Var[Gn]=0.2

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!