Question: Problem 1 . ( 5 0 pt ) Given a Markov stationary policy pi , consider the policy evaluation problem to compute v ^

Problem 1.(50pt) Given a Markov stationary policy \pi , consider the policy evaluation problem to compute v^\pi . For example, we can apply the temporal difference (TD) learning algorithm given by v_t+1(s)=v_t(s)+\alpha \delta _t(s)_{s_t=s}, where \delta _t:=r_t+\gamma v_t(s_t+1)-v_t(s_t) is known as TD error. Alternatively, we can apply the n-step TD learning algorithm given by v_t+1(s)=v_t(s)+\alpha (G_t^(n)-v_t(s))_{s_t=s}, where G_t^(n):=r_t+\gamma r_t+1+...+\gamma ^n-1 r_t+n-1+\gamma ^n v_t^\pi (s_t+n) for n=1,2,.... Note that \delta _t= G_t^(1)-v_t(s_t). The n-step TD algorithms for n<\infty use bootstrapping. Therefore, they use biased estimate of v^\pi . On the other hand, as n ->\infty , the n-step TD algorithm becomes a Monte Carlo method, where we use an unbiased estimate of v^\pi . However, these approaches delay the update for n stages and we update the value function estimate only for the current state. As an intermediate step to address these challenges, we first introduce the \lambda -return algorithm given by v_t+1(s)=v_t(s)+\alpha (G_t^\lambda -v_t(s))_{s_t=s}, where given \lambda in [0,1], we define G_t^\lambda :=(1-\lambda )_n=1^\infty \lambda ^n-1 G_t^(n) taking a weighted average of G_t^(n)'s.(a) By the definition of G_t^(n), we can show that G_t^(n)=r_t+\gamma G_t+1^(n-1). Derive an analogous recursive relationship for G_t^\lambda and G_t+1^\lambda .(b) Show that the term G_t^\lambda -v_t(s) in the \lambda -return update can be written as the sum of TD errors. The TD algorithm, Monte Carlo method and \lambda -return algorithm looks forward to approximate v^\pi . Alternatively, we can look backward via the eligibility trace method. TheTD(\lambda )algorithm is given by [ z_t(s)=\gamma \lambda z_t-1(s)+_{s=s_t} s in S; v_t+1(s)=v_t(s)+\alpha \delta _t z_t(s) s in S,] where z_t in ^|S| is called the eligibility vector and the initial z_-1(s)=0 for all s.(c) In the TD(\lambda ) algorithm, z_t is computed recursively. Express z_t only in terms of the states visited in the past. This representation of the eligibility vector will show that eligibility vectors combine the frequency heuristic and recency heuristic to address the credit assignment problem. For the rewards received, the frequency heuristic assigns higher credit to the frequently visited states while the recency heuristic assigns higher credit to the recently visited states. The eligibility vector assigns higher credits to the frequently and recently visited states. Note that in the TD(\lambda ) algorithm, value function estimate for every state gets updated different from the n-step TD algorithms, where only the estimate for the current state gets updated. If a state has not been visited recently and frequently then the eligibility of that state (i.e., the associated entry of the eligibility vector) will be close to zero. Therefore, the update via the TD-error will take very small steps for such states. Though \lambda -return is forward-looking while TD(\lambda ) is backward looking, they are equivalent as you will show next for the finite horizon problem with horizon length T<\infty .(d) Assume that the initial value function estimates are zero, i.e., v_0(s)=0 for all s. Then, the recursive update in the \lambda -return algorithm yields that v_T(s) can be written as v_T(s)=_t=0^T-1\alpha (G_t^\lambda -v_t(s_t))_{s_t=s}. Correspondingly, the recursive update in the TD(\lambda ) algorithm yields that v_T(s) can be written as v_T(s)=_t=0^T-1\alpha \delta _t z_t(s). Show that _t=0^T-1\alpha \delta _t z_t(s)=_t=0^T-1\alpha (G_t^\lambda -v_t(s_t))_{s_t=s} s .

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!