ssume a reinforcement learning agent has the following policy: (at |st) = exp(0.5st at 2 ) (2)

Fantastic news! We've Found the answer you've been seeking!

Question:

ssume a reinforcement learning agent has the following policy: (at |st) = exp(0.5st at 2 ) (2) (i) Let = [1, 1, 3] be the current parameters, 1 and 2 be two trajectories sampled from the current policy as below. 1 = (s = 1 0 2 , a = 0, r = 0.1),(s = 0 2 3 , a = 1, r = 0.1) (3) 2 = (s = 1 1 2 , a = 1, r = 0),(s = 4 1 0 , a = 0, r = 0.1) (4) Show how you update the reinforcement learning agent using the policy gradient algorithm? (ii) Describe two ways to reduce the variance of a policy gradient algorithm