Question: Problem 3 REINFORCE: MC Policy-Gradient Control (4pt) Suppose that we use the softmax policy function parameterized by : (s, ) e(s,a)T6 {k=1 e$(s,ax) TO? where

Problem 3 REINFORCE: MC Policy-Gradient Control

Problem 3 REINFORCE: MC Policy-Gradient Control (4pt) Suppose that we use the softmax policy function parameterized by : (s, ) e(s,a)T6 {k=1 e$(s,ax) TO? where o(s, a) is a feature vector of a state-action pair (s, a). Initially, 0 is set to [0.6,0.4). Now, you are given the feature vectors of state-action pairs as follows. (S1, A1) = (1,1), (S1, A2) = (1,2), (S2, A1) = (2,3), (S2, A2) = (2, 1) = = When you experience an episode, S2, A2, 5, S1, A2, 20, S3 with a step size a of 0.02 and a discount factor 7 of 0.5, how is the policy parameter 0 updated? Answer 0 and To(S2, A2) after every update. Note that Voln To(s, a) of the softmax policy is: Veln Te(s, a) = (s, a) Exe[b(s,:)] = $(s, a) +(a | $)6(8, a). = Problem 3 REINFORCE: MC Policy-Gradient Control (4pt) Suppose that we use the softmax policy function parameterized by : (s, ) e(s,a)T6 {k=1 e$(s,ax) TO? where o(s, a) is a feature vector of a state-action pair (s, a). Initially, 0 is set to [0.6,0.4). Now, you are given the feature vectors of state-action pairs as follows. (S1, A1) = (1,1), (S1, A2) = (1,2), (S2, A1) = (2,3), (S2, A2) = (2, 1) = = When you experience an episode, S2, A2, 5, S1, A2, 20, S3 with a step size a of 0.02 and a discount factor 7 of 0.5, how is the policy parameter 0 updated? Answer 0 and To(S2, A2) after every update. Note that Voln To(s, a) of the softmax policy is: Veln Te(s, a) = (s, a) Exe[b(s,:)] = $(s, a) +(a | $)6(8, a). =

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related General Management Questions!