Question: If we model the policy as a soft - max over some action preferences that do not explicitly model the state - action values and
If we model the policy as a softmax over some action preferences
that do not explicitly model the stateaction values and run a policy gradient algorithm for example the REINFORCE to update it If the policy gradient converges, then is it true that these preferences match the optimal stateaction value, ie the
Step by Step Solution
There are 3 Steps involved in it
1 Expert Approved Answer
Step: 1 Unlock
Question Has Been Solved by an Expert!
Get step-by-step solutions from verified subject matter experts
Step: 2 Unlock
Step: 3 Unlock
