Question: If we model the policy as a soft - max over some action preferences that do not explicitly model the state - action values and

If we model the policy as a soft-max over some action preferences
that do not explicitly model the state-action values and run a policy gradient algorithm (for example the REINFORCE) to update it. If the policy gradient converges, then is it true that these preferences match the optimal state-action value, i.e. the
?

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!