Question: ( a ) Why AlphaGo use a separate policy network and a separate value network? [ 1 . 0 M ] ( b ) How
a Why AlphaGo use a separate policy network and a separate value network? M
b How does the MCTS ensure an action with the highest value is found in realtime? If the
best action can be selected only by MCTS why is any prior learning of Qsa required?
M
c We have learned that Supervised Learning that learns with samples from a given
distribution does not capture the online nature of interactions as required for
reinforcement learning quite well.
i Why does AlphaGo use supervised learning to learn the initial policy and even
further M
ii In what ways the shortcomings of supervised learning are mitigated in AlphaGo?
M
d How does DQN handle the challenges referred to in the c part of this question?
Step by Step Solution
There are 3 Steps involved in it
1 Expert Approved Answer
Step: 1 Unlock
Question Has Been Solved by an Expert!
Get step-by-step solutions from verified subject matter experts
Step: 2 Unlock
Step: 3 Unlock
