Question: ( a ) Why AlphaGo use a separate policy network and a separate value network? [ 1 . 0 M ] ( b ) How

(a) Why AlphaGo use a separate policy network and a separate value network? [1.0 M]
(b) How does the MCTS ensure an action with the highest value is found in real-time? If the
best action can be selected only by MCTS, why is any prior learning of Q(s,a) required?
[2.0 M]
(c) We have learned that Supervised Learning that learns with samples from a given
distribution does not capture the online nature of interactions as required for
reinforcement learning quite well.
(i) Why does AlphaGo use supervised learning to learn the initial policy ( and even
further)?[1.5 M]
(ii) In what ways the shortcomings of supervised learning are mitigated in AlphaGo?
[2.0 M]
(d) How does DQN handle the challenges referred to in the c part of this question?

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!