Question: Consider the MCTS example below. Each node here is a state, the connections between nodes are actions, and transitions are deterministic. Within the nodes we

Consider the MCTS example below. Each node here is a state, the connections between nodes are actions, and transitions are deterministic.
Within the nodes we have two numbers: (the number of rollouts ending in success / the total number of rollouts that were experienced from that node). The value of a particular node is the ratio of these two numbers (i.e. a probability of success). In the example below, the value of the root node is 0.5.
If an action from a particular node has not been experienced, we show it ending in a black dot. Since in the beginning we start with a root node that is empty, the number of rollouts at the root is merely the sum of the rollouts of all it's children.
We have indexed each of the states below with a number to the top left.
Now, assume that a greedy tree policy is used to traverse the tree to select the next node to expand, and that a simulation is run from that node, resulting in success (+1). Compute the values of each of the nodes 1-5 after the tree is updated by this return being backed up.
Think about how the value will get propagated back up the tree.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!