Question: ( 2 5 points ) We have argued in class that a deterministic policy may result in extremely suboptimal outcomes. This is especially true when

(25 points) We have argued in class that a deterministic policy may result in extremely suboptimal outcomes.
This is especially true when the state transition model is adversarial, i.e. the next state is chosen by an
adversary who wants to minimize your reward. Consider the game of scissors/paper/stone played repeatedly
(infinitely many times). At each turn, Player 1 and Player 2 pick either "scissors" "paper" or "stone". The
states and rewards are given as state : reward below
(: scissors, paper :)1
(: scissors, stone :)-1
(: paper, scissors :)-1
(: paper, stone :)1
(: stone, paper :)-1
(: stone, scissors :)1
When Player 2 picks the same action as Player 1, Player 1's reward is 0. Player 1 picks the lefthand action of
the tuple, whereas Player 2 picks the righthand action.
(a)(10 points) Suppose that Player 1 follows a deterministic policy to pick their next move at time t.
Assuming that Player 2 observes everything that Player 1 does and knows the policy that Player 1 uses,
what is the optimal (reward maximizing) policy for Player 2 to follow?
(b)(15 points) What is the optimal randomized policy for Player 1, assuming that Player 2 will choose their
action adversarially? (i.e. to minimize Player l's revenue, under the worst-case assumption that Player 2
knows exactly Player 1's policy)(10 points) Consider the following modification of the greedy algorithm for learning a regret-minimizing
policy in the expert systems domain. Instead of picking an action out of the set of best-performing actions
so far (as is the case for the simple greedy algorithm), we pick the best performing action in the previous k
time-steps; if there is more than one k-best performing action, we pick the one with the lowest index. For
example, if k=1, we simply pick any one of the actions that offered the highest reward in the previous
round; if k=3, we pick the action that had the highest cumulative reward in the last three rounds, etc. More
formally, the k-greedy algorithm works as follows. At time t, let St be the set of actions that had maximal
total reward at time steps t-k,dots,t-1; we pick the action with the lowest index out of St(if tk we run
the original greedy algorithm). What is the worst-case regret of the k-greedy algorithm? You must provide a
formal proof of its regret guarantees with respect to the best action.
( 2 5 points ) We have argued in class that a

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!