( 2 5 points ) We have argued in class that a deterministic policy may result in extremely suboptimal outcomes This is especially true when the state transition model is adversarial, i e the next state is chosen by an adversary who wants to minimize your reward Consider the game of scissors paper stone played repeatedly ( infinitely many times ) At each turn, Player 1 and Player 2 pick either scissors paper or stone The states and rewards are given as state reward below ( scissors, paper ) 1 ( scissors, stone ) 1 ( paper, scissors ) 1 ( paper, stone ) 1 ( stone, paper ) 1 ( stone, scissors ) 1 When Player 2 picks the same action as Player 1 , Player 1 ' s reward is 0 Player 1 picks the lefthand action of the tuple, whereas Player 2 picks the righthand action ( a ) ( 1 0 points ) Suppose that Player 1 follows a deterministic policy to pick their next move at time t Assuming that Player 2 observes everything that Player 1 does and knows the policy that Player 1 uses, what is the optimal ( reward maximizing ) policy for Player 2 to follow ( b ) ( 1 5 points ) What is the optimal randomized policy for Player 1 , assuming that Player 2 will choose their action adversarially ( i e to minimize Player l ' s revenue, under the worst case assumption that Player 2 knows exactly Player 1 ' s policy ) ( 1 0 points ) Consider the following modification of the greedy algorithm for learning a regret minimizing policy in the expert systems domain Instead of picking an action out of the set of best performing actions so far ( as is the case for the simple greedy algorithm ) , we pick the best performing action in the previous k time steps if there is more than one k best performing action, we pick the one with the lowest index For example, if k 1 , we simply pick any one of the actions that offered the highest reward in the previous round if k 3 , we pick the action that had the highest cumulative reward in the last three rounds, etc More formally, the k greedy algorithm works as follows At time t , let S t be the set of actions that had maximal total reward at time steps t k , dots, t 1 we pick the action with the lowest index out of S t ( if t k we run the original greedy algorithm ) What is the worst case regret of the k greedy algorithm You must provide a formal proof of its regret guarantees with respect to the best action

The Answer is in the image, click to view ...

Question: ( 2 5 points ) We have argued in class that a deterministic policy may result in extremely suboptimal outcomes. This is especially true when

(25

points

)

We have argued in class that a deterministic policy may result in extremely suboptimal outcomes.

This is especially true when the state transition model is adversarial, i

.

.

the next state is chosen by an

adversary who wants to minimize your reward. Consider the game of scissors

/

paper

/

stone played repeatedly

(

infinitely many times

) .

At each turn, Player

1

and Player

2

pick either "scissors" "paper" or "stone". The

states and rewards are given as state : reward below

(

: scissors, paper :

) 1

(

: scissors, stone :

) - 1

(

: paper, scissors :

) - 1

(

: paper, stone :

) 1

(

: stone, paper :

) - 1

(

: stone, scissors :

) 1

When Player

2

picks the same action as Player

1,

Player

1'

s reward is

0 .

Player

1

picks the lefthand action of

the tuple, whereas Player

2

picks the righthand action.

(

) (10

points

)

Suppose that Player

1

follows a deterministic policy

to pick their next move at time

t .

Assuming that Player

2

observes everything that Player

1

does and knows the policy

that Player

1

uses,

what is the optimal

(

reward maximizing

)

policy for Player

2

to follow?

(

) (15

points

)

What is the optimal randomized policy for Player

1,

assuming that Player

2

will choose their

action adversarially?

(

.

.

to minimize Player l

'

s revenue, under the worst

-

case assumption that Player

2

knows exactly Player

1'

s policy

) (10

points

)

Consider the following modification of the greedy algorithm for learning a regret

-

minimizing

policy in the expert systems domain. Instead of picking an action out of the set of best

-

performing actions

so far

(

as is the case for the simple greedy algorithm

),

we pick the best performing action in the previous

k

time

-

steps; if there is more than one

k -

best performing action, we pick the one with the lowest index. For

example, if

k = 1,

we simply pick any one of the actions that offered the highest reward in the previous

round; if

k = 3,

we pick the action that had the highest cumulative reward in the last three rounds, etc. More

formally, the

k -

greedy algorithm works as follows. At time

t,

let

S_{t}

be the set of actions that had maximal

total reward at time steps

t - k,

dots,

t - 1

; we pick the action with the lowest index out of

S_{t} (

t k

we run

the original greedy algorithm

) .

What is the worst

-

case regret of the

k -

greedy algorithm? You must provide a

formal proof of its regret guarantees with respect to the best action.

( 2 5 points ) We have argued in class that a

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!

I am struggling to do the proper computations for some simple accounting. After i get this down, i will produce a paper on the importance of each number i solved for. Can someone please help me...

Which theorist/theory of the emergence and role of law best explains law's character and role today? Are there current events, which help illustrate the theorists' claims about how law develops?...

Read the article on Alternative Dispute Resolution. Then read the Gulf Island Pond scenario. Answer the following questions: 1. How would you use ADR to attempt to resolve the issues in this...

Hi, This subject is financial accounting, here is a short essay type question, approximately 5 paragraphs. ''Drawing on private interest theory, what powers do you believe the Australian Accounting...

Portray in words what transforms you would have to make to your execution to some degree (a) to accomplish this and remark on the benefits and detriments of this thought.You are approached to compose...

Let A, B be sets. Define: (a) the Cartesian product (A B) (b) the set of relations R between A and B (c) the identity relation A on the set A [3 marks] Suppose S, T are relations between A and B, and...

This text was adapted by The Saylor Foundation under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License without attribution as requested by the work's original creator or licensee. 1...

A Case Study of the Navy ABSTRACT We present here a case study of an organization within the U.S. Navy that created a new organizational construct and performance management system. We explore the...

answer the question clearly You are building a flight-control system for which a convincing safety case must be made. Would you assign the tasks of safety requirements engineering, test case...

Not A \"Pioneer\" But Recognize New Market Requirements? January 2012 CMS recently named 32 Pioneer ACOs and has begun funding Innovation Grants to help prepare for new models of care delivery....

The information below relates to the Cash account in the ledger of Minton Company. Balance September 1$17,150; Balance September 30$17,404; Cash deposited$64,000. Checks written$63,746. The September...

Paul's Premium Pillows has the following events occur during the current year. a. Paul paid his employees $90,000 cash for work done in the current year. He also owes an additional $10,000 for work...

You answer the phone at your office and are asked if Jane Smith is being seen as a client. What is your best response? Explain the policy of confidentiality and politely hang up Give the information...

Shamrock Company reported net income of $460,000 for the current year. Depreciation recorded on buildings and equipment amounted to $72,000 for the year. Balances of the current asset and current...