Question: Step 1 We start in the START state ( in the rotunda ) , and we have four action options that represent the four paths

Step

1

We start in the START state

(

in the rotunda

),

and we have four action options that represent the four paths that we can take through the caves: "Gold Vault

,

Escape Path",

Cave Troll

and "Beer Cellar". Because our initial value estimate of Q

(

start

,

Gold Vault

) = 4

is greater than our initial estimates of Q

(

start

,

Escape Path

) = 2,

(

start

,

Cave Troll

) = 1,

and Q

(

start

,

Beer Cellar

) = 3,

we choose the action "Gold Vault

.

We move to state s

' "

in vault

and upon seeing the dragon in the gold vault

(

SCARY

!)

we receive a reward of

- 7 (

which was not quite what we expected!

) .

Next, we consider which action to perform from the state

"

in vault.

The Q

-

value estimates we have for these state

-

action pairs are:

(

In Vault, Fight Dragon

) = 2

(

In Vault, RUN AWAY!

) = 1

Given that we love a battle, we see that the highest Q value

(

.

.

maxa'Q

(

',

'))

is given by Q

(

In Vault, Fight Dragon

) = 2 .

We thus update our initial Q value for choosing to come into the Gold Vault like so:

prediction error

= [- 7 + 2] - 4 = - 9

(

start

,

Gold Vault

) = 4 + - 9 = - 5

Step

2

Now we are in the state

"

in vault", and we have two action options: "fight dragon" and "RUN AWAY!." Because our current estimate of Q

(

in vault, fight dragon

) = 2

is greater than our current estimate of Q

(

in vault, RUN AWAY!

) = 1,

we choose to "fight the dragon." This moves us to terminal state

(

a state for which there are no further actions which we can take

)

"end of battle," and gives a reward of

- 10 . (

That dragon sure messed you up good!

) .

Note: When you are updating Q

(

,

)

after moving from state s to a terminal state s

',

then Q

(

',

') = 0

because there are no further possible actions to take in s

' .

There are no further actions available once you have chosen to fight the dragon, so the value for Q

(

',

')

0 .

We thus update our Q value like so:

prediction error

= [- 10 + 0] - 2 = - 12

(

in vault, fight dragon

) = 2 + (- 12) = - 10

We define an "iteration" as starting at the START node and reaching a terminal node. After each iteration, you go back to the START state. After this first iteration, here are the new, updated Q values, which reflect what you learned based on the actions you took this time around:

(

start

,

Gold Vault

) = - 5

(

start

,

Escape Path

) = 2

(

start

,

Cave Troll

) = 1

(

start

,

Beer Cellar

) = 3

(

in vault, Fight Dragon

) = - 10

(

in vault, RUN AWAY!

) = 1

(

The context for the questions is in the photos

)

1 .

Using the Q values that you learned after the FIRST iteration, record the updated Q values after the SECOND iteration below:

(

start

,

Gold Vault

) =

(

start

,

Escape Path

) =

(

start

,

Cave Troll

) =

(

start

,

Beer Cellar

) =

(

in cellar, Have a mead

) =

(

in cellar, Have a pint

) =

Hint: When you transition from

(

,

)

(

',

'),

you'll only update Q

(

,

)

to reflect what you learned after performing your chosen action and moving to the next state. Not every Q value gets updated every time!

2 .

Compare the first iteration with the second iteration, and consider what did and didn't change. Which of the following is true?

Some of the Q values change. True or False

The rewards change. True or False

The actions available from the Start state change. True or False

3 .

Using the new Q values from the SECOND iteration, run a THIRD iteration of the simulation and report the latest updated Q values below:

(

start

,

Gold Vault

) =

(

start

,

Escape Path

) =

(

start

,

Cave Troll

) =

(

start

,

Beer Cellar

) =

(

in cellar, Have a mead

) =

(

in cellar, Have a pint

) =

4 .

Using the new Q values from the THIRD iteration, run a FOURTH

(

and final

)

iteration of the simulation.

Now select from the choice below the latest Q values for the following state

/

action pairs:

(

start

,

Gold Vault

)

(

start

,

Escape Path

)

(

start

,

Cave Troll

)

(

start

,

Beer Cellar

)

) - 5, 5, 1, 1

) - 5, 5, 1, - 1

) - 10, 1, - 2, - 1

) 4, 2, 1, 3

5 .

For this RL simulation, we stipulated that

\

alpha

= 1

and that

\

gamma

= 1 .

But let's imagine

(

just for this question

)

that when you chose 'Have a mead' during the

2

nd iteration, drinking the mead changed your learning rate, so now

\

alpha

= 0.5

while gamma remains the same:

\

gamma

= 1 .

What effect would this have on your Q

-

learning and updating process in later iterations?

)

You would learn more slowly and make small changes to your value predictions.

)

You would learn quickly and make large changes to your value predictions.

)

You would care about future reward much less than present reward.

6 .

Which of the following claims is

/

are true about a value function?

.

A value function maps from a state to the actual reward received in that state.

.

A value function is a prediction about future discounted cumulative reward.

.

A value function can be represented as a function Q

(

,

)

that maps a state and action pair to a predicted future

(

discounted

)

sum of rewards.

.

B & C

.

.

,

,

& C

.

Step 1 We start in the START state ( in the

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!

Hi, I'm having an issue with this Simnet project and I was wondering if anyone could help me. There is only one issue: Step 3a, in the instruction pdf that I have attached. When my project is graded...

PLEASE ANSWER ME THE QUES 7 AND 8 , THANKS VERY MUCH. NEEDING DETAILED INFORMATION. \freport high account receivable or low current liabilities segregation of duties,double checks, recheck the...

BSB51915 DIPLOMA OF LEADERSHIP AND MANAGEMENT Resource Management Manage Industrial Relations BSBWRK510 Manage Employee Relations ii This workbook has been designed for use in conjunction with...

CHAPTER 7 Coaching for Per fo rm ance Once you have begun to delegate, you need to be able to work with your direct reports so that they are successful in their assignments. This managerial skill is...

Chapter 05 - Planning for and Recruiting Human Resources Chapter Five: Planning for and Recruiting Human Resources Human Resource Management 3rd edition by R.A. Noe, J.R. Hollenbeck, B. Gerhart, and...

MUST BE CORRECT ANSWERS A small software company has the following simplified cashflow, funded by shareholders' equity of 20,000 and a bank overdraft of 5000: Invoiced money received 2 months after...

I need help with the following review questions please! Its related to ethics and accounting, should be straight forward please help. I need this by tomorrow. Please let me know if you need any...

A STUDENT GUIDE TO INTRODUCTORY & FOUNDATION LEVEL CASE STUDY ASSIGNMENTS Using the 8 Step American Management Association (AMA) Problem Solving and Case Analysis Process Sally Armstrong, May 2005....

A Template for Use to Write Your Paper NOTE: This template modifies the standard Marketing Plan to adapt it to a Career Marketing Plan. You will use this for the BODY of your project and following...

I need help in developing two or more solutions or interventions that align with my Ishikawa root cause thematic analysis factors. I need to trace back to the Ishikawa root cause analysis diagram. I...

Discuss why a mutual fund family may find it beneficial to offer 50 or 60 different stock mutual funds.

Let's see how the demand curve, combined with the Hotelling rule, can define the market equilibrium. There are two periods, now and later. The demand curve in each period (= now or later) is Q = 10 -...

\ table [ [ Account , Amount ] , [ Cash , $ 2 6 , 0 0 0 ] , [ Accounts Receivable, 1 2 , 0 0 0 ] , [ Invemtory e$ 4 per unit, 6 . 5 0 0 ] , [ Interest Recetvable, 7 3 3 ] , [ Notes Receivable, 1 0 ,...

7 in 15 in 16 in Find the area 15 in 9 in Area of in Area of A Area of Figure Remember A bh in in