Question: Step 1 We start in the START state ( in the rotunda ) , and we have four action options that represent the four paths

Step 1
We start in the START state (in the rotunda), and we have four action options that represent the four paths that we can take through the caves: "Gold Vault,Escape Path", Cave Troll and "Beer Cellar". Because our initial value estimate of Q(start, Gold Vault)=4 is greater than our initial estimates of Q(start, Escape Path)=2, Q(start, Cave Troll)=1, and Q(start, Beer Cellar)=3, we choose the action "Gold Vault. We move to state s'"in vault and upon seeing the dragon in the gold vault (SCARY!) we receive a reward of -7(which was not quite what we expected!).
Next, we consider which action to perform from the state "in vault. The Q-value estimates we have for these state-action pairs are:
Q(In Vault, Fight Dragon)=2
Q(In Vault, RUN AWAY!)=1
Given that we love a battle, we see that the highest Q value (i.e. maxa'Q(s', a')) is given by Q(In Vault, Fight Dragon )=2. We thus update our initial Q value for choosing to come into the Gold Vault like so:
prediction error =[-7+2]-4=-9
Q (start, Gold Vault)=4+-9=-5
Step 2
Now we are in the state "in vault", and we have two action options: "fight dragon" and "RUN AWAY!." Because our current estimate of Q(in vault, fight dragon)=2 is greater than our current estimate of Q(in vault, RUN AWAY!)=1, we choose to "fight the dragon." This moves us to terminal state (a state for which there are no further actions which we can take) "end of battle," and gives a reward of -10.(That dragon sure messed you up good!).
Note: When you are updating Q(s, a) after moving from state s to a terminal state s', then Q(s',a')=0 because there are no further possible actions to take in s'. There are no further actions available once you have chosen to fight the dragon, so the value for Q(s', a') is 0.
We thus update our Q value like so:
prediction error =[-10+0]-2=-12
Q(in vault, fight dragon)=2+(-12)=-10
We define an "iteration" as starting at the START node and reaching a terminal node. After each iteration, you go back to the START state. After this first iteration, here are the new, updated Q values, which reflect what you learned based on the actions you took this time around:
Q(start, Gold Vault)=-5
Q(start, Escape Path)=2
Q(start, Cave Troll)=1
Q(start, Beer Cellar)=3
Q(in vault, Fight Dragon)=-10
Q(in vault, RUN AWAY!)=1
(The context for the questions is in the photos)
1. Using the Q values that you learned after the FIRST iteration, record the updated Q values after the SECOND iteration below:
Q(start, Gold Vault)=
Q(start, Escape Path)=
Q(start, Cave Troll)=
Q(start, Beer Cellar)=
Q(in cellar, Have a mead)=
Q(in cellar, Have a pint)=
Hint: When you transition from (s,a) to (s',a'), you'll only update Q(s,a) to reflect what you learned after performing your chosen action and moving to the next state. Not every Q value gets updated every time!
2. Compare the first iteration with the second iteration, and consider what did and didn't change. Which of the following is true?
Some of the Q values change. True or False
The rewards change. True or False
The actions available from the Start state change. True or False
3. Using the new Q values from the SECOND iteration, run a THIRD iteration of the simulation and report the latest updated Q values below:
Q(start, Gold Vault)=
Q(start, Escape Path)=
Q(start, Cave Troll)=
Q(start, Beer Cellar)=
Q(in cellar, Have a mead)=
Q(in cellar, Have a pint)=
4. Using the new Q values from the THIRD iteration, run a FOURTH (and final) iteration of the simulation.
Now select from the choice below the latest Q values for the following state/action pairs:
Q(start, Gold Vault)
Q(start, Escape Path)
Q(start, Cave Troll)
Q(start, Beer Cellar)
a)-5,5,1,1
b)-5,5,1,-1
c)-10,1,-2,-1
d)4,2,1,3
5. For this RL simulation, we stipulated that \alpha =1 and that \gamma =1.
But let's imagine (just for this question) that when you chose 'Have a mead' during the 2nd iteration, drinking the mead changed your learning rate, so now \alpha =0.5 while gamma remains the same: \gamma =1. What effect would this have on your Q-learning and updating process in later iterations?
a) You would learn more slowly and make small changes to your value predictions.
b) You would learn quickly and make large changes to your value predictions.
c) You would care about future reward much less than present reward.
6. Which of the following claims is/are true about a value function?
A. A value function maps from a state to the actual reward received in that state.
B. A value function is a prediction about future discounted cumulative reward.
C. A value function can be represented as a function Q(s, a) that maps a state and action pair to a predicted future (discounted) sum of rewards.
D. B & C.
E. A, B, & C.
Step 1 We start in the START state ( in the

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!