Question: 1 . Q - Learning [ 3 5 Points ] This time, although the Gridworld looks similar, it is not an MDP anymore. That means,
QLearning Points
This time, although the Gridworld looks similar, it is not an MDP anymore. That means, the only information you get from the game object is game.getactionsstate: State which returns a set of Actions. All other methods have been removed.
You will also see that the GUI adds a new episode label. Each iteration is a single update, and each episode is a collections of iterations such that the agent starts from the starting state at the beginning of an episode and reaches a terminal state at the end of an episode.
A stub of a Qlearner is specified in QLearningAgent. A QLearningAgent takes in
game, an object to get a set of available actions for a state ss
discount, the discount factor gamma gamma
learningrate, the learning rate alpha alpha
exploreprob, the probability of exploration for each iteration.
Note that the states in Gridworld arent given as well. Your code should be able to deal with a new state and a new state action pair.
Then you need to implement the following methods for this part:
agent.getqvaluestate action returns the Qvalue for state action pair. QsaQsa from Qtable. For a never seen pair, the Qvalue should be
agent.getvaluestate returns the value of the state VsVs
agent.getbestpolicystate returns the best policy of the state, pi spi s
agent.updatestate action, nextstate, reward updates the Qvalue for state action pair, based on the value of the nextstate and reward given.
Note: For getbestpolicy, you should break ties randomly for better behavior. The random.choice function will help.
For more instructions, refer to the lecture slides and comments in the skeleton file.
Important: Make sure that in your getvalue and getbestpolicy functions, you only access Qvalues by calling getqvalue. This abstraction will be useful for the Approximate Qlearning you will implement later when you override getqvalue to use features of stateaction pairs rather than stateaction pairs directly.
With the Qlearning update in place, you can watch your Qlearner learn under manual control, using the keyboard arrow keys:
python gridworld.py
Hint: to help
# template
class QLearningAgent:
Implement Q Reinforcement Learning Agent using Qtable."""
def initself game, discount, learningrate, exploreprob:
Store any needed parameters into the agent object.
Initialize Qtable.
# TODO
def getqvalueself state, action:
Retrieve Qvalue from Qtable.
For an never seen sa pair, the Qvalue is by default
return # TODO
def getvalueself state:
Compute state value from Qvalues using Bellman Equation.
Vs maxa Qsa
return # TODO
def getbestpolicyself state:
Compute the best action to take in the state using Policy Extraction.
pi s argmaxa Qsa
If there are ties, return a random one for better performance.
Hint: use random.choice
return None # TODO
def updateself state, action, nextstate, reward:
Update Qvalues using running average.
Qsaalpha Qsaalpha R gamma Vs
Where alpha is the learning rate, and gamma is the discount.
Note: You should not call this function in your code.
# TODO
# Epsilon Greedy
def getactionself state:
Compute the action to take for the agent, incorporating exploration.
That is with probability epsi act randomly.
Otherwise, act according to the best policy.
Hint: use random.randomepsi to check if exploration is needed.
return None # TODO
# Bridge Crossing Revisited
def question:
epsilon
learningrate
return epsilon, learningrate
# If not possible, return 'NOT POSSIBLE'
# Approximate QLearning
class ApproximateQAgentQLearningAgent:
Implement Approximate Q Learning Agent using weights."""
def initselfargs extractor:
Initialize parameters and store the feature extractor.
Initialize weights table."""
superinitargs
# TODO
def getweightself feature:
Get weight of a feature.
Never seen feature should have a weight of
return # TODO
def getqvalueself state, action:
Compute Q value based on the dot product of feature components and weights.
Qsa w fsa w fsa wn fnsa
return # TODO
def updateself state, action, nextstate, reward:
Update weights using leastsquares approximation.
Delta R gamma Vs Qsa
Then update weights: wi wi alpha Delta fis a
# TODO
Step by Step Solution
There are 3 Steps involved in it
1 Expert Approved Answer
Step: 1 Unlock
Question Has Been Solved by an Expert!
Get step-by-step solutions from verified subject matter experts
Step: 2 Unlock
Step: 3 Unlock
