Question: For the Markov Decision Process(MDP) this method is called in a loop and is supposed to update the state value of each cell. Since its
For the Markov Decision Process(MDP) this method is called in a loop and is supposed to update the state value of each cell. Since its already called in a loop I did not think it needs to be in a loop again. I was not sure if I use the computeQvalue correctly
ACTION_EAST=0
ACTION_SOUTH=1
ACTION_WEST=2
ACTION_NORTH=3
TRANSITION_SUCCEED=0.8 #The probability that by taking action A, it moves to the expected destination state S'. Here the state S' represents the new state that the action A aims to move to.
TRANSITION_FAIL=0.2 #The probability that by taking action A, it moves to an unexpected destination state S'. For example, by taking action East, you may moves to the neighboring direction North or South. So the probability of going to North or South is 0.1. We assume the two directions evenly split the value of TRANSITION_FAIL 0.2
GAMMA=0.9 #the discount factor
ACTION_REWARD=-0.1 #The instantaneous for taking each action (we assume the four actions (N/E/W/S) has the same reward)
CONVERGENCE=0.0000001 #The threshold for convergence to determine a stop sign
cur_convergence=100
#the function that calculates the Q and update state data with the Q
#s is state of each cell
#action from value 0-3 0-east, 1-south, 2-west, 3-north
def computeQValue(s,action):
def valueIteration():
print('Value Iteration.')
#called in a loop
#use the computeQValue and update the state value of each cell
#ideally the policy should be obtained less tahn 100 iterations possible
#use the cur_convergence and convergence
For i in range(3)
states.q_value[i] = computeQvalue(states, states.q_value)
Here is the cell instance class
class Cell:
def __init__(self,x,y):
self.q_values=[0.0,0.0,0.0,0.0]
self.location=(x,y)
self.state_value=max(self.q_values)
self.policy=0
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
