Question: I need help figuring out if my thought process is right for the MDP(Markov Decision Process). ACTION_EAST=0 ACTION_SOUTH=1 ACTION_WEST=2 ACTION_NORTH=3 TRANSITION_SUCCEED=0.8 #The probability that by
I need help figuring out if my thought process is right for the MDP(Markov Decision Process).
ACTION_EAST=0 ACTION_SOUTH=1 ACTION_WEST=2 ACTION_NORTH=3
TRANSITION_SUCCEED=0.8 #The probability that by taking action A, it moves to the expected destination state S'. Here the state S' represents the new state that the action A aims to move to. TRANSITION_FAIL=0.2 #The probability that by taking action A, it moves to an unexpected destination state S'. For example, by taking action East, you may moves to the neighboring direction North or South. So the probability of going to North or South is 0.1. We assume the two directions evenly split the value of TRANSITION_FAIL 0.2 GAMMA=0.9 #the discount factor ACTION_REWARD=-0.1 #The instantaneous for taking each action (we assume the four actions (N/E/W/S) has the same reward) CONVERGENCE=0.0000001 #The threshold for convergence to determine a stop sign cur_convergence=100
Also here is the cell class that is implemented
class Cell: def __init__(self,x,y): self.q_values=[0.0,0.0,0.0,0.0] self.location=(x,y) self.state_value=max(self.q_values) self.policy=0
#####Implement the below functions ############################ def computeQValue(s,action): print('Compute Q Values') #s is state of each cell #action from value 0-3 0-east, 1-south, 2-west, 3-north #For each cell based on action taken the q value is calculated #update the state data with the q value #Do I need this state_old_value = s.state_value.copy() ? if action == ACTION_EAST: s.q_value[0] = ACTION_REWARD + GAMMA*(s.state_value+ACTION_EAST) s.state_value = ACTION_REWARD + GAMMA*(TRANSITION_SUCCEED*s.q_value[0] + TRANSITION_FAIL*s.q_value[1] + TRANSITION_FAIL*s.q_value[3]) elif action == ACTION_SOUTH: s.q_value[1] = ACTION_REWARD + GAMMA*(s.state_value+ACTION_SOUTH) s.state_value = ACTION_REWARD + GAMMA*(TRANSITION_SUCCEED*s.q_value[1] + TRANSITION_FAIL*s.q_value[0] + TRANSITION_FAIL*s.q_value[2]) elif action == ACTION_WEST: s.q_value[2] = ACTION_REWARD + GAMMA*(s.state_value+ACTION_WEST) s.state_value = ACTION_REWARD + GAMMA*(TRANSITION_SUCCEED*s.q_valu[2] + TRANSITION_FAIL*s.q_value[1] + TRANSITION_FAIL*s.q_value[3]) else: s.q_value[3] = ACTION_REWARD + GAMMA*(s.state_value+ACTION_NORTH) s.state_value = ACTION_REWARD + GAMMA*(TRANSITION_SUCCEED*s.q_value[3] + TRANSITION_FAIL*s.q_value[2] + TRANSITION_FAIL*s.q_value[0]) def valueIteration(): print('Value Iteration.') #called in a loop #use the computeQValue and update the state value of each cell #ideally the policy should be obtained less tahn 100 iterations possible #use the cur_convergence and convergence states.state_value def policyEvaluation(): print('Policy Evaluation') #updating the state values of each cell based on current policy #for i in range(policy): #states.state_value = ACTION_REWARD + GAMMA*old_state_value[i + states.policy] def policyImprovement(): print('Policy Improvement.') #responsible for updating the policy of each cell #getting the max q value states.policy =+1 #The least q value? states.policy = -1 #policy iteration should be less than the value iteration
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
