Question: I WILL GIVE POSITIVE FEEDBACK!! Modify the values for the exploration factor, discount factor, and learning rates in the code to understand how those values
I WILL GIVE POSITIVE FEEDBACK!!
Modify the values for the exploration factor, discount factor, and learning rates in the code to understand how those values affect the performance of the algorithm. Be sure to place each experiment in a different code block so that your instructor can view all of your changes. Note: Discount factor = GAMMA, learning rate = LEARNING_RATE, exploration factor = combination of EXPLORATION_MAX, EXPLORATION_MIN, and EXPLORATION_DECAY.
- Create a Markdown cell in your Jupyter Notebook after the code and its outputs. In this cell, you will be asked to analyze the code and relate it to the concepts from your readings. You are expected to include resources to support your answers, and must include citations for those resources. Specifically, you must address the following rubric criteria:
- Explain how reinforcement learning concepts apply to the cartpole problem.
- What is the goal of the agent in this case?
- What are the various state values?
- What are the possible actions that can be performed?
- What reinforcement algorithm is used for this problem?
- Analyze how experience replay is applied to the cartpole problem.
- How does experience replay work in this algorithm?
- What is the effect of introducing a discount factor for calculating the future rewards?
- Analyze how neural networks are used in deep Q-learning.
- Explain the neural network architecture that is used in the cartpole problem.
- How does the neural network make the Q-learning algorithm more efficient?
- What difference do you see in the algorithm performance when you increase or decrease the learning rate?
- Explain how reinforcement learning concepts apply to the cartpole problem.
...CODE...
import random import gym import numpy as np from collections import deque from keras.models import Sequential from keras.layers import Dense from keras.optimizers import Adam from scores.score_logger import ScoreLogger ENV_NAME = "CartPole-v1" GAMMA = 0.95 LEARNING_RATE = 0.001 MEMORY_SIZE = 1000000 BATCH_SIZE = 20 EXPLORATION_MAX = 1.0 EXPLORATION_MIN = 0.01 EXPLORATION_DECAY = 0.995 class DQNSolver: def __init__(self, observation_space, action_space): self.exploration_rate = EXPLORATION_MAX self.action_space = action_space self.memory = deque(maxlen=MEMORY_SIZE) self.model = Sequential() self.model.add(Dense(24, input_shape=(observation_space,), activation="relu")) self.model.add(Dense(24, activation="relu")) self.model.add(Dense(self.action_space, activation="linear")) self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE)) def remember(self, state, action, reward, next_state, done): self.memory.append((state, action, reward, next_state, done)) def act(self, state): if np.random.rand() < self.exploration_rate: return random.randrange(self.action_space) q_values = self.model.predict(state) return np.argmax(q_values[0]) def experience_replay(self): if len(self.memory) < BATCH_SIZE: return batch = random.sample(self.memory, BATCH_SIZE) for state, action, reward, state_next, terminal in batch: q_update = reward if not terminal: q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0])) q_values = self.model.predict(state) q_values[0][action] = q_update self.model.fit(state, q_values, verbose=0) self.exploration_rate *= EXPLORATION_DECAY self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate) def cartpole(): env = gym.make(ENV_NAME) score_logger = ScoreLogger(ENV_NAME) observation_space = env.observation_space.shape[0] action_space = env.action_space.n dqn_solver = DQNSolver(observation_space, action_space) run = 0 while True: run += 1 state = env.reset() state = np.reshape(state, [1, observation_space]) step = 0 while True: step += 1 #env.render() action = dqn_solver.act(state) state_next, reward, terminal, info = env.step(action) reward = reward if not terminal else -reward state_next = np.reshape(state_next, [1, observation_space]) dqn_solver.remember(state, action, reward, state_next, terminal) state = state_next if terminal: print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step)) score_logger.add_score(step, run) break dqn_solver.experience_replay()
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
