Question: Below i have attached the training.py code . Give the DQN architecture 1 code which takes state and action as input and returns only one

Below i have attached the training.py code
.
Give the DQN architecture
1
code which takes state and action as input and returns only one q value. Integrate the code with the training.py and give
.
Make sure you are getting a increasing reward trend as episodes increase and dont use the offline data as mentioned in the image. Other guidelines are there in image
from keras.models import Model
from keras.layers import Input, Dense, Lambda
from keras.optimizers import Adam
import keras.backend as K
from collections import deque
import random
# Constants
BATCH_SIZE =64
GAMMA =0.99
EPSILON_START =1.0
EPSILON_MIN =0.01
EPSILON_DECAY =0.995
LEARNING_RATE =0.001
# Dueling DQN Model Architecture
def create_dueling_dqn_model(input_shape, action_space):
state_input = Input(shape=(input_shape,))
x = Dense(512, activation='relu')(state_input)
x = Dense(256, activation='relu')(x)
x = Dense(64, activation='relu')(x)
state_value = Dense(1)(x)
state_value = Lambda(lambda s: K.expand_dims(s[:,0],-1), output_shape=(action_space,))(state_value)
action_advantage = Dense(action_space)(x)
action_advantage = Lambda(lambda a: a[:, :]- K.mean(a[:, :], keepdims=True), output_shape=(action_space,))(action_advantage)
q_values = Lambda(lambda w: w[0]+ w[1], output_shape=(action_space,))([state_value, action_advantage])
model = Model(inputs=state_input, outputs=q_values)
model.compile(loss='mse', optimizer=Adam(lr=LEARNING_RATE))
return model
# DQN Training Function
def DQN_training(env, offline_data, use_offline_data=False):
state_size = env.observation_space.shape[0]
action_size = env.action_space.n
model = create_dueling_dqn_model(state_size, action_size)
replay_buffer = deque(maxlen=2000)
epsilon = EPSILON_START
total_reward_per_episode =[]
for episode in range(1000): # Number of episodes
state = env.reset()
state = np.reshape(state,[1, state_size])
total_reward =0
for time_step in range(500): # Max steps in an episode
if np.random.rand()= epsilon:
action = env.action_space.sample() # Explore action space
else:
q_values = model.predict(state)
action = np.argmax(q_values[0]) # Exploit learned values
next_state, reward, done, _= env.step(action)
next_state = np.reshape(next_state, [1, state_size])
total_reward += reward
if not use_offline_data: # Only save and learn if not using offline data
replay_buffer.append((state, action, reward, next_state, done))
if len(replay_buffer)> BATCH_SIZE:
minibatch = random.sample(replay_buffer, BATCH_SIZE)
for s, a, r, n_s, d in minibatch:
target = r
if not d:
target = r + GAMMA * np.amax(model.predict(n_s)[0])
target_f = model.predict(s)
target_f[0][a]= target
model.fit(s, target_f, epochs=1, verbose=0)
state = next_state
if done:
break
total_reward_per_episode.append(total_reward)
# Update epsilon
epsilon = max(EPSILON_MIN, epsilon * EPSILON_DECAY)
return model, np.array(total_reward_per_episode)
# Replace this line with any initialization of the environment required before training
# env = gym.make('LunarLander-v2')
# Do not load offline data
use_offline_data = False
# Now you would call DQN_training like this:
# final_model, total_reward_per_episode = DQN_training(env, None, use_offline_data)
# After training, you'd save your model and plot the rewards.
Section 3: Train DQN Model
In this section you will train two DQN models of Architecture type 1, i.e. the DQN model should accept
the state and the action as input and the output of the model should be the Q-value of the state-action
pair given in the input. The first DQN model should be without the data collected in step 3 and the
second one uses the data.
VERY IMPORTANT: If you are coding DQN model of Architecture type 2(i.e. the DQN
model that accepts state as input and the output is Q-value of all the state-action pair),
you will get a ZERO for this section. There will be NO MERCY in this regard.
Deliverables (75 marks): You are given a Python script
training.py. This script contains the bare basic
skeleton of the DQN training code along with a function that loads the data collected in step 3. You must
NOT change the overall structure of the skeleton. There are two functions in
training.py: DQN_training
and plot_reward. Your task is to write the code for these two functions. Few additional instructions:
This function MUST train DQN of architecture 1(the DQN model should accept the state and the
action as input and the output of the model should be the Q-value of the state-action pair give
 Below i have attached the training.py code . Give the DQN

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!