Question: useful codedef create_mlp(input_dim: int, output_dim: int, architecture: List[int], squash=False, activation: Type[nn.Module]=nn.ReLU) -> List[nn.Module]: '''Creates a list of modules that define an MLP.''' if len(architecture) >

 useful codedef create_mlp(input_dim: int, output_dim: int, architecture: List[int], squash=False, activation: Type[nn.Module]=nn.ReLU)

useful codedef create_mlp(input_dim: int, output_dim: int, architecture: List[int], squash=False, activation: Type[nn.Module]=nn.ReLU) -> List[nn.Module]: '''Creates a list of modules that define an MLP.''' if len(architecture) > 0: layers = [nn.Linear(input_dim, architecture[0]), activation()] else: layers = [] for i in range(len(architecture) - 1): layers.append(nn.Linear(architecture[i], architecture[i+1])) layers.append(activation()) if output_dim > 0: last_dim = architecture[-1] if len(architecture) > 0 else input_dim layers.append(nn.Linear(last_dim, output_dim)) if squash: # squashes output down to (-1, 1) layers.append(nn.Tanh()) return layers

def create_net(input_dim: int, output_dim: int, squash=False): layers = create_mlp(input_dim, output_dim, architecture=[64, 64], squash=squash) net = nn.Sequential(*layers) return net

def argmax_policy(net): # TODO: Return a FUNCTION that takes in a state, and outputs the maximum Q value of said state. # Inputs: # - net: (type nn.Module). A neural network module, going from state dimension to number of actions. Q network. # Wanted output: # - argmax_fn: A function which takes in a state, and outputs the maximum Q value of said state. pass

def expert_policy(expert, s): '''Returns a one-hot encoded action of what the expert predicts at state s.''' action = expert.predict(s)[0] one_hot_action = np.eye(4)[action] return one_hot_action

We first ask that you implement some simple utilities that will go toward training all of our policies. Because LunarLander-v2 is an environment with a finite number of actions, we can represent our policy by a neural network that takes the state in and outputs a vector with dimension being the number of actions. In imitation learning, we want to be able to match the expert actions at all states. In the discrete action case, this boils down to maximizing the log likelihood of the taken expert actions in those particular states (i.e. by maximizing logits). But in evaluation/deployment, we want to use a greedy version of our learnt policy-we want to be able to exploit what we have learned throughout the training process by choosing the action that the learner thinks is best (i.e. maximum logits). Please implement the argmax policy method in Part 1: Utils section of the notebook. Follow the instructions present in the method description to help write this method. det argmax policy (net): * TOo0: Return s rUACrIok that takes in a state, and eutputs the maximum 0 value of said otate. e. Inpats 1 * - net. (type nn. Module). A neural network inodula, going from state dimeasion to number of actiona. Q network. * Wanted outpat: * - argmax fni a function which takee in a ntate, and cutpute the maximam 0 value of said state. pasan Behavioral cloning: Behavioral cloning is the simplest imitation learning algorithm, where we perform supervised learning on the given (offline) expert dataset. We either do this via log likelihood maximization (cross entropy minimization) in the discrete action case, or mean-squared error minimization (can also do MLE) in the continuous control setting. Please implement the following loarn () function for BC. def learn(selt, env, ataten, actiona, n_gtepa=1e4, traneaterprue)t f Tobo: Implecent this nethed. keturn the Ilnal gteedy poliey (argnax poltey)

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!