Question: useful codedef create_mlp(input_dim: int, output_dim: int, architecture: List[int], squash=False, activation: Type[nn.Module]=nn.ReLU) -> List[nn.Module]: '''Creates a list of modules that define an MLP.''' if len(architecture) >
![useful codedef create_mlp(input_dim: int, output_dim: int, architecture: List[int], squash=False, activation: Type[nn.Module]=nn.ReLU)](https://s3.amazonaws.com/si.experts.images/answers/2024/09/66e01a5018f54_83166e01a4f88f45.jpg)
useful codedef create_mlp(input_dim: int, output_dim: int, architecture: List[int], squash=False, activation: Type[nn.Module]=nn.ReLU) -> List[nn.Module]: '''Creates a list of modules that define an MLP.''' if len(architecture) > 0: layers = [nn.Linear(input_dim, architecture[0]), activation()] else: layers = [] for i in range(len(architecture) - 1): layers.append(nn.Linear(architecture[i], architecture[i+1])) layers.append(activation()) if output_dim > 0: last_dim = architecture[-1] if len(architecture) > 0 else input_dim layers.append(nn.Linear(last_dim, output_dim)) if squash: # squashes output down to (-1, 1) layers.append(nn.Tanh()) return layers
def create_net(input_dim: int, output_dim: int, squash=False): layers = create_mlp(input_dim, output_dim, architecture=[64, 64], squash=squash) net = nn.Sequential(*layers) return net
def argmax_policy(net): # TODO: Return a FUNCTION that takes in a state, and outputs the maximum Q value of said state. # Inputs: # - net: (type nn.Module). A neural network module, going from state dimension to number of actions. Q network. # Wanted output: # - argmax_fn: A function which takes in a state, and outputs the maximum Q value of said state. pass
def expert_policy(expert, s): '''Returns a one-hot encoded action of what the expert predicts at state s.''' action = expert.predict(s)[0] one_hot_action = np.eye(4)[action] return one_hot_action
We first ask that you implement some simple utilities that will go toward training all of our policies. Because LunarLander-v2 is an environment with a finite number of actions, we can represent our policy by a neural network that takes the state in and outputs a vector with dimension being the number of actions. In imitation learning, we want to be able to match the expert actions at all states. In the discrete action case, this boils down to maximizing the log likelihood of the taken expert actions in those particular states (i.e. by maximizing logits). But in evaluation/deployment, we want to use a greedy version of our learnt policy-we want to be able to exploit what we have learned throughout the training process by choosing the action that the learner thinks is best (i.e. maximum logits). Please implement the argmax policy method in Part 1: Utils section of the notebook. Follow the instructions present in the method description to help write this method. det argmax policy (net): * TOo0: Return s rUACrIok that takes in a state, and eutputs the maximum 0 value of said otate. e. Inpats 1 * - net. (type nn. Module). A neural network inodula, going from state dimeasion to number of actiona. Q network. * Wanted outpat: * - argmax fni a function which takee in a ntate, and cutpute the maximam 0 value of said state. pasan Behavioral cloning: Behavioral cloning is the simplest imitation learning algorithm, where we perform supervised learning on the given (offline) expert dataset. We either do this via log likelihood maximization (cross entropy minimization) in the discrete action case, or mean-squared error minimization (can also do MLE) in the continuous control setting. Please implement the following loarn () function for BC. def learn(selt, env, ataten, actiona, n_gtepa=1e4, traneaterprue)t f Tobo: Implecent this nethed. keturn the Ilnal gteedy poliey (argnax poltey)
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
