Question: Part 0 ( not graded ) Implement Transformer and TransformerLayer for the BEFOREAFTER version of the task. You should identify the number of other letters
Part not graded Implement Transformer and TransformerLayer for the BEFOREAFTER version of
the task. You should identify the number of other letters of the same type in the sequence. This will require
implementing both Transformer and TransformerLayer, as well as training in train classifier.
Your Part solutions should not use nnTransformerEncoder, nnTransformerDecoder, or
any other offtheshelf selfattention layers. You should only use Linear, softmax, and standard nonlinearities to implement Transformers from scratch.
TransformerLayer This layer should follow the format discussed in class: selfattention singleheaded is fine; you can use either backwardonly or bidirectional attention; residual connection;
Linear layer, nonlinearity, and Linear layer; final residual connection. With a shallow network like this,
you likely dont need layer normalization, which is a bit more complicated to implement. Because this task
is relatively simple, you dont need a very welltuned architecture to make this work. You will implement
all of these components from scratch.
You will want to form queries, keys, and values matrices with linear layers, then use the queries and keys
to compute attention over the sentence, then combine with the values. Youll want to use matmul for this
purpose, and you may need to transpose matrices as well. Doublecheck your dimensions and make sure
everything is happening over the correct dimension. Furthermore, the division by
dk in the attention paper
may help stabilize and improve training, so dont forget it
Transformer Building the Transformer will involve: adding positional encodings to the input see
the PositionalEncoding class; but we recommend leaving these out for now using one or more
of your TransformerLayers; using Linear and softmax layers to make the prediction. Different from
Assignment you are simultaneously making predictions over each position in the sequence. Your network
should return the log probabilities at the output layer a x matrix as well as the attentions you compute,
which are then plotted for you for visualization purposes in plots
Training follows previous assignments. A skeleton is provided in train classifier. We have already formed inputoutput tensors inside LetterCountingExample, so you can use these as your inputs
and outputs. Whatever training code you used for Assignment should likely work here too, with the major
change being the need to make simultaneous predictions at all timesteps and accumulate losses over all of
them simultaneously. NLLLoss can help with computing a bulk loss over the entire sequence.
Without positional encodings, your model may struggle a bit, but you should be able to get at least
accuracy with a singlelayer Transformer in a few epochs of training. The attention maps should also show
some evidence of the model attending to the characters in context.
Part points Now extend your Transformer classifier with positional encodings and address the
main task: identifying the number of letters of the same type preceding that letter. Run this with python
letter counting.py no other arguments. Without positional encodings, the model simply sees a bag
of characters and cannot distinguish letters occurring later or earlier in the sentence although loss will still
decrease and something can still be learned
We provide a PositionalEncoding module that you can use: this initializes a nnEmbedding
layer, embeds the index of each character, then adds these to the actual character embeddings
If the input
sequence is the, then the embedding of the first token would be embedchart embedpos and the
embedding of the second token would be embedcharh embedpos
Your final implementation should get over accuracy on this task. Our reference implementation
achieves over accuracy in epochs of training taking seconds each using singlehead
Transformer layers there is some variance and it can depend on initialization Also note that the
autograder trains your model on an additional task as well. You will fail this hidden test if your model
uses anything hardcoded about these labels or if you try to cheat and just return the correct answer that you
computed by directly counting letters yourself but any implementation that works for this problem will
work for the hidden test.
Step by Step Solution
There are 3 Steps involved in it
1 Expert Approved Answer
Step: 1 Unlock
Question Has Been Solved by an Expert!
Get step-by-step solutions from verified subject matter experts
Step: 2 Unlock
Step: 3 Unlock
