Question: Part 0 ( not graded ) Implement Transformer and TransformerLayer for the BEFOREAFTER version of the task. You should identify the number of other letters

Part 0(not graded) Implement Transformer and TransformerLayer for the BEFOREAFTER version of
the task. You should identify the number of other letters of the same type in the sequence. This will require
implementing both Transformer and TransformerLayer, as well as training in train classifier.
Your Part 1 solutions should not use nn.TransformerEncoder, nn.TransformerDecoder, or
any other off-the-shelf self-attention layers. You should only use Linear, softmax, and standard nonlinearities to implement Transformers from scratch.
TransformerLayer This layer should follow the format discussed in class: (1) self-attention (singleheaded is fine; you can use either backward-only or bidirectional attention); (2) residual connection; (3)
Linear layer, nonlinearity, and Linear layer; (4) final residual connection. With a shallow network like this,
you likely dont need layer normalization, which is a bit more complicated to implement. Because this task
is relatively simple, you dont need a very well-tuned architecture to make this work. You will implement
all of these components from scratch.
You will want to form queries, keys, and values matrices with linear layers, then use the queries and keys
to compute attention over the sentence, then combine with the values. Youll want to use matmul for this
purpose, and you may need to transpose matrices as well. Double-check your dimensions and make sure
everything is happening over the correct dimension. Furthermore, the division by
dk in the attention paper
may help stabilize and improve training, so dont forget it!
Transformer Building the Transformer will involve: (1) adding positional encodings to the input (see
the PositionalEncoding class; but we recommend leaving these out for now)(2) using one or more
of your TransformerLayers; (3) using Linear and softmax layers to make the prediction. Different from
Assignment 2, you are simultaneously making predictions over each position in the sequence. Your network
should return the log probabilities at the output layer (a 20x3 matrix) as well as the attentions you compute,
which are then plotted for you for visualization purposes in plots/.
Training follows previous assignments. A skeleton is provided in train classifier. We have already formed input/output tensors inside LetterCountingExample, so you can use these as your inputs
and outputs. Whatever training code you used for Assignment 2 should likely work here too, with the major
change being the need to make simultaneous predictions at all timesteps and accumulate losses over all of
them simultaneously. NLLLoss can help with computing a bulk loss over the entire sequence.
2
Without positional encodings, your model may struggle a bit, but you should be able to get at least 85%
accuracy with a single-layer Transformer in a few epochs of training. The attention maps should also show
some evidence of the model attending to the characters in context.
Part 1(50 points) Now extend your Transformer classifier with positional encodings and address the
main task: identifying the number of letters of the same type preceding that letter. Run this with python
letter counting.py, no other arguments. Without positional encodings, the model simply sees a bag
of characters and cannot distinguish letters occurring later or earlier in the sentence (although loss will still
decrease and something can still be learned).
We provide a PositionalEncoding module that you can use: this initializes a nn.Embedding
layer, embeds the index of each character, then adds these to the actual character embeddings.2
If the input
sequence is the, then the embedding of the first token would be embedchar(t)+ embedpos(0), and the
embedding of the second token would be embedchar(h)+ embedpos(1).
Your final implementation should get over 95% accuracy on this task. Our reference implementation
achieves over 98% accuracy in 5-10 epochs of training taking 20 seconds each using 1-2 single-head
Transformer layers (there is some variance and it can depend on initialization). Also note that the
autograder trains your model on an additional task as well. You will fail this hidden test if your model
uses anything hardcoded about these labels (or if you try to cheat and just return the correct answer that you
computed by directly counting letters yourself), but any implementation that works for this problem will
work for the hidden test.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!