Question: Part 2 : Transformer for Language Modeling ( 5 0 points ) In this second part, you will implement a Transformer language model. This should

Part 2: Transformer for Language Modeling (50 points)
In this second part, you will implement a Transformer language model. This should build heavily off of
what you did for Part 1, although for this part you are allowed to use off-the-shelf Transformer components.
For this part, we use the first 100,000 characters of text8 as the training set. The development set is
500 characters taken from elsewhere in the collection. Your model will need to be able to consume a chunk
of characters and make predictions of the next character at each position simultaneously. Structurally, this
looks exactly like Part 1, although with 27 output classes instead of 3.
Getting started Run:
python lm.py
This loads the data, instantiates a UniformLanguageModel which assigns each character an equal 1
27
probability, and evaluates it on the development set. This model achieves a total log probability of -1644, an
average log probability (per token) of -3.296, and a perplexity of 27. Note that exponentiating the average
log probability gives you 1
27 in this case, which is the inverse of perplexity.
The NeuralLanguageModel class you are given has one method: get next char log probs.
It takes a context and returns the log probability distribution over the next characters given that context as a
numpy vector of length equal to the vocabulary size.
Part 2 Deliverable Implement a Transformer language model. This will require: defining a PyTorch module to handle language model prediction, implementing training of that module in train lm, and finally
completing the definition of NeuralLanguageModel appropriately to use this module for prediction.
Your network should take a chunk of indexed characters as input, embed them, put them through a Transformer, and make predictions from the final layer outputs.
Your final model must pass the sanity and normalization checks, get a perplexity value less than or
equal to 7, and train in less than 10 minutes. Our Transformer reference implementation gets a perplexity
of 6.3 in about 6 minutes of training. However, this is an unoptimized, unbatched implementation and you
can likely do better.
Network structure You can use a similar input layer (Embedding followed by PositionalEncoding) as in
Part 1 to encode the character indices. You can use the PositionalEncoding from Part 1. You can then use
your Transformer architecture from Part 1 or you can use a real nn.TransformerEncoder,
3 which is
made up of TransformerEncoderLayers.
Note that unlike the Transformer encoder you used in part 1, for Part 2 you must be careful to use a causal
mask for the attention: tokens should not be able to attend to tokens occurring after them in the sentence,
or else the model can easily cheat(consider that if token n attends to token n +1, the model can store the
identity of token n +1 in the nth position and predict it at the output layer). Fortunately it should be very
easy to spot this, as your perplexity will get very close to 1 very quickly and you will fail the sanity check.
You can use the mask argument in TransformerEncoder and pass in a triangular matrix of zeros /
negative infinities to prevent this.
Training on chunks Unlike in Part 1, you are presented with data in a long, continuous stream of characters. Nevertheless, your network should process a chunk of characters at a time, simultaneously predicting
the next character at each index in the chunk.
Youll have to decide how you want to chunk the data for both training and inference. Given a chunk,
you can either train just on that chunk or include a few extra tokens for context and not compute loss over
those positions. This can improve performance a bit because every prediction now has meaningful context,
but may only make a minor difference in the end.
Start of sequence In general, the beginning of any sequence is represented to the language model by a
special start-of-sequence token. For simplicity, we are going to overload space and use that as the startof-sequence character. That is, when give a chunk of 20 characters, you want to feed space plus the first
19 into the model and predict the 20 characters.
Evaluation Unlike past assignments where you are evaluated on correctness of predictions, in this case
your model is evaluated on perplexity and likelihood, which rely on the probabilities that your model returns.
Your model must be a correct implementation of a language model. Correct in this case means that
it must represent a probability distribution P(wi
|w1,..., wi1). You should be sure to check that your
models output is indeed a legal probability distribution over the next word.
Batching Batching across multiple sequences can further increase the speed of training. While you do not
need to do this to complete the assignment, you may find the speedups helpful. As in Assignment 2, you
should be able to do this by increasing the dimension of your tensors by 1,

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!