Question: Part 2 : Transformer for Language Modeling ( 5 0 points ) In this second part, you will implement a Transformer language model. This should

Part

2

: Transformer for Language Modeling

(50

points

)

In this second part, you will implement a Transformer language model. This should build heavily off of

what you did for Part

1,

although for this part you are allowed to use off

-

the

-

shelf Transformer components.

For this part, we use the first

100, 000

characters of text

8

as the training set. The development set is

500

characters taken from elsewhere in the collection. Your model will need to be able to consume a chunk

of characters and make predictions of the next character at each position simultaneously. Structurally, this

looks exactly like Part

1,

although with

27

output classes instead of

3 .

Getting started Run:

python lm

.

This loads the data, instantiates a UniformLanguageModel which assigns each character an equal

1

27

probability, and evaluates it on the development set. This model achieves a total log probability of

- 1644,

average log probability

(

per token

)

- 3.296,

and a perplexity of

27 .

Note that exponentiating the average

log probability gives you

1

27

in this case, which is the inverse of perplexity.

The NeuralLanguageModel class you are given has one method: get next char log probs.

It takes a context and returns the log probability distribution over the next characters given that context as a

numpy vector of length equal to the vocabulary size.

Part

2

Deliverable Implement a Transformer language model. This will require: defining a PyTorch module to handle language model prediction, implementing training of that module in train lm

,

and finally

completing the definition of NeuralLanguageModel appropriately to use this module for prediction.

Your network should take a chunk of indexed characters as input, embed them, put them through a Transformer, and make predictions from the final layer outputs.

Your final model must pass the sanity and normalization checks, get a perplexity value less than or

equal to

7,

and train in less than

10

minutes. Our Transformer reference implementation gets a perplexity

6.3

in about

6

minutes of training. However, this is an unoptimized, unbatched implementation and you

can likely do better.

Network structure You can use a similar input layer

(

Embedding followed by PositionalEncoding

)

as in

Part

1

to encode the character indices. You can use the PositionalEncoding from Part

1 .

You can then use

your Transformer architecture from Part

1

or you can use a real nn

.

TransformerEncoder,

3

which is

made up of TransformerEncoderLayers.

Note that unlike the Transformer encoder you used in part

1,

for Part

2

you must be careful to use a causal

mask for the attention: tokens should not be able to attend to tokens occurring after them in the sentence,

or else the model can easily

cheat

(

consider that if token n attends to token n

+ 1,

the model can store the

identity of token n

+ 1

in the nth position and predict it at the output layer

) .

Fortunately it should be very

easy to spot this, as your perplexity will get very close to

1

very quickly and you will fail the sanity check.

You can use the mask argument in TransformerEncoder and pass in a triangular matrix of zeros

/

negative infinities to prevent this.

Training on chunks Unlike in Part

1,

you are presented with data in a long, continuous stream of characters. Nevertheless, your network should process a chunk of characters at a time, simultaneously predicting

the next character at each index in the chunk.

You

ll have to decide how you want to chunk the data for both training and inference. Given a chunk,

you can either train just on that chunk or include a few extra tokens for context and not compute loss over

those positions. This can improve performance a bit because every prediction now has meaningful context,

but may only make a minor difference in the end.

Start of sequence In general, the beginning of any sequence is represented to the language model by a

special start

-

-

sequence token. For simplicity, we are going to overload space and use that as the startof

-

sequence character. That is

,

when give a chunk of

20

characters, you want to feed space plus the first

19

into the model and predict the

20

characters.

Evaluation Unlike past assignments where you are evaluated on correctness of predictions, in this case

your model is evaluated on perplexity and likelihood, which rely on the probabilities that your model returns.

Your model must be a

correct

implementation of a language model. Correct in this case means that

it must represent a probability distribution P

(

|

1, . . .,

1) .

You should be sure to check that your

model

s output is indeed a legal probability distribution over the next word.

Batching Batching across multiple sequences can further increase the speed of training. While you do not

need to do this to complete the assignment, you may find the speedups helpful. As in Assignment

2,

you

should be able to do this by increasing the dimension of your tensors by

1,

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!

MATHEMATICS FOR MACHINE LEARNING Marc Peter Deisenroth A. Aldo Faisal Cheng Soon Ong Contents Foreword 1 Part I Mathematical Foundations 9 1 Introduction and Motivation 11 1.1 Finding Words for...

Literature Review Examples Find a peer-reviewed literature review article that you will use as a source in your literature review. In three hundred words provide a critical analysis of the article...

You are required to make a short summary of a proceeding paper below: A Tutorial on Simulation Conceptual Modeling by Stewart Robinson (2017) Using your creativity, write a 3 pages summary. Highlight...

********PLEASE ANSWER IN PYTHON ONLY********* PA4 Maps (100 pts) Due: Learner Objectives ----------------- At the conclusion of this programming assignment, participants should be able to: Implement...

Write 2 paragraphs about Macro risks and the term structure of interest rates article. No max word count, page count, or formatting requirements but has to be submit to my tutor's work as my own....

PA4 Maps (100 pts) Due: Learner Objectives ----------------- At the conclusion of this programming assignment, participants should be able to: Implement hash tables and hash functions Linear probing...

*******PLEASE ANSWER IN PYTHON ONLY********* PA4 Maps (100 pts) Due: Learner Objectives ----------------- At the conclusion of this programming assignment, participants should be able to: Implement...

*******PLEASE ANSWER IN PYTHON ONLY********* Learner Objectives ----------------- At the conclusion of this programming assignment, participants should be able to: Implement hash tables and hash...

A discrete sequence {xn} can be converted into a continuous representation x(t) = ts X n= (t n ts) xn, where ts is the sampling period. (a) State two characteristic properties of Dirac's function. [2...

a) Explain the following terms: i. Moderate inflation 11. Galloping inflation 111. Hyper inflation b) Assuming the money supply (m) is defined as the sum of currency (c) plus deposits (D); M = C + D...

Name of Experiment: Determination of the formula of a complex by: 1- Mole-ratio method 2- Continuous variation method Q1: Purpose of experiment Q2: Introduction

4 0 Multiple Choice 1 point The combination of all the factors that consumers evaluate when deciding whether or n product differentiation. total product offer. product package. product mix.

2. Illustrate a graphical representation of a short humanitarian food value chain. Assume that you are evaluating potential policy impacts in the presence of tomato donations. What type of...