Question: Part 0 ( not graded ) Implement Transformer and TransformerLayer for the BEFOREAFTER version of the task. You should identify the number of other letters

Part

0 (

not graded

)

Implement Transformer and TransformerLayer for the BEFOREAFTER version of

the task. You should identify the number of other letters of the same type in the sequence. This will require

implementing both Transformer and TransformerLayer, as well as training in train classifier.

Your Part

1

solutions should not use nn

.

TransformerEncoder, nn

.

TransformerDecoder, or

any other off

-

the

-

shelf self

-

attention layers. You should only use Linear, softmax, and standard nonlinearities to implement Transformers from scratch.

TransformerLayer This layer should follow the format discussed in class:

(1)

self

-

attention

(

singleheaded is fine; you can use either backward

-

only or bidirectional attention

)

;

(2)

residual connection;

(3)

Linear layer, nonlinearity, and Linear layer;

(4)

final residual connection. With a shallow network like this,

you likely don

t need layer normalization, which is a bit more complicated to implement. Because this task

is relatively simple, you don

t need a very well

-

tuned architecture to make this work. You will implement

all of these components from scratch.

You will want to form queries, keys, and values matrices with linear layers, then use the queries and keys

to compute attention over the sentence, then combine with the values. You

ll want to use matmul for this

purpose, and you may need to transpose matrices as well. Double

-

check your dimensions and make sure

everything is happening over the correct dimension. Furthermore, the division by

dk in the attention paper

may help stabilize and improve training, so don

t forget it

!

Transformer Building the Transformer will involve:

(1)

adding positional encodings to the input

(

see

the PositionalEncoding class; but we recommend leaving these out for now

) (2)

using one or more

of your TransformerLayers;

(3)

using Linear and softmax layers to make the prediction. Different from

Assignment

2,

you are simultaneously making predictions over each position in the sequence. Your network

should return the log probabilities at the output layer

(

20

3

matrix

)

as well as the attentions you compute,

which are then plotted for you for visualization purposes in plots

/ .

Training follows previous assignments. A skeleton is provided in train classifier. We have already formed input

/

output tensors inside LetterCountingExample, so you can use these as your inputs

and outputs. Whatever training code you used for Assignment

2

should likely work here too, with the major

change being the need to make simultaneous predictions at all timesteps and accumulate losses over all of

them simultaneously. NLLLoss can help with computing a

bulk

loss over the entire sequence.

2

Without positional encodings, your model may struggle a bit, but you should be able to get at least

85 %

accuracy with a single

-

layer Transformer in a few epochs of training. The attention maps should also show

some evidence of the model attending to the characters in context.

Part

1 (50

points

)

Now extend your Transformer classifier with positional encodings and address the

main task: identifying the number of letters of the same type preceding that letter. Run this with python

letter counting.py

,

no other arguments. Without positional encodings, the model simply sees a bag

of characters and cannot distinguish letters occurring later or earlier in the sentence

(

although loss will still

decrease and something can still be learned

) .

We provide a PositionalEncoding module that you can use: this initializes a nn

.

Embedding

layer, embeds the index of each character, then adds these to the actual character embeddings

. 2

If the input

sequence is the, then the embedding of the first token would be embedchar

(

) +

embedpos

(0),

and the

embedding of the second token would be embedchar

(

) +

embedpos

(1) .

Your final implementation should get over

95 %

accuracy on this task. Our reference implementation

achieves over

98 %

accuracy in

5 - 10

epochs of training taking

20

seconds each using

1 - 2

single

-

head

Transformer layers

(

there is some variance and it can depend on initialization

) .

Also note that the

autograder trains your model on an additional task as well. You will fail this hidden test if your model

uses anything hardcoded about these labels

(

or if you try to cheat and just return the correct answer that you

computed by directly counting letters yourself

),

but any implementation that works for this problem will

work for the hidden test.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!

answer the question clearly You are building a flight-control system for which a convincing safety case must be made. Would you assign the tasks of safety requirements engineering, test case...

Java 1. The data file. The first part of your assignment is to select a subject for a data file, which will be a simple version of what is called a "database." A data file typically contains...

This question involves the use of AGGREGATE linear PYTHOIN regression on the Auto data set. (a) Perform a simple linear regression with mpg as the response and horsepower as the predictor. Describe...

FILE TOOLS VIEW BSBPMG430 Assessment Task 2_V4.1.docx (Protected View) - Word (Product Activation Failed) - X PROTECTED VIEW Be careful-files from the Internet can contain viruses. Unless you need to...

Mates Rates Rent-A-Car ( just do the part a) using visual studio code (C#) Criteria sheet - Par A Example supplementary files (readme.pdf) Example supplementary files (class-diagram.pdf) Assignment...

pls help solve it all Pre-Assessment Checklist: Task 3 - Project The purpose of this checklist The pre-assessment checklist helps students determine if they are ready for assessment. The...

Login to your Unik system and try the following at the Linux prompt [user@Loki user]s finger This gives a list of everyone currently logged into the syst [user Loki users finger robert Fulkerson...

In C++ Project requires code to be executable. Base Code is included at the bottom for your convenience. Base Code for your convenience: You will write an nxn tic-tac-toe game/program that utilizes...

Tasks The goal of the project is to complete the code for the NgramAnalyser, MarkovModel, ModelMatcher and MatcherController classes, as detailed below, and to add test code to a new JUnit test...

Below are selected account balances at June 30, 20X1 and 20X2 for NIKE Inc. 20x2 ($) 20x1 ($) Cash 236,000 61,000 Accounts Receivable 8,000 12,000 Prepaid Insurance 5,000 10,000 Inventory 71,000...

Visit the IT department at your school or at a local business and determine whether the organization uses file-oriented systems, DBMSs, or both. Write a brief memo with your conclusions.

Thich of the following includes a provision that requires public companies to set policies to allow exeostive compensation to betaken tatements that did not comply with accounting standards? The...

Part B: How much additional debt will Green Paper need to borrow initially because of the expansion project , if it is financed in the same manner as the rest of Green Paper ? The amount of...