1 Problem 1 Transformer Questions ( 6 0 points ) The Transformer model comes from the paper Attention is all you need 1 and it has achieved a lot of success in Natural Language Processing ( NLP ) do main It shows several advantages over the previous RNN model such as parallel processing and long range context dependencies BERT 2 and GPT 2 3 are two famous extensions of the Transformer You are going to answer several questions about your Transformer BERT GPT understanding The core idea of self attention operation is the scaled dot product attention The features are first transformed into three different matrices Q , K and V , and the self attention is calculated by the following Attention ( Q , K , V ) softmax ( Q K T d 2 ) V Q 1 ( 1 0 points ) In the above self attention operation, why do we need to incorporate the scale factor d k 2 into the calculation Q 2 ( 1 0 points ) When we train the Transformer on the word sequences, usu ally we need to add additional positional embedding for each word, why is this necessary Q 3 ( 1 0 points ) In the Transformer framework, there are two types of atten tion modules, which are self attention and encoder decoder attention What is the difference between these two modules in terms of functionality and technical implementation Q 4 ( 1 0 points ) There are also other types of attention calculations such as the additive attention 4 Additive attention computes the compatibility func tion using a feed forward network with a single hidden layer In the Transformer model, why the authors choose to use scaled dot product attention instead of additive attention and what is the main advantages Q 5 ( 1 0 points ) BERT and GPT models pretrain their model on a large scale dataset in a self supervising way Please describe their pretraining tasks and discuss why it is useful Q 6 ( 1 0 points ) In the BERT model design, there are two special tokens C L S and SEP, what is the purpose of designing these two special tokens, and how they are used during the training and evaluation

The Answer is in the image, click to view ...

Question: 1 Problem 1 : Transformer Questions ( 6 0 points ) The Transformer model comes from the paper Attention is all you need [ 1

1

Problem

1

: Transformer Questions

(60

points

)

The Transformer model comes from the paper "Attention is all you need"

[1]

and it has achieved a lot of success in Natural Language Processing

(

NLP

)

-

main. It shows several advantages over the previous RNN model such as parallel

processing and long

-

range context dependencies. BERT

[2]

and GPT

- 2 [3]

are

two famous extensions of the Transformer. You are going to answer several

questions about your Transformer

/

BERT

/

GPT understanding.

The core idea of self

-

attention operation is the scaled dot

-

product attention.

The features are first transformed into three different matrices

Q, K

and

V,

and the self

-

attention is calculated by the following:

Attention

(Q, K, V) =

softmax

(\frac{Q K^{T}}{\sqrt[]{d_{}}}) V

1 (10

points

)

: In the above self

-

attention operation, why do we need to

incorporate the scale factor

\sqrt[]{d_{k}}

into the calculation?

2 (10

points

)

: When we train the Transformer on the word sequences, usu

-

ally we need to add additional positional embedding for each word, why is this

necessary?

3 (10

points

)

: In the Transformer framework, there are two types of atten

-

tion modules, which are self

-

attention and encoder

-

decoder attention. What is

the difference between these two modules in terms of functionality and technical

implementation?

4 (10

points

)

: There are also other types of attention calculations such as

the additive attention

[4] .

Additive attention computes the compatibility func

-

tion using a feed

-

forward network with a single hidden layer. In the Transformer

model, why the authors choose to use scaled

-

dot product attention instead of

additive attention and what is the main advantages?

5 (10

points

)

: BERT and GPT models pretrain their model on a large

-

scale dataset in a self

-

supervising way. Please describe their pretraining tasks

and discuss why it is useful.

6 (10

points

)

: In the BERT model design, there are two special tokens

C L S

and SEP, what is the purpose of designing these two special tokens,

and how they are used during the training and evaluation?

1 Problem 1 : Transformer Questions ( 6 0 points

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!

Economics solve the following questions 6. (17.5 points) Consider the Mortensen-Pissarides model in continuous time. La- bor force is normalized to 1. Unemployed workers, with measure u $ 1, search...

Consider a Diamond-Mortensen-Pissarides Model of two-sided search like the one we studied in class. We are going to consider some extensions and their implications in this problems set. Efficiency...

Industrial Engineering 335 Operations Research - Optimization Spring 2017 Homework 2 1 Directions: Due: Monday February 13, 2017, beginning of class. Motivate all your answers. The deadline is hard....

Section A. 1) The Economy cannot be considered fully employed unless the measured unemployment rate is below 1%. Agree or disagree and explain your answer in a paragraph. What is the current actual...

QUESTION 2 The U0 Union is considering two expenditure options. It can either buy 20,000 ice-creams (each costing $2.50) to distribute to UQ students or it can construct a skate park costing $50,000...

You are members of a construction company that has recently been contacted by Sunrise Resort a boutique resort in Myrtle Beach, severely damaged by a recent Removal of all the debris should be...

Criteria Exemplary 6 points Accomplishe d 4.8 points Developing 3.6 points Beginning Minimum Below Standards 2.4 points 1.2 points Formulated, wrote, interpreted, argued, and evaluated...

Case Study Ribera, J. (2018). Agile project management. IESE Business School. https://www.thecasecentre.org/programmeAdmin/products/view?id=154257 Objectives Upon successful completion of the case...

Hi, Please help me with homework. Thank you !!! Thumbs up for ALL answers. Material: Book Title Social Media Marketing: A Strategic Approach Author Barker, Barker, Bormann, Roberts, Zahay...

MANA 5F50 - Dr. Krayer Case for Unit 2 Quiz MANA 5F50 - Dr. Krayer Case for Unit 2 Quiz 1. Consider the action taken by Taco Bell and Pizza Hut management in this case. Is this a MECHANISTIC or...

Perform circular shift on row 1 & 4 with order of 2 & 3 and Column shift on column 2 & 4 with order 3 & 1 on following matrix: CE FA 14 16 24 36 BC DE 73 41 26 EA 7D 9C 15 D4

What is the difference between a concurring opinion and a majority opinion? Between a concurring opinion and a dissenting opinion? Why do judges and justices write concurring and dissenting opinions,...

4. Why is depreciation recorded?

Processing times (including selup times) and due dates for six jobs waiting to be processed at a work center are given in the following table. Determine the sequence of jobs, the average flow time, av