Question: 1 Problem 1 : Transformer Questions ( 6 0 points ) The Transformer model comes from the paper Attention is all you need [ 1

1 Problem 1: Transformer Questions (60 points)
The Transformer model comes from the paper "Attention is all you need" [1]
and it has achieved a lot of success in Natural Language Processing (NLP) do-
main. It shows several advantages over the previous RNN model such as parallel
processing and long-range context dependencies. BERT [2] and GPT-2[3] are
two famous extensions of the Transformer. You are going to answer several
questions about your Transformer/BERT/GPT understanding.
The core idea of self-attention operation is the scaled dot-product attention.
The features are first transformed into three different matrices Q,K and V,
and the self-attention is calculated by the following:
Attention(Q,K,V)=softmax(QKTd2)V
Q1(10 points): In the above self-attention operation, why do we need to
incorporate the scale factor dk2 into the calculation?
Q2(10 points): When we train the Transformer on the word sequences, usu-
ally we need to add additional positional embedding for each word, why is this
necessary?
Q3(10 points): In the Transformer framework, there are two types of atten-
tion modules, which are self-attention and encoder-decoder attention. What is
the difference between these two modules in terms of functionality and technical
implementation?
Q4(10 points): There are also other types of attention calculations such as
the additive attention [4]. Additive attention computes the compatibility func-
tion using a feed-forward network with a single hidden layer. In the Transformer
model, why the authors choose to use scaled-dot product attention instead of
additive attention and what is the main advantages?
Q5(10 points): BERT and GPT models pretrain their model on a large-
scale dataset in a self-supervising way. Please describe their pretraining tasks
and discuss why it is useful.
Q6(10 points): In the BERT model design, there are two special tokens
CLS and SEP, what is the purpose of designing these two special tokens,
and how they are used during the training and evaluation?
1 Problem 1 : Transformer Questions ( 6 0 points

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!