Question: Question 2 : Parameter - Efficient Transfer Learning [ Jiaoda ] ( 3 0 pts ) Consider a vanilla encoder - decoder transformer [ 2

Question 2: Parameter-Efficient Transfer Learning [Jiaoda](30 pts)
Consider a vanilla encoder-decoder transformer [2].
a)(1 pts) Given the vocabulary size V and embedding dimension D, compute the number of
parameters in an embedding layer (ignore positional encodings).
b)(2 pts) How many embedding layers are there in an encoder-decoder transformer architecture?
What is the total number of parameters in the embedding layers? Is it larger than your answer in
a)? Why, or, why not?
In an encoder layer, there are two sub-layers: a multi-head self-attention mechanism and a position-wise
fully connected feed-forward network. A residual connection is deployed around each sub-layer, followed
by layer normalization.
c)(2 pts) Compute the number of parameters in a multi-head self-attention sub-layer. Write down
all the intermediate steps and assumptions you make.
d)(2 pts) Given that the dimensionality of the intermediate layer is 4D, compute the number of
parameters in a feed-forward network.
e)(1 pts) In a decoder layer, there is an additional sub-layer: multi-head encoder-decoder attention.
Compute the number of parameters in one such sub-layer.
f)(2 pts) There is an output layer made up of a linear transformation and a softmax function that
produces next-token probabilities. Does it introduce extra parameters? Why or why not?
g)(2 pts) Given that both the encoder and the decoder have L layers, compute the total number of
parameters in the transformer.
Consider the adapter network described in [1].
h)(2 pts) Given the bottleneck dimension of the adapter M, compute the number of parameters in
a single adapter module.
i)(2 pts) If we insert an adapter after each sub-layer, how many adapters are inserted in an encoder-
decoder transformer described above? Compute the total number of newly added parameters.
j)(2 pts) If we perform adapter tuning on a downstream binary classification task, what components
are trained? Compute the total number of trainable parameters.
k)(2 pts) Under what condition is adapter tuning more parameter-efficient than fine-tuning?
 Question 2: Parameter-Efficient Transfer Learning [Jiaoda](30 pts) Consider a vanilla encoder-decoder

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!