Question: Question 2 : Parameter - Efficient Transfer Learning [ Jiaoda ] ( 3 0 pts ) Consider a vanilla encoder - decoder transformer [ 2

Question

2

: Parameter

-

Efficient Transfer Learning

[

Jiaoda

] (30

pts

)

Consider a vanilla encoder

-

decoder transformer

[2] .

) (1

pts

)

Given the vocabulary size

V

and embedding dimension

D,

compute the number of

parameters in an embedding layer

(

ignore positional encodings

) .

) (2

pts

)

How many embedding layers are there in an encoder

-

decoder transformer architecture?

What is the total number of parameters in the embedding layers? Is it larger than your answer in

) ?

Why, or

,

why not?

In an encoder layer, there are two sub

-

layers: a multi

-

head self

-

attention mechanism and a position

-

wise

fully connected feed

-

forward network. A residual connection is deployed around each sub

-

layer, followed

by layer normalization.

) (2

pts

)

Compute the number of parameters in a multi

-

head self

-

attention sub

-

layer. Write down

all the intermediate steps and assumptions you make.

) (2

pts

)

Given that the dimensionality of the intermediate layer is

4 D,

compute the number of

parameters in a feed

-

forward network.

) (1

pts

)

In a decoder layer, there is an additional sub

-

layer: multi

-

head encoder

-

decoder attention.

Compute the number of parameters in one such sub

-

layer.

) (2

pts

)

There is an output layer made up of a linear transformation and a softmax function that

produces next

-

token probabilities. Does it introduce extra parameters? Why or why not?

) (2

pts

)

Given that both the encoder and the decoder have

L

layers, compute the total number of

parameters in the transformer.

Consider the adapter network described in

[1] .

) (2

pts

)

Given the bottleneck dimension of the adapter

M,

compute the number of parameters in

a single adapter module.

) (2

pts

)

If we insert an adapter after each sub

-

layer, how many adapters are inserted in an encoder

-

decoder transformer described above? Compute the total number of newly added parameters.

) (2

pts

)

If we perform adapter tuning on a downstream binary classification task, what components

are trained? Compute the total number of trainable parameters.

) (2

pts

)

Under what condition is adapter tuning more parameter

-

efficient than fine

-

tuning?

Question 2: Parameter-Efficient Transfer Learning [Jiaoda](30 pts) Consider a vanilla encoder-decoder

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

Briefly describe ASCII and Unicode and draw attention to any relationship between them. [3 marks] (b) Briefly explain what a Reader is in the context of reading characters from data. [3 marks] A...

ttth Suppose that the sequence of bags {Bn | n N} is recursively enumerated by the computable function e(n, x) = fn(x), [7 marks] Hence prove that the set of all recursive bags cannot be recursively...

A creative engineer suggests structuring the TLB so that not all the bits of the presented address need match to result in a hit. Suggest how this might be achieved, and what might be the costs and...

Prolog You are approached to compose a Prolog program to work with twofold trees. Your code shouldn't depend on any library predicates and you ought to expect that the mediator is running without...

10.255.255.255.ii (ii) Describe the advantages of the paired de Bruijn Graph versus the non paired version of the de Bruijn Graph. [3 marks] (d) Discuss the advantages of using soft k-means versus...

Python and most Python libraries are free to download or use, though many users use Python through a paid service. Paid services help IT organizations manage the risks associated with the use of...

Question 2 (Parameter estimation - first-order system). Suppose that the resistance in the circuit of Figure 8.4.1(a) is 3 MS. A voltage is applied to the circuit and then suddenly removed at time t...

When writing C# code, how do you indicate a namespace? You must use create a file called "namespaces", and include all namespace code in this file. Then, the namespace can be used by name in your...

1 2 3 4 7 8 9 12 13 14 15 16 17 18 19 20 21 22 23 24 28 29 30 31 38 40 41 44 47 48 49 50 51 62 63 64 66 67 68 69 70 71 73 74 76 77 78 79 80 81 82 85 86 87 88 89 90 91 92 93 94 95 99 100 101 104 105...

Question 1 A parameter in a C++ function is passed by value. Which of following statement is true? I. The value of the real parameter is passed to the formal parameter; II. Changing the value of the...

What were some of the key elements of strategy that Ray Kroc designed for McDonalds Corp.? As the McDonalds Corporation entered the 21st century, its corporate practices had become more vulnerable...

What is the Euler equation for consumption, and what is its economic interpretation?

Deliverable: a completed Schedule K from the Form 1 0 6 5 package ( Tip: you should access irs.gov for both the form and instructions ) . Facts: You have just started your career as an Auditor with...

Discuss several ways ( at least four ways ) to improve the requirement process and thus, a project team can avoid a situation where too many changes to the project requirements occur at the later...

What does this key public know about this issue?

What is the nature and type of each key public?

What does this public need on this issue?