Question: 3. (3 points) Understanding say-attention. Let us assume the basic denition of self-attention (without any weight matrices), where all the queries, keys, and values are

3. (3 points) Understanding say-attention. Let us
3. (3 points) Understanding say-attention. Let us assume the basic denition of self-attention (without any weight matrices), where all the queries, keys, and values are the data points themselves (i.e., :13,- = q,- = k,- = 1),). We will see how self-attention lets the network select different parts of the data to be the \"content\" (value) and other parts to determine where to \"pay attention\" (queries and keys). Consider 4 orthogonal \"base\" vectors all of equal 2 norm (1, b, c, d. (Suppose that their norm is {3, which is some very, very large number.) Out of these base vectors, construct 3 tokens: $1=d+b, 532:3: m3=c+b. a. (0.5 points) What are the norms of 3:1, 2:2, 3:3? b. (2 points) Compute (y1,y2,y3) = Selfattention(m1,m2,m3). Identify which tokens (or combinations of tokens) are approximated by the outputs y1,y2, y3. c. (0.5 points) Using the above example, describe in a couple of sentences how selfattention that allows networks to \"copy\" an input value to the output

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Mathematics Questions!