Question: n class, we showed that a com - puting a regular self - attention layer takes O ( T 2 ) running time for an
n class, we showed that a com puting a regular selfattention layer takes OT running time for an input with T tokens. One alternative is to use linear selfattention In the simplest form, this is identical to the standard dotproduct selfattention discussed in the class and lecture notes, except that the exponentials in the rowwisesoftmax operation softmaxQK are dropped;
Step by Step Solution
There are 3 Steps involved in it
1 Expert Approved Answer
Step: 1 Unlock
Question Has Been Solved by an Expert!
Get step-by-step solutions from verified subject matter experts
Step: 2 Unlock
Step: 3 Unlock
