Question: n class, we showed that a com - puting a regular self - attention layer takes O ( T 2 ) running time for an

n class, we showed that a com- puting a regular self-attention layer takes O(T2) running time for an input with T tokens. One alternative is to use linear self-attention. In the simplest form, this is identical to the standard dot-product self-attention discussed in the class and lecture notes, except that the exponentials in the rowwise-softmax operation softmax(QK) are dropped;

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!