(NLP 연구) The Long-Document Transformer 03.10

인공지능 공부/NLP 연구

앨런튜링_ 2022. 3. 24. 09:09

Background on Transformers
- Multi-Head Self-Attention
  - The operation for a single head is deﬁned as

On the scalability of Self-Attention
- At this point, it is apparent that the memory and computational complexity required to compute the attention matrix is quadratic in the input sequence length, i.e., N × N . In particular, the QK matrix multiplication operation alone consumes N 2 time and memory. This restricts the overall utility of self-attentive models in applications which demand the processing of long sequences. In subsequent sections, we discuss methods that reduce the cost of self-attention.

Local attention 코드 구현

각 집약 - > 전체문장으로 보는 경향 확인