인공지능 공부/NLP 연구

(NLP 연구) The Long-Document Transformer 03.10

앨런튜링_ 2022. 3. 24. 09:09
  • Efficient Transformers: A Survey 읽어보자
  • https://arxiv.org/pdf/2009.06732.pdf
    • Introduction
      • on-device applications, models are supposed to be able to operate with a limited computational budget
      • In this paper, we propose a taxonomy of efficient Transformer models, characterizing them by the technical innovation and primary use case
      • The efficiency might also refer to computational costs, e.g. number of FLOPs, both during training and inference.
        • 플롭스(FLOPS, FLoating point Operations Per Second)는 컴퓨터의 성능을 수치로 나타낼 때 주로 사용되는 단위
      •  

  • Background on Transformers
    • Multi-Head Self-Attention
      • The operation for a single head is defined as

  • On the scalability of Self-Attention
    • At this point, it is apparent that the memory and computational complexity required to compute the attention matrix is quadratic in the input sequence length, i.e., N × N . In particular, the QK matrix multiplication operation alone consumes N 2 time and memory. This restricts the overall utility of self-attentive models in applications which demand the processing of long sequences. In subsequent sections, we discuss methods that reduce the cost of self-attention.

Local attention 코드 구현

 

각 집약 - > 전체문장으로 보는 경향 확인