티스토리

30살 인공지능 도전기

검색하기

(NLP 연구) The Long-Document Transformer 03.01

인공지능 공부/NLP 연구

(NLP 연구) The Long-Document Transformer 03.01

앨런튜링_ 2022. 3. 24. 08:48

Attention Weigthts가 Feature Importance와 유사한가?
Input length가 Attention의 단서증폭효과의 영향을 주나?
- 즉 Input의 문장이 길어지면 Downstream Task에 사용되는 단서가 많아지나?
Token마다 Attention Layer를 동일하게 적용하는게 맞을까?
- 어떤 Token은 Full Attention은 적용하고 어떤 Token은 Sparse Attention을 적용하는게 맞지 않을까? 즉 토큰마다의 Overfit과 Underfit이 있지 않을까?
- Attention optimization이라고 논문의 표현이 맞을까?

Motivate

BIgbird 문서의 최대길이를 512 Token → 4096 Token으로 증가
- BigBird (NeurIPS, 2020, 358회 인용 / Google Research)
Attention mechanism
- Compound Sparse Attention
- 12개의 BertLayer에 동일한 적용(Random Attention의 적용은 각 Layer마다 다른지 의문)
- BIgBird : Band Attention + Global-Node Attention + Random Attention
QA task SOTA

Method

Input length가 Attention의 단서증폭효과의 영향을 주나?
- Dataset: IMDb
  - 리뷰정보 입력으로 사용 부정적(0), 긍정적(1)
  - Train: 25,000, Test : 25,000
  - Minibatch : 24
- Model 선정
  - Transformer basic
Input length에 대한 정확도 측정 (num_epochs = 20)

저작자표시