BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

  • Link : https://arxiv.org/abs/1810.04805

๐Ÿ’ก ํ•ต์‹ฌ ์•„์ด๋””์–ด๊ฐ€ ๋ญ์•ผ?

  • BERT(Bidirectional Encoder Representations from Transformers)๋Š” ์ž…๋ ฅ๋œ ๋ฌธ์žฅ์—์„œ ๊ฐ ๋‹จ์–ด์˜ ๋ฌธ๋งฅ์„ ์–‘๋ฐฉํ–ฅ์œผ๋กœ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•ด multi-layer bidirectional Transformer encoder๋ฅผ MLM(Masked Language Model)๊ณผ NSP(Next Sentence Prediction) ๋ฐฉ์‹์„ ์ด์šฉํ•ด ๋ชจ๋ธ์„ ์‚ฌ์ „ ํ•™์Šต์‹œํ‚ค๊ณ , ํ•™์Šต๋œ ๋ชจ๋ธ์„ fine-tuningํ•˜์—ฌ ๋‹ค์–‘ํ•œ ์–ธ์–ด task์— ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.
  • unlabeled text๋ฅผ ์ด์šฉํ•ด deep bidirectional representations์„ ์‚ฌ์ „์— ํ•™์Šตํ•˜๊ณ , ํ•™์Šตํ•œ ๋ชจ๋ธ์„ fine-tuningํ•˜์—ฌ ๋‹ค์–‘ํ•œ ์–ธ์–ด tasks์—์„œ SOTA๋ฅผ ๊ธฐ๋กํ–ˆ๋‹ค.

BERT๋Š” ํฌ๊ฒŒ 4๊ฐ€์ง€ ์ธก๋ฉด์—์„œ ์‚ดํŽด๋ณผ ์ˆ˜ ์žˆ๋‹ค.

Model Architecture

image

  • BERT๋Š” multi-layer bidirectional Transformer encoder๋ฅผ ๋ฒ ์ด์Šค๋กœ ํ•œ๋‹ค.
  • BERT์˜ Transformer๋Š” bidirectional self-attention์„ ์‚ฌ์šฉํ•˜์ง€๋งŒ, GPT Transformer๋Š” constrained self-attention์„ ์‚ฌ์šฉํ•œ๋‹ค. ๊ทธ๋ž˜์„œ GPT Transformer๋Š” ๊ฐ ํ† ํฐ์˜ ์™ผ์ชฝ ๋ฌธ๋งฅ๋งŒ์„ ๊ณ ๋ คํ•  ์ˆ˜ ์žˆ๋‹ค.

Input/Output Representations

image

  • BERT์˜ input embeddings์€ 3๊ฐ€์ง€ embedding vector์ธ Token Embedding, Segment Embeddings, Position Embeddings์˜ ํ•ฉ์ด๋‹ค.
  • ๋ชจ๋“  input sequence์˜ ์ฒซ๋ฒˆ์งธ ํ† ํฐ์€ classification token์ธ [CLS]์ด๋ฉฐ, [CLS] ํ† ํฐ์˜ ์ตœ์ข… hidden state๋Š” classification task๋ฅผ ์œ„ํ•œ ์ข…ํ•ฉ sequence representation์ด๋‹ค.
  • Token Embeddings์€ WordPiece embedding์„ ์‚ฌ์šฉํ•œ๋‹ค.
    • WordPiece embedding์ด๋ž€? : ๊ธฐ์กด์˜ ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ ๊ธฐ๋ฒ•์€ ๋‹จ์–ด ๋‹จ์œ„๋กœ ์ž„๋ฒ ๋”ฉ์„ ์ƒ์„ฑํ•œ๋‹ค. ์ด๋Ÿฐ ๋ฐฉ์‹์€ ์‚ฌ์ „์— ์—†๋Š” ๋‹จ์–ด(OOV, Out Of Vocabulary)๊ฐ€ ๋“ฑ์žฅํ•  ๋•Œ ์ฒ˜๋ฆฌ๊ฐ€ ์–ด๋ ต๊ณ , ์–ธ์–ด์—๋Š” ๋„ˆ๋ฌด ๋งŽ์€ ๋‹จ์–ด๊ฐ€ ์กด์žฌํ•˜๊ธฐ์— ๋ชจ๋“  ๋‹จ์–ด๋ฅผ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„ํ•˜๊ธฐ๋Š” ์–ด๋ ต๋‹ค. ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด, WordPiece Embedding์€ ๋‹จ์–ด๋ฅผ subword ๋‹จ์œ„๋กœ, ์šฐ๋ฆฌ๋ง๋กœ ํ•˜๋ฉด ํ˜•ํƒœ์†Œ ๋‹จ์œ„๋กœ, ๋‹จ์–ด๋ฅผ ๋ถ„ํ•ดํ•œ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, โ€œunhappinessโ€๋ผ๋Š” ๋‹จ์–ด๋Š” โ€œunโ€, โ€œhappiโ€, โ€œnessโ€์™€ ๊ฐ™์ด ๋‚˜๋ˆŒ ์ˆ˜ ์žˆ๋‹ค. ์ด๋ ‡๊ฒŒ ํ•™์Šตํ•˜๋ฉด, ์ƒˆ๋กœ ๋“ฑ์žฅํ•œ ๋‹จ์–ด๋„ subword์˜ ์กฐํ•ฉ์œผ๋กœ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๊ณ , ๊ณต๊ฐ„์˜ ํšจ์œจ์„ฑ์ด ๋†’์•„์ง€๊ณ , ๋‹ค์–‘ํ•œ ์ ‘๋‘์‚ฌ, ์ ‘๋ฏธ์‚ฌ ๋“ฑ์„ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์ข‹๋‹ค.
  • ๋˜ํ•œ input sequence๋Š” ํ•œ ์Œ์˜ ๋ฌธ์žฅ์œผ๋กœ ๊ตฌ์„ฑ๋˜๋Š”๋ฐ, ๊ฐ ๋ฌธ์žฅ์€ [SEP] ํ† ํฐ์œผ๋กœ ๋ถ„๋ฆฌ๋œ๋‹ค. ์ด๋•Œ, ๋ถ„๋ฆฌ๋œ ๊ฐ ๋ฌธ์žฅ์„ ๊ตฌ๋ถ„ํ•˜๊ธฐ ์œ„ํ•ด Segment Embeddings๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, A ๋ฌธ์žฅ์˜ ํ† ํฐ์€ $E_A$๋กœ ์ž„๋ฒ ๋”ฉํ•˜๊ณ , B ๋ฌธ์žฅ์˜ ํ† ํฐ์€ $E_B$๋กœ ์ž„๋ฒ ๋”ฉํ•˜๋Š” ์‹์ด๋‹ค.
  • Position Embeddings๋Š” ๋‹จ์–ด์˜ ์ˆœ์„œ๋ฅผ ๋ถ€์—ฌํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉํ•œ๋‹ค. Transformer์˜ attention์€ ๋ฌธ์žฅ ๋‚ด ๋‹จ์–ด์˜ ์ˆœ์„œ๋ฅผ ๊ณ ๋ คํ•˜์ง€ ์•Š๊ธฐ์—, ๋‹จ์–ด์˜ ์ˆœ์„œ๋ฅผ ๋ถ€์—ฌํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉํ•œ๋‹ค. ๋งจ์ฒ˜์Œ ํ† ํฐ๋ถ€ํ„ฐ $E_0, E_1,E_2โ€ฆ$์‹์œผ๋กœ ๋ถ€์—ฌํ•œ๋‹ค.

Pre-Training

Masked Language Model, MLM

  • ์ž…๋ ฅ ๋ฌธ์žฅ์—์„œ ์ „์ฒด ๋‹จ์–ด์˜ 15%๋ฅผ ๋žœ๋คํ•˜๊ฒŒ ์„ ํƒํ•ด [MASK] ํ† ํฐ์œผ๋กœ ๋งˆ์Šคํ‚นํ•˜๊ณ , ๋งˆ์Šคํ‚น๋œ ํ† ํฐ์„ ์˜ˆ์ธกํ•˜๋„๋ก ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚จ๋‹ค.
    • ์˜ˆ๋ฅผ ๋“ค์–ด, โ€œI donโ€™t think that I like herโ€์ด๋ผ๋Š” ๋ฌธ์žฅ์ด ์ฃผ์›Œ์ง€๋ฉด, โ€I dont [MASK] that I like herโ€๊ณผ ๊ฐ™์ด ๋ณ€ํ˜•ํ•œ๋‹ค.
  • Self-attention์„ ํ†ตํ•ด [MASK] ํ† ํฐ๊ณผ ๋‹ค๋ฅธ ํ† ํฐ๊ณผ์˜ ์—ฐ๊ด€์„ฑ์„ ๊ณ„์‚ฐํ•ด [MASK] ํ† ํฐ์„ ์˜ˆ์ธกํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฐฉ์‹์œผ๋กœ ๋ฌธ์žฅ์˜ ์ขŒ์šฐ ๋ฌธ๋งฅ์„ ๊ณ ๋ คํ•  ์ˆ˜ ์žˆ๋„๋ก deep bidirectional Transformer๋ฅผ ํ•™์Šต์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค.
  • ํ•˜์ง€๋งŒ, ์šฐ๋ฆฌ๊ฐ€ ์‚ฌ์ „ ํ•™์Šตํ•œ ๋ชจ๋ธ์„ ๊ฐ€์ง€๊ณ  downstream task๋ฅผ ์ˆ˜ํ–‰ํ•  ๋•Œ(fine-tuning์‹œ)๋Š” ์‚ฌ์ „ํ•™์Šต๋•Œ์™€๋Š” ๋‹ฌ๋ฆฌ input sequence์— [MASK] ํ† ํฐ์ด ์กด์žฌํ•˜์ง€ ์•Š๋Š”๋‹ค. ๋”ฐ๋ผ์„œ, ํ† ํฐ์„ [MASK] ํ† ํฐ์œผ๋กœ ๋ฐ”๊พธ๋Š”(80%๋งŒํผ) ๊ฒƒ๋งŒ์ด ์•„๋‹Œ, ๋žœ๋คํ•œ ํ† ํฐ์œผ๋กœ ๋ฐ”๊พธ๊ฑฐ๋‚˜(10%๋งŒํผ), ํ† ํฐ์„ ๋ฐ”๊พธ์ง€ ์•Š๊ธฐ๋„(10%๋งŒํผ) ํ•œ๋‹ค.

Next Sentence Prediction, NSP

  • ์ถ”๊ฐ€์ ์œผ๋กœ, ๋ฌธ์žฅ ๊ฐ„ ๊ด€๊ณ„๋ฅผ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•ด Next Sentence Prediction์„ ํ•™์Šตํ•œ๋‹ค.
  • ์งˆ์˜์‘๋‹ต(Question Answering, QA)๋‚˜ ์ž์—ฐ์–ด ์ถ”๋ก (Natural Language Inference, NLI)์™€ ๊ฐ™์€ task๋Š” ๋ฌธ์žฅ ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•˜๋‹ค.
  • ๋”ฐ๋ผ์„œ ๋ชจ๋ธ์ด ๋ฌธ์žฅ ๊ฐ„์˜ ๋‹จ์–ด๋ฅผ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋„๋ก ์‚ฌ์ „ํ•™์Šต ์‹œ, ๋ฌธ์žฅ A์™€ ์ด์–ด์งˆ ๋ฌธ์žฅ B๋ฅผ ๋‹ฌ๋ฆฌ ํ•˜์—ฌ ํ•™์Šตํ•œ๋‹ค. ๋ฌธ์žฅ B๊ฐ€ ์‹ค์ œ๋กœ ๋ฌธ์žฅ A ๋’ค์— ์ด์–ด์ง€๋Š” ๋ฌธ์žฅ(labeled as IsNext)์ธ์ง€ 50%, ์ด์–ด์ง€์ง€ ์•Š๋Š” ๋ฌธ์žฅ(labeled as NotNext)์ธ์ง€ 50%๋กœ ๊ตฌ์„ฑํ•˜์—ฌ ํ•™์Šตํ•œ๋‹ค.

Fine-tuning

image

  • ์ˆ˜ํ–‰ํ•˜๊ณ ์ž ํ•˜๋Š” downstream task์— ๋”ฐ๋ผ input์„ ๋„ฃ์–ด fine-tuning์„ ํ•œ๋‹ค.

    1) Paraphrasing : sentence pairs

    2) Entailment : Hypothesis-Premise pairs

    3) Question Answering Question-Passage pairs

    4) Text Classification or Sequence Tagging : None pair

  • ๋˜ํ•œ, task์— ๋”ฐ๋ผ output layer์— ๋„ฃ์–ด์ค„ output์ด ๋‹ค๋ฅด๋‹ค.

    1) for token level tasks such as sequence tagging or question answering : token representations

    2) for classification such as entailment or sentiment analysis : [CLS] representation

๐Ÿ’ก ๊ธฐ์กด ์•„์ด๋””์–ด์™€์˜ ์ฐจ์ด๋Š” ๋ญ์•ผ?

image

  • โ€œI donโ€™t OOOOO that I like herโ€์ด๋ผ๋Š” ๋ฌธ์žฅ์—์„œ 00000์— ๋“ค์–ด๊ฐˆ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๊ณ ์ž ํ•  ๋•Œ, ์šฐ๋ฆฌ๋Š” OOOOO ๋‹จ์–ด ์ขŒ์šฐ ๋ฌธ๋งฅ ๋ชจ๋‘๋ฅผ ๊ณ ๋ คํ•ด์„œ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๊ณ ๋Š” ํ•œ๋‹ค. ์ง๊ด€์ ์œผ๋กœ ์ƒ๊ฐํ•ด๋ณผ ๋•Œ, ๋‹จ์–ด์˜ ์ขŒ์šฐ ๋ฌธ๋งฅ์„ ๋ชจ๋‘ ๊ณ ๋ คํ•˜๋Š” ๊ฒƒ์ด ๋” ์ •ํ™•ํžˆ ์˜ˆ์ธกํ•˜๋Š”๋ฐ ๋„์›€์ด ๋  ๊ฒƒ์ด๋‹ค.
  • ๊ธฐ์กด bidirectionalํ•œ ํ•™์Šต์€ RNN์ด๋‚˜ LSTM์—์„œ๋„ ์‚ฌ์šฉ๋˜์—ˆ๋Š”๋ฐ, ์ด ๋ชจ๋ธ๋“ค์—์„œ๋Š” forward(left-to-right)์™€ backward(right-to-left)๋ฅผ ๋ณ„๋„๋กœ ํ•™์Šตํ•˜์—ฌ, ๋‘ ๋ฐฉํ–ฅ์˜ ์ถœ๋ ฅ์„ ํ•ฉ์ณ ๋‹จ์–ด์˜ ํ‘œํ˜„์„ ๋งŒ๋“ค์—ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฐฉ์‹์€ ๊ฐ ๋ฐฉํ–ฅ์—์„œ ๋ฐ˜๋Œ€ ๋ฐฉํ–ฅ์˜ ์ •๋ณด๋ฅผ ๊ณ ๋ คํ•˜์ง€ ๋ชปํ•˜๋Š” ๋‹จ์ ์ด ์žˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ์™ผ์ชฝ์—์„œ ์˜ค๋ฅธ์ชฝ์œผ๋กœ ๋‹จ์–ด๋ฅผ ์ดํ•ดํ•ด๊ฐ€๊ธฐ์—, ๋‹จ์–ด์˜ ์˜ค๋ฅธ์ชฝ์— ์žˆ๋Š” ๋ฌธ๋งฅ์„ ๊ฐ™์ด ๊ณ ๋ คํ•˜์ง€๋Š” ๋ชปํ•œ๋‹ค. ๋Œ€ํ‘œ์ ์œผ๋กœ ์ด ๋…ผ๋ฌธ์—์„œ๋„ ์–ธ๊ธ‰ํ•˜๋Š” ELMo์™€ ๊ฐ™์€ ๋ชจ๋ธ์ด ๊ทธ๋Ÿฌํ•˜๋‹ค.
  • BERT๋Š” ๊ธฐ์กด RNN์ด๋‚˜ LSTM๊ณผ ๊ฐ™์ด ์ˆœ์ฐจ์ ์ธ ์–‘๋ฐฉํ–ฅ ์ฒ˜๋ฆฌ๋ฅผ ํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹Œ, ๋ฌธ์žฅ ์ „์ฒด๋ฅผ ํ•œ ๋ฒˆ์— ์ž…๋ ฅ๋ฐ›๊ณ , ๋งˆ์Šคํ‚น๋œ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๊ธฐ ์œ„ํ•ด ๋ฌธ์žฅ ๋‚ด ๋ชจ๋“  ๋‹จ์–ด์˜ ์ƒํ˜ธ ๊ด€๊ณ„๋ฅผ ํ•™์Šตํ•œ๋‹ค.
  • Transformer์˜ Self-Attention ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ํ†ตํ•ด ๋ฌธ์žฅ ๋‚ด ๋‹จ์–ด๊ฐ€ ๋ฌธ์žฅ ์ „์ฒด์—์„œ ๋‹ค๋ฅธ ๋‹จ์–ด๋“ค๊ณผ์˜ ์—ฐ๊ด€์„ฑ์„ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๋‹ค. ์ฆ‰ ๋‹จ์–ด ๊ฐ„์˜ forward์™€ backward๋ฌธ๋งฅ ๋ชจ๋‘๋ฅผ ๊ณ ๋ คํ•  ์ˆ˜ ์žˆ๋‹ค.
    • ์˜ˆ๋ฅผ ๋“ค์–ด, โ€œThe way that you love meโ€๋ผ๋Š” ๋ฌธ์žฅ์—์„œ โ€œyouโ€๋ผ๋Š” ๋‹จ์–ด๋Š” ๋ฌธ์žฅ ๋‚ด์˜ ๋ชจ๋“  ๋‹จ์–ด๋ฅผ ์ฐธ๊ณ ํ•ด ๋ฌธ๋งฅ์„ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋‹ค. ์ฆ‰, โ€œyouโ€๋ผ๋Š” ๋‹จ์–ด์˜ ์ขŒ์ธก ๋ฌธ๋งฅ โ€œThe way thatโ€๊ณผ ์šฐ์ธก ๋ฌธ๋งฅ โ€œlove meโ€ ๋ชจ๋‘๋ฅผ ๊ณ ๋ คํ•  ์ˆ˜ ์žˆ๋‹ค.

๐Ÿ’ก ๊ทธ๋Ÿผ GPT-1์€? ๊ทธ๊ฒƒ๋„ Transformer๋ฅผ ์‚ฌ์šฉํ–ˆ์ง€ ์•Š์•„?

  • GPT-1์—์„œ ๋˜ํ•œ transformer์˜ self-attention์„ ์‚ฌ์šฉํ–ˆ์ง€๋งŒ, GPT-1์—์„œ๋Š” ์˜ˆ์ธกํ•˜๊ณ ์ž ํ•˜๋Š” ํ† ํฐ์˜ ์ขŒ์ธก ํ† ํฐ๋“ค๋งŒ์„ ์ด์šฉํ•ด ์˜ˆ์ธก์„ ์ˆ˜ํ–‰ํ–ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ํ…์ŠคํŠธ๋ฅผ ์ƒ์„ฑํ•  ๋•Œ ์ž์—ฐ์Šค๋Ÿฌ์šด ์ˆœ์„œ๋ฅผ ์œ ์ง€ํ•  ์ˆ˜ ์žˆ์ง€๋งŒ, ํŠน์ • ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•  ๋•Œ๋Š” ํ•œ์ชฝ ๋ฐฉํ–ฅ์˜ ๋ฌธ๋งฅ๋งŒ์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.

image

GPT-1, Improving Language Understanding by Generative Pre-Training

Experiments

SQuAD v1.1, SQuAD v2.0

SQuAD๋Š” MRC task๋ฅผ ๋Œ€ํ‘œํ•˜๋Š” ๋ฐ์ดํ„ฐ์…‹์ด๋‹ค. SQuAD v1.1์€ ๋ฌด์กฐ๊ฑด ๋‹ต์ด ์žˆ๋Š” ๋ฐ์ดํ„ฐ๋กœ ์ด๋ฃจ์–ด์ ธ์žˆ๊ณ , v2.2๋Š” ๋‹ต์ด ์—†๋Š” ๋ฐ์ดํ„ฐ๋„ ํฌํ•จ๋˜์–ด ์žˆ๋‹ค.

๐Ÿ’ก MRC๊ฐ€ ๋ญ์•ผ?

  • MRC๋Š” Machine Reading Comprehension์˜ ์•ฝ์ž๋กœ, ๊ธฐ๊ณ„ ๋…ํ•ด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœํ•œ ์งˆ์˜์‘๋‹ต task์ด๋‹ค. ์งˆ๋ฌธ๊ณผ ๋‹ต์ด ํฌํ•จ๋œ ๋ฌธ์„œ๋ฅผ ์ œ๊ณตํ•˜๊ณ , ๊ทธ ์•ˆ์—์„œ ๋‹ต์„ ์ฐพ๋Š”๋‹ค.

BERT๊ฐ€ MRC task๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ณผ์ •์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

  1. ์งˆ๋ฌธ๊ณผ ์ฃผ์–ด์ง„ ๋ฌธ๋‹จ์„ [SEP] ํ† ํฐ์„ ์ด์šฉํ•ด ํ•˜๋‚˜์˜ sequence๋กœ ์ด์–ด ๋ถ™์ธ๋‹ค.
  2. BERT ๋ชจ๋ธ์ด sequence์˜ ๊ฐ ํ† ํฐ๋“ค์˜ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ๋ฅผ ๊ตฌํ•˜๊ณ , ์ฃผ์–ด์ง„ ์งˆ๋ฌธ์— ๋Œ€ํ•œ ๋‹ต๋ณ€์˜ ์‹œ์ž‘๊ณผ ๋ ์œ„์น˜๋ฅผ ์˜ˆ์ธกํ•œ๋‹ค.

SWAG

SWAG์€ ํ›„๋ณด ๋ฌธ์žฅ๋“ค ์ค‘ ์ฃผ์–ด์ง„ ๋ฌธ์žฅ๊ณผ ์ด์–ด์งˆ ๋ฌธ์žฅ์„ ๊ณ ๋ฅด๋Š” ๋ฐ์ดํ„ฐ์…‹์ด๋‹ค.

BERT๋Š” ์ฃผ์–ด์ง„ ๋ฌธ์žฅ๊ณผ ํ›„๋ณด ๋ฌธ์žฅ๋“ค์„ ๊ฐ๊ฐ ํ•˜๋‚˜์˜ sequence๋กœ ์ด์–ด๋ถ™์ธ๋‹ค. ๊ทธ ํ›„, ๊ฐ sequence๋“ค์„ BERT์— ๋„ฃ์–ด [CLS] ํ† ํฐ์˜ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ๋ฅผ ๊ตฌํ•˜๊ณ , ๊ตฌํ•ด์ง„ ๊ฐ’๋“ค์„ softmax layer์— ๋„ฃ์–ด ๊ฐ€์žฅ ํ™•๋ฅ ์ด ๋†’์€ ์„ ํƒ์ง€๋ฅผ ๊ณ ๋ฅธ๋‹ค.

Ablation Study

Ablation studys๋Š” ์—ฐ๊ตฌ์—์„œ ์ œ์•ˆํ•œ ์•„์ด๋””์–ด๊ฐ€ ๋ชจ๋ธ์— ์–ด๋– ํ•œ ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š”์ง€ ํ™•์ธํ•˜๊ณ  ์‹ถ์„ ๋•Œ, ์•„์ด๋””์–ด๊ฐ€ ์ ์šฉ๋œ ๋ชจ๋ธ๊ณผ ํ•ด๋‹น ์•„์ด๋””์–ด๋งŒ ์ œ๊ฑฐํ•œ ๋ชจ๋ธ์„ ๋น„๊ตํ•˜๋Š” ๊ฒƒ์„ ๋งํ•œ๋‹ค.

image

NSP์— ๊ด€ํ•œ ablation

image

model size์— ๋Œ€ํ•œ ablation

image

Categories:

Updated:

Leave a comment