Improving Language Understanding by Generative Pre-Training

Improving Language Understanding by Generative Pre-Training

  • Link : https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf

Abstract

์ž์—ฐ์–ด๋ฅผ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์€ textual entailment, question answering(์งˆ์˜์‘๋‹ต), semantic similarity assessment(์˜๋ฏธ ์œ ์‚ฌ๋„ ํ‰๊ฐ€), document classification(๋ฌธ์„œ ๋ถ„๋ฅ˜)์™€ ๊ฐ™์€ ์—ฌ๋Ÿฌ task๋ฅผ ํฌํ•จํ•œ๋‹ค.

ํ•˜์ง€๋งŒ, unlabeled๋œ ๋ฐ์ดํ„ฐ๋Š” ์ถฉ๋ถ„ํ•ด๋„, ํŠน์ • task๋ฅผ ์œ„ํ•œ labeled ๋ฐ์ดํ„ฐ๊ฐ€ ๋ถ€์กฑํ•˜์—ฌ ํŒ๋ณ„ ๋ชจ๋ธ์„ ์ถฉ๋ถ„ํžˆ ํ•™์Šตํ•˜๊ธฐ์—๋Š” ์–ด๋ ค์›€์ด ์žˆ๋‹ค. ๊ฐ€๋ น ์ธํ„ฐ๋„ท ๊ธฐ์‚ฌ๋“ค์€ ๋งŽ์ง€๋งŒ, ์ธํ„ฐ๋„ท ๊ธฐ์‚ฌ๋“ค์„ ์ฃผ์ œ๋ณ„๋กœ ๋ถ„๋ฅ˜ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์ฐพ๊ธฐ๋Š” ์‰ฝ์ง€ ์•Š์•„, ์ธํ„ฐ๋„ท ๊ธฐ์‚ฌ๋ฅผ ์ฃผ์ œ๋ณ„๋กœ ๋ถ„๋ฅ˜ํ•˜๋Š” ๋ชจ๋ธ์„ ํ•™์Šตํ•˜๊ธฐ์— ์–ด๋ ค์›€์„ ๊ฒช๋Š”๋‹ค.

์ด ์—ฐ๊ตฌ์—์„œ๋Š” ์–ธ์–ด ๋ชจ๋ธ์„ unlabeled text๋ฅผ ์ด์šฉํ•ด generative pre-trainingํ•œ ํ›„, ๊ฐ task์— ๋งž๊ฒŒ discriminative fine-tuning์„ ํ•˜๋Š” ๋ฐฉ์‹์„ ํ†ตํ•ด ์—ฌ๋Ÿฌ task์—์„œ ํฐ ์„ฑ๊ณผ๋ฅผ ๋‚ผ ์ˆ˜ ์žˆ์Œ์„ ๋ณด์ธ๋‹ค

์ฆ‰, NLP ์•ˆ์—์„œ์˜ ์—ฌ๋Ÿฌ ์„ธ๋ถ€ task๋“ค, ์˜ˆ๋ฅผ ๋“ค์–ด ์˜๋ฏธ ์œ ์‚ฌ๋„๋ฅผ ํ‰๊ฐ€ํ•˜๋Š” task๋งŒ์„ ์œ„ํ•œ ๋ชจ๋ธ์„ ๋งŒ๋“œ๋Š” ๊ฒƒ๋ณด๋‹ค ๋‹ค์–‘ํ•œ text ๋ฐ์ดํ„ฐ๋“ค์„ ๊ฐ€์ง€๊ณ  ๋ชจ๋ธ์„ ํ•™์Šตํ•œ ํ›„, ์„ธ๋ถ€ task์— ๋งž๊ฒŒ fine-tuning์„ ๊ฑฐ์ณ ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ผ ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์ธ๋‹ค.

Large Language Model์˜ ์‹œ์ดˆ GPT-1์— ๊ด€ํ•œ ๋…ผ๋ฌธ์ด๋‹ค.

1. Introduction

NLP์—์„œ raw text๋กœ๋ถ€ํ„ฐ ํšจ๊ณผ์ ์œผ๋กœ ํ•™์Šตํ•˜๋Š” ๋Šฅ๋ ฅ์€, supervised learning์˜ ์˜์กด๋„๋ฅผ ๋‚ฎ์ถœ ์ˆ˜ ์žˆ๊ธฐ์— ํ•ต์‹ฌ ๋Šฅ๋ ฅ์ด๋ผ ํ•  ์ˆ˜ ์žˆ๋‹ค.

๋Œ€๋ถ€๋ถ„์˜ deep leaerning ๋ฐฉ์‹์€ labeled๋œ ๋ฐ์ดํ„ฐ๊ฐ€ ์ƒ๋‹นํžˆ ํ•„์š”ํ•œ๋ฐ, ํ˜„์‹ค์—์„œ๋Š” labeled๋œ ๋ฐ์ดํ„ฐ๋“ค์ด ๋ถ€์กฑํ•˜๊ธฐ์— ๋ชจ๋ธ์„ ๋‹ค๋ฅธ ๋„๋ฉ”์ธ์— ์ ์šฉํ•˜๊ธฐ์—๋Š” ์ œ์•ฝ์ด ์ƒ๊ธด๋‹ค.

๋งŒ์•ฝ, ๋ชจ๋ธ์ด unlabeled ๋ฐ์ดํ„ฐ์˜ ์–ธ์–ด ์ •๋ณด๋ฅผ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค๋ฉด, ์‹œ๊ฐ„๊ณผ ๋น„์šฉ์„ ์žก์•„๋จน๋Š” ๋ฐ์ดํ„ฐ์˜ label์„ ํš๋“ํ•˜๋Š” ๋ฐฉ๋ฒ•์˜ ๋Œ€์•ˆ์ด ๋  ์ˆ˜ ์žˆ๋‹ค.

๋˜ํ•œ supervision์ด ๊ฐ€๋Šฅํ•œ ๊ฒฝ์šฐ์—๋„, unsupervised ๋ฐฉ์‹์œผ๋กœ ์ข‹์€ representation์„ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์€ ์ƒ๋‹นํ•œ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์ œ๊ณตํ•  ์ˆ˜ ์žˆ๋‹ค.

ํ•˜์ง€๋งŒ, unlabeled๋œ text์—์„œ word-level ์ด์ƒ์˜ ์ •๋ณด๋ฅผ ํ™œ์šฉํ•˜๊ธฐ๋ž€ ์–ด๋ ต๋‹ค.

๋จผ์ €, ์–ด๋–ค ์ตœ์ ํ™” ํ•จ์ˆ˜๊ฐ€ transfer task์— ํšจ๊ณผ์ ์ธ text representations์„ ํ•™์Šตํ•˜๊ธฐ์— ์ ํ•ฉํ•œ์ง€ ๋ถˆํ™•์‹คํ•˜๋‹ค.

๋‘ ๋ฒˆ์งธ, ํ•™์Šตํ•œ representations๋ฅผ target task์— ๋งž๊ฒŒ ์–ด๋–ป๊ฒŒ transferํ•˜๋Š”๊ฒŒ ๊ฐ€์žฅ ํšจ๊ณผ์ ์ธ์ง€ ๋ถˆํ™•์‹คํ•˜๋‹ค.

์ด๋Ÿฌํ•œ ๋ถˆํ™•์‹ค์„ฑ์œผ๋กœ ์ธํ•ด language processing์„ ์œ„ํ•œ ํšจ๊ณผ์ ์ธ semi-supervised learning approaches๋ฅผ ์ฐพ๊ธฐ ์–ด๋ ค์› ๋‹ค.

์ด ๋…ผ๋ฌธ์—์„œ๋Š”, unsupervised pre-training๊ณผ supervised fine-tuning์„ ์กฐํ•ฉํ•˜์—ฌ language understanding tasks๋ฅผ ์œ„ํ•œ semi-supervised approach๋ฅผ ํƒ์ƒ‰ํ•ด๋ณธ๋‹ค. ์ด ์—ฐ๊ตฌ์˜ ๋ชฉํ‘œ๋Š” ๊ด‘๋ฒ”์œ„ํ•œ ๊ณผ์ œ์— ์•ฝ๊ฐ„์˜ ์ ์‘๋งŒ์œผ๋กœ task์— ๋งž๊ฒŒ ๋ณ€ํ˜•์‹œํ‚ฌ ์ˆ˜ ์žˆ๋Š” ๋ณดํŽธ์ ์ธ ํ‘œํ˜„(universal representation)์„ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

3 Framework

ํ›ˆ๋ จ ๊ณผ์ •์€ ๋‘ ๋‹จ๊ณ„๋กœ ์ด๋ฃจ์›Œ์ง„๋‹ค. ์ฒซ ๋‹จ๊ณ„๋Š” ๊ฑฐ๋Œ€ํ•œ text ์ž๋ฃŒ๋ฅผ ๊ฐ€์ง€๊ณ  ๋Œ€์šฉ๋Ÿ‰ ์–ธ์–ด ๋ชจ๋ธ์„ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ๊ทธ ํ›„ fine-tuning ๋‹จ๊ณ„์—์„œ labeled data๋ฅผ ์ด์šฉํ•œ discriminative task์— ๋ชจ๋ธ์„ ์ ์šฉ์‹œํ‚จ๋‹ค.

3.1 Unsupervised pre-training

์ผ๋ฐ˜์ ์ธ ์–ธ์–ด ๋ชจ๋ธ์ฒ˜๋Ÿผ $k$๊ฐœ์˜ ์ด์ „ ํ† ํฐ๋“ค์ด ์ฃผ์›Œ์กŒ์„ ๋•Œ, ํ˜„์žฌ ํ† ํฐ์ด ๋‚˜์˜ฌ ํ™•๋ฅ ์„ ๊ตฌํ•˜๋Š” likelihood ํ•จ์ˆ˜๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜๋„๋ก ํ•œ๋‹ค.

$L_1(U)=\sum_i log\ P(u_i u_{i-k},โ€ฆ,u_{i-1};\Theta)$

์—ฌ๊ธฐ์„œ $k$๋Š” context window์˜ ํฌ๊ธฐ์ด๊ณ , ์กฐ๊ฑด๋ถ€ํ™•๋ฅ  $P$๋Š” parameters $\Theta$๋ฅผ ๊ฐ€์ง€๋Š” neural network๋ฅผ ์ด์šฉํ•˜์—ฌ ๋ชจ๋ธ๋ง๋œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ด parameters๋Š” SGD(Stochastic Gradient Descent)๋ฅผ ํ†ตํ•ด ํ•™์Šต๋œ๋‹ค. ๊ฐ€๋ น $k$๊ฐ’์ด 4์ด๋ฉด, ํ˜„์žฌ ์šฐ๋ฆฌ๊ฐ€ ์˜ˆ์ธกํ•˜๊ณ ์ž ํ•˜๋Š” ํ† ํฐ $u_i$ ์ด์ „ 4๊ฐœ์˜ ํ† ํฐ $u_{i-4}, u_{i-3}, u_{i-2}, u_{i-1}$๊ฐ€ ์ฃผ์›Œ์กŒ์„ ๋•Œ, ๋‹ค์Œ์œผ๋กœ $u_i$๊ฐ€ ๋‚˜์˜ฌ ํ™•๋ฅ ์„ ์ตœ๋Œ€ํ™”ํ•˜๋„๋ก ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚จ๋‹ค.

์ด ์—ฐ๊ตฌ์—์„œ๋Š” ์–ธ์–ด ๋ชจ๋ธ๋กœ multi-layer Transformer decoder๋ฅผ ์‚ฌ์šฉํ–ˆ๋‹ค.

Pre-training ๊ณผ์ •์€ ์ด๋ ‡๋‹ค.

$h_0=UW_e+Wp$

  • input ํ† ํฐ U์— ๋Œ€ํ•ด token embedding๊ณผ position embedding์„ ์ˆ˜ํ–‰ํ•ด $h_0$๋ฅผ ๊ตฌํ•œ๋‹ค.

$h_l=$ transformer_block$(h_{l-1})\forall i \in [1,n]$

  • hidden state๋ฅผ transformer block์— ๋„ฃ๊ณ  ํ•™์Šต์‹œํ‚ค๋Š” ๊ณผ์ •์„ n๊ฐœ์˜ layer์— ๋ฐ˜๋ณตํ•œ๋‹ค.

$P(u) =$ softmax$(h_nW_e^T)$

  • ์ตœ์ข…์ ์œผ๋กœ ๊ตฌํ•œ hidden state๋ฅผ ์ด์šฉํ•ด ํ™•๋ฅ ์„ ๊ตฌํ•œ๋‹ค.

3.2 Supervised fine-tuning

์–ธ์–ด ๋ชจ๋ธ์„ ํ•™์Šตํ•œ ํ›„, ๋ชจ๋ธ์˜ parameters๋ฅผ supervised target task์— ๋งž๊ฒŒ ์กฐ์ •ํ•œ๋‹ค.

ํ† ํฐ $x^1,โ€ฆ,x^m$์œผ๋กœ ์ด๋ค„์ง„ ๋ฌธ์žฅ๊ณผ ์šฐ๋ฆฌ๊ฐ€ ๊ตฌํ•˜๊ณ ์ž ํ•˜๋Š” label $y$๊ฐ€ ์žˆ์„ ๋•Œ, ๋จผ์ € input ๋ฐ์ดํ„ฐ๋ฅผ ์–ธ์–ด ๋ชจ๋ธ์— ๋„ฃ์–ด ์ตœ์ข… transformer ๋ธ”๋ก์˜ activation $h_l^m$์„ ๊ตฌํ•œ๋‹ค. ๊ทธ ํ›„ $W_y$๋ฅผ parameters๋กœ ๊ฐ–๋Š” ์ถ”๊ฐ€์ ์ธ linear output layer์— ๋„ฃ์–ด $y$๋ฅผ ์˜ˆ์ธกํ•œ๋‹ค.

$P(y x^1,โ€ฆ,x^m)=softmax(h_l^mW_y)$

๊ทธ๋ฆฌ๊ณ  ๋‹ค์Œ๊ณผ ๊ฐ™์€ likelihood ํ•จ์ˆ˜๋ฅผ maximizeํ•˜๋„๋ก ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚จ๋‹ค.

$L_2(C)=\sum_{(x,y)}logP(y x^1,โ€ฆ,x^m)$

์ด ์—ฐ๊ตฌ์—์„œ๋Š” ์ถ”๊ฐ€์ ์œผ๋กœ ์–ธ์–ด ๋ชจ๋ธ์„ ํ•™์Šตํ•  ๋•Œ ์‚ฌ์šฉํ•œ $L_1(U)$ ํ•จ์ˆ˜๋ฅผ $L_2(C)$์™€ ๊ฐ™์ด ์‚ฌ์šฉํ•˜์˜€๋”๋‹ˆ, supervised model์˜ ์ผ๋ฐ˜ํ™”๋ฅผ ๊ฐœ์„ ํ•  ์ˆ˜ ์žˆ์—ˆ๊ณ , ๋ชจ๋ธ์˜ ์ˆ˜๋ ด์„ ๊ฐ€์†ํ™”ํ•จ์„ ๋ฐœ๊ฒฌํ–ˆ๋‹ค.

๋”ฐ๋ผ์„œ $L_3(C)=L_2(C)+\lambda*L_1(c)$๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜๋„๋ก fine-tuning ํ•™์Šต์„ ์ง„ํ–‰ํ•œ๋‹ค.

3.3 Task-specific input transformations

๊ฐ task์— ๋งž๊ฒŒ input ๊ตฌ์กฐ๋ฅผ ๋‹ค๋ฅด๊ฒŒ ํ•˜์—ฌ fine-tuning ํ•™์Šต์„ ์ง„ํ–‰ํ•œ๋‹ค.

  • Classification : text ๊ทธ๋Œ€๋กœ input์œผ๋กœ ๋„ฃ๋Š”๋‹ค.
  • Textual Entailment : premise(์ „์ œ)์™€ hypothesis(๊ฐ€์„ค)๋ฅผ ๊ฒฐํ•ฉํ•˜๊ณ , ๊ทธ ์‚ฌ์ด๋ฅผ delimiter ํ† ํฐ($)์œผ๋กœ ๊ตฌ๋ณ„ํ•˜์—ฌ input์œผ๋กœ ๋„ฃ๋Š”๋‹ค.
  • Similarity : ์œ ์‚ฌ๋„๋ฅผ ๊ตฌํ•  ๋‘ ๋ฌธ์žฅ์„ ๊ฒนํ•ฉํ•˜๊ณ , ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ delimiter ํ† ํฐ์œผ๋กœ ๊ตฌ๋ณ„ํ•˜์—ฌ input์œผ๋กœ ๋„ฃ๋Š”๋‹ค. ๋ฌธ์žฅ ๊ฐ„์˜ ์œ ์‚ฌ๋„๋ฅผ ๊ตฌํ•  ๋•Œ ๋‚ด์ œ๋œ ordering์„ ๋ฐฐ์ œํ•˜๊ธฐ ์œ„ํ•ด, ๋‘ ๋ฌธ์žฅ์˜ ์ˆœ์„œ๋ฅผ ๋ฐ”๊ฟ” ๋…๋ฆฝ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•œ๋‹ค. ๊ทธ ํ›„ output์„ softmax layer๋ฅผ ํ†ตํ•ด ์ •๊ทœํ™”ํ•˜์—ฌ ๊ฐ€์žฅ ์ ์ ˆํ•œ ๋‹ต๋ณ€์„ ๊ตฌํ•œ๋‹ค.
  • Question Answering and Commonsense Reasoning : context ๋ฌธ์„œ $z$์™€ question $q$๋ฅผ ๊ฐ€๋Šฅํ•œ ๊ฐ€๋Šฅํ•œ ๋‹ต๋ณ€ $\left{ a_k \right}$์™€ ๊ฐ๊ฐ ๊ฒฐํ•ฉํ•˜์—ฌ ๋…๋ฆฝ์ ์œผ๋กœ ์ฒ˜๋ฆฌ ํ›„, softmax layer๋กœ normalizeํ•˜์—ฌ ๊ฐ€๋Šฅํ•œ ๋‹ต๋ณ€๋“ค์— ๋Œ€ํ•ด output distribution์„ ์ƒ์„ฑํ•œ๋‹ค.

Conclusion

GPT๋Š” unlabeled ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•˜์—ฌ unsupervised pre-training์„ ์ง„ํ–‰ํ•˜๊ณ , ๊ทธ ํ›„ ํŠน์ • task์— ๋งž๊ฒŒ labeled ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•˜์—ฌ supervised fine-tuning์„ ์ง„ํ–‰ํ•˜์˜€๊ณ , ์ด๋ฅผ ํ†ตํ•ด ๊ฐ๊ฐ์˜ ๋ชฉ์ ์— ๋งž๊ฒŒ task๋ฅผ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋„๋ก ๋งŒ๋“ค์–ด์ง„ ๋ชจ๋ธ์ด๋‹ค.

ํŠน์ • ๊ณผ์ œ์— ์ ํ•ฉํ•œ ๋ชจ๋ธ ๊ตฌ์กฐ๋ฅผ ๋งŒ๋“ค์–ด๋‚ด๋Š” ๋ฐฉ์‹์ด ์•„๋‹Œ, semi-supervised learning์„ ํ†ตํ•ด Language Model์„ ํ•™์Šต์‹œํ‚จ ํ›„, fine-tuning์„ ํ†ตํ•ด ๊ฐ task๋ฅผ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋„๋ก ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚จ ์ ์ด ์ธ์ƒ์ ์ด์—ˆ๋‹ค.

image

Categories:

Updated:

Leave a comment