Attention Is All You Need

Attention Is All You Need

  • Link : https://arxiv.org/abs/1706.03762*

๐Ÿ’ก Attention์˜ ๋“ฑ์žฅ ๋ฐฐ๊ฒฝ

๊ธฐ์กด์—๋Š” ๋ชจ๋ธ์ด ์ž์—ฐ์–ด๋ฅผ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•ด Seq2Seq ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. Seq2Seq ๋ชจ๋ธ์€ RNN์—์„œ many-to-many์— ํ•ด๋‹นํ•˜๋Š” ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ๊ทธ ์ค‘ ์ž…๋ ฅ ๋ฌธ์žฅ์„ ์ฝ์–ด์˜ค๋Š” ๋ถ€๋ถ„์„ โ€˜์ธ์ฝ”๋”(Encoder)โ€™, ์ถœ๋ ฅ ๋ฌธ์žฅ์„ ์ƒ์„ฑํ•˜๋Š” ๋ถ€๋ถ„์„ โ€˜๋””์ฝ”๋”(Decoder)โ€™๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

image

๋ชจ๋ธ์ด ๋ฌธ์žฅ์„ ์ฝ์–ด์˜ฌ ๋•Œ, ์ธ์ฝ”๋”์—์„œ๋Š” ๋ฌธ์žฅ์˜ ๋งจ ์•ž ๋‹จ์–ด๋ถ€ํ„ฐ ์ˆœ์ฐจ์ ์œผ๋กœ ์ฝ์–ด์™€ ๋งˆ์ง€๋ง‰ hidden state ๋ฒกํ„ฐ์— ๋ชจ๋“  ์ธ์ฝ”๋”ฉ๋œ ์ •๋ณด๋ฅผ ์šฐ๊ฒจ ๋„ฃ์Šต๋‹ˆ๋‹ค. ์ด๋กœ ์ธํ•ด, ์•ž์— ๋‚˜์˜จ ๋‹จ์–ด์— ๋Œ€ํ•œ ์ •๋ณด๋Š” ์ ์ฐจ ์‚ฌ๋ผ์ง€๊ณ , ์ž…๋ ฅ ๋ฌธ์žฅ์˜ ๊ธธ์ด๊ฐ€ ๊ธธ์–ด์ง€๋ฉด Vanishing Gradient์™€ ๊ฐ™์€ Long-Term problem์ด ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค.

์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด Attention์ด๋ผ๋Š” ๊ฐœ๋…์ด ๋“ฑ์žฅํ•˜์˜€์Šต๋‹ˆ๋‹ค.

๐Ÿ’ก Attention

์šฐ๋ฆฌ๋Š” ๋ฌธ์žฅ์„ ์ดํ•ดํ•  ๋•Œ, ๋ฌธ์žฅ ๋‚ด์˜ ๋ชจ๋“  ๋‹จ์–ด๋ฅผ ์ง‘์ค‘ํ•ด์„œ ๋ณด์ง€ ์•Š์Šต๋‹ˆ๋‹ค. โ€œAttention Is All You Need.โ€๋ผ๋Š” ๋ฌธ์žฅ์ด ์žˆ์œผ๋ฉด ์šฐ๋ฆฌ๋Š” โ€˜Attentionโ€™์ด๋ผ๋Š” ๋‹จ์–ด๋ฅผ โ€˜Isโ€™๋ผ๋Š” ๋‹จ์–ด๋ณด๋‹ค ๋”์šฑ ์ง‘์ค‘ํ•ด์„œ ๋ณด๋Š” ๊ฒƒ์ฒ˜๋Ÿผ ๋ง์ด์ฃ .

๋‹ค์‹œ ๋งํ•ด, ์˜ˆ์ธก ๋‹จ์–ด(Output)์„ ์ถœ๋ ฅํ•˜๊ธฐ ์œ„ํ•ด, ์ž…๋ ฅ ๋ฌธ์žฅ ๋‚ด์—์„œ Output๊ณผ ๊ด€๋ จ์ด ๋†’์€ ๋‹จ์–ด, ์ฆ‰ ์ค‘์š”๋„๊ฐ€ ๋†’์€ ๋‹จ์–ด์—๋งŒ ์ง‘์ค‘(Attention)ํ•˜์ž๋Š” ์ปจ์…‰์ด ๋ฐ”๋กœ Attention์ž…๋‹ˆ๋‹ค.

Attention์ด๋ž€, ๋””์ฝ”๋”๊ฐ€ ๊ฐ ํƒ€์ž„ ์Šคํ…์—์„œ ์˜ˆ์ธก ๋‹จ์–ด๋ฅผ ์ƒ์„ฑํ•  ๋•Œ ์ธ์ฝ”๋”์˜ ๋ช‡ ๋ฒˆ์งธ ํƒ€์ž„ ์Šคํ…์„ ๋” ์ง‘์ค‘(Attention)ํ•ด์•ผ ํ•˜๋Š” ์ง€๋ฅผ ์ ์ˆ˜(Score)ํ˜•ํƒœ๋กœ ๋‚˜ํƒ€๋‚ด๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๋””์ฝ”๋”์˜ ๊ฐ ํƒ€์ž„ ์Šคํ…๋งˆ๋‹ค ์ธ์ฝ”๋”์˜ hidden state ๋ฒกํ„ฐ์™€์˜ ์œ ์‚ฌ๋„๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ, ์ธ์ฝ”๋”์˜ ๋ช‡ ๋ฒˆ์งธ hidden state ๋ฒกํ„ฐ๊ฐ€ ๋” ํ•„์š”ํ•œ์ง€(์ค‘์š”ํ•œ์ง€)๋ฅผ ๊ณ ๋ คํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

๐Ÿ’ก Seq2Seq with Attention

๊ธฐ์กด Seq2Seq ๋ชจ๋ธ์— Attention ๊ธฐ๋ฒ•์„ ์ ์šฉํ•˜๋ฉด ์–ด๋–ป๊ฒŒ ๋ ๊นŒ์š”?

๊ธฐ์กด RNN ๊ธฐ๋ฐ˜ Seq2Seq ๊ตฌ์กฐ์˜ ๊ฒฝ์šฐ, โ€˜์ด์ „ ํƒ€์ž„ ์Šคํ…์˜ hidden state ๋ฒกํ„ฐ(Output)โ€™์™€ โ€˜ํ˜„์žฌ ํƒ€์ž„ ์Šคํ…์˜ ๋””์ฝ”๋” ์ž…๋ ฅ๊ฐ’โ€™์„ ํ†ตํ•ด โ€˜๋””์ฝ”๋”์˜ hidden stateโ€™๋ฅผ ๊ตฌํ–ˆ์Šต๋‹ˆ๋‹ค.

Attention ๊ตฌ์กฐ๊ฐ€ ์ถ”๊ฐ€๋œ Seq2Seq์—์„œ๋Š”, ํ˜„์žฌ ํƒ€์ž„ ์Šคํ…์˜ ๋””์ฝ”๋” hidden state์™€ ๊ฐ๊ฐ์˜ ์ธ์ฝ”๋” ํƒ€์ž„ ์Šคํ…์˜ hidden state ๋ฒกํ„ฐ๋“ค์„ ๋‚ด์ (Dot-product)ํ•˜์—ฌ Attention score๋ฅผ ๊ตฌํ•ฉ๋‹ˆ๋‹ค.

๊ตฌํ•ด์ง„ Attention score๋ฅผ ์ธ์ฝ”๋” hidden state ๋ฒกํ„ฐ๋“ค์˜ ๊ฐ€์ค‘์น˜(weight)๋กœ ์‚ฌ์šฉํ•˜์—ฌ, ๊ฐ€์ค‘ ํ‰๊ท ํ•˜์—ฌ Attention ๋ฒกํ„ฐ, ์ฆ‰ ํ•˜๋‚˜์˜ output ๋ฒกํ„ฐ๋ฅผ ๊ตฌํ•ด์ค„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ๊ตฌํ•ด์ง„ Attention ๋ฒกํ„ฐ์™€ ๋””์ฝ”๋”์˜ ๋งˆ์ง€๋ง‰ ํƒ€์ž„ ์Šคํ…์˜ hidden state ๋ฒกํ„ฐ๋ฅผ concatํ•˜์—ฌ ๋งˆ์ง€๋ง‰ output layer์˜ ์ž…๋ ฅ์œผ๋กœ ๋„ฃ์–ด์ค๋‹ˆ๋‹ค.

image

์ถœ์ฒ˜: ์œ„ํ‚ค๋…์Šค ๋”ฅ๋Ÿฌ๋‹์„ ์ด์šฉํ•œ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ์ž…๋ฌธ

๐Ÿ’ก Transformer

Transformer๋Š” ์˜ค๋กœ์ง€ Attention Mechanism์—๋งŒ ์˜์กดํ•œ simple network ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ง‘๋‹ˆ๋‹ค.

Attention mechanism์€ ์ž…๋ ฅ ๋ฌธ์žฅ(input sequence)๊ณผ ์ถœ๋ ฅ ๋ฌธ์žฅ(output sequence)์˜ ๊ฑฐ๋ฆฌ์— ์ƒ๊ด€์—†์ด ์˜์กด์„ฑ(dependency)๋ฅผ ๋ชจ๋ธ๋งํ•  ์ˆ˜ ์žˆ์ง€๋งŒ, ์•ž์„œ ์–ธ๊ธ‰ํ•œ ๋ชจ๋ธ์ฒ˜๋Ÿผ ๋Œ€๋ถ€๋ถ„์˜ ๊ฒฝ์šฐ recurrent network์™€ ํ•จ๊ป˜ ์‚ฌ์šฉ๋˜๊ณ  ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

Recurrent ๋ชจ๋ธ์€ ๊ตฌ์กฐ์ƒ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ๋ฅผ ํ•  ์ˆ˜ ์—†๊ณ , ๋ฉ”๋ชจ๋ฆฌ ์ œ์•ฝ์— ๋”ฐ๋ผ ๋ฌธ์žฅ(sequence)์˜ ๊ธธ์ด๊ฐ€ ๊ธธ์–ด์ง€๋ฉด ๊ทธ์— ๋”ฐ๋ฅธ ํ•™์Šต์— ๋ฌธ์ œ๊ฐ€ ์ƒ๊น๋‹ˆ๋‹ค.

๋”ฐ๋ผ์„œ, Transformer๋Š” recurrence์—†์ด ์˜ค๋กœ์ง€ attention mechanism์—๋งŒ ์˜์กดํ•˜์—ฌ ์ž…๋ ฅ๊ณผ ์ถœ๋ ฅ ๊ฐ„์˜ ์ „์—ญ ์˜์กด์„ฑ(Global Dependency)๋ฅผ ๋ชจ๋ธ๋ง ํ•  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.

๋” ์ด์ƒ RNN์ด๋‚˜ CNN ๋ชจ๋“ˆ์€ ํ•„์š”ํ•˜์ง€ ์•Š๊ณ , Attention Mechanism๋งŒ ์žˆ์œผ๋ฉด ๋˜๊ธฐ์— ๋…ผ๋ฌธ์˜ ์ œ๋ชฉ ๋˜ํ•œ Attention Is All You Need๋ผ๋Š” ์ ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ’ก Query, Key, Value๋กœ Attention Vector ๊ตฌํ•˜๊ธฐ

์šฐ๋ฆฌ๊ฐ€ ์œ ์‚ฌ๋„๋ฅผ ๊ตฌํ•˜๊ณ ์ž ํ•˜๋Š” vector๋ฅผ Query, ๊ทธ Query์™€์˜ ์œ ์‚ฌ๋„๋ฅผ ๊ณ„์‚ฐํ•  ๋‹ค๋ฅธ ๋ฒกํ„ฐ๋“ค์„ Key๋ผ ํ•ฉ๋‹ˆ๋‹ค.

์ด๋ ‡๊ฒŒ ๊ตฌํ•œ ์œ ์‚ฌ๋„ ์ ์ˆ˜์— Softmax๋ฅผ ์ ์šฉํ•˜์—ฌ attention score๋ฅผ ๊ตฌํ•˜๊ณ , ๊ฐ ๋ฒกํ„ฐ๋“ค์˜ ๊ฐ’์ธ Value์™€(์ฆ‰, key์˜ ๊ฐœ์ˆ˜์™€ value์˜ ๊ฐœ์ˆ˜๋Š” ๋™์ผ) ๊ฐ€์ค‘ ํ‰๊ท ํ•˜์—ฌ Attention ๋ฒกํ„ฐ๋ฅผ ๊ตฌํ•ฉ๋‹ˆ๋‹ค. ์ด Attention ๋ฒกํ„ฐ๊ฐ€ Query์˜ hidden state๊ฐ’์œผ๋กœ ๋“ค์–ด๊ฐ€๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

Query, Key, Value๋Š” ์ž…๋ ฅ ๋‹จ์–ด์˜ Embedding ๊ฐ’(X)์— ๊ฐ๊ฐ์˜ ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ์„ ๊ณฑํ•˜์—ฌ ๊ตฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • $XW^Q=Q,\ XW^K=K,\ XW^V=V$

    image

    ์ถœ์ฒ˜: ์œ„ํ‚ค๋…์Šค ๋”ฅ๋Ÿฌ๋‹์„ ์ด์šฉํ•œ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ์ž…๋ฌธ

๋‹จ์ผ query q์— ๋Œ€ํ•ด, key๋“ค์˜ ํ–‰๋ ฌ์ธ K์™€ value๋“ค์˜ ํ–‰๋ ฌ์ธ V๊ฐ€ ์žˆ์„ ๋•Œ Attention ๋ฒกํ„ฐ๋ฅผ ๊ตฌํ•˜๋Š” ๊ณผ์ •์„ ์ˆ˜์‹์œผ๋กœ ํ‘œํ˜„ํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

$A(q,K,V)= \sum_i \frac{\exp(q\cdot k_i)}{\sum_j\exp(q\cdot k_j)}v_i$

์‚ฌ์‹ค ์ด๋ ‡๊ฒŒ ๋ฒกํ„ฐ ๋‹จ์œ„๋กœ ๊ฐ ๋‹จ์–ด์— ๋Œ€ํ•œ Attention ๋ฒกํ„ฐ๋ฅผ ๊ตฌํ•˜๋Š” ๋Œ€์‹ , ํ–‰๋ ฌ ๋‹จ์œ„๋กœ ๊ณ„์‚ฐํ•˜์—ฌ Attention ํ–‰๋ ฌ์„ ๊ตฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ธฐ์กด์—๋Š” ๊ฐ ๋‹จ์–ด์— ๋Œ€ํ•œ ๋ฒกํ„ฐ ๊ณ„์‚ฐ์„ ํ†ตํ•ด Attention score๋ฅผ ๊ตฌํ–ˆ๋‹ค๋ฉด, ํ–‰๋ ฌ์„ ์ด์šฉํ•˜์—ฌ ๋ฌธ์žฅ ๋‚ด ๋ชจ๋“  ๋‹จ์–ด์— ๋Œ€ํ•œ ํ–‰๋ ฌ ๊ณ„์‚ฐ์„ ํ†ตํ•ด Attention ํ–‰๋ ฌ์„ ๊ตฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

image

์ถœ์ฒ˜: ์œ„ํ‚ค๋…์Šค ๋”ฅ๋Ÿฌ๋‹์„ ์ด์šฉํ•œ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ์ž…๋ฌธ

image

์ถœ์ฒ˜: ์œ„ํ‚ค๋…์Šค ๋”ฅ๋Ÿฌ๋‹์„ ์ด์šฉํ•œ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ์ž…๋ฌธ

$A(Q,K,V) = softmax(QK^T)V$

image

์ด๋Ÿฌํ•œ ํ–‰๋ ฌ ๊ณ„์‚ฐ ๋ฐฉ์‹์„ ํ†ตํ•ด(๋…ผ๋ฌธ์—์„œ๋Š” โ€˜highly optimized matrix multiplication codeโ€™๋ผ๊ณ  ํ‘œํ˜„ํ•จ) ๊ธฐ์กด RNN ๊ณ„์—ด์˜ ๋ชจ๋ธ๋ณด๋‹ค ์†๋„์™€ ๊ณต๊ฐ„ ์ธก๋ฉด์—์„œ ์ด์ ์„ ๊ฐ–๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

๐Ÿ’ก Scaled Dot-Product Attention

Attention score๋ฅผ ๊ณ„์‚ฐ ์‹œ query์™€ key์˜ dimension์— ๋”ฐ๋ผ ๋‚ด์ ์˜ ๋ถ„์‚ฐ๊ฐ’์ด ์ขŒ์ง€์šฐ์ง€ ๋  ์ˆ˜ ์žˆ๊ณ  ์ด์— ๋”ฐ๋ผ Softmax์˜ ๋ถ„ํฌ์— ์˜ํ–ฅ์„ ์ค„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ๋ณด์ •ํ•ด์ฃผ๊ธฐ ์œ„ํ•ด ํ‘œ์ค€ํŽธ์ฐจ๋กœ ๋‚˜๋ˆ ์ฃผ๋Š” ๊ณผ์ •์„ ํ†ตํ•ด ๋ถ„์‚ฐ์„ 1๋กœ ์œ ์ง€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Q์™€ K๊ฐ€ ํ‰๊ท ์ด 0์ด๊ณ  ๋ถ„์‚ฐ์ด 1์ธ vector๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ๋‹ค๋ฉด, ํ†ต๊ณ„์ ์œผ๋กœ ๊ณ„์‚ฐ ํ–ˆ์„๋•Œ $Q\cdot K$์˜ ๋ถ„์‚ฐ๊ฐ’๊ณผ $d_k$์˜ ๊ฐ’์ด ๋™์ผํ•ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ, Q์™€ K์˜ Dot-Product ๊ฐ’์„ key์˜ dimension์ธ $d_k$๋กœ ๋‚˜๋ˆ (Scaled) ์ตœ์ข…์ ์œผ๋กœ Attention ๋ฒกํ„ฐ๊ฐ’์„ ๊ตฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ’ก Multi-Head Attention

Multi-Head Attention์„ ํ™œ์šฉํ•˜๋ฉด ๋™์‹œ์— ์—ฌ๋Ÿฌ ๋ฒ„์ „์˜ Attention์„ ์ง„ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ•œ ํ—ค๋“œ(head)๋Š” ํ•œ ์ข…๋ฅ˜์˜ ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ($W_i^Q, W_i^K, W_i^V$)์„ ํ†ตํ•ด Q, K, V๋ฅผ ๊ตฌํ•˜๊ณ  ์ตœ์ข…์ ์œผ๋กœ Attention ๋ฒกํ„ฐ๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค. ๋งŒ์•ฝ ์ด ํ—ค๋“œ๊ฐ€ ์—ฌ๋Ÿฌ ๊ฐœ ์žˆ๋‹ค๋ฉด? ์šฐ๋ฆฌ๋Š” ์—ฌ๋Ÿฌ ์ข…๋ฅ˜์˜ ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ($head_0=Attention(QW_0^Q,KW_0^K,VW_0^V), \ head_1=Attention(QW_1^Q,KW_1^K,VW_1^V) โ€ฆ$)์„ ์ด์šฉํ•˜์—ฌ ์—ฌ๋Ÿฌ ๋ฒ„์ „์˜ Attention ๋ฒกํ„ฐ๋ฅผ ๊ตฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ๊ฐ ํ—ค๋“œ๋ณ„๋กœ ์–ป์–ด์ง„ Attention ๋ฒกํ„ฐ๋ฅผ concatํ•˜์—ฌ ์ „์ฒด ๊ฒฐ๊ณผ ๋ฒกํ„ฐ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด, ๋ฌธ์žฅ์„ ์—ฌ๋Ÿฌ ๊ด€์ ์—์„œ ๋ฐ”๋ผ๋ณผ ์ˆ˜ ์žˆ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

๐Ÿ’ก 3๊ฐ€์ง€ Attention

image

Transformer์—์„œ๋Š” 3๊ฐ€์ง€์˜ Multi-Head Attention์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

  • Encoder Self-Attention: Self-Attention์€ ์ž๊ธฐ ์ž์‹ ์—๊ฒŒ attention์„ ์ˆ˜ํ–‰ํ•œ๋‹ค๋Š” ์˜๋ฏธ๋กœ, ์‰ฝ๊ฒŒ ๋งํ•ด ์ธ์ฝ”๋”๋กœ ๋“ค์–ด์˜จ ์ž…๋ ฅ ๋ฌธ์žฅ์˜ ๋ชจ๋“  ๋ฒกํ„ฐ๋“ค์— ๋Œ€ํ•ด ๊ฐ๊ฐ Attention ๋ฒกํ„ฐ๋ฅผ ๊ตฌํ•œ๋‹ค๋Š” ๋œป์ž…๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ, ์ด๋•Œ Q, K, V๋Š” ์ž…๋ ฅ ๋ฌธ์žฅ์˜ ๋ชจ๋“  ๋ฒกํ„ฐ๋“ค์ด ํ•ด๋‹น๋ฉ๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ self-attention์„ ํ†ตํ•ด ์ž…๋ ฅ ๋ฌธ์žฅ ๋‚ด์˜ ๋‹จ์–ด๋“ค๋ผ๋ฆฌ์˜ ์œ ์‚ฌ๋„๋ฅผ ๊ตฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

    image

  • Masked Decoder Self-Attention: RNN์€ ๊ตฌ์กฐ์ ์œผ๋กœ ๋‹ค์Œ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•  ๋•Œ, ์ด์ „๊นŒ์ง€ ์ž…๋ ฅ๋œ ๋‹จ์–ด๋“ค๋งŒ์„ ์ฐธ๊ณ ํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ Transformer๋Š” ๋ฌธ์žฅ ํ–‰๋ ฌ์„ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›๊ธฐ์— ๋‹ค์Œ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•  ๋•Œ, ๊ทธ ๋’ค์— ๋‚˜์˜ค๋Š” ๋‹จ์–ด๋“ค๊นŒ์ง€๋„ ์ฐธ๊ณ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด, Masking์„ ํ•˜์—ฌ Attention score matrix์—์„œ ์ž๊ธฐ ์ž์‹ (๋””์ฝ”๋”๋กœ ๋“ค์–ด์˜ค๋Š” embedding)๊ณผ ๊ทธ ์ด์ „์— ๋‚˜์˜จ ๋‹จ์–ด๋“ค๋งŒ ์ฐธ๊ณ ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

    image

    image

  • Encoder-Decoder Attention: ์ด๋ฒˆ์—๋Š” Self-Attention์ด ์•„๋‹Œ, ๋””์ฝ”๋”์—์„œ์˜ Query์— ๋Œ€ํ•ด ์ธ์ฝ”๋”์˜ ๋งˆ์ง€๋ง‰ ์ธต์—์„œ ๋‚˜Key์™€ Value๋ฅผ ์ด์šฉํ•ด Attention์„ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

    image

๐Ÿ’ก Residual Connection & Layer Normalization (Add&Norm)

image

block(๋ชจ๋“ˆ)์„ ๋ณด๋ฉด 3๊ฐœ์˜ ๋ฒกํ„ฐ๊ฐ€ ๊ฐ๊ฐ query, key, value๋กœ ๋“ค์–ด๊ฐ€์„œ Multi-Head Attention์„ ํ†ตํ•ด ๊ณ„์‚ฐ๋˜๊ณ , ์ด๋ ‡๊ฒŒ Attention์„ ๊ฑฐ์นœ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ์™€ ์›๋ž˜์˜ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ๋ฅผ ๋”ํ•ด์ฃผ๋Š” ๊ฒƒ(Add)์„ residual connection์ด๋ผ ๋ถ€๋ฆ…๋‹ˆ๋‹ค. Residual connection์„ ํ†ตํ•ด ๋ ˆ์ด์–ด๊ฐ€ ๊นŠ์–ด์งˆ์ˆ˜๋ก gradient๊ฐ€ ์ ์  ์ปค์ง€๊ฑฐ๋‚˜ ์ž‘์•„์ง€๋Š” ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ทธ ํ›„ Layer Normalization์„ ํ†ตํ•ด ๊ฐ ์ž…๋ ฅ๊ฐ’๋“ค์˜ Feature๋“ค์— ๋Œ€ํ•œ ํ‰๊ท ๊ณผ ๋ถ„์‚ฐ์„ ๊ตฌํ•ด batch์— ์žˆ๋Š” ๊ฐ ์ž…๋ ฅ๊ฐ’๋“ค์„ ์ •๊ทœํ™” ํ•ด์ค๋‹ˆ๋‹ค.

๐Ÿ’ก Positional Encoding

RNN์€ ๋ชจ๋ธ์˜ ๊ตฌ์กฐ์ƒ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ๋‹จ์–ด๋“ค์˜ ์ˆœ์„œ์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ํ•™์Šตํ•˜์ง€๋งŒ, Attention ๋ชจ๋ธ์€ ๊ตฌ์กฐ์ƒ ์ˆœ์„œ ์ •๋ณด๋ฅผ ํ•™์Šตํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ๊ฐ ๋‹จ์–ด์˜ ํฌ์ง€์…˜๋งˆ๋‹ค ๊ทธ ํฌ์ง€์…˜์„ ๋‚˜ํƒ€๋‚ด๋Š” ์ •๋ณด๋ฅผ ์ถ”๊ฐ€ํ•ด์ฃผ๋Š” ๊ฒƒ์„ Positional Encoding์ด๋ผ ํ•ฉ๋‹ˆ๋‹ค. Positional encoding์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์‚ฌ์ธํŒŒ ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•˜๊ณ , ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ ๋‚ด์˜ ์ฐจ์› ์ธ๋ฑ์Šค์— ๋”ฐ๋ผ sinํ•จ์ˆ˜์™€ cosํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•˜์—ฌ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.

image

๐Ÿ’ก Transformer ๋ชจ๋ธ์˜ ์žฅ์ 

์ด๋Ÿฌํ•œ Transformer ๋ชจ๋ธ์€

  • Parallelization: ๋ณ‘๋ ฌํ™”๋ฅผ ํ†ตํ•ด ๊ณ„์‚ฐ ํšจ์œจ์„ฑ์„ ๋†’์ด๊ณ  ํ•™์Šต์†๋„๋ฅผ ๊ฐœ์„ ํ•  ์ˆ˜ ์žˆ์—ˆ๊ณ 
  • Long-range Dependencies: ๋ชจ๋“  ์œ„์น˜์—์„œ ๋‹ค๋ฅธ ์œ„์น˜๊นŒ์ง€์˜ ๊ด€๊ณ„(Long-Term Dependency)๋ฅผ ์‰ฝ๊ฒŒ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์—ˆ์œผ๋ฉฐ
  • Interpretable: ๋˜ํ•œ Attention score๋ฅผ ์‹œ๊ฐํ™”ํ•˜์—ฌ ๊ฐ ์š”์†Œ๋“ค ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ์‹œ๊ฐํ™”ํ•˜์—ฌ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

image

Categories:

Updated:

Leave a comment