End-to-End Object Detection with Transformers

End-to-End Object Detection with Transformers

  • Link : https://arxiv.org/abs/2005.12872

image

๐Ÿ’ก ์ด ๋…ผ๋ฌธ์˜ ์žฅ์ ์€?

  • ๊ธฐ์กด์˜ ๊ฐ์ฒด ํƒ์ง€(Object detection) ๊ธฐ์ˆ ๊ณผ ๋น„๊ตํ–ˆ์„ ๋•Œ ๋งค์šฐ ๊ฐ„๋‹จํ•˜๋ฉฐ ๋˜ํ•œ ๊ฒฝ์Ÿ๋ ฅ ์žˆ๋Š” ์„ฑ๋Šฅ์„ ๋ณด์ธ๋‹ค.

๐Ÿ’ก ๋…ผ๋ฌธ์ด ์ œ์•ˆํ•œ ๊ฒƒ์€?

  • DEtection TRansformer, DETR์€ Bipartite matching loss function(์ด๋ถ„ ๋งค์นญ ์†์‹ค ํ•จ์ˆ˜)๋ฅผ ์ œ์‹œํ–ˆ๊ณ , Transformer๋ฅผ ์ด์šฉํ•œ object detection task๋ฅผ ์ˆ˜ํ–‰ํ•œ๋‹ค.

Overall training procedure

  1. Extract feature map by CNN backbone
    • CNN์„ backbone network๋กœ ์‚ฌ์šฉํ•˜์—ฌ ์ด๋ฏธ์ง€์˜ features๋ฅผ ์ถ”์ถœํ•œ๋‹ค.
  2. Add Positional Encoding
    • ์ด๋ฏธ์ง€์—์„œ ์ƒ๋Œ€์ ์ธ ์œ„์น˜ ์ •๋ณด๋ฅผ ๋‹ด๊ธฐ ์œ„ํ•ด positional encoding์„ ์ถ”๊ฐ€ํ•˜์—ฌ encoder๋กœ ๋“ค์–ด๊ฐ„๋‹ค.
  3. Generate Object queires
  4. Output encoder memory by Transformer encoder
  5. Output embedding by Transformer decoder
  6. Class prediction by Class head
  7. Bounding box prediction by Bounding box head
  8. Match prediction with ground truth by Hungarian Matcher
  9. Compute losses

Background

  • ๊ธฐ์กด object detection์€ ์ฃผ๋กœ pre-defined anchor์„ ์‚ฌ์šฉํ•œ๋‹ค. ์ด๋Š” ์ด๋ฏธ์ง€ ๋‚ด ๊ณ ์ •๋œ ์ง€์ ๋งˆ๋‹ค ๋‹ค์–‘ํ•œ scale, aspect ratio๋ฅผ ๊ฐ€์ง„ anchor๋ฅผ ์ƒ์„ฑํ•˜๊ณ , anchor ๊ธฐ๋ฐ˜์œผ๋กœ ์ƒ์„ฑํ•œ ์˜ˆ์ธก bounding box์™€ ground truth๋ฅผ ๋งค์นญํ•œ๋‹ค. ์ด๋•Œ ground truth์™€์˜ IoU๊ฐ’์ด ํŠน์ • threshold ์ด์ƒ์ผ ๊ฒฝ์šฐ positive sample๋กœ ๊ฐ„์ฃผํ•˜๋ฉฐ, positive sample์— ๋Œ€ํ•ด์„œ๋งŒ bounding box regression์„ ์ˆ˜ํ–‰ํ•œ๋‹ค. ์ด์ฒ˜๋Ÿผ ํ•˜๋‚˜์˜ ground truth์— ๋Œ€ํ•ด ๋‹ค์ˆ˜์˜ bounding box๊ฐ€ ๋งค์นญ๋˜๋Š”, ์˜ˆ์ธก bounding box์™€ ground truth ๊ฐ„์˜ many-to-one ๊ด€๊ณ„๊ฐ€ ์„ฑ๋ฆฝํ•œ๋‹ค.

Bipartite Matching

image

  • ๊ธฐ์กด object detection ๋ฐฉ๋ฒ•๋“ค์€ ๋„ˆ๋ฌด ๋ณต์žกํ•˜๋ฉฐ ๋‹ค์–‘ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ™œ์šฉํ•œ๋‹ค. ๋˜ํ•œ, bounding box์˜ ํ˜•ํƒœ, bounding box๊ฐ€ ๊ฒน์น  ๋•Œ์˜ ์ฒ˜๋ฆฌ ๋ฐฉ๋ฒ•๊ณผ ๊ฐ™์€ prior knowledge(์‚ฌ์ „ ์ง€์‹)๊ฐ€ ์š”๊ตฌ๋œ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ํƒ์ง€ํ•˜๊ณ ์ž ํ•˜๋Š” object๊ฐ€ ๊ธฐ์ฐจ์™€ ๊ฐ™์ด ๊ธด ๋ฌผ์ฒด์ผ ๊ฒฝ์šฐ bounding box๋ฅผ ๊ธธ๊ฒŒ ์„ค์ •ํ•œ๋‹ค๋˜๊ฐ€ ํ•˜๋Š” ๊ฒƒ์ด๋‹ค.
  • ๋˜ํ•œ, ํ•˜๋‚˜์˜ ground truth๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋‹ค์ˆ˜์˜ bounding box๊ฐ€ ์กด์žฌํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ด๋Ÿฌํ•œ near-duplicateํ•œ ์˜ˆ์ธก, redundantํ•œ ์˜ˆ์ธก์„ ์ œ๊ฑฐํ•˜๊ธฐ ์œ„ํ•ด NMS(Non Maximum Supppresion)๊ณผ ๊ฐ™์€ post-processing ๊ณผ์ •์ด ๋ฐ˜๋“œ์‹œ ํ•„์š”ํ•˜๋‹ค.
  • ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ bipartite matching(์ด๋ถ„ ๋งค์นญ)์„ ํ†ตํ•ด set prediction problem์„ ์ง์ ‘์ ์œผ๋กœ ํ•ด๊ฒฐํ•œ๋‹ค. ์—ฌ๊ธฐ์„œ set์€ ์ˆ˜ํ•™์—์„œ ๋งํ•˜๋Š” ์ง‘ํ•ฉ์ด๋‹ค. ์ง‘ํ•ฉ์€ ์ค‘๋ณต๋˜๋Š” ์›์†Œ๊ฐ€ ์—†๊ณ , ์›์†Œ์˜ ์ˆœ์„œ ๋˜ํ•œ ์ƒ๊ด€์ด ์—†๋‹ค.
  • ์ด๋ฏธ์ง€์—์„œ ํƒ์ง€ํ•  object์˜ ๊ฐœ์ˆ˜๋ฅผ ๊ณ ์ •ํ•ด๋‘๋ฉด, ์ด๋ถ„ ๋งค์นญ์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค.

image

  • ์ด๋•Œ, Hungarian algorithm์„ ์‚ฌ์šฉํ•˜์—ฌ loss๊ฐ€ ๊ฐ€์žฅ ์ž‘๊ฒŒ ๋‚˜์˜ค๋„๋ก, ground-truth์™€ prediction์‚ฌ์ด์˜ ์ด๋ถ„ ๋งค์นญํ•œ๋‹ค.

Generalized Intersection over Union, GIoU

  • GIoU loss๋Š” ๋‘ box ์‚ฌ์ด์˜ IoU๊ฐ’์„ ํ™œ์šฉํ•œ loss๋กœ scale-invariant(์ฒ™๋„ ๋ถˆ๋ณ€)ํ•˜๋‹ค๋Š” ํŠน์ง•์ด ์žˆ๋‹ค.
  • GIoU๋ฅผ ๊ตฌํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” predicted box $b_{\sigma(i)}$์™€ ground truth box $\hat{b_i}$๋ฅผ ๋‘˜๋Ÿฌ์‹ธ๋Š” ๊ฐ€์žฅ ์ž‘์€ box $B(b_{\sigma(i)},\hat{b})$๋ฅผ ๊ตฌํ•œ๋‹ค. ์ด๋•Œ, predicted box์™€ ground truth box๊ฐ€ ๋งŽ์ด ๊ฒน์น ์ˆ˜๋ก $B(b_{\sigma(i)},\hat{b})$๊ฐ€ ์ž‘์•„์ง€๋ฉฐ, ๋‘ box๊ฐ€ ๋ฉ€์–ด์งˆ์ˆ˜๋ก $B(b_{\sigma(i)},\hat{b})$๊ฐ€ ์ปค์ง„๋‹ค.
  • $IoU(b_{\sigma(i)},\hat{b})$๋Š” ๋‘ box ์‚ฌ์ด์˜ IoU๋ฅผ ์˜๋ฏธํ•˜๋ฉฐ, $\frac{ B(b_{\sigma(i)},\hat{b} \setminus b_{\sigma(i)}\cup \hat{b_i} }{ B(b_{\sigma(i)},\hat{b} }$๋Š” $B(b_{\sigma(i)},\hat{b})$์—์„œ predicted box์™€ ground truth box๋ฅผ ํ•ฉํ•œ ์˜์—ญ์„ ๋บ€ ์˜์—ญ์— ํ•ด๋‹นํ•œ๋‹ค. GIoU๋Š” -1์—์„œ 1 ์‚ฌ์ด์˜ ๊ฐ’์„ ๊ฐ€์ง€๋ฉฐ, GIoU loss๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ๋Š” 1-GIoU ํ˜•ํƒœ๋กœ ์‚ฌ์šฉํ•œ๋‹ค.
  • $L_{box}(b_{\sigma(i)},\hat{b})=\lambda_{iou}L_{iou}(b_{\sigma(i)},\hat{b})+\lambda_{L1} ย  b_{\sigma(i)}-\hat{b} ย  1$, $\lambda{iou},\lambda_{L1}$์€ ๋‘ term ์‚ฌ์ด๋ฅผ ์กฐ์ •ํ•˜๋Š” scalar hyperparameter

Transformer

image

DETR์—์„œ ์‚ฌ์šฉํ•˜๋Š” Transformer์™€ NLP task์—์„œ ์‚ฌ์šฉํ•˜๋Š” Transformer๋Š” ์ฐจ์ด์ ์ด ์žˆ๋‹ค.

  1. Transformer๋Š” encoder์—์„œ ๋ฌธ์žฅ์— ๋Œ€ํ•œ embedding์„ ์ž…๋ ฅ๋ฐ›๋Š” ๋ฐ˜๋ฉด, DETR์€ ์ด๋ฏธ์ง€ feature map์„ ๋ฐ›๋Š”๋‹ค.
  2. Transformer๋Š” decoder์— target embedding์„ ์ž…๋ ฅํ•˜๋Š” ๋ฐ˜๋ฉด, DETR์€ object queries๋ฅผ ์ž…๋ ฅํ•œ๋‹ค.
  3. Transformer๋Š” decoder์—์„œ ์ฒซ ๋ฒˆ์งธ attention ์—ฐ์‚ฐ ์‹œ masked multi-head attention์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ฐ˜๋ฉด, DETR์€ multi-head self-attention์„ ์ˆ˜ํ–‰ํ•œ๋‹ค.
  4. Transformer๋Š” decoder ์ดํ›„ ํ•˜๋‚˜์˜ head๋ฅผ ๊ฐ€์ง€๋Š” ๋ฐ˜๋ฉด, DETR์€ ๋‘ ๊ฐœ์˜ head๋ฅผ ๊ฐ€์ง„๋‹ค.

Encoder

image

  • Encoder๋Š” $d \times HW$ํฌ๊ธฐ์˜ ์—ฐ์†์„ฑ์„ ๋ ๋Š” feature map์„ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›๋Š”๋‹ค. ์ด๋•Œ $d$๋Š” image feature๋ฅผ ์˜๋ฏธํ•˜๊ณ  $HW$๋Š” ๊ฐ๊ฐ์˜ ํ”ฝ์…€ ์œ„์น˜ ์ •๋ณด๋ฅผ ๋‹ด๊ณ  ์žˆ๋‹ค.
  • Encoder์˜ self-attention map์„ ์‹œ๊ฐํ™”ํ•ด๋ณด๋ฉด ๊ฐœ๋ณ„ ์ธ์Šคํ„ด์Šค๋ฅผ ์ ์ ˆํžˆ ๋ถ„๋ฆฌํ•˜๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

Decoder

image

  • Decoder๋Š” $N$๊ฐœ์˜ object query(ํ•™์Šต๋œ ์œ„์น˜ ์ž„๋ฒ ๋”ฉ)๋ฅผ ์ดˆ๊ธฐ ์ž…๋ ฅ์œผ๋กœ ์ด์šฉํ•œ๋‹ค. ์ธ์ฝ”๋”๊ฐ€ global attention์„ ํ†ตํ•ด ์ธ์Šคํ„ด์Šค๋ฅผ ๋ถ„๋ฆฌํ•œ ๋’ค์— ๋””์ฝ”๋”๋Š” ๊ฐ ์ธ์Šคํ„ด์Šค์˜ ํด๋ž˜์Šค์™€ ๊ฒฝ๊ณ„์„ ์„ ์ถ”์ถœํ•œ๋‹ค.

Reference

  • https://www.youtube.com/watch?v=hCWUTvVrG7E&t=1207s

  • https://herbwood.tistory.com/26

Categories:

Updated:

Leave a comment