Apr 02, 2021 โ€ข 4 min read โ˜• (Last updated: Apr 02, 2021)

EfficientNetV2 - Smaller Models and Faster Training

TL;DR

EfficientNet ์˜ 2๋ฒˆ์งธ ๋…ผ๋ฌธ์ด ๋‚˜์™”๋„ค์š”. ์ €์ž๋Š” EfficientNet ์„ ์“ด ๋‘ ๋ถ„์ด ์“ฐ์…จ๋„ค์š”.

์ด๋ฒˆ์— ๋‚˜์˜จ ๋…ผ๋ฌธ์€ ํšจ์œจ์„ฑ์„ ๋ชฉํ‘œ๋กœ ํ•œ ์—ฐ๊ตฌ์ธ๋ฐ, NAS๋กœ ๋ชจ๋ธ ํ›ˆ๋ จ ์†๋„์™€ ํŒŒ๋ผ๋ฉ”ํ„ฐ ์ˆ˜๋ฅผ ์—„์ฒญ๋‚˜๊ฒŒ ์ค„์ด๋ฉด์„œ ์„ฑ๋Šฅ๋„ comparable ํ•˜๊ฑฐ๋‚˜ ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ–ˆ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

paper : arXiv

code : github

์š” ๋…ผ๋ฌธ๊ณผ ๊ด€๋ จ๋†’์€ reference

  1. EfficientNet : paper

Introduction

์ตœ๊ทผ์— ๋‚˜์˜จ convolution ๊ธฐ๋ฐ˜ architectures ๋ฅผ ๋ณด๋ฉด (e.g. ResNet-RS, NFNet), ์„ฑ๋Šฅ์€ ์ข‹์ง€๋งŒ, ๋ชจ๋ธ ํŒŒ๋ผ๋ฉ”ํ„ฐ๊ฐ€ ๋„ˆ๋ฌด ๋งŽ๊ณ  FLOPs ๋„ ์—„์ฒญ๋‚˜๊ฒŒ ์ปค์„œ ์›ฌ๋งŒํ•œ ์žฅ๋น„ ์•„๋‹ˆ๋ฉด ํ›ˆ๋ จํ•˜๊ธฐ๋„ ๋นก์„ผ ๋ฌธ์ œ๊ฐ€ ์žˆ์–ด์š”.

Training Efficiency

Training with very large image sizes is slow

ํฐ ํฌ๊ธฐ์˜ ์ด๋ฏธ์ง€๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ์ž‘์€ batch size๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•˜๋Š” ์ ์ด ์†๋„ ์ €ํ•˜์˜ ์›์ธ์ž„์„ ์–ธ๊ธ‰ํ•˜๋ฉด์„œ, ํ›ˆ๋ จ ์‹œ์— progressively ์ด๋ฏธ์ง€ ํฌ๊ธฐ๋ฅผ ์กฐ์ •ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ์ด๋Ÿฐ ๋ฌธ์ œ๋ฅผ ๊ฐœ์„ ํ–ˆ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

Depthwise Convolutions are slow in early layers

EfficientNet architecture์—๋Š” MBConv ๋ผ๋Š” block ์ด ์žˆ๋Š”๋ฐ, depth-wise convolution ์ด ์‚ฌ์šฉ๋˜์ฃ . ๊ทธ๋Ÿฐ๋ฐ, ์š” ์—ฐ์‚ฐ์ด tpu/gpu ์—์„œ ์ œ๋Œ€๋กœ ๊ฐ€์†์„ ๋ชป ๋ฐ›์•„์„œ ์ผ๋ฐ˜์ ์œผ๋กœ ์‚ฌ์šฉํ•˜๋Š” convolution ์—ฐ์‚ฐ๋ณด๋‹ค ํŒŒ๋ผ๋ฉ”ํ„ฐ๋‚˜ FLOPs ๋Š” ์ž‘์ง€๋งŒ ์†๋„๊ฐ€ ๋А๋ ค์š”.

์ตœ๊ทผ ์—ฐ๊ตฌ๋“ค์—๋Š” ์ด๋Ÿฐ ๋ฌธ์ œ๋•Œ๋ฌธ์— Fused-MBConv ๋ผ๋Š” ๊ฑธ ๋งŒ๋“ค์—ˆ๋Š”๋ฐ, ์•„๋ž˜ ๊ทธ๋ฆผ์ฒ˜๋Ÿผ Conv 1x1 + depthwise Conv 3x3 -> Conv 3x3 ์œผ๋กœ replace ํ•œ๊ฒŒ ๋” ์ข‹๋‹ค๋Š” ์—ฐ๊ตฌ๋ฅผ ์–ธ๊ธ‰ํ•˜๋ฉด์„œ

MBConv

EfficientNet-B4 ์— gradually Fused-MBConv ๋ฅผ ์ ์šฉํ•ด ๋ดค๋Š”๋ฐ, early layers (1 ~ 3 stages) ์—๋งŒ ์ ์šฉํ•˜๋Š”๊ฒŒ, ์†๋„๋„ ๋น ๋ฅด๋ฉด์„œ ์„ฑ๋Šฅ๋„ ์ œ์ผ ์ข‹๊ฒŒ ๊ฐ€์ ธ๊ฐˆ ์ˆ˜ ์žˆ์—ˆ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

Equally scaling up every stage is sub-optimal

EfficinetNet ์—์„  compound scaling rule ์— ๋”ฐ๋ผ์„œ scaling ํ•˜๋Š”๋ฐ, ๋งŒ์•ฝ depth coef ๊ฐ€ 2๋ผ๋ฉด, ๋ชจ๋“  stages ์—์„œ 2๋กœ scaling ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ, ๊ฐ stages ์—์„œ ํ›ˆ๋ จ ์‹œ๊ฐ„๊ณผ ํŒŒ๋ผ๋ฉ”ํ„ฐ ์ˆ˜๋Š” equally contributed ์•ˆํ•˜๋Š” ๋ฌธ์ œ์ ์„ ๋“ค๋ฉด์„œ, non-uniform ํ•œ scaling ์ „๋žต์„ ์„ ํƒํ•˜๊ฒ ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

์ด๋ฏธ์ง€ ์‚ฌ์ด์ฆˆ ๊ฐ™์€ ๊ฒฝ์šฐ๋„ ํ›ˆ๋ จ ์‹œ๊ฐ„๊ณผ memory ์— ํฐ ์˜ํ–ฅ์„ ์ฃผ๊ธฐ ๋•Œ๋ฌธ์—, (image size์— ๋Œ€ํ•œ) scaling rule ๋„ ๋ณ€๊ฒฌํ–ˆ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

Training-Aware NAS and Scaling

๋ชจ๋ธ ํ›ˆ๋ จ ์†๋„๋ฅผ ์œ„ํ•œ best combination ์„ ์ฐพ๊ธฐ์œ„ํ•ด, training-aware NAS ์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.

EfficientNet์—์„œ ์‚ฌ์šฉํ•œ NAS ๊ธฐ๋ฐ˜์„ ํ–ˆ๋Š”๋ฐ, ์•„๋ž˜์™€ ๊ฐ™์€ ๋ชฉํ‘œ๋ฅผ joinly optimize ํ–ˆ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

  1. accuracy
  2. parameter-efficiency
  3. training-efficiency (on modern accelerators)

๊ตฌ์ฒด์ ์ธ settings ์€ ๋…ผ๋ฌธ์—

EfficientNetV2 Architecture

NAS๋ฅผ ์‚ฌ์šฉํ•ด์„œ ์ฐพ์€ architecture (EfficientNetV2-S, baseline) ๊ฐ€ ์•„๋ž˜์™€ ๊ฐ™์€ ๊ตฌ์กฐ๋ผ ํ•ฉ๋‹ˆ๋‹ค. EfficientNet ๊ณผ ํฌ๊ฒŒ 4๊ฐ€์ง€ ์ฐจ์ด์ ์ด ์žˆ๋‹ค ํ•˜๋Š”๋ฐ,

  1. MBConv ์™€ Fused-MBConv ๋ฅผ ์„ž์–ด์„œ ์”€
  2. ๋” ์ž‘์€ expansion ratio (for MBConv) ๋ฅผ ์‚ฌ์šฉ -> ๋” ์ ์€ overhead ๋ฅผ ๊ฐ€์ง€๊ธฐ ๋•Œ๋ฌธ
  3. 3x3 kernel sizes ๋ฅผ ์„ ํ˜ธ. (ํ•˜์ง€๋งŒ ์ž‘์€ receptive field๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋งŒํผ layer๋ฅผ ๋” ์Œ“๊ฒŒ ๋จ)
  4. EfficientNet ์— ์žˆ๋˜ ๋งจ ๋งˆ์ง€๋ง‰ stride-1 stage ๋ฅผ ์ œ๊ฑฐ. -> ์ด๊ฒƒ๋„ ๋ฉ”๋ชจ๋ฆฌ ๋•Œ๋ฌธ์—

EfficinetNetV2-S

EfficientNetV2 Scaling

์œ„์—์„œ ๋งŒ๋“  EfficientNetV2-S ๊ธฐ๋ฐ˜์œผ๋กœ M/L ๋ฒ„์ „๋„ ๋งŒ๋“ค์—ˆ๋Š”๋ฐ, ๋ช‡ ๊ฐ€์ง€ ์ œํ•œ์„ ๋‘๊ณ  scaling ํ–ˆ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

  1. maximum inference image size to 480
  2. add more layers to later stages (stage 5, 6)

acc_vs_step

Progressive Learning

ํ›ˆ๋ จ ์‹œ image size ๋ฅผ dynamic ํ•˜๊ฒŒ ์‚ฌ์šฉํ•˜๋Š”๋ฐ, ์ด์ „ ์—ฐ๊ตฌ๋“ค์€ accuracy drop ์ด ๋ฐœ์ƒํ–ˆ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฒˆ ๋…ผ๋ฌธ์—์„  ๊ทธ๋Ÿฐ accuracy drop์ด imbalanced regularization (๋‹ค๋ฅธ ์ด๋ฏธ์ง€ ํฌ๊ธฐ๋กœ ํ•™์Šตํ•˜๋ฉด ๊ฑฐ๊ธฐ์— ๋งž๋Š” regularization strength๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•œ๋‹ค)์—์„œ ์˜ค์ง€ ์•Š์„๊นŒ ์ถ”์ธกํ•ฉ๋‹ˆ๋‹ค.

์•„๋ž˜์™€ ๊ฐ™์ด regularization strength๋ฅผ ์‹คํ—˜ํ•ด ๋ณธ ๊ฒฐ๊ณผ, ์ถ”์ธกํ•œ ๋Œ€๋กœ image size ๊ฐ€ ์ž‘์„ ๋• weak augmentations, ํด ๋• strong augmentations์ด ์„ฑ๋Šฅํ–ฅ์ƒ์— ๋” ๋„์›€๋๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

regularization_strength

Progressive Learning ์„ ํ•˜๊ธฐ ์œ„ํ•ด์„œ fomulation์„ ์„ธ์› ๋Š”๋ฐ, ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.

์ „์ฒด ํ›ˆ๋ จ์„ NN steps๋ฅผ ํ•˜๊ณ  ํ›ˆ๋ จ ๊ณผ์ •์„ MM stages๋กœ ๋‚˜๋ˆด๊ณ , kk ๋Š” regularizations ์ข…๋ฅ˜ (e.g. RandAugment, MixUp, Dropout, ...)

progressive_learning

๊ฐ ๋ชจ๋ธ์€ ์•„๋ž˜์™€ ๊ฐ™์€ recipes ๋กœ ํ›ˆ๋ จํ–ˆ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

recipe

Benchmark

ImageNet

accuracy, training speed ์ธก๋ฉด์—์„œ EfficientNet ๋Œ€๋น„ ๋‹ค ์ข‹๋„ค์š”.

performance

efficiency

Transfer Learning Performance Comparison

๋‹ค๋ฅธ datasets ์— transfer learning ํ–ˆ๋Š”๋ฐ, ์„ฑ๋Šฅ์ด comparable ํ•˜๋„ค์š”.

transfer_learning

Conclusion

์ƒˆ๋กœ์šด ์•„์ด๋””์–ด๋ณด๋‹จ ์—ฌ๋Ÿฌ ๊ฐ€์ง€๋ฅผ ์กฐํ•ฉํ•˜๊ณ  training recipe ์‹คํ—˜์— ๊ฐ€๊นŒ์› ์ง€๋งŒ, ๊ฐ ์ ์œผ๋กœ ์ด๋Ÿฐ ํŠœ๋‹ ์„ฑ๊ฒฉ์˜ ์—ฐ๊ตฌ๋„ ์ข‹์•„ํ•˜๊ณ , ์—„์ฒญ๋‚œ ๊ฐœ์„ ์ด ์žˆ์–ด์„œ ์žฌ๋ฐŒ๊ฒŒ ์ฝ์—ˆ์Šต๋‹ˆ๋‹ค.

๊ฒฐ๋ก  : ๊ตณ๊ตณ