본문 바로가기

부스트캠프 AI Tech/[Week6] Computer Vision

[Week6] CV - Image Classification Ⅱ [Day3]

1. Problems with deeper layers

 

  • Alexnet -> VGGNet
    • Deeper networks learn more powerful features, because of
      • Larger receptive fields             
        cf. receptive field : CNN에서 각 단계의 입력 이미지에 대해 하나의 필터가 커퍼할 수 있는 이미지 일부
      • More capacity and non-linearity
    • But, getting deeper and deeper always works better?
  • Deeper networks are harder to optimize
    • Gradient vanishing / exploding
    • Computationally complex
    • Degradation problem (not overfitting)

 

2. CNN architectures for image classification 2

2.1 GoogLeNet (2015)

*Inception module

  • Apply multiple filter operations on input activation from the previous layer:
    • 1x1, 3x3, 5x5 convolution filters
    • 3x3 pooling operation
  • Concatenate all filter outputs together along the channel axis
  • The increased network size increases the use of computational resources -> Use 1x1 convolutions!
  • Use 1x1 convolutions as "bottleneck" layers that reduce the number of channels
  • 1x1 convolution
  • Stem network : vanilla convolution networks
  • Stacked inception modules
  • Auxiliary classifiers
    • The vanishing gradient problem is dealt with by the auxiliary classifier
    • Injecting additional gradients into lower layers
    • Used only during training, removed at testing time
  • Classifier output (a single FC layer)


 

 

2.2 ResNet (2016)

  • 최초 100개 레이어 이상을 적층하여 더 깊은 레이어를 쌓을수록 성능이 좋다는것을 증명
  • 인간 레벨의 성능을 최초로 뛰어넘음
  • classification, localization, detection, segmentation 성능 모두 뛰어남
  • What makes it hard to build a very deep architecture?
  • Degradation problem
    • As the network depth increases, accuracy gets saturated ⇒ degrade rapidly
    • This is not caused by overfitting. The problem is optimization!
    • not overfitting -> gradient vanishing / exploding
  • Hypothesis

    • Plain layer: As the layers get deeper, it is hard to learn good directly
    • Residual block: Instead, we learn residual
      • Target function : 𝐻(𝑥) = 𝐹(𝑥) + 𝑥
      • Residual function : 𝐹(𝑥) = 𝐻(𝑥) - 𝑥
    • Solution 
      • Shortcut connection : backpropagation에 layer에서 vanishing이 일어나도 identity항을 통해 정보소실X
    • Analysis of residual connection
      • Residual networks have O(2^n) implicit paths connecting input and output, and adding a block doubles the number of paths
  • Overall architecture


2.3 Beyond ResNets

  • DenseNet
    • In ResNet, we added the input and the output of the layer element-wisely
    • In the Dense blocks, every output of each layer is concatenated along the channel axis
      -> concatenation을 통해 정보를 보존
  • SENet
    • Attention across channels
    • Recalibrates channel-wise responses by modeling interdependencies between channels
    • Squeeze and excitation operations
      • Squeeze: capturing distributions of channel-wise responses by global average pooling
        global average pooling을 통해 공간정보를 1로 만들어주고, 채널의 평균 정보(채널 축의 분포)만 포함.
      • Excitation: gating channels by channel-wise attention weights obtained by a FC layer
        FC layer를 활용하여 new weighting 하기 위한 attention score를 만들고, 입력 attention과 함께 activation을 rescaling함으로써 중요도가 떨어지는것은 약하게, 중요도가 높은것은 강하게 만들어주는 효과
  • EfficientNet
    • Building deep, wide, and high resolution networks in an efficient way
    • (b), (c), (d) 방법론들은 결국 다 saturation이 되지만 그 시점은 다름
    • (e) - 따라서 적절히 섞어준다면 성능이 상승할 수 있다는 내용
    • NAS보다도 더 좋은 성능을 발휘
  • Deformable convolution
    • 2D spatial offset prediction for irregular convolution
    • Irregular grid sampling with 2D spatial offsets
    • Implemented by standard CNN and grid sampling with 2D offsets
    • offsets branch는 어떻게 만들까? random하게?

 

 

 

3. Summary of image classification

  • AlexNet 
    • simple CNN architecture
    • Simple computation, but heavy memory size
    • Low accuracy
  • VGGNET
    • simple with 3x3 convolutions
    • Highest memory, the heaviest computation
  • GoogLeNet
    • inception module and auxiliary classifier
  • ResNet
    • deeper layers with residual blocks
    • Moderate efficiency (depending on the model)
  • Simple but powerfual backbone model
    • GoogLeNet is the most efficient CNN model out of {AlexNet, VGg, ResNet}
    • But, it is complicated to use
    • Instead, VGGNet and ResNet are typically used as a backbone model for many tasks
    • Constructed with simple 3x3 conv layers