[Week6] CV - Image Classification Ⅱ [Day3]

Alexnet -> VGGNet
- Deeper networks learn more powerful features, because of
  - Larger receptive fields
    cf. receptive field : CNN에서 각 단계의 입력 이미지에 대해 하나의 필터가 커퍼할 수 있는 이미지 일부
  - More capacity and non-linearity
- But, getting deeper and deeper always works better?
Deeper networks are harder to optimize
- Gradient vanishing / exploding
- Computationally complex
- Degradation problem (not overfitting)

2.1 GoogLeNet (2015)

*Inception module

Apply multiple filter operations on input activation from the previous layer:
- 1x1, 3x3, 5x5 convolution filters
- 3x3 pooling operation
Concatenate all filter outputs together along the channel axis
The increased network size increases the use of computational resources -> Use 1x1 convolutions!
Use 1x1 convolutions as "bottleneck" layers that reduce the number of channels
1x1 convolution
Stem network : vanilla convolution networks
Stacked inception modules
Auxiliary classifiers
- The vanishing gradient problem is dealt with by the auxiliary classifier
- Injecting additional gradients into lower layers
- Used only during training, removed at testing time
Classifier output (a single FC layer)

2.2 ResNet (2016)

최초 100개 레이어 이상을 적층하여 더 깊은 레이어를 쌓을수록 성능이 좋다는것을 증명
인간 레벨의 성능을 최초로 뛰어넘음
classification, localization, detection, segmentation 성능 모두 뛰어남
What makes it hard to build a very deep architecture?
Degradation problem
- As the network depth increases, accuracy gets saturated ⇒ degrade rapidly
- This is not caused by overfitting. The problem is optimization!
- not overfitting -> gradient vanishing / exploding
Hypothesis
- Plain layer: As the layers get deeper, it is hard to learn good directly
- Residual block: Instead, we learn residual
  - Target function : 𝐻(𝑥) = 𝐹(𝑥) + 𝑥
  - Residual function : 𝐹(𝑥) = 𝐻(𝑥) - 𝑥
- Solution
  - Shortcut connection : backpropagation에 layer에서 vanishing이 일어나도 identity항을 통해 정보소실X
- Analysis of residual connection
  - Residual networks have O(2^n) implicit paths connecting input and output, and adding a block doubles the number of paths
Overall architecture

2.3 Beyond ResNets

DenseNet
- In ResNet, we added the input and the output of the layer element-wisely
- In the Dense blocks, every output of each layer is concatenated along the channel axis
  -> concatenation을 통해 정보를 보존
SENet
- Attention across channels
- Recalibrates channel-wise responses by modeling interdependencies between channels
- Squeeze and excitation operations
  - Squeeze: capturing distributions of channel-wise responses by global average pooling
    global average pooling을 통해 공간정보를 1로 만들어주고, 채널의 평균 정보(채널 축의 분포)만 포함.
  - Excitation: gating channels by channel-wise attention weights obtained by a FC layer
    FC layer를 활용하여 new weighting 하기 위한 attention score를 만들고, 입력 attention과 함께 activation을 rescaling함으로써 중요도가 떨어지는것은 약하게, 중요도가 높은것은 강하게 만들어주는 효과
EfficientNet
- Building deep, wide, and high resolution networks in an efficient way
- (b), (c), (d) 방법론들은 결국 다 saturation이 되지만 그 시점은 다름
- (e) - 따라서 적절히 섞어준다면 성능이 상승할 수 있다는 내용
- NAS보다도 더 좋은 성능을 발휘
Deformable convolution
- 2D spatial offset prediction for irregular convolution
- Irregular grid sampling with 2D spatial offsets
- Implemented by standard CNN and grid sampling with 2D offsets
- offsets branch는 어떻게 만들까? random하게?

AlexNet
- simple CNN architecture
- Simple computation, but heavy memory size
- Low accuracy
VGGNET
- simple with 3x3 convolutions
- Highest memory, the heaviest computation
GoogLeNet
- inception module and auxiliary classifier
ResNet
- deeper layers with residual blocks
- Moderate efficiency (depending on the model)
Simple but powerfual backbone model
- GoogLeNet is the most efficient CNN model out of {AlexNet, VGg, ResNet}
- But, it is complicated to use
- Instead, VGGNet and ResNet are typically used as a backbone model for many tasks
- Constructed with simple 3x3 conv layers

[Week6] Object Detection [Day5] (0)	2021.09.10
[Week6] Semantic segmentation [Day4] (0)	2021.09.09
[Week6] CV - Image Classification Ⅰ[Day1] (0)	2021.09.06

백chef