1. Problems with deeper layers
- Alexnet -> VGGNet
- Deeper networks learn more powerful features, because of
- Larger receptive fields
cf. receptive field : CNN에서 각 단계의 입력 이미지에 대해 하나의 필터가 커퍼할 수 있는 이미지 일부 - More capacity and non-linearity
- Larger receptive fields
- But, getting deeper and deeper always works better?
- Deeper networks learn more powerful features, because of
- Deeper networks are harder to optimize
- Gradient vanishing / exploding
- Computationally complex
- Degradation problem (not overfitting)
- Gradient vanishing / exploding
2. CNN architectures for image classification 2
2.1 GoogLeNet (2015)
*Inception module
- Apply multiple filter operations on input activation from the previous layer:
- 1x1, 3x3, 5x5 convolution filters
- 3x3 pooling operation
- Concatenate all filter outputs together along the channel axis
- The increased network size increases the use of computational resources -> Use 1x1 convolutions!
- Use 1x1 convolutions as "bottleneck" layers that reduce the number of channels
- 1x1 convolution
- Stem network : vanilla convolution networks
- Stacked inception modules
- Auxiliary classifiers
- The vanishing gradient problem is dealt with by the auxiliary classifier
- Injecting additional gradients into lower layers
- Used only during training, removed at testing time
- Classifier output (a single FC layer)
2.2 ResNet (2016)
- 최초 100개 레이어 이상을 적층하여 더 깊은 레이어를 쌓을수록 성능이 좋다는것을 증명
- 인간 레벨의 성능을 최초로 뛰어넘음
- classification, localization, detection, segmentation 성능 모두 뛰어남
- What makes it hard to build a very deep architecture?
- Degradation problem
- As the network depth increases, accuracy gets saturated ⇒ degrade rapidly
- This is not caused by overfitting. The problem is optimization!
- not overfitting -> gradient vanishing / exploding
- Hypothesis
- Plain layer: As the layers get deeper, it is hard to learn good directly
- Residual block: Instead, we learn residual
- Target function : 𝐻(𝑥) = 𝐹(𝑥) + 𝑥
- Residual function : 𝐹(𝑥) = 𝐻(𝑥) - 𝑥
- Solution
- Shortcut connection : backpropagation에 layer에서 vanishing이 일어나도 identity항을 통해 정보소실X
- Analysis of residual connection
- Residual networks have O(2^n) implicit paths connecting input and output, and adding a block doubles the number of paths
- Overall architecture
2.3 Beyond ResNets
- DenseNet
- In ResNet, we added the input and the output of the layer element-wisely
- In the Dense blocks, every output of each layer is concatenated along the channel axis
-> concatenation을 통해 정보를 보존
- In ResNet, we added the input and the output of the layer element-wisely
- SENet
- Attention across channels
- Recalibrates channel-wise responses by modeling interdependencies between channels
- Squeeze and excitation operations
- Squeeze: capturing distributions of channel-wise responses by global average pooling
global average pooling을 통해 공간정보를 1로 만들어주고, 채널의 평균 정보(채널 축의 분포)만 포함. - Excitation: gating channels by channel-wise attention weights obtained by a FC layer
FC layer를 활용하여 new weighting 하기 위한 attention score를 만들고, 입력 attention과 함께 activation을 rescaling함으로써 중요도가 떨어지는것은 약하게, 중요도가 높은것은 강하게 만들어주는 효과
- Squeeze: capturing distributions of channel-wise responses by global average pooling
- EfficientNet
- Building deep, wide, and high resolution networks in an efficient way
- (b), (c), (d) 방법론들은 결국 다 saturation이 되지만 그 시점은 다름
- (e) - 따라서 적절히 섞어준다면 성능이 상승할 수 있다는 내용
- NAS보다도 더 좋은 성능을 발휘
- Building deep, wide, and high resolution networks in an efficient way
- Deformable convolution
- 2D spatial offset prediction for irregular convolution
- Irregular grid sampling with 2D spatial offsets
- Implemented by standard CNN and grid sampling with 2D offsets
- offsets branch는 어떻게 만들까? random하게?
3. Summary of image classification
- AlexNet
- simple CNN architecture
- Simple computation, but heavy memory size
- Low accuracy
- VGGNET
- simple with 3x3 convolutions
- Highest memory, the heaviest computation
- GoogLeNet
- inception module and auxiliary classifier
- ResNet
- deeper layers with residual blocks
- Moderate efficiency (depending on the model)
- Simple but powerfual backbone model
- GoogLeNet is the most efficient CNN model out of {AlexNet, VGg, ResNet}
- But, it is complicated to use
- Instead, VGGNet and ResNet are typically used as a backbone model for many tasks
- Constructed with simple 3x3 conv layers
'부스트캠프 AI Tech > [Week6] Computer Vision' 카테고리의 다른 글
[Week6] Object Detection [Day5] (0) | 2021.09.10 |
---|---|
[Week6] Semantic segmentation [Day4] (0) | 2021.09.09 |
[Week6] CV - Image Classification Ⅰ[Day1] (0) | 2021.09.06 |