*What is semantic segmentation?
- Classify each pixel of an image into a category
- Don't care about instances. Only care about semantic category
- Applications
- Medical images
- Autonomous driving
- Computational photography
*Semantic segmentation architectures
- Fully convolutional networks
- The first end-to-end architecture for semantic segmentation
- Take an image of an arbitrary size as input, and output a segmentation map of the corresponding size to the input
- Fully connected vs Fully convolutional
- Fully connected layer: Output a fixed dimensional vector and discard spatial coordinates
- Fully convolutional layer: Output a classification map which has spatial coordinates
- Interpreting fully connected layers as 1x1 convolutions
- A fully connected layer classifies a single feature vector
- A 1x1 convolution layer classifies every feature vector of the convolutional feature map
- Limitation : Predicted score map is in a very low-resolution
- why?
- For having a large receptive field, several spatial pooling layers are deployed
- Solution : Enlarge the score map by upsampling!
- A fully connected layer classifies a single feature vector
- Upsampling
- The size of the input image is reduced to a smaller feature map
- Upsample to the size of input image
- Unpooling
- Transposed convolution
- Upsample and convolution
- Pooling layer를 없애거나 Stride를 크게 주면 output resolution은 커지지만 receptive field가 작아지면서 이미지의 전반적인 context를 담을수 없어 성능이 떨어짐
- 따라서 receptive field는 크게 유지한채 Upsampling layer를 추가 하여 resolution을 높게 가져가는 방법을 사용
*Transposed convolution
- Transposed convolutions work by swapping the forward and backward passes of convolution
- Checkerboard artifacts due to uneven overlaps
- overlap되는 문제가 존재
- overlap되는 문제가 존재
*Upsample convolution
- Better approaches for upsampling
- Avoid overlap issues in transposed convolution
- Decompose into spatial upsampling and feature convolution
- {Nearest-neighbor (NN), Bilinear} interpolation followed by convolution
- {Nearest-neighbor (NN), Bilinear} interpolation followed by convolution
*Back to FCN
- Low layer은 receptive field가 작아서 디테일하고 로컬하고 작은 변화에 민감하다
- 반대로 high layer는 resolution은 작지만 receptive field가 커서 전반적이고 경향성을 파악할 수 있음
- Semantic segmention은 이 두가지가 다 필요함.
- 따라서 fusion이 필요함
- 중간층의 map들을 upsampling하여 사용한다
- Integrates activations from lower layers into prediction
- Preseves higher spatial resolution
- Captures lower-level semantics at the same time
- 각 upsampled prediction들을 score를 계산한다
- Features of FCN
- Faster
- The end-to-end architecture that does not depend on other hand-crafted components
- Accurate
- Feature representation and classifiers are jointly optimized
- Faster
2.2 Hypercolumns for object segmentation (비슷하지만 다른 방법론)
- Fully convolutional networks
- CNN layers typically use the output of the last layer as feature representation
- Too coarse spatially
- Too coarse spatially
- CNN layers typically use the output of the last layer as feature representation
- Overall architecture
- Very similar to FCN
- Difference : Apply to each bounding box
2.3 U-Net
- Built upon “fully convolutional networks”
- Share the same FCN property
- Predict a dense map by concatenating feature maps from contracting path
- Similar to skip connections in FCN
- Yield more precise segmentations
- Overall architecture
- Contracting path
- Repeatedly applying 3x3 convolutions
- Doubling the number of feature channels
- Being used to capture holistic context
- Expanding path
- Repeatedly applying 2x2 convolutions
- Halving the number of feature channels
- Concatenating the corresponding feature maps from the contracting path
- Overall
- Concatenation of feature maps provides localized information
- localized된 중요한 정보들이 마지막 단으로 바로 skip connection 되면서 민감한 경계선을 잘 찾음
- Concatenation of feature maps provides localized information
- What if the spatial size of the feature map is an odd number?
- An even number is required for input and feature sizes
- 일반적으로 downsampling은 버림을 하여 7x7 -> 3x3
- 3x3을 upsampling을 하면 6x6이 되기 때문에 정보가 손실
- 따라서 홀수 feature map이 나오지 않도록 주의해야함
- Contracting path
2.4 DeepLab
•DeepLab v1 (2015) : Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs. ICLR 2015.
•DeepLab v2 (2017) : DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. TPAMI 2017.
•DeepLab v3 (2017) : Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv 2017.
•DeepLab v3+ (2018) : Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. ECCV 2018
- Conditional Random Fields (CRFs)
- CRF post-processes a segmentation map to be refined to follow image boundaries
- 1st row: score map (before softmax) / 2nd row : belief map (after softmax)
- Dilated convolution
- Atrous convolution
- Inflate the kernel by inserting spaces between the kernel element (Dilation factor)
- Enable exponential expansion of the receptive field
- Depthwise separable convolution (proposed by Howard et al.)
- 연산량을 낮추고자 standard conv를 두 단계로 나눔
- 연산량을 낮추고자 standard conv를 두 단계로 나눔
- Deeplab v3+
'부스트캠프 AI Tech > [Week6] Computer Vision' 카테고리의 다른 글
[Week6] Object Detection [Day5] (0) | 2021.09.10 |
---|---|
[Week6] CV - Image Classification Ⅱ [Day3] (0) | 2021.09.08 |
[Week6] CV - Image Classification Ⅰ[Day1] (0) | 2021.09.06 |