본문 바로가기

부스트캠프 AI Tech/[Week6] Computer Vision

[Week6] Semantic segmentation [Day4]

*What is semantic segmentation?

  • Classify each pixel of an image into a category
  • Don't care about instances. Only care about semantic category
  • Applications
    • Medical images
    • Autonomous driving
    • Computational photography

 

 

*Semantic segmentation architectures

  • Fully convolutional networks
    • The first end-to-end architecture for semantic segmentation
    • Take an image of an arbitrary size as input, and output a segmentation map of the corresponding size to the input
  • Fully connected vs Fully convolutional
    • Fully connected layer: Output a fixed dimensional vector and discard spatial coordinates
    • Fully convolutional layer: Output a classification map which has spatial coordinates
  • Interpreting fully connected layers as 1x1 convolutions
    • A fully connected layer classifies a single feature vector
    • A 1x1 convolution layer classifies every feature vector of the convolutional feature map
    • Limitation : Predicted score map is in a very low-resolution
    • why?
      • For having a large receptive field, several spatial pooling layers are deployed
    • Solution : Enlarge the score map by upsampling!
  • Upsampling
    • The size of the input image is reduced to a smaller feature map
    • Upsample to the size of input image
      • Unpooling
      • Transposed convolution
      • Upsample and convolution
    • Pooling layer를 없애거나 Stride를 크게 주면 output resolution은 커지지만 receptive field가 작아지면서 이미지의 전반적인 context를 담을수 없어 성능이 떨어짐
    • 따라서 receptive field는 크게 유지한채 Upsampling layer를 추가 하여 resolution을 높게 가져가는 방법을 사용

 

*Transposed convolution

  • Transposed convolutions work by swapping the forward and backward passes of convolution
  • Checkerboard artifacts due to uneven overlaps
    • overlap되는 문제가 존재

*Upsample convolution

  • Better approaches for upsampling
  • Avoid overlap issues in transposed convolution
  • Decompose into spatial upsampling and feature convolution
    • {Nearest-neighbor (NN), Bilinear} interpolation followed by convolution

 

 

*Back to FCN

  • Low layer은 receptive field가 작아서 디테일하고 로컬하고 작은 변화에 민감하다
  • 반대로 high layer는 resolution은 작지만 receptive field가 커서 전반적이고 경향성을 파악할 수 있음
  • Semantic segmention은 이 두가지가 다 필요함.
  • 따라서 fusion이 필요함

 

  • 중간층의 map들을 upsampling하여 사용한다
  • Integrates activations from lower layers into prediction
  • Preseves higher spatial resolution
  • Captures lower-level semantics at the same time
  • 각 upsampled prediction들을 score를 계산한다
  • Features of FCN
    • Faster
      • The end-to-end architecture that does not depend on other hand-crafted components
    • Accurate
      • Feature representation and classifiers are jointly optimized

 

 

 2.2 Hypercolumns for object segmentation (비슷하지만 다른 방법론)

  • Fully convolutional networks
    • CNN layers typically use the output of the last layer as feature representation
      • Too coarse spatially
  • Overall architecture
    • Very similar to FCN
    • Difference : Apply to each bounding box

 

2.3 U-Net

  • Built upon “fully convolutional networks”
    • Share the same FCN property
  • Predict a dense map by concatenating feature maps from contracting path
    • Similar to skip connections in FCN
  • Yield more precise segmentations
  • Overall architecture
    • Contracting path
      • Repeatedly applying 3x3 convolutions
      • Doubling the number of feature channels
      • Being used to capture holistic context
    • Expanding path

      • Repeatedly applying 2x2 convolutions
      • Halving the number of feature channels
      • Concatenating the corresponding feature maps from the contracting path
    • Overall
      • Concatenation of feature maps provides localized information
        • localized된 중요한 정보들이 마지막 단으로 바로 skip connection 되면서 민감한 경계선을 잘 찾음
    • What if the spatial size of the feature map is an odd number?
      • An even number is required for input and feature sizes
      • 일반적으로 downsampling은 버림을 하여 7x7 -> 3x3
      • 3x3을 upsampling을 하면 6x6이 되기 때문에 정보가 손실
      • 따라서 홀수 feature map이 나오지 않도록 주의해야함

 

2.4 DeepLab

 

•DeepLab v1 (2015) : Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs. ICLR 2015.

•DeepLab v2 (2017) : DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. TPAMI 2017.

•DeepLab v3 (2017) : Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv 2017.

•DeepLab v3+ (2018) : Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. ECCV 2018

 

  • Conditional Random Fields (CRFs)
    • CRF post-processes a segmentation map to be refined to follow image boundaries
    • 1st row: score map (before softmax) / 2nd row : belief map (after softmax)
  • Dilated convolution

    • Atrous convolution
    • Inflate the kernel by inserting spaces between the kernel element (Dilation factor)
    • Enable exponential expansion of the receptive field

 

  • Depthwise separable convolution (proposed by Howard et al.)
    • 연산량을 낮추고자 standard conv를 두 단계로 나눔

 

  • Deeplab v3+