본문 바로가기

부스트캠프 AI Tech/[Week6] Computer Vision

[Week6] Object Detection [Day5]

*Object Detection

 

1.1 Fundamental image recognition tasks

  • Semantic < Instance < Panoptic
  • Semantic에서 동일한 클래스에서도 각각의 개체를 나눔
  • 위 task를 수행하기 위해서는 object detection이 필요

 

 

1.2 What is object detection?

  • classification
  • bounding box

 

1.3 What are the applications of object detection?

  • Autonomous driving
  • Optical Character Recognition(OCR)

 

2. Two-stage detector

 

2.0 Traditional methods - hand - crafted techniques

  • Gradient-based detector (e.g., HOG)
  • Selective search
    • Over-segmentation
    • Iteratively merging similar regions
    • Extracting candidate boxes from all remaining segmentations

 

2.1 R-CNN

R-CNN Architecture

  • region proposal (~2k)
  • warped region (reshape)
  • classifier : SVM
  • region proposal에서 나온 region을 모두 CNN에 넣기 때문에 굉장히 느림

 

2.2 Fast R-CNN

Fast R-CNN Architecture

  • Recycle a pre-computed feature for multiple object detection
  • Conv. feature map from the original image
  • ROI feature extraction from the feature map through ROI pooling
  • Class and box prediction for each ROI

 

2.3 Faster R-CNN

  • End-to-End object detection by neural region proposal
  • IoU
  • Anchor boxes


  • Region Proposal Network (RPN)
    • image에서 한개의 feature maps을 뽑아 놓고, RPN에 넣어 region proposal을 함
  • Non-Maximum Suppression (NMS)
    • Step 1: Select the box with the highest objectiveness score
    • Step 2: Compare IoU of this box with other boxes
    • Step 3: Remove the bounding boxes with IoU 50%
    • Step 4: Move to the next highest objectiveness score
    • Step 5: Repeat steps 2-4
  • Summary of the R-CNNN family

 

 

3. Single-stage detector

 

3.0 Comparison with two-stage detectors

  • One-stage vs. two-stage 
  • No explicit ROI pooling


3.1 You only loock once (YOLO)

  • YOLO Architecture

    • 마지막 layer의 30 dimensions (length : 5B + C , B=2  C=20)
    • SxS grid (S=7) -> CNN 마지막 layer의 resolution
  • Performance

 

3.2 Single Shot MultiBox Detector (SSD)

  • YOLO에서는 속도는 빠르지만 Localization 정확도가 떨어지는 단점이 있음
  • 따라서  SSD에서는 multi scale object를 더 잘 처리하기 위한 방법을 제안
  • SSD Architecture
  • Performance
    • input size는 다르지만 mAP, FPS 성능이 좋아짐

 

4. Two-stage detector vs. one-stage detector

 

 

4.1 Focal loss

Single-stage detector는 ROI pooling이 없다보니 모든 영역에서의 loss가 발생하고 일정 gradient가 발생함.

일반적으로 background의 영역이 많고, 상대적으로 positive 영역은 적기 때문에 많은 필요없는 negative sample에 대

한 정보가 많아지면서 class imbalance 문제가 발생.

  • class imbalance problem

    • Focal loss는 앞에 확률텀을 붙여줌
    • 잘 맞춘 애들은 loss를 낮게 만들고, 잘 맞추지 못한 애들은 loss를 많이 준다

 

4.2 RetinaNet

  • RetinaNet is a one-stage network
  • class subnet , box subnet
  • Performance

 

5. Detection with Transformer

*Transformer

  • Transformer has shown a great success in NLP
  • Why not extending Transformer to computer vision tasks!
    • ViT (Vision Transformer) by Google
    • DeiT (Data-efficient image Transformer) by Facebook
    • DETR (DEtection TRansformer) by Facebook

 

*DETR

  • CNN의 feature와 각 위치의 multi dimension으로 표현한 encoding을 쌍으로 입력 토큰을 만들어 줌
  • transformer의 input으로 넣어줌
  • encoding된 특징들을 decoder에 넣어줌 (decoder에게 질의함)
  • decoder의 output을 통해 prediction(class, bbox)

 

 

*Further reading

  • Object Detection의 또 다른 트렌드
    • Bounding box can be represented by other ways (left-top, right-bottom, centroid & size)
    • Idea: Let’s detect objects using corresponding points!
    • CornerNet/CenterNet will be covered in Lecture 7