*Object Detection
1.1 Fundamental image recognition tasks
- Semantic < Instance < Panoptic
- Semantic에서 동일한 클래스에서도 각각의 개체를 나눔
- 위 task를 수행하기 위해서는 object detection이 필요
1.2 What is object detection?
- classification
- bounding box
1.3 What are the applications of object detection?
- Autonomous driving
- Optical Character Recognition(OCR)
2. Two-stage detector
2.0 Traditional methods - hand - crafted techniques
- Gradient-based detector (e.g., HOG)
- Selective search
- Over-segmentation
- Iteratively merging similar regions
- Extracting candidate boxes from all remaining segmentations
2.1 R-CNN
- region proposal (~2k)
- warped region (reshape)
- classifier : SVM
- region proposal에서 나온 region을 모두 CNN에 넣기 때문에 굉장히 느림
2.2 Fast R-CNN
- Recycle a pre-computed feature for multiple object detection
- Conv. feature map from the original image
- ROI feature extraction from the feature map through ROI pooling
- Class and box prediction for each ROI
2.3 Faster R-CNN
- End-to-End object detection by neural region proposal
- IoU
- Anchor boxes
- Region Proposal Network (RPN)
- image에서 한개의 feature maps을 뽑아 놓고, RPN에 넣어 region proposal을 함
- image에서 한개의 feature maps을 뽑아 놓고, RPN에 넣어 region proposal을 함
- Non-Maximum Suppression (NMS)
- Step 1: Select the box with the highest objectiveness score
- Step 2: Compare IoU of this box with other boxes
- Step 3: Remove the bounding boxes with IoU 50%
- Step 4: Move to the next highest objectiveness score
- Step 5: Repeat steps 2-4
- Summary of the R-CNNN family
3. Single-stage detector
3.0 Comparison with two-stage detectors
- One-stage vs. two-stage
- No explicit ROI pooling
3.1 You only loock once (YOLO)
- YOLO Architecture
- 마지막 layer의 30 dimensions (length : 5B + C , B=2 C=20)
- SxS grid (S=7) -> CNN 마지막 layer의 resolution
- Performance
3.2 Single Shot MultiBox Detector (SSD)
- YOLO에서는 속도는 빠르지만 Localization 정확도가 떨어지는 단점이 있음
- 따라서 SSD에서는 multi scale object를 더 잘 처리하기 위한 방법을 제안
- SSD Architecture
- Performance
- input size는 다르지만 mAP, FPS 성능이 좋아짐
- input size는 다르지만 mAP, FPS 성능이 좋아짐
4. Two-stage detector vs. one-stage detector
4.1 Focal loss
Single-stage detector는 ROI pooling이 없다보니 모든 영역에서의 loss가 발생하고 일정 gradient가 발생함.
일반적으로 background의 영역이 많고, 상대적으로 positive 영역은 적기 때문에 많은 필요없는 negative sample에 대
한 정보가 많아지면서 class imbalance 문제가 발생.
- class imbalance problem
- Focal loss는 앞에 확률텀을 붙여줌
- 잘 맞춘 애들은 loss를 낮게 만들고, 잘 맞추지 못한 애들은 loss를 많이 준다
4.2 RetinaNet
- RetinaNet is a one-stage network
- class subnet , box subnet
- Performance
5. Detection with Transformer
*Transformer
- Transformer has shown a great success in NLP
- Why not extending Transformer to computer vision tasks!
- ViT (Vision Transformer) by Google
- DeiT (Data-efficient image Transformer) by Facebook
- DETR (DEtection TRansformer) by Facebook
*DETR
- CNN의 feature와 각 위치의 multi dimension으로 표현한 encoding을 쌍으로 입력 토큰을 만들어 줌
- transformer의 input으로 넣어줌
- encoding된 특징들을 decoder에 넣어줌 (decoder에게 질의함)
- decoder의 output을 통해 prediction(class, bbox)
*Further reading
- Object Detection의 또 다른 트렌드
- Bounding box can be represented by other ways (left-top, right-bottom, centroid & size)
- Idea: Let’s detect objects using corresponding points!
- CornerNet/CenterNet will be covered in Lecture 7
'부스트캠프 AI Tech > [Week6] Computer Vision' 카테고리의 다른 글
[Week6] Semantic segmentation [Day4] (0) | 2021.09.09 |
---|---|
[Week6] CV - Image Classification Ⅱ [Day3] (0) | 2021.09.08 |
[Week6] CV - Image Classification Ⅰ[Day1] (0) | 2021.09.06 |