YOLO9000: Bettter, Faster, Stronger review
/category/Paper-Review/object%20detection

2022. 4. 6. 21:37

YOLO9000: Bettter, Faster, Stronger

Preview

YOLOv1와 마찬가지로 One stage object detector이다.
YOLOv2: YOLOv1의 단점을 개선하여 연산을 빠르게, 정확도는 높임
YOLO9000: Detection dataset의 적은 class 개수로인한 예측 가능한 class개수의 증가
- Classification dataset과의 joint training통해 존재하지 않는 object class에 대한 예측도 가능해짐
새로운 classification network인 Darknet-19를 통해 성능 향상

Better

Batch Normalization
High Resolution Classifier
Convolutional with Anchor Box
Dimension Clusters
Direct Location Prediction
Fine-Grained Features
Multi-Scale Training

Faster

Darknet-19

Stronger

Dataset combination with WordTree
Joing Classification and Detection

Better

1. Batch Normalization

train과정에서 scale($\gamma$), shift($\beta$)를 batch별로 구하고, 이 값을 저장하여 test시 활용

효과

mini batch 학습시, 빠른수렴 가능
overfitting 억제 (mAP 2%정도 증가)

2. High Resolution Classifier

YOLOv1

darkNet을 먼저 224x224 이미지를 pretrain시킨 후
detection할때는 448x448의 이미지를 넣어 사용

문제점 : 모델이 224x224의 이미지로 학습했기 때문에 448x448의 이미지에 적응할려면 좀 시간이 걸림

YOLOv2

darkNet을 224x224 이미지로 pretrain하고 마지막 10 epoch정도를 448x448의 이미지로 classification task로 pretrain시킨 후
detection할때 416x416의 이미지를 넣어 사용

Q. detection시 448x448의 이미지를 사용하지 않고 416x416의 이미지를 사용하는 이유?

→ 최종 feature map의 크기(13x13)가 홀수가 되도록 하여, feature map내에 하나의 중심 cell이 존재할 수 있도록 하기 위함

3. Convolutional with Anchor Box

YOLOv1

YOLOv2

Network

fc-layer를 제거하고 conv-layer로 prediction 수행
max pooling 한 개를 제거

→ high resolution을 유지

Anchor box

각각의 grid cell마다 5개의 anchor box를 사용

→ YOLOv1에서는 grid cell마다 0~1사이의 bbox 좌표값을 랜덤으로 설정후 최적의 값을 찾는데 이것보다 anchor box를 정의한 후 bbox regression을 통해 offset(종횡비)을 조정하는 과정이 단순하고 네트워크가 학습하기 쉽다.

→ 또한 YOLOv1에서는 grid cell마다 2개의 bbox를 사용하여 chd 98개의 bbox로 object를 예측하지만 YOLOv2에서는 anchor box를 사용하여 보다많은 bbox로 object를 예측한다.

YOLOv1에서는 각각의 grid cell의 bbox는 같은 class probability를 갖지만 YOLOv2에서는 각각의 grid cell에서 서로다른 anchor box마다 다른 class probability를 갖는다.

4. Dimension Clusters

기존 anchor box 사용모델 (faster R-CNN)

anchor box ratio, size를 미리 정하는 hand picked방식

YOLOv2

anchor box ratio, size를 더 좋은 조건으로 학습을 시작하기위해 k-meas clustering을 통해 prior를 탐색하는 방법을 선택

YOLOv2 k-means clustering

ground truth box의 width, height의 값을 이용하여 수행
기준을 Euclidean distance대신에 IoU를 사용

→ Anchor box 중심점 간 거리로 계산 시, 실제로는 유사하지 않음에도 같은 cluster에 속하게 될 우려가 있다.

→ 또한 큰 bbox는 작은 bbox에 비해 큰 error를 발생시키는 문제가 있다.

5. Direct Location Prediction

Anchor box의 각 좌표 ($t_x, t_y$)에 sigmoid dunction을 적용하여 grid cell 내에 중심이 위치하도록 함
Sigmoid function 적용이유: Faster R-CNN처럼 함수 $d$ 에 아무 제약이 없으면, cell을 벗어난 anchor가 생성됨

6. Fine-Grained Features

YOLOv2의 최종 output feauture map의 크기는 13x13인데 이처럼 feature map의 크기가 작을 경우 큰 object를 예측하기 용이하지만 작은 object는 예측하기 어렵다는 문제가 있다.
따라서 앞에 layer의 26x26의 feature map에서 channel을 유지한채 4개로 분할하고 결합하여 13x13의 긴 feature map을 얻고 이것을 이후 layer의 feature map에 결합하여 사용한다.

정리

High resolution feature map: 작은 object를 예측하는데 용이
Low resolution feauture map: 큰 object를 예측하는데 용이
이 두 가지 특징을 결합하여 object의 크기에 대해 robust한 model를 만듬

7. Multi-Scale Training

여러 size의 이미지에 robust해지기 위해 모든 batch 10개마다 random하게 input size를 변경한다. (input의 범위: {320, 352, ... , 608}(32씩 증가))

→ YOLOv2는 기존 YOLOv1과는 다르게 fc-layer를 사용하지 않는 fully convolutional layer를 사용하기때문에 input size변화가 가능

Faster

DarkNet(YOLOv1)

GoogLeNet을 기반으로 한 CNN구조

DarkNet-19(YOLOv2)

VGG-16과 GoogLeNet의 장점들을 가져와 새로운 CNN구조를 만듬
VGG-16의 3x3 filter를 사용하여 더 깊은 네트워크를 만듬
GoogLeNet의 1x1 filter를 사용하여 feature representation을 더욱 압축
global average pooling을 사용하여 parameter수를 감소

Strong

1. Hierarchical Classification

기존 classification task의 dataset은 WordNet기반으로 구성되어 있는데 이 WordNet은 위에 그림과 같이 directed graph형태이다.
YOLO9000에서는 Hierarchical classification 학습을 통해 WordTree를 생성

→ 즉 WordNet구조로 공통 root를 갖는 label을 묶는 작업을 시행

2. Dataset Combination with WordTree

ImageNet + COCO dataset + ImageNet Detection
총 9000개의 class

3. Joint Classification and detection

Detection dataset과 Classification dataset개수 차이가 크므로, oversampling으로 개수를 맞춤
COCO에 있는 데이터로 detection 학습 + 9000개 object에 대한 구분 가능

Loss

Detection dataset: classification + bbox regression
Classification dataset: only classification loss

Experiment

1. Classification

Dataset: 1000-class ImageNet
기존에는 전체 class에대해 softmax를 계산하였다면 YOLO9000에서는 같은 부모노드로부터 나온 노드끼리 softmax 계산을 하여 probability를 계산

2. Object Detection

Dataset: ImageNet Detection + COCO
9000개 class로 WordTree 구축 후 YOLOv2 학습
동물 class의 성능이 좋음

Reference

youtube

https://www.youtube.com/watch?v=vLdrI8NCFMs

blog

https://herbwood.tistory.com/17?category=856250
https://adioshun.gitbooks.io/semantic-segmentation/content/2016yolo2-yolo9000.html
https://m.blog.naver.com/sogangori/221011203855

'Paper-Review > object detection' 카테고리의 다른 글

You Only Look Once: Unified, Real-Time Object Detection Review (0)	2022.03.31

BELATED ARTICLES

You Only Look Once: Unified, Real-Time Object Detection Review 2022.03.31

Jimin's history

CATEGORIES