SSD: Single Shot MultiBox Detector

SSD is an object detection algorithm after the Yolo-v1, improved the mAP and speed. Yolo achieved a dramatic performance improvement in terms of speed compared to the two-step algorithm of the RCNN sereis, but there was a limit in accuracy.

input : 300x300 3 channel image
output : bounding box, object class
Backbone network : VGG-16
Multiscale Feature Maps

As you can see in the figure above, the biggest difference between Yolo and SSD is that in the case of SSD, detection is performed on various sized feature maps. Thus SSD improved the Yolo's limitation of finding small objects.

Object detection

Briefly explain, is an image processing that classify which object is in the image and localize where the object is in the image. So we can thin of object detection as a combination of the regression problem of finding the size and position of the bounding box and the classification problem of finding out what the object is.

Implementation of SSD

The SSD consists of three-step convolution networks.

Backbone network : VGG-16
Auxiliary convolutions : provides feature maps of various sizes
Prediction convolutions : identify and locate objects

Backbone Network

Backbone network is a pretrained-network that has been proven to perform well in classification. By using a verfied network, it is possible to use a stable network without having to learn from scratch. These backbone network are slightly modified. So if you are using VGG-16, you can follow the paper, but if you are willing to replace the network, you have to modify carefully with logic.

What we are doing is not classification, so we will replace the fully connected layers at the end with convolution layers. For details on converting a FC layer into a Conv layer, see the paper.

Auxiliary Convolution Layers

For various size of feature maps, SSD adds auxiliary convolution layers after the backbone network. By iterating the pointwise convolution(1x1) and 3x3 convolution, adjust the number of channels and the size of feature map.

Now the result of 4, 7, 8, 9, 10 ,11 layers will be used as input of prediction convolution layers.

Prediction Convolution Layers

Now, we have various sized feature map from the stable well peforming backbone network. From these feature, we have to classify and localize the object.

If you think of drawing a box on an image in object detection, what we need to find in localization is the position and size of the box.

Priors

Finding object in the entire image makes the problem's complexity too high, so the search space should be narrowed. For each position, SSD uses predefined boxes called priors. The priors have different scales and aspect ratios for feature maps. By using the prior's concept, SSD dramatically reduced the complexity of the problem.

However, these predefined box can't be the final result and should be adjusted to get more precise localization result. So the prediction convolution is trained to derive the offset dx, dy away from the prior position and the ratio dh, dw to the prior size.

The Final form of Prediction Convolution Layer's Result

For class prediction, as like the localization prediction, convolution layers will be trained to compute class confidence for every feature location and every priors.

This is the basic concept and structure of SSD. I referenced here to get details and figures.

'Vision' 카테고리의 다른 글

SENet(Squeeze-and-Excitation Networks) (0)	2022.01.05
1x1 Convolutions & Inception Module (0)	2022.01.04
Pseudo-LiDAR from Visual Depth Estimation:Bridging the Gap in 3D Object Detection for Autonomous Driving (0)	2021.11.21
Introduction - Attention (0)	2021.10.31
Complex-YOLO: Real-time 3D Object Detection on Point Clouds (0)	2021.09.12

Keep Walking

SSD: Single Shot MultiBox Detector