본문 바로가기
Vision

SSD: Single Shot MultiBox Detector

by Hotbingsoo 2021. 6. 27.
반응형

SSD: Single Shot MultiBox Detector

SSD is an object detection algorithm after the Yolo-v1, improved the mAP and speed. Yolo achieved a dramatic performance improvement in terms of speed compared to the two-step algorithm of the RCNN sereis, but there was a limit in accuracy.

Structure of SSD and Yolo

  • input : 300x300 3 channel image
  • output : bounding box, object class
  • Backbone network : VGG-16
  • Multiscale Feature Maps

As you can see in the figure above, the biggest difference between Yolo and SSD is that in the case of SSD, detection is performed on various sized feature maps. Thus SSD improved the Yolo's limitation of finding small objects.

 

Object detection

object dectection boxing

Briefly explain, is an image processing that classify which object is in the image and localize where the object is in the image. So we can thin of object detection as a combination of the regression problem of finding the size and position of the bounding box and the classification problem of finding out what the object is.

 

 

Implementation of SSD

The SSD consists of three-step convolution networks.

  • Backbone network : VGG-16
  • Auxiliary convolutions : provides feature maps of various sizes
  • Prediction convolutions : identify and locate objects

Backbone Network

Modification of Backbone network

Backbone network is a pretrained-network that has been proven to perform well in classification. By using a verfied network, it is possible to use a stable network without having to learn from scratch. These backbone network are slightly modified. So if you are using VGG-16, you can follow the paper, but if you are willing to replace the network, you have to modify carefully with logic.

 

What we are doing is not classification, so we will replace the fully connected layers at the end with convolution layers. For details on converting a FC layer into a Conv layer, see the paper.

 

Auxiliary Convolution Layers

Auxiliary  Convolution Layers

For various size of feature maps, SSD adds auxiliary convolution layers after the backbone network. By iterating the pointwise convolution(1x1) and 3x3 convolution, adjust the number of channels and the size of feature map.

 

Now the result of 4, 7, 8, 9, 10 ,11 layers will be used as input of prediction convolution layers.

 

Prediction Convolution Layers

Now, we have various sized feature map from the stable well peforming backbone network. From these feature, we have to classify and localize the object. 

 

 If you think of drawing a box on an image in object detection, what we need to find in localization is the position and size of the box.

 

Priors

priors

Finding object in the entire image makes the problem's complexity too high, so the search space should be narrowed. For each position, SSD uses predefined boxes called priors. The priors have different scales and aspect ratios for feature maps. By using the prior's concept, SSD dramatically reduced the complexity of the problem.

Prior and the Target box

However, these predefined box can't be the final result and should be adjusted to get more precise localization result. So the prediction convolution is trained to derive the offset dx, dy away from the prior position and the ratio dh, dw to the prior size.

The Final form of Prediction Convolution Layer's Result

 

For class prediction, as like the localization prediction, convolution layers will be trained to compute class confidence for every feature location and every priors.

 

This is the basic concept and structure of SSD. I referenced here to get details and figures. 

 

반응형

댓글