Single Shot MultiBox Detector (SSD) implementation for PASCAL VOC 2012 dataset.
While the general concepts of vanilla SSD algorithm are maintained, several important differences and additions to reference implementation are introduced:
- data analysis
- maximum theoretical network recall analysis
- only blocks 2 to 5 of VGG backbone are used for predictions
- configurable network architecture
- use of corner boxes instead of center boxes
- simpler offsets predictions loss
- network operates on original image resolution
A short description of each change follows.
Original paper defined sizes and prediction layers for default boxes used in the network without any reference to sizes of objects it was trying to detect. In this work we provide a script to analyze sizes and aspect ratios of objects in data set, which can be used to guide network architecture design.
Given sizes and arrangement of default boxes in SSD, it's possible to compute what is the maximum theoretical recall for the network against a given dataset. Knowing that ceiling helps to inform both training and architecture design decisions.
From data analysis one can compute set of optimal default boxes and their placement for given dataset. It turns out that for PASCAL VOC 2012 dataset using layers above block 5 is not necessary - all objects in the dataset can be detected with boxes placed on earlier layers. Analysis also shows that over 70% of annotations require default boxes placed on block 2 and 3 for default boxes to be able to match them.
Configuration file provided allows to control which of major VGG blocks outputs should be used to construct prediction heads, as well as what should be sizes and aspect ratios of default boxes placed on them. No changes in code are necessary to adjust network to configuration optimal for a given dataset.
Original SSD network defines default boxes in [center_x, center_y, width, height]
format.
This work uses an alternative [min_x, min_y, max_x, max_y]
format.
Both formats are interchangeable, but the latter is far more popular among computer vision frameworks and easier to work with.
Offsets losses are just square error losses scaled by boxes sizes, and computed in the same fashion for each box coordinate.
Original SSD scales images to 300x300 or 500x500 resolution. This has several disadvantages, especially for VOC dataset:
- objects aspect ratios might be distorted - and the distortion factor varies across images
- data analysis becomes more difficult, making finding optimal network configuration difficult as well
- for most VOC images above rescaling decreases image resolution, making small objects, so ones that are particularly hard to detect, even smaller
In this work we choose to train and predict on original image resolution, only adjusting image to a size factor of 32, which simplifies computations of default boxes coordinates. This still allows predictions to run above 30 frames per second on GeForce GTX 1080 Ti.
Following scripts are provided in the scripts
directory
data_analysis.py
- analyzes sizes and aspect ratios of annotations in datasetmodel_analysis.py
- analyzes performance of trained model on a datasetnetworks_configuration_analysis.py
- analyzes overlap between neighbouring default boxes defined by network configurationnetworks_theoretical_bounds_analysis.py
- analyzes theoretical recall network with given configuration can achieve on given dataset, reports sizes and aspect ratios of annotations network can't detecttrain.py
- trains networkvisualize.py
- provides routines to visualize raw data, augmented data, predictions, etc
Location of data and model paths, training hyperparameters and other inputs for all scripts are controlled through configuration file parameter.
config.yaml
provides a sample configuration.
Docker file for building container in which project can be run is provided at ./docker/app.Dockerfile
Helper invoke command invoke host.build-app-container
is provided for building the container.
You can use invoke host.run-app-container
to start the container. It wraps docker run
command, setting useful options, such us enabling GPU support and mounting logs directory.
Once inside docker container, you can use invoke to run provided commands.
Use invoke --list
to see all available commands.
The most frequently used commands are:
- train.train-object-detection-model - for training model
- analyze.analyze-objects-detections-predictions - for analyzing accuracy of predictions
- visualize.log-predictions - for visualizing predictions
Dataset is not included with this repository. Please download dataset from the official webpage. Once downloaded, adjust config.yaml so its relevant section points to path with data.
A few sample predictions on VOC 2012 dataset made with a trained model are shown below
Dataset used: VOC Pascal 2012
Confidence threshold used: 0.5
Recall: 0.458
Precision: 0.723
F1 score: 0.561
This project can be readily reused with different object detection datasets. In most cases the only changes you would need to do are:
- implement a data loader - look at
net.data.VOCSamplesDataLoader
for reference - adjust configuration file to load data from appropriate path
Of course I would then advise to use tools project provides to define optimal network configuration for your dataset, going through
data analysis -> network configuration adjustments -> theoretical network performance analysis loop -> training -> model performance analysis
loop.