**DriveNet with DIGITS and Caffe, to solve the last assigment of Udacity's Self Driving Car Nanodegree
The goals / steps of this project according the rubric are the following:
- Perform a Histogram of Oriented Gradients (HOG) feature extraction on a labeled training set of images and train a classifier Linear SVM classifier
- Optionally, you can also apply a color transform and append binned color features, as well as histograms of color, to your HOG feature vector.
- Implement a sliding-window technique and use your trained classifier to search for vehicles in images.
- Run your pipeline on a video stream (start with the test_video.mp4 and later implement on full project_video.mp4) and create a heat map of recurring detections frame by frame to reject outliers and follow detected vehicles.
- Estimate a bounding box for vehicles detected.
BUT (and here comes the interesting part), there are many ways to reach a goal, people from the Slack Channel have discussed about using a neural network to tackle this challenge, because the pipeline with HOG, a classifier with region proposals takes too long, specially for a real time solution. What will be proposed for this challenge, takes about 0.03 seconds to detect the cars on an image. Then again faster detection equals more detections per second, and with some tunning this can be easily deployed in a Jetson TX1.
So the goals I set myself for this challenge are the following:
- Use DriveNet as an architecture to tackle the car detection problem.
- Use Digits 5 as a framework to train, this to facilitate production and to prove the potential of Digits for training of DNN, also the visualizations provided are top notch.
- Instead of the dataset provided from Udacity, use the KITTI Dataset because of its size and potential scalability with Lidar Data.
- Use Caffe. Since we've already used Keras and TF, it'll be nice to test another framework, also for learning purposes.
Digits is a framework by Nvidia, to facilitate training of deep neural networks. Currently it works only with Caffe and Torch. Its main advantage is the facility of training multiple models, loading new models and visualize results without a lot of effort, so that you can test a network to see if it would work in 10 minutes, and then improve on top of that. A glimpse into Digits can be seen on the next figure:
Its interface is simple but powerful, it allows dataset selection, model selection and also a pre-trained model store, to use past models in other assignments. It usses Caffe, and the results can be expanded with TensorRT to deploy your models easily.
The database used from the assignment is taken from KITTI, having 12GB of data seemed useful for our purposes, also in the future, both Udacity's dataset and Kitti's will be merged to get even better results. The network used to test the Udacity video has never seen any of that data, and therefore it's interesting to see the results.
DetectNet is a derivation from GoogleNet used in its inception to detect numbers in an image, but it's not a simple number classifier network, it takes as input a whole image, and tries to detect all the Street numbers on it. Because it doesn't behave like other past approaches and because it doesn't generate proposals (sections in the image to be classified if car or not) it doesn't need a Sliding window to generate possible frames where cars could be, and it's really robust to scaling and occlusion.
Because of those reasons this network was selected, and a more detailed view can be seen on this next figure:
This network was trained during 30 epochs with the weights taken from the street number selection task. It then was fine tunned with 60 more epochs. No augmentation was performed in either of those both stages.
The next image shows the loss in the first 30 epochs, it can be seen that his one is not really accurate, but that's expected from start.
The learning rate for training was a polynomial decay, starting with 0.0001 as can be seen in the next figure:
After that first run, some fine tunning was done, with 60 more epochs, this used the weights from the past training so a lot of improvement was shown. The loss is shown in the next figure:
The learning rate for this one was also a polynomial decay, but with different parameters:
After the training was done (which due to the images length took a lot of time) the results were great, for a test image:
The result with bounding boxes drawn was the following:
The advantage of DIGITS, is that it allows visualization at different stages of the inference, so we can clearly see how the neurons are behaving on all stages like this:
Having the last step of the past inference generated a heatmap like layer that is like this:
These results were satisfactory, since the network had never seen that video before, making it generalized for multiple uses.
Because it's Caffe, the following files are needed in order to generate the video:
.caffemodel
A Caffemodel filedeploy.prototxt
A Caffemodel file- The input video
And must be called this way:
python CarDetection.py snapshot_iter_19778.caffemodel deploy.prototxt project_video.mp4
NOTE: Caffe must be installed in the machine for this to work. Instructions on how to build caffe can be found HERE
In the CarDetection.py
the following functions are to note:
def forward_pass(image, net, transformer, batch_size=None):
"""
Returns scores for each image as an np.ndarray (nImages x nClasses)
Arguments:
image -- a list of np.ndarrays
net -- a caffe.Net
transformer -- a caffe.io.Transformer
Keyword arguments:
batch_size -- how many images can be processed at once
(a high value may result in out-of-memory errors)
"""
if batch_size is None:
batch_size = 1
caffe_images = []
if image.ndim == 2:
caffe_images.append(image[:,:,np.newaxis])
else:
caffe_images.append(image)
dims = transformer.inputs['data'][1:]
scores = None
for chunk in [caffe_images[x:x+batch_size] for x in xrange(0, len(caffe_images), batch_size)]:
new_shape = (len(chunk),) + tuple(dims)
if net.blobs['data'].data.shape != new_shape:
net.blobs['data'].reshape(*new_shape)
for index, image in enumerate(chunk):
image_data = transformer.preprocess('data', image)
net.blobs['data'].data[index] = image_data
start = time.time()
output = net.forward()[net.outputs[-1]]
end = time.time()
if scores is None:
scores = np.copy(output)
else:
scores = np.vstack((scores, output))
print 'Processed %s/%s images in %f seconds ...' % (len(scores), len(caffe_images), (end - start))
return scores
That does the forward pass of an image through the network and evaluates time.
Once we get the results from the network we use this function to draw the bounding boxes:
def draw_bboxes(image, locations):
"""
Draws the bounding boxes into an image
Arguments:
image -- a single image already resized
locations -- the location of the bounding boxes
"""
for left,top,right,bottom,confidence in locations:
if confidence==0:
continue
cv2.rectangle(image,(left,top),(right,bottom),(255,0,0),3)
#cv2.imwrite('bbox.png',image)#test on a single image
return image
A clustering is done on the outputs to vote on the bounding boxes to be drawn:
def forward(self, bottom, top):
for i in xrange(self.num_classes):
data0 = bottom[0].data[:,i:i+1,:,:]
bbox = cluster(self, data0, bottom[1].data)
top[i].data[...] = bbox
To test the model with the video, please call:
python CarDetection.py snapshot_iter_19778.caffemodel deploy.prototxt project_video.mp4
Videos of the both detections can be seen here:
There is some struggle with the white color, but in the future this is expected to be resolved.
- For the harder challenge, it performs even better and detects cars from the other lane also
This project was exciting, specially because I drove me to use something different, so that I can get better results and faster times. The idea of using a complex network, but having it evaluate each image in 0.03 secs average is amazing. I am looking forward to improve this ever further with the help of TensorRT, and also augment the dataset and play with a segmentation network to see how it performs, and if there is improvements for both time and accuracy.
To the Udacity Slack Channel, for all the help and ideas. Also to NVIDIA, who provided me with the TX1 to use in the future deployment of this network.