Whiteboard content based summarization and video indexing. Video demo link : https://youtu.be/Zaze4Jtx7ig
- Upload video you want to summarize to google colab and generate annotations for it using the the colab notebook notebooks/objectdetection/whiteboard_content_detection.ipynb
- Place the annotations in backend/annotaions
- Set up project in an ide linke spyder or pycharm
- Install python dependencies
- Run project
First the video is split into frames by sampling at 1 fps
The frames are then annotated using a object detection model that I trained. Using facebook's detectron2 system I trained a Faster R-CNN model on a dataset of annotated whiteboard content. The dataset was prepared by selecting 550 annotated images from a public repository making sure that different lighting conditions were included. The model was trained for 1500 iterations after which it gave an average precision of 0.83 which was enough for the project because there are other techniques applied to make up for any errors. After annotation each image has a list of bounding boxes for written content on the board associated with it.
The frames are then binarized by first applying bilateral and median filter to them and generating a background mask. This mask is then subtracted from the image. This process gets rid of most of the background and the lecturer in a frame. The resulting image is then binarized using Otsu's binarization technique.
The Spatial groups refer to parts of the written content that occupy the same space on the board. This is determined by putting all bounding boxes with IOU greater than 0.5 into the same spatial group. Bounding boxes in the same spatial groups are further split into temporal groups if there is a difference of 10 seconds between them.
There are errors associated with object detection , also there are frames in which the lecturer might obstruct the written content. To tackle this problem image crops from the bounding boxes in the same temporal groups are averaged to give one reconstructed image for every temporal group.
Another problem that arises is that same content can get split into different temporal groups if it is obstructed by the lecturer for over 10 seconds. This is rectified by comparing the reconstructed image for consecutive temporal groups using perceptual hashing. If the two images are similar the groups are merged.
Finally to generate the summary I used the approach proposed by this paper. Any two temporal groups within the same spatial group are considered to be in conflict.A split interval is a time interval in which , if the video is split it resolves the conflict. All the split intervals are found out and then the interval which resolves the most conflict is chosen for splitting the video and a split index is recorded. This is done till there are no more conflicts. Then one summary image is generated for each interval marked by the split indices. A reconstructed image is added to the summary image if it has not already been added to any image or it exists for over 50% of the duration of the interval.