Abstract
Background: With technological advancement, the quality of life of people has improved. Also, with technological advancement, large amounts of data are produced by people. The data is in the forms of text, images and videos. Hence, there is a need for significant efforts and means of devising methodologies for analyzing and summarizing them to manage with the space constraints. Video summaries can be generated either by keyframes or by skim/shot. The keyframe extraction is done based on deep learning-based object detection techniques. Various object detection algorithms have been reviewed for generating and selecting the best possible frames as keyframes. A set of frames is extracted out of the original video sequence and based on the technique used, one or more frames of the set are decided as a keyframe, which then becomes the part of the summarized video. The following paper discusses the selection of various keyframe extraction techniques in detail.
Methods: The research paper is focused on the summary generation for office surveillance videos. The major focus of the summary generation is based on various keyframe extraction techniques. For the same, various training models like Mobilenet, SSD, and YOLO are used. A comparative analysis of the efficiency for the same showed that YOLO gives better performance as compared to the other models. Keyframe selection techniques like sufficient content change, maximum frame coverage, minimum correlation, curve simplification, and clustering based on human presence in the frame have been implemented.
Results: Variable and fixed-length video summaries were generated and analyzed for each keyframe selection technique for office surveillance videos. The analysis shows that the output video obtained after using the Clustering and the Curve Simplification approaches is compressed to half the size of the actual video but requires considerably less storage space. The technique depending on the change of frame content between consecutive frames for keyframe selection produces the best output for office surveillance videos.
Conclusion: In this paper, we discussed the process of generating a synopsis of a video to highlight the important portions and discard the trivial and redundant parts. Firstly, we have described various object detection algorithms like YOLO and SSD, used in conjunction with neural networks like MobileNet, to obtain the probabilistic score of an object that is present in the video. These algorithms generate the probability of a person being a part of the image for every frame in the input video. The results of object detection are passed to keyframe extraction algorithms to obtain the summarized video. Our comparative analysis for keyframe selection techniques for office videos will help in determining which keyframe selection technique is preferable.
Keywords: Video summarization, key-frame extraction, multi-view summarization, object detection, YOLO, SSD.
Graphical Abstract