Video summarization of person of interest (POI)
With the increase of video content available, there is a greater need for data management of these digital media. Video summarization aims to create a succinct and comprehensive synopsis through the selection of key details from video media [1]. Most video summarization models summarize the entire v...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | English |
Published: |
Nanyang Technological University
2022
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/156530 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | With the increase of video content available, there is a greater need for data management of these digital media. Video summarization aims to create a succinct and comprehensive synopsis through the selection of key details from video media [1]. Most video summarization models summarize the entire video without any prior trimming of key details which may lead to excess information being provided. Conventional video summarization models only provide one summarized statement for the entire video, which often results in a very broad description of activities that happened in the video.
The first contribution of this project is a new enhanced video summarization model which provides additional information centered around a particular Person of interest (POI). A new pipeline is developed for video summarization of POI using deep-learning based methods, providing further insights on POI face, action and clothing. Face recognition and detection are first used to identify the POI within the video.When the identity of the POI is identified using face recognition, clothes descriptors are applied on the POI to identify what clothing they are wearing. Finally, the video is trimmed to only include parts containing the POI for more precise video summarization, accurately deriving the key activities the POI is involved in.
Multiple state-of-the-art face detection, mask classification and face recognition models have been explored and integrated into the new pipeline to achieve this goal. Convolutional Neural Networks (CNN), such as Resnet 50, are used for classification and Multi-Task Cascaded Convolutional Network (MTCNN) are used for face recognition, while object detection model You Only Look Once (YOLO) is used for human extraction. K-means clustering is used for color extraction of POI clothes.
The second contribution is the enhancement of the accuracy of the various individual component’s ability to extract and classify the various objects, and thus justify the selections made for the pipeline. The use of Face Detection using DLIB has achieve an accuracy of 88.2%, however, the enhanced facial recognition model, which includes the use of both Multi-Task Cascaded Convolutional Network (MTCNN) and Face Detection using DLIB, achieved an accuracy of 94.9%, a 6% increased in overall accuracy. While the mask classification model trained using ResNet 50 achieved an accuracy of 98.11%. An overall evaluation of the model and its use cases conclude the report, with possible further expansions such as real-time video detection and optimisation of descriptors.
Keywords: Convolutional Neural Network, Face Detection, Face Recognition, Object detection, Video Summarization, You Only Look Once |
---|