Visual recognition using deep learning (video captioning using deep learning)

Video captioning refers to the process of conveying information of video clips through automatically generated natural language sentences. The unprecedented success of deep learning approaches in Computer Vision and Natural Language Processing have spurred significant progress in the research area o...

Full description

Saved in:
Bibliographic Details
Main Author: Thong, Jing Lin
Other Authors: Yap Kim Hui
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2021
Subjects:
Online Access:https://hdl.handle.net/10356/148774
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Video captioning refers to the process of conveying information of video clips through automatically generated natural language sentences. The unprecedented success of deep learning approaches in Computer Vision and Natural Language Processing have spurred significant progress in the research area of video captioning. Currently, video captioning has extensive applications in video surveillance, video subtitling and human-robot interaction. Most existing video captioning methods adopt the pure encoder-decoder framework, where the encoder is used to extract video features while the decoder is used to generate captions. However, even though current state-of-the-art models achieved high scores on the evaluation metrics, a significant proportion of the generated captions still do not accurately describe visual content of the videos. In this project, a comprehensive survey was conducted to identify and compare the performance of existing state-of-the-art models. A deep learning model was then developed to equip the basic encoder-decoder framework with enhanced visual reasoning capacity by incorporating additional sophisticated spatio-temporal reasoning modules. In addition, as the encoder-decoder framework only leverages on progressive flow of information to generate sentences based on extracted video features, an additional layer will be developed to establish the reverse flow and generate video features based on the generated sentences which will be compared against the original video features. Thereafter, reinforcement learning techniques were used to further optimise the model. Extensive experiments on benchmark datasets demonstrate that the overall model outperforms existing state-of-the-art methods and improves the quality of generated captions. Moreover, a user-friendly web application was designed using the Django framework to deploy the developed deep learning model. This web application allows users to upload selected videos and generate captions. Furthermore, a robust text-based search function was developed to allow users to search for their videos by entering key search terms. The report contains the design of the model, experimental results, considerations in designing the web application, a systematic guide from the user's perspective and details of the integration of the video captioning model to the web application. It concludes with a discussion of the final results and possible future extensions.