Semantic-filtered Soft-Split-Aware video captioning with audio-augmented feature

Automatic video description, or video captioning, is a challenging yet much attractive task. It aims to combine video with text. Multiple methods have been proposed based on neural networks, utilizing Convolutional Neural Networks (CNN) to extract features, and Recurrent Neural Networks (RNN) to enc...

Full description

Saved in:
Bibliographic Details
Main Authors: Xu, Yuecong, Yang, Jianfei, Mao, Kezhi
Other Authors: School of Electrical and Electronic Engineering
Format: Article
Language:English
Published: 2021
Subjects:
Online Access:https://hdl.handle.net/10356/151341
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Automatic video description, or video captioning, is a challenging yet much attractive task. It aims to combine video with text. Multiple methods have been proposed based on neural networks, utilizing Convolutional Neural Networks (CNN) to extract features, and Recurrent Neural Networks (RNN) to encode and decode videos to generate descriptions. Previously, a number of methods used in video captioning task are motivated by image captioning approaches. However, videos carry much more information than images. This increases the difficulty of video captioning task. Current methods commonly lack the ability to utilize the additional information provided by videos, especially the semantic and structural information of the videos. To address the above shortcoming, we propose a Semantic-Filtered Soft-Split-Aware-Gated LSTM (SF-SSAG-LSTM) model, that would improve video captioning quality by combining semantic concepts with audio-augmented feature extracted from input videos, while understanding the underlying structure of input videos. In the experiments, we quantitatively evaluate the performance of our model which matches other prominent methods on three benchmark datasets. We also qualitatively examine the result of our model, and show that our generated descriptions are more detailed and logical.