Neural image and video captioning

In today’s digital age, the proliferation of visual content has underscored the critical importance of multimedia comprehension and interpretation. Video uses images and sound to convey information. This project introduces a novel approach to video captioning, leveraging the synergies between Machin...

Full description

Saved in:
Bibliographic Details
Main Author: Lam, Ting En
Other Authors: Hanwang Zhang
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2024
Subjects:
Online Access:https://hdl.handle.net/10356/175286
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:In today’s digital age, the proliferation of visual content has underscored the critical importance of multimedia comprehension and interpretation. Video uses images and sound to convey information. This project introduces a novel approach to video captioning, leveraging the synergies between Machine Learning, Computer Vision and Natural Language Processing to bridge the gap between human and computer understanding of visual understanding by generating descriptive captions from visual content. In this project, the effectiveness of various image captioning models is evaluated to identify optimal frameworks for textual description generation. Subsequently, a video captioning model capable of generating multimodal captions for video content is developed. The proposed image and video captioning models are evaluated using standard metrics and a human evaluation study was conducted. Additionally, the models are deployed into a user-friendly application for usage. Overall, this study seeks to improve video captioning performance and foster further advancements in this field.