From an image to a text description of the image

Information technology is changing rapidly, multimedia video with its rich information content, diverse presentation, convenient transmission, and storage form is rapidly replacing the traditional paper text. The amount of video data is growing in a spurt. In the face of the vast sea of news video,...

Full description

Saved in:

Bibliographic Details
Main Author:	Liu, Yanli
Other Authors:	Chng Eng Siong
Format:	Final Year Project
Language:	English
Published:	Nanyang Technological University 2022
Subjects:	Engineering::Computer science and engineering
Online Access:	https://hdl.handle.net/10356/156521
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-156521
record_format	dspace
spelling	sg-ntu-dr.10356-1565212022-04-19T06:38:12Z From an image to a text description of the image Liu, Yanli Chng Eng Siong School of Computer Science and Engineering ASESChng@ntu.edu.sg Engineering::Computer science and engineering Information technology is changing rapidly, multimedia video with its rich information content, diverse presentation, convenient transmission, and storage form is rapidly replacing the traditional paper text. The amount of video data is growing in a spurt. In the face of the vast sea of news video, how to quickly and accurately retrieve and store video information has become a pressing problem. Video uses images and sound to convey information. To achieve this purpose, the visual summaries of broadcast news videos can first be recovered by extracting the video’s important frames, resulting in a collection of images that is a good representation of the video’s visual content. Image captioning is then used to assign relevant descriptions to the extracted keyframes. Meanwhile, the audio of the video is extracted to be processed. Not only the speech content itself but also the background sound indicate the news content. This project implements a fully automated video captioning system designed specifically for broadcast news video. To perform image captioning, the proposed system uses shot-based boundary detection to extract important frames, and a CLIP prefix + GTP2 model is used for image caption. The system’s accuracy is measured using the MS COCO dataset, and it’s compared to the current state-of-the-art in image captioning. Also presented is a method for evaluating the generated video captions against a set of annotated keyframes. Bachelor of Engineering (Computer Science) 2022-04-19T06:38:11Z 2022-04-19T06:38:11Z 2022 Final Year Project (FYP) Liu, Y. (2022). From an image to a text description of the image. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/156521 https://hdl.handle.net/10356/156521 en SCSE21-0061 application/pdf Nanyang Technological University
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Engineering::Computer science and engineering
spellingShingle	Engineering::Computer science and engineering Liu, Yanli From an image to a text description of the image
description	Information technology is changing rapidly, multimedia video with its rich information content, diverse presentation, convenient transmission, and storage form is rapidly replacing the traditional paper text. The amount of video data is growing in a spurt. In the face of the vast sea of news video, how to quickly and accurately retrieve and store video information has become a pressing problem. Video uses images and sound to convey information. To achieve this purpose, the visual summaries of broadcast news videos can first be recovered by extracting the video’s important frames, resulting in a collection of images that is a good representation of the video’s visual content. Image captioning is then used to assign relevant descriptions to the extracted keyframes. Meanwhile, the audio of the video is extracted to be processed. Not only the speech content itself but also the background sound indicate the news content. This project implements a fully automated video captioning system designed specifically for broadcast news video. To perform image captioning, the proposed system uses shot-based boundary detection to extract important frames, and a CLIP prefix + GTP2 model is used for image caption. The system’s accuracy is measured using the MS COCO dataset, and it’s compared to the current state-of-the-art in image captioning. Also presented is a method for evaluating the generated video captions against a set of annotated keyframes.
author2	Chng Eng Siong
author_facet	Chng Eng Siong Liu, Yanli
format	Final Year Project
author	Liu, Yanli
author_sort	Liu, Yanli
title	From an image to a text description of the image
title_short	From an image to a text description of the image
title_full	From an image to a text description of the image
title_fullStr	From an image to a text description of the image
title_full_unstemmed	From an image to a text description of the image
title_sort	from an image to a text description of the image
publisher	Nanyang Technological University
publishDate	2022
url	https://hdl.handle.net/10356/156521
_version_	1731235803736047616

From an image to a text description of the image

Similar Items