Improved image captioning techniques with comparative study

Image captioning is the process of generating a sentence that describes a given image. This project improves the Neural Baby Talk model by identifying the source of the errors and resolving it by readjusting weights of the problematic subnetwork. Comparative study with Image Captioning Transforme...

Full description

Saved in:

Bibliographic Details
Main Author:	He, Cari
Other Authors:	Lee Bu Sung, Francis
Format:	Final Year Project
Language:	English
Published:	Nanyang Technological University 2021
Subjects:	Engineering::Computer science and engineering
Online Access:	https://hdl.handle.net/10356/153187
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-153187
record_format	dspace
spelling	sg-ntu-dr.10356-1531872021-11-15T08:25:44Z Improved image captioning techniques with comparative study He, Cari Lee Bu Sung, Francis School of Computer Science and Engineering DSO National Laboratories Xiao Xuhong Mak Lee Onn EBSLEE@ntu.edu.sg Engineering::Computer science and engineering Image captioning is the process of generating a sentence that describes a given image. This project improves the Neural Baby Talk model by identifying the source of the errors and resolving it by readjusting weights of the problematic subnetwork. Comparative study with Image Captioning Transformer and OSCAR is conducted and the former is determined to be most practical for industry applications. The Image Captioning Transformer is further improved upon by using features extracted from VinVL. YOLOv4 was also experimented on for its ability to extract features for image captioning but it underperformed. Image captioning is typically done by first extracting features from an image using an object detector trained for image captioning, then the features are passed into a captioning model to generate the caption. Captioning models were first investigated before the methods for feature extraction. Neural Baby Talk, Image Captioning Transformer and OSCAR captioning models were explored and evaluated based on their performance and speed. Neural Baby Talk model was analysed based on its output captions and the problems were traced back to a subnetwork that refines coarse labels provided by its object detector to a fine-grained word. This subnetwork was first bypassed to verify that it is the cause of the issues before weights were adjusted to achieve improvement in performance across all metrics to 74.0 on BLEU 1, 32.6 on BLEU 4, 25.7 on METEOR, 101.6 on CIDEr and 19.1 on SPICE. While OSCAR produced the best performance, and Image Captioning Transformer had the best speed to performance trade-off for industrial applications. Hence, Image Captioning Transformer is used for further experiments. To extract features from an image, an object detector is selected then usually is trained for the purpose of image captioning. The methods used to train the Faster R-CNN object detector include Bottom-Up Attention and VinVL. Originally, Image Captioning Transformer uses features from Bottom-Up Attention. To compare the two methods, image features from VinVL are then fed into Image Captioning Transformer which improved the performance of Image Captioning Transformer on all metricsto 81.2 on BLEU 1, 40. 4 on BLEU 4, 28.9 on METEOR, 131.4 on CIDEr and 22.7 on SPICE. The YOLOv4 object detectoris also experimented on its ability to produce robust features from an image for image captioning. To use YOLOv4 for image captioning, Visual Genome dataset is thoroughly cleaned and used in retraining of YOLOv4 for the purpose of image captioning. However, the performance of the Image Captioning Transformer dropped when incorporating the features extracted from the YOLOv4 object detector. Hence after all experiments it is found that Image Captioning Transformer with VinVL may be the best in terms of both speed and performance. Bachelor of Science in Data Science and Artificial Intelligence 2021-11-15T08:25:44Z 2021-11-15T08:25:44Z 2021 Final Year Project (FYP) He, C. (2021). Improved image captioning techniques with comparative study. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/153187 https://hdl.handle.net/10356/153187 en SCSE20-1120 application/pdf Nanyang Technological University
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Engineering::Computer science and engineering
spellingShingle	Engineering::Computer science and engineering He, Cari Improved image captioning techniques with comparative study
description	Image captioning is the process of generating a sentence that describes a given image. This project improves the Neural Baby Talk model by identifying the source of the errors and resolving it by readjusting weights of the problematic subnetwork. Comparative study with Image Captioning Transformer and OSCAR is conducted and the former is determined to be most practical for industry applications. The Image Captioning Transformer is further improved upon by using features extracted from VinVL. YOLOv4 was also experimented on for its ability to extract features for image captioning but it underperformed. Image captioning is typically done by first extracting features from an image using an object detector trained for image captioning, then the features are passed into a captioning model to generate the caption. Captioning models were first investigated before the methods for feature extraction. Neural Baby Talk, Image Captioning Transformer and OSCAR captioning models were explored and evaluated based on their performance and speed. Neural Baby Talk model was analysed based on its output captions and the problems were traced back to a subnetwork that refines coarse labels provided by its object detector to a fine-grained word. This subnetwork was first bypassed to verify that it is the cause of the issues before weights were adjusted to achieve improvement in performance across all metrics to 74.0 on BLEU 1, 32.6 on BLEU 4, 25.7 on METEOR, 101.6 on CIDEr and 19.1 on SPICE. While OSCAR produced the best performance, and Image Captioning Transformer had the best speed to performance trade-off for industrial applications. Hence, Image Captioning Transformer is used for further experiments. To extract features from an image, an object detector is selected then usually is trained for the purpose of image captioning. The methods used to train the Faster R-CNN object detector include Bottom-Up Attention and VinVL. Originally, Image Captioning Transformer uses features from Bottom-Up Attention. To compare the two methods, image features from VinVL are then fed into Image Captioning Transformer which improved the performance of Image Captioning Transformer on all metricsto 81.2 on BLEU 1, 40. 4 on BLEU 4, 28.9 on METEOR, 131.4 on CIDEr and 22.7 on SPICE. The YOLOv4 object detectoris also experimented on its ability to produce robust features from an image for image captioning. To use YOLOv4 for image captioning, Visual Genome dataset is thoroughly cleaned and used in retraining of YOLOv4 for the purpose of image captioning. However, the performance of the Image Captioning Transformer dropped when incorporating the features extracted from the YOLOv4 object detector. Hence after all experiments it is found that Image Captioning Transformer with VinVL may be the best in terms of both speed and performance.
author2	Lee Bu Sung, Francis
author_facet	Lee Bu Sung, Francis He, Cari
format	Final Year Project
author	He, Cari
author_sort	He, Cari
title	Improved image captioning techniques with comparative study
title_short	Improved image captioning techniques with comparative study
title_full	Improved image captioning techniques with comparative study
title_fullStr	Improved image captioning techniques with comparative study
title_full_unstemmed	Improved image captioning techniques with comparative study
title_sort	improved image captioning techniques with comparative study
publisher	Nanyang Technological University
publishDate	2021
url	https://hdl.handle.net/10356/153187
_version_	1718368049998856192

Improved image captioning techniques with comparative study

Similar Items