Improved image captioning techniques with comparative study
Image captioning is the process of generating a sentence that describes a given image. This project improves the Neural Baby Talk model by identifying the source of the errors and resolving it by readjusting weights of the problematic subnetwork. Comparative study with Image Captioning Transforme...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | English |
Published: |
Nanyang Technological University
2021
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/153187 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-153187 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-1531872021-11-15T08:25:44Z Improved image captioning techniques with comparative study He, Cari Lee Bu Sung, Francis School of Computer Science and Engineering DSO National Laboratories Xiao Xuhong Mak Lee Onn EBSLEE@ntu.edu.sg Engineering::Computer science and engineering Image captioning is the process of generating a sentence that describes a given image. This project improves the Neural Baby Talk model by identifying the source of the errors and resolving it by readjusting weights of the problematic subnetwork. Comparative study with Image Captioning Transformer and OSCAR is conducted and the former is determined to be most practical for industry applications. The Image Captioning Transformer is further improved upon by using features extracted from VinVL. YOLOv4 was also experimented on for its ability to extract features for image captioning but it underperformed. Image captioning is typically done by first extracting features from an image using an object detector trained for image captioning, then the features are passed into a captioning model to generate the caption. Captioning models were first investigated before the methods for feature extraction. Neural Baby Talk, Image Captioning Transformer and OSCAR captioning models were explored and evaluated based on their performance and speed. Neural Baby Talk model was analysed based on its output captions and the problems were traced back to a subnetwork that refines coarse labels provided by its object detector to a fine-grained word. This subnetwork was first bypassed to verify that it is the cause of the issues before weights were adjusted to achieve improvement in performance across all metrics to 74.0 on BLEU 1, 32.6 on BLEU 4, 25.7 on METEOR, 101.6 on CIDEr and 19.1 on SPICE. While OSCAR produced the best performance, and Image Captioning Transformer had the best speed to performance trade-off for industrial applications. Hence, Image Captioning Transformer is used for further experiments. To extract features from an image, an object detector is selected then usually is trained for the purpose of image captioning. The methods used to train the Faster R-CNN object detector include Bottom-Up Attention and VinVL. Originally, Image Captioning Transformer uses features from Bottom-Up Attention. To compare the two methods, image features from VinVL are then fed into Image Captioning Transformer which improved the performance of Image Captioning Transformer on all metricsto 81.2 on BLEU 1, 40. 4 on BLEU 4, 28.9 on METEOR, 131.4 on CIDEr and 22.7 on SPICE. The YOLOv4 object detectoris also experimented on its ability to produce robust features from an image for image captioning. To use YOLOv4 for image captioning, Visual Genome dataset is thoroughly cleaned and used in retraining of YOLOv4 for the purpose of image captioning. However, the performance of the Image Captioning Transformer dropped when incorporating the features extracted from the YOLOv4 object detector. Hence after all experiments it is found that Image Captioning Transformer with VinVL may be the best in terms of both speed and performance. Bachelor of Science in Data Science and Artificial Intelligence 2021-11-15T08:25:44Z 2021-11-15T08:25:44Z 2021 Final Year Project (FYP) He, C. (2021). Improved image captioning techniques with comparative study. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/153187 https://hdl.handle.net/10356/153187 en SCSE20-1120 application/pdf Nanyang Technological University |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Engineering::Computer science and engineering |
spellingShingle |
Engineering::Computer science and engineering He, Cari Improved image captioning techniques with comparative study |
description |
Image captioning is the process of generating a sentence that describes a given image. This
project improves the Neural Baby Talk model by identifying the source of the errors and
resolving it by readjusting weights of the problematic subnetwork. Comparative study with
Image Captioning Transformer and OSCAR is conducted and the former is determined to be
most practical for industry applications. The Image Captioning Transformer is further
improved upon by using features extracted from VinVL. YOLOv4 was also experimented on
for its ability to extract features for image captioning but it underperformed.
Image captioning is typically done by first extracting features from an image using an object
detector trained for image captioning, then the features are passed into a captioning model to
generate the caption. Captioning models were first investigated before the methods for feature
extraction.
Neural Baby Talk, Image Captioning Transformer and OSCAR captioning models were
explored and evaluated based on their performance and speed. Neural Baby Talk model was
analysed based on its output captions and the problems were traced back to a subnetwork that
refines coarse labels provided by its object detector to a fine-grained word. This subnetwork
was first bypassed to verify that it is the cause of the issues before weights were adjusted to
achieve improvement in performance across all metrics to 74.0 on BLEU 1, 32.6 on BLEU 4,
25.7 on METEOR, 101.6 on CIDEr and 19.1 on SPICE. While OSCAR produced the best
performance, and Image Captioning Transformer had the best speed to performance trade-off
for industrial applications. Hence, Image Captioning Transformer is used for further
experiments.
To extract features from an image, an object detector is selected then usually is trained for the
purpose of image captioning. The methods used to train the Faster R-CNN object detector
include Bottom-Up Attention and VinVL. Originally, Image Captioning Transformer uses
features from Bottom-Up Attention. To compare the two methods, image features from VinVL
are then fed into Image Captioning Transformer which improved the performance of Image
Captioning Transformer on all metricsto 81.2 on BLEU 1, 40. 4 on BLEU 4, 28.9 on METEOR,
131.4 on CIDEr and 22.7 on SPICE.
The YOLOv4 object detectoris also experimented on its ability to produce robust features from
an image for image captioning. To use YOLOv4 for image captioning, Visual Genome dataset
is thoroughly cleaned and used in retraining of YOLOv4 for the purpose of image captioning.
However, the performance of the Image Captioning Transformer dropped when incorporating
the features extracted from the YOLOv4 object detector. Hence after all experiments it is found
that Image Captioning Transformer with VinVL may be the best in terms of both speed and
performance. |
author2 |
Lee Bu Sung, Francis |
author_facet |
Lee Bu Sung, Francis He, Cari |
format |
Final Year Project |
author |
He, Cari |
author_sort |
He, Cari |
title |
Improved image captioning techniques with comparative study |
title_short |
Improved image captioning techniques with comparative study |
title_full |
Improved image captioning techniques with comparative study |
title_fullStr |
Improved image captioning techniques with comparative study |
title_full_unstemmed |
Improved image captioning techniques with comparative study |
title_sort |
improved image captioning techniques with comparative study |
publisher |
Nanyang Technological University |
publishDate |
2021 |
url |
https://hdl.handle.net/10356/153187 |
_version_ |
1718368049998856192 |