Improved image captioning techniques with comparative study

Image captioning is the process of generating a sentence that describes a given image. This project improves the Neural Baby Talk model by identifying the source of the errors and resolving it by readjusting weights of the problematic subnetwork. Comparative study with Image Captioning Transforme...

Full description

Saved in:
Bibliographic Details
Main Author: He, Cari
Other Authors: Lee Bu Sung, Francis
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2021
Subjects:
Online Access:https://hdl.handle.net/10356/153187
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Image captioning is the process of generating a sentence that describes a given image. This project improves the Neural Baby Talk model by identifying the source of the errors and resolving it by readjusting weights of the problematic subnetwork. Comparative study with Image Captioning Transformer and OSCAR is conducted and the former is determined to be most practical for industry applications. The Image Captioning Transformer is further improved upon by using features extracted from VinVL. YOLOv4 was also experimented on for its ability to extract features for image captioning but it underperformed. Image captioning is typically done by first extracting features from an image using an object detector trained for image captioning, then the features are passed into a captioning model to generate the caption. Captioning models were first investigated before the methods for feature extraction. Neural Baby Talk, Image Captioning Transformer and OSCAR captioning models were explored and evaluated based on their performance and speed. Neural Baby Talk model was analysed based on its output captions and the problems were traced back to a subnetwork that refines coarse labels provided by its object detector to a fine-grained word. This subnetwork was first bypassed to verify that it is the cause of the issues before weights were adjusted to achieve improvement in performance across all metrics to 74.0 on BLEU 1, 32.6 on BLEU 4, 25.7 on METEOR, 101.6 on CIDEr and 19.1 on SPICE. While OSCAR produced the best performance, and Image Captioning Transformer had the best speed to performance trade-off for industrial applications. Hence, Image Captioning Transformer is used for further experiments. To extract features from an image, an object detector is selected then usually is trained for the purpose of image captioning. The methods used to train the Faster R-CNN object detector include Bottom-Up Attention and VinVL. Originally, Image Captioning Transformer uses features from Bottom-Up Attention. To compare the two methods, image features from VinVL are then fed into Image Captioning Transformer which improved the performance of Image Captioning Transformer on all metricsto 81.2 on BLEU 1, 40. 4 on BLEU 4, 28.9 on METEOR, 131.4 on CIDEr and 22.7 on SPICE. The YOLOv4 object detectoris also experimented on its ability to produce robust features from an image for image captioning. To use YOLOv4 for image captioning, Visual Genome dataset is thoroughly cleaned and used in retraining of YOLOv4 for the purpose of image captioning. However, the performance of the Image Captioning Transformer dropped when incorporating the features extracted from the YOLOv4 object detector. Hence after all experiments it is found that Image Captioning Transformer with VinVL may be the best in terms of both speed and performance.