Improved image captioning techniques with comparative study
Image captioning is the process of generating a sentence that describes a given image. This project improves the Neural Baby Talk model by identifying the source of the errors and resolving it by readjusting weights of the problematic subnetwork. Comparative study with Image Captioning Transforme...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | English |
Published: |
Nanyang Technological University
2021
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/153187 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | Image captioning is the process of generating a sentence that describes a given image. This
project improves the Neural Baby Talk model by identifying the source of the errors and
resolving it by readjusting weights of the problematic subnetwork. Comparative study with
Image Captioning Transformer and OSCAR is conducted and the former is determined to be
most practical for industry applications. The Image Captioning Transformer is further
improved upon by using features extracted from VinVL. YOLOv4 was also experimented on
for its ability to extract features for image captioning but it underperformed.
Image captioning is typically done by first extracting features from an image using an object
detector trained for image captioning, then the features are passed into a captioning model to
generate the caption. Captioning models were first investigated before the methods for feature
extraction.
Neural Baby Talk, Image Captioning Transformer and OSCAR captioning models were
explored and evaluated based on their performance and speed. Neural Baby Talk model was
analysed based on its output captions and the problems were traced back to a subnetwork that
refines coarse labels provided by its object detector to a fine-grained word. This subnetwork
was first bypassed to verify that it is the cause of the issues before weights were adjusted to
achieve improvement in performance across all metrics to 74.0 on BLEU 1, 32.6 on BLEU 4,
25.7 on METEOR, 101.6 on CIDEr and 19.1 on SPICE. While OSCAR produced the best
performance, and Image Captioning Transformer had the best speed to performance trade-off
for industrial applications. Hence, Image Captioning Transformer is used for further
experiments.
To extract features from an image, an object detector is selected then usually is trained for the
purpose of image captioning. The methods used to train the Faster R-CNN object detector
include Bottom-Up Attention and VinVL. Originally, Image Captioning Transformer uses
features from Bottom-Up Attention. To compare the two methods, image features from VinVL
are then fed into Image Captioning Transformer which improved the performance of Image
Captioning Transformer on all metricsto 81.2 on BLEU 1, 40. 4 on BLEU 4, 28.9 on METEOR,
131.4 on CIDEr and 22.7 on SPICE.
The YOLOv4 object detectoris also experimented on its ability to produce robust features from
an image for image captioning. To use YOLOv4 for image captioning, Visual Genome dataset
is thoroughly cleaned and used in retraining of YOLOv4 for the purpose of image captioning.
However, the performance of the Image Captioning Transformer dropped when incorporating
the features extracted from the YOLOv4 object detector. Hence after all experiments it is found
that Image Captioning Transformer with VinVL may be the best in terms of both speed and
performance. |
---|