IMAGE CAPTIONING ON GEOLOGICAL ROCKS WITH TRANSFORMER ARCHITECTURE AND DATA AUGMENTATION
Image captioning is a task to provide a natural language description of an image to explain its content. Previous research on geological rocks image captioning by Nursikuwagus et al (2022) was done using the CNN-LSTM architecture. Their latest research has added VaT and SeTrans to their experimen...
Saved in:
Main Author: | |
---|---|
Format: | Final Project |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/79487 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
Summary: | Image captioning is a task to provide a natural language description of an image to
explain its content. Previous research on geological rocks image captioning by
Nursikuwagus et al (2022) was done using the CNN-LSTM architecture. Their
latest research has added VaT and SeTrans to their experiment. With the
development of Transformer architecture, there is an opportunity to improve the
existing model. Apart from that, there is also the problem of relatively small dataset
compared to other image captioning dataset such as Flickr8k and MSCOCO..
In this final research project, Vision Transformer (ViT) and Swin Transformer as
an image encoder and Transformer decoder as text decoder were used to improve
the performance of image captioning model on geological rocks. Data augmentation
were done to increase the dataset size by doing random crop and horizontal flip on
image data and backtranslation on text data. The first experiment was conducted by
changing the model, starting from a CNN-LSTM then changing it to a full-
Transformer model. The second experiment was carried out by changing the
learning rate and training with augmented data to improve model performance from
the first experiment.
Swin-Transformer model produces the best performance with BLEU values of
45.01, 28.52, 19.34, and 9.57. Decreased learning rate only slightly improves model
performance and image data augmentation does not improve model performance.
Training with text data augmentation produces lower BLEU-1 values, but higher
BLEU-4 values at 40.62 and 16.79. The results of this study show ViT and Swin
Transformers can improve model performance compared to CNN. However, LSTM
is still superior for generating longer captions on this dataset. |
---|