IMAGE CAPTIONING ON GEOLOGICAL ROCKS WITH TRANSFORMER ARCHITECTURE AND DATA AUGMENTATION

Image captioning is a task to provide a natural language description of an image to explain its content. Previous research on geological rocks image captioning by Nursikuwagus et al (2022) was done using the CNN-LSTM architecture. Their latest research has added VaT and SeTrans to their experimen...

Full description

Saved in:
Bibliographic Details
Main Author: Iqbal Sigid, Muhammad
Format: Final Project
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/79487
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
Description
Summary:Image captioning is a task to provide a natural language description of an image to explain its content. Previous research on geological rocks image captioning by Nursikuwagus et al (2022) was done using the CNN-LSTM architecture. Their latest research has added VaT and SeTrans to their experiment. With the development of Transformer architecture, there is an opportunity to improve the existing model. Apart from that, there is also the problem of relatively small dataset compared to other image captioning dataset such as Flickr8k and MSCOCO.. In this final research project, Vision Transformer (ViT) and Swin Transformer as an image encoder and Transformer decoder as text decoder were used to improve the performance of image captioning model on geological rocks. Data augmentation were done to increase the dataset size by doing random crop and horizontal flip on image data and backtranslation on text data. The first experiment was conducted by changing the model, starting from a CNN-LSTM then changing it to a full- Transformer model. The second experiment was carried out by changing the learning rate and training with augmented data to improve model performance from the first experiment. Swin-Transformer model produces the best performance with BLEU values of 45.01, 28.52, 19.34, and 9.57. Decreased learning rate only slightly improves model performance and image data augmentation does not improve model performance. Training with text data augmentation produces lower BLEU-1 values, but higher BLEU-4 values at 40.62 and 16.79. The results of this study show ViT and Swin Transformers can improve model performance compared to CNN. However, LSTM is still superior for generating longer captions on this dataset.