INDONESIAN IMAGE CAPTIONING USING VISION-LANGUAGE MODEL

The success of pre-train and fine-tune schemes in the fields of computer vision and natural language processing has led to the increase of research exploring Vision-Language Models, commonly known as VL Models. Previous research on Indonesian language image captioning generally relied on limited...

Full description

Saved in:
Bibliographic Details
Main Author: Astrada Fathurrahman, Raihan
Format: Final Project
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/78303
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
Description
Summary:The success of pre-train and fine-tune schemes in the fields of computer vision and natural language processing has led to the increase of research exploring Vision-Language Models, commonly known as VL Models. Previous research on Indonesian language image captioning generally relied on limited data in terms of both quality and quantity. Additionally, these studies did not leverage VL models, despite their capability to achieve state-of-the-art performance in image captioning due to their strong generalization from pre-training on large- scale data. To address these shortcomings, this final project constructed a dataset of 60,000 image captions by refining sentences from MSCOCO data automatically translated into Indonesian. This dataset was then used to train VL models capable of achieving state-of-the-art performance on English language data, such as BLIP, GIT, and OFA, for handling image captioning in Indonesian. These models were trained through a transfer learning scheme on Indonesian image captioning datasets with varying qualities and quantities, utilizing both machine translated and human translated data, as well as their combination. Experimental results indicate that the BLIP model, fine-tuned with a combination of machine translated and human translated data, exhibited the best language adaptation ability in handling Indonesian image captioning. This model successfully attained BLEU 1,2,3,4 scores of 57.9, 43.3, 31.5, 23.2 respectively, and a CIDEr score of 143.5. The average BLEU and CIDEr scores increased by 78% and 52% respectively compared to the baseline that did not utilize VL models. Furthermore, manual evaluation showed that using both human translated and machine translated data produced more accurate and natural captions for the VL model used.