INDONESIAN IMAGE CAPTIONING USING VISION-LANGUAGE MODEL
The success of pre-train and fine-tune schemes in the fields of computer vision and natural language processing has led to the increase of research exploring Vision-Language Models, commonly known as VL Models. Previous research on Indonesian language image captioning generally relied on limited...
Saved in:
Main Author: | |
---|---|
Format: | Final Project |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/78303 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
Summary: | The success of pre-train and fine-tune schemes in the fields of computer vision and natural
language processing has led to the increase of research exploring Vision-Language Models,
commonly known as VL Models. Previous research on Indonesian language image captioning
generally relied on limited data in terms of both quality and quantity. Additionally, these
studies did not leverage VL models, despite their capability to achieve state-of-the-art
performance in image captioning due to their strong generalization from pre-training on large-
scale data.
To address these shortcomings, this final project constructed a dataset of 60,000 image captions
by refining sentences from MSCOCO data automatically translated into Indonesian. This
dataset was then used to train VL models capable of achieving state-of-the-art performance on
English language data, such as BLIP, GIT, and OFA, for handling image captioning in
Indonesian. These models were trained through a transfer learning scheme on Indonesian image
captioning datasets with varying qualities and quantities, utilizing both machine translated and
human translated data, as well as their combination.
Experimental results indicate that the BLIP model, fine-tuned with a combination of machine
translated and human translated data, exhibited the best language adaptation ability in handling
Indonesian image captioning. This model successfully attained BLEU 1,2,3,4 scores of 57.9,
43.3, 31.5, 23.2 respectively, and a CIDEr score of 143.5. The average BLEU and CIDEr scores
increased by 78% and 52% respectively compared to the baseline that did not utilize VL
models. Furthermore, manual evaluation showed that using both human translated and machine
translated data produced more accurate and natural captions for the VL model used. |
---|