INDONESIAN IMAGE CAPTIONING USING VISION-LANGUAGE MODEL
The success of pre-train and fine-tune schemes in the fields of computer vision and natural language processing has led to the increase of research exploring Vision-Language Models, commonly known as VL Models. Previous research on Indonesian language image captioning generally relied on limited...
Saved in:
Main Author: | |
---|---|
Format: | Final Project |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/78303 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
id |
id-itb.:78303 |
---|---|
spelling |
id-itb.:783032023-09-18T22:54:42ZINDONESIAN IMAGE CAPTIONING USING VISION-LANGUAGE MODEL Astrada Fathurrahman, Raihan Indonesia Final Project image captioning, human translated data, machine translated data, vision-language model. INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/78303 The success of pre-train and fine-tune schemes in the fields of computer vision and natural language processing has led to the increase of research exploring Vision-Language Models, commonly known as VL Models. Previous research on Indonesian language image captioning generally relied on limited data in terms of both quality and quantity. Additionally, these studies did not leverage VL models, despite their capability to achieve state-of-the-art performance in image captioning due to their strong generalization from pre-training on large- scale data. To address these shortcomings, this final project constructed a dataset of 60,000 image captions by refining sentences from MSCOCO data automatically translated into Indonesian. This dataset was then used to train VL models capable of achieving state-of-the-art performance on English language data, such as BLIP, GIT, and OFA, for handling image captioning in Indonesian. These models were trained through a transfer learning scheme on Indonesian image captioning datasets with varying qualities and quantities, utilizing both machine translated and human translated data, as well as their combination. Experimental results indicate that the BLIP model, fine-tuned with a combination of machine translated and human translated data, exhibited the best language adaptation ability in handling Indonesian image captioning. This model successfully attained BLEU 1,2,3,4 scores of 57.9, 43.3, 31.5, 23.2 respectively, and a CIDEr score of 143.5. The average BLEU and CIDEr scores increased by 78% and 52% respectively compared to the baseline that did not utilize VL models. Furthermore, manual evaluation showed that using both human translated and machine translated data produced more accurate and natural captions for the VL model used. text |
institution |
Institut Teknologi Bandung |
building |
Institut Teknologi Bandung Library |
continent |
Asia |
country |
Indonesia Indonesia |
content_provider |
Institut Teknologi Bandung |
collection |
Digital ITB |
language |
Indonesia |
description |
The success of pre-train and fine-tune schemes in the fields of computer vision and natural
language processing has led to the increase of research exploring Vision-Language Models,
commonly known as VL Models. Previous research on Indonesian language image captioning
generally relied on limited data in terms of both quality and quantity. Additionally, these
studies did not leverage VL models, despite their capability to achieve state-of-the-art
performance in image captioning due to their strong generalization from pre-training on large-
scale data.
To address these shortcomings, this final project constructed a dataset of 60,000 image captions
by refining sentences from MSCOCO data automatically translated into Indonesian. This
dataset was then used to train VL models capable of achieving state-of-the-art performance on
English language data, such as BLIP, GIT, and OFA, for handling image captioning in
Indonesian. These models were trained through a transfer learning scheme on Indonesian image
captioning datasets with varying qualities and quantities, utilizing both machine translated and
human translated data, as well as their combination.
Experimental results indicate that the BLIP model, fine-tuned with a combination of machine
translated and human translated data, exhibited the best language adaptation ability in handling
Indonesian image captioning. This model successfully attained BLEU 1,2,3,4 scores of 57.9,
43.3, 31.5, 23.2 respectively, and a CIDEr score of 143.5. The average BLEU and CIDEr scores
increased by 78% and 52% respectively compared to the baseline that did not utilize VL
models. Furthermore, manual evaluation showed that using both human translated and machine
translated data produced more accurate and natural captions for the VL model used. |
format |
Final Project |
author |
Astrada Fathurrahman, Raihan |
spellingShingle |
Astrada Fathurrahman, Raihan INDONESIAN IMAGE CAPTIONING USING VISION-LANGUAGE MODEL |
author_facet |
Astrada Fathurrahman, Raihan |
author_sort |
Astrada Fathurrahman, Raihan |
title |
INDONESIAN IMAGE CAPTIONING USING VISION-LANGUAGE MODEL |
title_short |
INDONESIAN IMAGE CAPTIONING USING VISION-LANGUAGE MODEL |
title_full |
INDONESIAN IMAGE CAPTIONING USING VISION-LANGUAGE MODEL |
title_fullStr |
INDONESIAN IMAGE CAPTIONING USING VISION-LANGUAGE MODEL |
title_full_unstemmed |
INDONESIAN IMAGE CAPTIONING USING VISION-LANGUAGE MODEL |
title_sort |
indonesian image captioning using vision-language model |
url |
https://digilib.itb.ac.id/gdl/view/78303 |
_version_ |
1822995698457509888 |