INDONESIAN IMAGE CAPTIONING USING VISION-LANGUAGE MODEL

The success of pre-train and fine-tune schemes in the fields of computer vision and natural language processing has led to the increase of research exploring Vision-Language Models, commonly known as VL Models. Previous research on Indonesian language image captioning generally relied on limited...

Full description

Saved in:
Bibliographic Details
Main Author: Astrada Fathurrahman, Raihan
Format: Final Project
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/78303
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
id id-itb.:78303
spelling id-itb.:783032023-09-18T22:54:42ZINDONESIAN IMAGE CAPTIONING USING VISION-LANGUAGE MODEL Astrada Fathurrahman, Raihan Indonesia Final Project image captioning, human translated data, machine translated data, vision-language model. INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/78303 The success of pre-train and fine-tune schemes in the fields of computer vision and natural language processing has led to the increase of research exploring Vision-Language Models, commonly known as VL Models. Previous research on Indonesian language image captioning generally relied on limited data in terms of both quality and quantity. Additionally, these studies did not leverage VL models, despite their capability to achieve state-of-the-art performance in image captioning due to their strong generalization from pre-training on large- scale data. To address these shortcomings, this final project constructed a dataset of 60,000 image captions by refining sentences from MSCOCO data automatically translated into Indonesian. This dataset was then used to train VL models capable of achieving state-of-the-art performance on English language data, such as BLIP, GIT, and OFA, for handling image captioning in Indonesian. These models were trained through a transfer learning scheme on Indonesian image captioning datasets with varying qualities and quantities, utilizing both machine translated and human translated data, as well as their combination. Experimental results indicate that the BLIP model, fine-tuned with a combination of machine translated and human translated data, exhibited the best language adaptation ability in handling Indonesian image captioning. This model successfully attained BLEU 1,2,3,4 scores of 57.9, 43.3, 31.5, 23.2 respectively, and a CIDEr score of 143.5. The average BLEU and CIDEr scores increased by 78% and 52% respectively compared to the baseline that did not utilize VL models. Furthermore, manual evaluation showed that using both human translated and machine translated data produced more accurate and natural captions for the VL model used. text
institution Institut Teknologi Bandung
building Institut Teknologi Bandung Library
continent Asia
country Indonesia
Indonesia
content_provider Institut Teknologi Bandung
collection Digital ITB
language Indonesia
description The success of pre-train and fine-tune schemes in the fields of computer vision and natural language processing has led to the increase of research exploring Vision-Language Models, commonly known as VL Models. Previous research on Indonesian language image captioning generally relied on limited data in terms of both quality and quantity. Additionally, these studies did not leverage VL models, despite their capability to achieve state-of-the-art performance in image captioning due to their strong generalization from pre-training on large- scale data. To address these shortcomings, this final project constructed a dataset of 60,000 image captions by refining sentences from MSCOCO data automatically translated into Indonesian. This dataset was then used to train VL models capable of achieving state-of-the-art performance on English language data, such as BLIP, GIT, and OFA, for handling image captioning in Indonesian. These models were trained through a transfer learning scheme on Indonesian image captioning datasets with varying qualities and quantities, utilizing both machine translated and human translated data, as well as their combination. Experimental results indicate that the BLIP model, fine-tuned with a combination of machine translated and human translated data, exhibited the best language adaptation ability in handling Indonesian image captioning. This model successfully attained BLEU 1,2,3,4 scores of 57.9, 43.3, 31.5, 23.2 respectively, and a CIDEr score of 143.5. The average BLEU and CIDEr scores increased by 78% and 52% respectively compared to the baseline that did not utilize VL models. Furthermore, manual evaluation showed that using both human translated and machine translated data produced more accurate and natural captions for the VL model used.
format Final Project
author Astrada Fathurrahman, Raihan
spellingShingle Astrada Fathurrahman, Raihan
INDONESIAN IMAGE CAPTIONING USING VISION-LANGUAGE MODEL
author_facet Astrada Fathurrahman, Raihan
author_sort Astrada Fathurrahman, Raihan
title INDONESIAN IMAGE CAPTIONING USING VISION-LANGUAGE MODEL
title_short INDONESIAN IMAGE CAPTIONING USING VISION-LANGUAGE MODEL
title_full INDONESIAN IMAGE CAPTIONING USING VISION-LANGUAGE MODEL
title_fullStr INDONESIAN IMAGE CAPTIONING USING VISION-LANGUAGE MODEL
title_full_unstemmed INDONESIAN IMAGE CAPTIONING USING VISION-LANGUAGE MODEL
title_sort indonesian image captioning using vision-language model
url https://digilib.itb.ac.id/gdl/view/78303
_version_ 1822995698457509888