INDONESIAN IMAGE CAPTIONING USING VISION-LANGUAGE MODEL

The success of pre-train and fine-tune schemes in the fields of computer vision and natural language processing has led to the increase of research exploring Vision-Language Models, commonly known as VL Models. Previous research on Indonesian language image captioning generally relied on limited...

Full description

Saved in:

Bibliographic Details
Main Author:	Astrada Fathurrahman, Raihan
Format:	Final Project
Language:	Indonesia
Online Access:	https://digilib.itb.ac.id/gdl/view/78303
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Institut Teknologi Bandung
Language:	Indonesia

id	id-itb.:78303
spelling	id-itb.:783032023-09-18T22:54:42ZINDONESIAN IMAGE CAPTIONING USING VISION-LANGUAGE MODEL Astrada Fathurrahman, Raihan Indonesia Final Project image captioning, human translated data, machine translated data, vision-language model. INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/78303 The success of pre-train and fine-tune schemes in the fields of computer vision and natural language processing has led to the increase of research exploring Vision-Language Models, commonly known as VL Models. Previous research on Indonesian language image captioning generally relied on limited data in terms of both quality and quantity. Additionally, these studies did not leverage VL models, despite their capability to achieve state-of-the-art performance in image captioning due to their strong generalization from pre-training on large- scale data. To address these shortcomings, this final project constructed a dataset of 60,000 image captions by refining sentences from MSCOCO data automatically translated into Indonesian. This dataset was then used to train VL models capable of achieving state-of-the-art performance on English language data, such as BLIP, GIT, and OFA, for handling image captioning in Indonesian. These models were trained through a transfer learning scheme on Indonesian image captioning datasets with varying qualities and quantities, utilizing both machine translated and human translated data, as well as their combination. Experimental results indicate that the BLIP model, fine-tuned with a combination of machine translated and human translated data, exhibited the best language adaptation ability in handling Indonesian image captioning. This model successfully attained BLEU 1,2,3,4 scores of 57.9, 43.3, 31.5, 23.2 respectively, and a CIDEr score of 143.5. The average BLEU and CIDEr scores increased by 78% and 52% respectively compared to the baseline that did not utilize VL models. Furthermore, manual evaluation showed that using both human translated and machine translated data produced more accurate and natural captions for the VL model used. text
institution	Institut Teknologi Bandung
building	Institut Teknologi Bandung Library
continent	Asia
country	Indonesia Indonesia
content_provider	Institut Teknologi Bandung
collection	Digital ITB
language	Indonesia
description	The success of pre-train and fine-tune schemes in the fields of computer vision and natural language processing has led to the increase of research exploring Vision-Language Models, commonly known as VL Models. Previous research on Indonesian language image captioning generally relied on limited data in terms of both quality and quantity. Additionally, these studies did not leverage VL models, despite their capability to achieve state-of-the-art performance in image captioning due to their strong generalization from pre-training on large- scale data. To address these shortcomings, this final project constructed a dataset of 60,000 image captions by refining sentences from MSCOCO data automatically translated into Indonesian. This dataset was then used to train VL models capable of achieving state-of-the-art performance on English language data, such as BLIP, GIT, and OFA, for handling image captioning in Indonesian. These models were trained through a transfer learning scheme on Indonesian image captioning datasets with varying qualities and quantities, utilizing both machine translated and human translated data, as well as their combination. Experimental results indicate that the BLIP model, fine-tuned with a combination of machine translated and human translated data, exhibited the best language adaptation ability in handling Indonesian image captioning. This model successfully attained BLEU 1,2,3,4 scores of 57.9, 43.3, 31.5, 23.2 respectively, and a CIDEr score of 143.5. The average BLEU and CIDEr scores increased by 78% and 52% respectively compared to the baseline that did not utilize VL models. Furthermore, manual evaluation showed that using both human translated and machine translated data produced more accurate and natural captions for the VL model used.
format	Final Project
author	Astrada Fathurrahman, Raihan
spellingShingle	Astrada Fathurrahman, Raihan INDONESIAN IMAGE CAPTIONING USING VISION-LANGUAGE MODEL
author_facet	Astrada Fathurrahman, Raihan
author_sort	Astrada Fathurrahman, Raihan
title	INDONESIAN IMAGE CAPTIONING USING VISION-LANGUAGE MODEL
title_short	INDONESIAN IMAGE CAPTIONING USING VISION-LANGUAGE MODEL
title_full	INDONESIAN IMAGE CAPTIONING USING VISION-LANGUAGE MODEL
title_fullStr	INDONESIAN IMAGE CAPTIONING USING VISION-LANGUAGE MODEL
title_full_unstemmed	INDONESIAN IMAGE CAPTIONING USING VISION-LANGUAGE MODEL
title_sort	indonesian image captioning using vision-language model
url	https://digilib.itb.ac.id/gdl/view/78303
_version_	1822995698457509888

INDONESIAN IMAGE CAPTIONING USING VISION-LANGUAGE MODEL

Similar Items