IMAGE CAPTIONING WITH TEXT AUGMENTATION AND TRANSFORMER CASE STUDY: TOURISM DATA

Indonesian tourism has a lot of potential because it is related to nature and very diverse culture that can be developed into a tourism destination. With the addition of a description automatically can be used in an application to suggest places you want to visit. Image captioning is a task relat...

Full description

Saved in:
Bibliographic Details
Main Author: Thoriq Ahmada, Marsa
Format: Theses
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/65749
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
id id-itb.:65749
institution Institut Teknologi Bandung
building Institut Teknologi Bandung Library
continent Asia
country Indonesia
Indonesia
content_provider Institut Teknologi Bandung
collection Digital ITB
language Indonesia
description Indonesian tourism has a lot of potential because it is related to nature and very diverse culture that can be developed into a tourism destination. With the addition of a description automatically can be used in an application to suggest places you want to visit. Image captioning is a task related to generating image descriptions automatically. The development of image captioning continues with the development of an attention-based model. In the study of Xu et al. (2015) developed the CNN-LSTM model by adding attention to it. Research on image captioning in the Indonesian tourism domain has been conducted by Fudholi et al. (2021). In this study also using the CNN-LSTM architecture with attention such as the research of Xu et al. (2015). The technique developed in this research is to replace the algorithm for feature extraction, namely VGG16, with a newer feature extraction algorithm, namely EfficientNet. In previous studies, the sequential model was still used for the language model section. This is because the sequential model has problems in long-range context dependencies, namely when the sentence is long, the sequence model will be difficult to catch if there is a relationship between the initial and final words. Another problem with the sequential model is that when processing data, it is necessary to wait for the results word by word. Another problem that arises in image captioning with Indonesian captions is low resource availability. This causes a lack of diversity in the construction of captions. The diversity of captions produced by the model is important because the resulting captions are not boring using the same words. In this thesis, research has been carried out on the transformers-based image captioning model to solve the problems that exist in the sequential model, namely the long-term context dependency. With the multi-head attention on the transformers model which can capture the relationship between words well even though the position of the words are far apart the problem of long-term contextdependence can be solved. The problem that will be solved in this research is the problem of low availability of resources in Indonesian image captions which can be overcome by doing text augmentation. The addition of text can add some variation to the text by replacing a few words of a sentence so as to add some new IV vocabulary that may appear. Sentences formed from text augmentation are expected to have the same meaning as the sentences before the text augmentation. There are two text augmentation techniques used in this study, namely Word2Vec and BERT. In this study, it was found that the use of a transformer-based image caption model can improve performance both in terms of accuracy and variety of information generated compared to the attention-based model used in previous studies. Compared to the attention model, the transformers model has an increase in the CIDEr score of 0.741 and an increase in the BLEU-4 score of 0.079. In the diversity metric there was also an increase of 19% more vocabulary, and in the Div-1 and Div-2 metrics it increased 0.09 and 0.134 respectively. This is because the transformers model has multi-head attention that can study the relationship between words. This causes the performance of accuracy and diversity to be better than the attention model that uses the sequential model, namely the GRU which has long-range context dependencies which also causes repeated words due to loss of information. From the experimental results in this study, it is obtained that text augmentation reduces performance in terms of accuracy. The decrease in the attention model is 0.026 on the CIDEr metric, and 0.002 on the BLEU-4 metric. Meanwhile, in the transformers model, the CIDEr value decreased by 0.335 and the BLEU-4 value decreased by 0.054. This decrease shows that doing text augmentation has not been able to make the model able to predict captions more accurately. However, the use of text augmentation can improve the performance of the model in terms of text diversity. It is proven that the model's attention can increase the vocabulary 39% more and increase the Div-2 score by 0.015. The transformers text augmentation model increases vocabulary by 35% and increases the Div-2 score by 0.008. This shows that text augmentation can be used for image captioning tasks if caption diversity is important in this problem.
format Theses
author Thoriq Ahmada, Marsa
spellingShingle Thoriq Ahmada, Marsa
IMAGE CAPTIONING WITH TEXT AUGMENTATION AND TRANSFORMER CASE STUDY: TOURISM DATA
author_facet Thoriq Ahmada, Marsa
author_sort Thoriq Ahmada, Marsa
title IMAGE CAPTIONING WITH TEXT AUGMENTATION AND TRANSFORMER CASE STUDY: TOURISM DATA
title_short IMAGE CAPTIONING WITH TEXT AUGMENTATION AND TRANSFORMER CASE STUDY: TOURISM DATA
title_full IMAGE CAPTIONING WITH TEXT AUGMENTATION AND TRANSFORMER CASE STUDY: TOURISM DATA
title_fullStr IMAGE CAPTIONING WITH TEXT AUGMENTATION AND TRANSFORMER CASE STUDY: TOURISM DATA
title_full_unstemmed IMAGE CAPTIONING WITH TEXT AUGMENTATION AND TRANSFORMER CASE STUDY: TOURISM DATA
title_sort image captioning with text augmentation and transformer case study: tourism data
url https://digilib.itb.ac.id/gdl/view/65749
_version_ 1822004942793080832
spelling id-itb.:657492022-06-24T14:53:54ZIMAGE CAPTIONING WITH TEXT AUGMENTATION AND TRANSFORMER CASE STUDY: TOURISM DATA Thoriq Ahmada, Marsa Indonesia Theses image captioning, transformers, attention, text augmentation, BERT, Word2Vec. INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/65749 Indonesian tourism has a lot of potential because it is related to nature and very diverse culture that can be developed into a tourism destination. With the addition of a description automatically can be used in an application to suggest places you want to visit. Image captioning is a task related to generating image descriptions automatically. The development of image captioning continues with the development of an attention-based model. In the study of Xu et al. (2015) developed the CNN-LSTM model by adding attention to it. Research on image captioning in the Indonesian tourism domain has been conducted by Fudholi et al. (2021). In this study also using the CNN-LSTM architecture with attention such as the research of Xu et al. (2015). The technique developed in this research is to replace the algorithm for feature extraction, namely VGG16, with a newer feature extraction algorithm, namely EfficientNet. In previous studies, the sequential model was still used for the language model section. This is because the sequential model has problems in long-range context dependencies, namely when the sentence is long, the sequence model will be difficult to catch if there is a relationship between the initial and final words. Another problem with the sequential model is that when processing data, it is necessary to wait for the results word by word. Another problem that arises in image captioning with Indonesian captions is low resource availability. This causes a lack of diversity in the construction of captions. The diversity of captions produced by the model is important because the resulting captions are not boring using the same words. In this thesis, research has been carried out on the transformers-based image captioning model to solve the problems that exist in the sequential model, namely the long-term context dependency. With the multi-head attention on the transformers model which can capture the relationship between words well even though the position of the words are far apart the problem of long-term contextdependence can be solved. The problem that will be solved in this research is the problem of low availability of resources in Indonesian image captions which can be overcome by doing text augmentation. The addition of text can add some variation to the text by replacing a few words of a sentence so as to add some new IV vocabulary that may appear. Sentences formed from text augmentation are expected to have the same meaning as the sentences before the text augmentation. There are two text augmentation techniques used in this study, namely Word2Vec and BERT. In this study, it was found that the use of a transformer-based image caption model can improve performance both in terms of accuracy and variety of information generated compared to the attention-based model used in previous studies. Compared to the attention model, the transformers model has an increase in the CIDEr score of 0.741 and an increase in the BLEU-4 score of 0.079. In the diversity metric there was also an increase of 19% more vocabulary, and in the Div-1 and Div-2 metrics it increased 0.09 and 0.134 respectively. This is because the transformers model has multi-head attention that can study the relationship between words. This causes the performance of accuracy and diversity to be better than the attention model that uses the sequential model, namely the GRU which has long-range context dependencies which also causes repeated words due to loss of information. From the experimental results in this study, it is obtained that text augmentation reduces performance in terms of accuracy. The decrease in the attention model is 0.026 on the CIDEr metric, and 0.002 on the BLEU-4 metric. Meanwhile, in the transformers model, the CIDEr value decreased by 0.335 and the BLEU-4 value decreased by 0.054. This decrease shows that doing text augmentation has not been able to make the model able to predict captions more accurately. However, the use of text augmentation can improve the performance of the model in terms of text diversity. It is proven that the model's attention can increase the vocabulary 39% more and increase the Div-2 score by 0.015. The transformers text augmentation model increases vocabulary by 35% and increases the Div-2 score by 0.008. This shows that text augmentation can be used for image captioning tasks if caption diversity is important in this problem. text