IMAGE CAPTIONING WITH TEXT AUGMENTATION AND TRANSFORMER CASE STUDY: TOURISM DATA
Indonesian tourism has a lot of potential because it is related to nature and very diverse culture that can be developed into a tourism destination. With the addition of a description automatically can be used in an application to suggest places you want to visit. Image captioning is a task relat...
Saved in:
Main Author: | |
---|---|
Format: | Theses |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/65749 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
id |
id-itb.:65749 |
---|---|
institution |
Institut Teknologi Bandung |
building |
Institut Teknologi Bandung Library |
continent |
Asia |
country |
Indonesia Indonesia |
content_provider |
Institut Teknologi Bandung |
collection |
Digital ITB |
language |
Indonesia |
description |
Indonesian tourism has a lot of potential because it is related to nature and very
diverse culture that can be developed into a tourism destination. With the addition
of a description automatically can be used in an application to suggest places you
want to visit. Image captioning is a task related to generating image descriptions
automatically. The development of image captioning continues with the
development of an attention-based model. In the study of Xu et al. (2015) developed
the CNN-LSTM model by adding attention to it. Research on image captioning in
the Indonesian tourism domain has been conducted by Fudholi et al. (2021). In this
study also using the CNN-LSTM architecture with attention such as the research of
Xu et al. (2015). The technique developed in this research is to replace the
algorithm for feature extraction, namely VGG16, with a newer feature extraction
algorithm, namely EfficientNet.
In previous studies, the sequential model was still used for the language model
section. This is because the sequential model has problems in long-range context
dependencies, namely when the sentence is long, the sequence model will be
difficult to catch if there is a relationship between the initial and final words.
Another problem with the sequential model is that when processing data, it is
necessary to wait for the results word by word. Another problem that arises in
image captioning with Indonesian captions is low resource availability. This causes
a lack of diversity in the construction of captions. The diversity of captions
produced by the model is important because the resulting captions are not boring
using the same words.
In this thesis, research has been carried out on the transformers-based image
captioning model to solve the problems that exist in the sequential model, namely
the long-term context dependency. With the multi-head attention on the
transformers model which can capture the relationship between words well even
though the position of the words are far apart the problem of long-term contextdependence can be solved. The problem that will be solved in this research is the
problem of low availability of resources in Indonesian image captions which can
be overcome by doing text augmentation. The addition of text can add some
variation to the text by replacing a few words of a sentence so as to add some new
IV
vocabulary that may appear. Sentences formed from text augmentation are
expected to have the same meaning as the sentences before the text augmentation.
There are two text augmentation techniques used in this study, namely Word2Vec
and BERT.
In this study, it was found that the use of a transformer-based image caption model
can improve performance both in terms of accuracy and variety of information
generated compared to the attention-based model used in previous studies.
Compared to the attention model, the transformers model has an increase in the
CIDEr score of 0.741 and an increase in the BLEU-4 score of 0.079. In the diversity
metric there was also an increase of 19% more vocabulary, and in the Div-1 and
Div-2 metrics it increased 0.09 and 0.134 respectively. This is because the
transformers model has multi-head attention that can study the relationship
between words. This causes the performance of accuracy and diversity to be better
than the attention model that uses the sequential model, namely the GRU which has
long-range context dependencies which also causes repeated words due to loss of
information.
From the experimental results in this study, it is obtained that text augmentation
reduces performance in terms of accuracy. The decrease in the attention model is
0.026 on the CIDEr metric, and 0.002 on the BLEU-4 metric. Meanwhile, in the
transformers model, the CIDEr value decreased by 0.335 and the BLEU-4 value
decreased by 0.054. This decrease shows that doing text augmentation has not been
able to make the model able to predict captions more accurately. However, the use
of text augmentation can improve the performance of the model in terms of text
diversity. It is proven that the model's attention can increase the vocabulary 39%
more and increase the Div-2 score by 0.015. The transformers text augmentation
model increases vocabulary by 35% and increases the Div-2 score by 0.008. This
shows that text augmentation can be used for image captioning tasks if caption
diversity is important in this problem. |
format |
Theses |
author |
Thoriq Ahmada, Marsa |
spellingShingle |
Thoriq Ahmada, Marsa IMAGE CAPTIONING WITH TEXT AUGMENTATION AND TRANSFORMER CASE STUDY: TOURISM DATA |
author_facet |
Thoriq Ahmada, Marsa |
author_sort |
Thoriq Ahmada, Marsa |
title |
IMAGE CAPTIONING WITH TEXT AUGMENTATION AND TRANSFORMER CASE STUDY: TOURISM DATA |
title_short |
IMAGE CAPTIONING WITH TEXT AUGMENTATION AND TRANSFORMER CASE STUDY: TOURISM DATA |
title_full |
IMAGE CAPTIONING WITH TEXT AUGMENTATION AND TRANSFORMER CASE STUDY: TOURISM DATA |
title_fullStr |
IMAGE CAPTIONING WITH TEXT AUGMENTATION AND TRANSFORMER CASE STUDY: TOURISM DATA |
title_full_unstemmed |
IMAGE CAPTIONING WITH TEXT AUGMENTATION AND TRANSFORMER CASE STUDY: TOURISM DATA |
title_sort |
image captioning with text augmentation and transformer case study: tourism data |
url |
https://digilib.itb.ac.id/gdl/view/65749 |
_version_ |
1822004942793080832 |
spelling |
id-itb.:657492022-06-24T14:53:54ZIMAGE CAPTIONING WITH TEXT AUGMENTATION AND TRANSFORMER CASE STUDY: TOURISM DATA Thoriq Ahmada, Marsa Indonesia Theses image captioning, transformers, attention, text augmentation, BERT, Word2Vec. INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/65749 Indonesian tourism has a lot of potential because it is related to nature and very diverse culture that can be developed into a tourism destination. With the addition of a description automatically can be used in an application to suggest places you want to visit. Image captioning is a task related to generating image descriptions automatically. The development of image captioning continues with the development of an attention-based model. In the study of Xu et al. (2015) developed the CNN-LSTM model by adding attention to it. Research on image captioning in the Indonesian tourism domain has been conducted by Fudholi et al. (2021). In this study also using the CNN-LSTM architecture with attention such as the research of Xu et al. (2015). The technique developed in this research is to replace the algorithm for feature extraction, namely VGG16, with a newer feature extraction algorithm, namely EfficientNet. In previous studies, the sequential model was still used for the language model section. This is because the sequential model has problems in long-range context dependencies, namely when the sentence is long, the sequence model will be difficult to catch if there is a relationship between the initial and final words. Another problem with the sequential model is that when processing data, it is necessary to wait for the results word by word. Another problem that arises in image captioning with Indonesian captions is low resource availability. This causes a lack of diversity in the construction of captions. The diversity of captions produced by the model is important because the resulting captions are not boring using the same words. In this thesis, research has been carried out on the transformers-based image captioning model to solve the problems that exist in the sequential model, namely the long-term context dependency. With the multi-head attention on the transformers model which can capture the relationship between words well even though the position of the words are far apart the problem of long-term contextdependence can be solved. The problem that will be solved in this research is the problem of low availability of resources in Indonesian image captions which can be overcome by doing text augmentation. The addition of text can add some variation to the text by replacing a few words of a sentence so as to add some new IV vocabulary that may appear. Sentences formed from text augmentation are expected to have the same meaning as the sentences before the text augmentation. There are two text augmentation techniques used in this study, namely Word2Vec and BERT. In this study, it was found that the use of a transformer-based image caption model can improve performance both in terms of accuracy and variety of information generated compared to the attention-based model used in previous studies. Compared to the attention model, the transformers model has an increase in the CIDEr score of 0.741 and an increase in the BLEU-4 score of 0.079. In the diversity metric there was also an increase of 19% more vocabulary, and in the Div-1 and Div-2 metrics it increased 0.09 and 0.134 respectively. This is because the transformers model has multi-head attention that can study the relationship between words. This causes the performance of accuracy and diversity to be better than the attention model that uses the sequential model, namely the GRU which has long-range context dependencies which also causes repeated words due to loss of information. From the experimental results in this study, it is obtained that text augmentation reduces performance in terms of accuracy. The decrease in the attention model is 0.026 on the CIDEr metric, and 0.002 on the BLEU-4 metric. Meanwhile, in the transformers model, the CIDEr value decreased by 0.335 and the BLEU-4 value decreased by 0.054. This decrease shows that doing text augmentation has not been able to make the model able to predict captions more accurately. However, the use of text augmentation can improve the performance of the model in terms of text diversity. It is proven that the model's attention can increase the vocabulary 39% more and increase the Div-2 score by 0.015. The transformers text augmentation model increases vocabulary by 35% and increases the Div-2 score by 0.008. This shows that text augmentation can be used for image captioning tasks if caption diversity is important in this problem. text |