AUTOMATIC TEXT GENERATION OF GEOLOGICAL ROCK IMAGES WITH DETECTION OF SEMANTIC ATTENTION (SEMATT) RELATIONS BETWEEN OBJECTS
Image captioning is a combination of computer vision and natural language processing (NLP) tasks. The process carried out is to translate an image into text, or a caption. The task of computer vision is to carry out the process of image recognition for image extraction, while in NLP, text generat...
Saved in:
Main Author: | |
---|---|
Format: | Dissertations |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/76854 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
id |
id-itb.:76854 |
---|---|
institution |
Institut Teknologi Bandung |
building |
Institut Teknologi Bandung Library |
continent |
Asia |
country |
Indonesia Indonesia |
content_provider |
Institut Teknologi Bandung |
collection |
Digital ITB |
language |
Indonesia |
description |
Image captioning is a combination of computer vision and natural language
processing (NLP) tasks. The process carried out is to translate an image into text,
or a caption. The task of computer vision is to carry out the process of image
recognition for image extraction, while in NLP, text generation is used using the
recurrent neural network (RNN) method, which is a semantic prediction of image
features. The use of image captioning to provide captioning on images has been
carried out by various researchers using MSCOCO and FLICKR images. In this
study, there are differences in objects, namely geological objects of rocks that have
not been represented in the MSCOCO and FLICKr datasets. The domain of this
object is limited to the geology of rocks, and the description of the image is
validated directly by geologists. Image captioning is a combination of
convolutional neural network (CNN) models and language models such as long-
short-term memory (LSTM), Transformers, and attention models. The language
model was redeveloped by engineering the model by adding attention methods and,
subsequently, Transformers. The recognition process carried out on images is
carried out using the regular CNN and separable CNN methods. Object
identification in the form of image features is interpreted as a language model for
predicting words that correspond to image features. The interpretation of these
image features is carried out based on images and references to geological images
of rocks, so as to obtain semantics that are in accordance with the geologist's
interpretation..
Machine learning in image captioning has cointained of two parts, namely the
encoder and decoder. A task encoder is a process that performs recognition on
images. In another task, the decoder task is part of a process that empowers the
language model as a word generator that receives input in the form of combined
image feature extraction and word embedding. This study examines these two parts
by exploring the CNN method as an encoder and the LSTM method as a decoder.
The study of the word generation model began with the LSTM method with the
addition of the attention method, then continued by exploring the Transformers
method. The semantic attention approach is a task approach that emphasizes the
vi
decoder with the intention of obtaining words that correspond semantically to the
image features. Models that have been developed using MSCOCO and FLICKr,
such as the Szegedy and Karpathy models, have not been able to predict the
geological caption of rocks that are close to reference.
The contribution of this research is to combine state-of-the-art architectural image
captioning with the CNN approach, the LSTM method, or Transformers that can
generate captioning from geological images of rocks. The approach to identifying
background image features that the research contributed to was the proposal of
architectural engineering semantic attention (SemATT). Ensemble architecture
considerations were the target of research experiments that combined CNN
methods with language models. On the language model side, such as the use of
LSTM methods, attention methods, and transformers, these became experimental
targets as text generation models. The proposed word embedding method is also an
encoding target to improve the performance of the value encoding feature. The next
proposed performance is the use of image sizes of 224x224 and 229x229 pixels.
The results of this study are a geological image captioning model of rocks, which
is a deep learning machine ensemble consisting of image extraction models, text
extraction models, and word generation models. The image extraction model is
CNN's architectural engineering, with both regular and separable models. The text
extraction model is the empowerment of the word embedding model by using
word2vec as encoding. In the word generation model, that means empowering the
LSTM model or the Transformers model. The hasil of the machine deep learning
ensemble is in the form of Fully Connected (FC). The FC units is processed with
the softmax function to get the word with the greatest probability. The sequential
process is processed with greedy search algorithms or beam search to get sentence
semantics that have a relationship with the geological image of rocks.
Previous research experiments have provided various in captioning results from
the architecture of captioning geological images of proposed rocks. Evaluate the
performance of the image captioning model by using BLEU and RougeL scores and
show that the results of caption predictions have the accuracy and availability of
the predicted words. Visual Attention (VaT), consisting of separable CNN and
Semantic Transformers (SeTrans), has BLEU score values including BLEU-1 =
0.908, BLEU-2 = 0.877, BLEU-3 = 0.750, and BLEU-4 = 0.510. VaT, consisting
of separable CNN and LSTM (SemATT), has BLEU scores including BLEU-1 =
0.933, BLEU-2 = 0.843, BLEU-3 = 0.743, and BLEU-4 = 0.542. The SemATT
model, which is a combination of separable CNN and LSTM, was confirmed to have
better results compared to the VaT-SeTrans Model and previous engineered
models. In addition, the SemATT model also has results that exceed those of the
VGG16-LSTM-word2vec model, including the VGG16-LSTM-ATT model,
Bahdanau, and VGG16-LLSTM-AAtt. Luong.
After conducting various experiments with VGG16, Resnet50, InceptionV3, and
VaT (Xception) models with LSTM, the results of the VaT (Xception) and LSTM
models were confirmed to exceed the VGG16, Resnet50, and InceptionV3 models.
vii
The use of linearly arranged convolution layers has the effect that fitur extraction
information tends to be reduced, so it has the potential to lose important fiturs as
the number of layers deepens. Meanwhile, the Xception model with 36 layers is able
to maintain the fitur needed for the caption process. On the language generation
side, LSTM is able to produce words that have a relationship between the word
before and the one produced. Transformers have an advantage if the number of
sentences or words is very large. In the problem studied, the number of unique
words only reached 397 out of 4215 sentences. So there may still be research
opportunities for the sustainability of this research, both in terms of image
recognition and language generation. |
format |
Dissertations |
author |
Nursikuwagus, Agus |
spellingShingle |
Nursikuwagus, Agus AUTOMATIC TEXT GENERATION OF GEOLOGICAL ROCK IMAGES WITH DETECTION OF SEMANTIC ATTENTION (SEMATT) RELATIONS BETWEEN OBJECTS |
author_facet |
Nursikuwagus, Agus |
author_sort |
Nursikuwagus, Agus |
title |
AUTOMATIC TEXT GENERATION OF GEOLOGICAL ROCK IMAGES WITH DETECTION OF SEMANTIC ATTENTION (SEMATT) RELATIONS BETWEEN OBJECTS |
title_short |
AUTOMATIC TEXT GENERATION OF GEOLOGICAL ROCK IMAGES WITH DETECTION OF SEMANTIC ATTENTION (SEMATT) RELATIONS BETWEEN OBJECTS |
title_full |
AUTOMATIC TEXT GENERATION OF GEOLOGICAL ROCK IMAGES WITH DETECTION OF SEMANTIC ATTENTION (SEMATT) RELATIONS BETWEEN OBJECTS |
title_fullStr |
AUTOMATIC TEXT GENERATION OF GEOLOGICAL ROCK IMAGES WITH DETECTION OF SEMANTIC ATTENTION (SEMATT) RELATIONS BETWEEN OBJECTS |
title_full_unstemmed |
AUTOMATIC TEXT GENERATION OF GEOLOGICAL ROCK IMAGES WITH DETECTION OF SEMANTIC ATTENTION (SEMATT) RELATIONS BETWEEN OBJECTS |
title_sort |
automatic text generation of geological rock images with detection of semantic attention (sematt) relations between objects |
url |
https://digilib.itb.ac.id/gdl/view/76854 |
_version_ |
1822995084656771072 |
spelling |
id-itb.:768542023-08-19T08:22:33ZAUTOMATIC TEXT GENERATION OF GEOLOGICAL ROCK IMAGES WITH DETECTION OF SEMANTIC ATTENTION (SEMATT) RELATIONS BETWEEN OBJECTS Nursikuwagus, Agus Indonesia Dissertations CNN, LSTM, Transformers, Word2vec, OneHotVector, Caption. INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/76854 Image captioning is a combination of computer vision and natural language processing (NLP) tasks. The process carried out is to translate an image into text, or a caption. The task of computer vision is to carry out the process of image recognition for image extraction, while in NLP, text generation is used using the recurrent neural network (RNN) method, which is a semantic prediction of image features. The use of image captioning to provide captioning on images has been carried out by various researchers using MSCOCO and FLICKR images. In this study, there are differences in objects, namely geological objects of rocks that have not been represented in the MSCOCO and FLICKr datasets. The domain of this object is limited to the geology of rocks, and the description of the image is validated directly by geologists. Image captioning is a combination of convolutional neural network (CNN) models and language models such as long- short-term memory (LSTM), Transformers, and attention models. The language model was redeveloped by engineering the model by adding attention methods and, subsequently, Transformers. The recognition process carried out on images is carried out using the regular CNN and separable CNN methods. Object identification in the form of image features is interpreted as a language model for predicting words that correspond to image features. The interpretation of these image features is carried out based on images and references to geological images of rocks, so as to obtain semantics that are in accordance with the geologist's interpretation.. Machine learning in image captioning has cointained of two parts, namely the encoder and decoder. A task encoder is a process that performs recognition on images. In another task, the decoder task is part of a process that empowers the language model as a word generator that receives input in the form of combined image feature extraction and word embedding. This study examines these two parts by exploring the CNN method as an encoder and the LSTM method as a decoder. The study of the word generation model began with the LSTM method with the addition of the attention method, then continued by exploring the Transformers method. The semantic attention approach is a task approach that emphasizes the vi decoder with the intention of obtaining words that correspond semantically to the image features. Models that have been developed using MSCOCO and FLICKr, such as the Szegedy and Karpathy models, have not been able to predict the geological caption of rocks that are close to reference. The contribution of this research is to combine state-of-the-art architectural image captioning with the CNN approach, the LSTM method, or Transformers that can generate captioning from geological images of rocks. The approach to identifying background image features that the research contributed to was the proposal of architectural engineering semantic attention (SemATT). Ensemble architecture considerations were the target of research experiments that combined CNN methods with language models. On the language model side, such as the use of LSTM methods, attention methods, and transformers, these became experimental targets as text generation models. The proposed word embedding method is also an encoding target to improve the performance of the value encoding feature. The next proposed performance is the use of image sizes of 224x224 and 229x229 pixels. The results of this study are a geological image captioning model of rocks, which is a deep learning machine ensemble consisting of image extraction models, text extraction models, and word generation models. The image extraction model is CNN's architectural engineering, with both regular and separable models. The text extraction model is the empowerment of the word embedding model by using word2vec as encoding. In the word generation model, that means empowering the LSTM model or the Transformers model. The hasil of the machine deep learning ensemble is in the form of Fully Connected (FC). The FC units is processed with the softmax function to get the word with the greatest probability. The sequential process is processed with greedy search algorithms or beam search to get sentence semantics that have a relationship with the geological image of rocks. Previous research experiments have provided various in captioning results from the architecture of captioning geological images of proposed rocks. Evaluate the performance of the image captioning model by using BLEU and RougeL scores and show that the results of caption predictions have the accuracy and availability of the predicted words. Visual Attention (VaT), consisting of separable CNN and Semantic Transformers (SeTrans), has BLEU score values including BLEU-1 = 0.908, BLEU-2 = 0.877, BLEU-3 = 0.750, and BLEU-4 = 0.510. VaT, consisting of separable CNN and LSTM (SemATT), has BLEU scores including BLEU-1 = 0.933, BLEU-2 = 0.843, BLEU-3 = 0.743, and BLEU-4 = 0.542. The SemATT model, which is a combination of separable CNN and LSTM, was confirmed to have better results compared to the VaT-SeTrans Model and previous engineered models. In addition, the SemATT model also has results that exceed those of the VGG16-LSTM-word2vec model, including the VGG16-LSTM-ATT model, Bahdanau, and VGG16-LLSTM-AAtt. Luong. After conducting various experiments with VGG16, Resnet50, InceptionV3, and VaT (Xception) models with LSTM, the results of the VaT (Xception) and LSTM models were confirmed to exceed the VGG16, Resnet50, and InceptionV3 models. vii The use of linearly arranged convolution layers has the effect that fitur extraction information tends to be reduced, so it has the potential to lose important fiturs as the number of layers deepens. Meanwhile, the Xception model with 36 layers is able to maintain the fitur needed for the caption process. On the language generation side, LSTM is able to produce words that have a relationship between the word before and the one produced. Transformers have an advantage if the number of sentences or words is very large. In the problem studied, the number of unique words only reached 397 out of 4215 sentences. So there may still be research opportunities for the sustainability of this research, both in terms of image recognition and language generation. text |