AUTOMATIC TEXT GENERATION OF GEOLOGICAL ROCK IMAGES WITH DETECTION OF SEMANTIC ATTENTION (SEMATT) RELATIONS BETWEEN OBJECTS

Image captioning is a combination of computer vision and natural language processing (NLP) tasks. The process carried out is to translate an image into text, or a caption. The task of computer vision is to carry out the process of image recognition for image extraction, while in NLP, text generat...

Full description

Saved in:

Bibliographic Details
Main Author:	Nursikuwagus, Agus
Format:	Dissertations
Language:	Indonesia
Online Access:	https://digilib.itb.ac.id/gdl/view/76854
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Institut Teknologi Bandung
Language:	Indonesia

id	id-itb.:76854
institution	Institut Teknologi Bandung
building	Institut Teknologi Bandung Library
continent	Asia
country	Indonesia Indonesia
content_provider	Institut Teknologi Bandung
collection	Digital ITB
language	Indonesia
description	Image captioning is a combination of computer vision and natural language processing (NLP) tasks. The process carried out is to translate an image into text, or a caption. The task of computer vision is to carry out the process of image recognition for image extraction, while in NLP, text generation is used using the recurrent neural network (RNN) method, which is a semantic prediction of image features. The use of image captioning to provide captioning on images has been carried out by various researchers using MSCOCO and FLICKR images. In this study, there are differences in objects, namely geological objects of rocks that have not been represented in the MSCOCO and FLICKr datasets. The domain of this object is limited to the geology of rocks, and the description of the image is validated directly by geologists. Image captioning is a combination of convolutional neural network (CNN) models and language models such as long- short-term memory (LSTM), Transformers, and attention models. The language model was redeveloped by engineering the model by adding attention methods and, subsequently, Transformers. The recognition process carried out on images is carried out using the regular CNN and separable CNN methods. Object identification in the form of image features is interpreted as a language model for predicting words that correspond to image features. The interpretation of these image features is carried out based on images and references to geological images of rocks, so as to obtain semantics that are in accordance with the geologist's interpretation.. Machine learning in image captioning has cointained of two parts, namely the encoder and decoder. A task encoder is a process that performs recognition on images. In another task, the decoder task is part of a process that empowers the language model as a word generator that receives input in the form of combined image feature extraction and word embedding. This study examines these two parts by exploring the CNN method as an encoder and the LSTM method as a decoder. The study of the word generation model began with the LSTM method with the addition of the attention method, then continued by exploring the Transformers method. The semantic attention approach is a task approach that emphasizes the vi decoder with the intention of obtaining words that correspond semantically to the image features. Models that have been developed using MSCOCO and FLICKr, such as the Szegedy and Karpathy models, have not been able to predict the geological caption of rocks that are close to reference. The contribution of this research is to combine state-of-the-art architectural image captioning with the CNN approach, the LSTM method, or Transformers that can generate captioning from geological images of rocks. The approach to identifying background image features that the research contributed to was the proposal of architectural engineering semantic attention (SemATT). Ensemble architecture considerations were the target of research experiments that combined CNN methods with language models. On the language model side, such as the use of LSTM methods, attention methods, and transformers, these became experimental targets as text generation models. The proposed word embedding method is also an encoding target to improve the performance of the value encoding feature. The next proposed performance is the use of image sizes of 224x224 and 229x229 pixels. The results of this study are a geological image captioning model of rocks, which is a deep learning machine ensemble consisting of image extraction models, text extraction models, and word generation models. The image extraction model is CNN's architectural engineering, with both regular and separable models. The text extraction model is the empowerment of the word embedding model by using word2vec as encoding. In the word generation model, that means empowering the LSTM model or the Transformers model. The hasil of the machine deep learning ensemble is in the form of Fully Connected (FC). The FC units is processed with the softmax function to get the word with the greatest probability. The sequential process is processed with greedy search algorithms or beam search to get sentence semantics that have a relationship with the geological image of rocks. Previous research experiments have provided various in captioning results from the architecture of captioning geological images of proposed rocks. Evaluate the performance of the image captioning model by using BLEU and RougeL scores and show that the results of caption predictions have the accuracy and availability of the predicted words. Visual Attention (VaT), consisting of separable CNN and Semantic Transformers (SeTrans), has BLEU score values including BLEU-1 = 0.908, BLEU-2 = 0.877, BLEU-3 = 0.750, and BLEU-4 = 0.510. VaT, consisting of separable CNN and LSTM (SemATT), has BLEU scores including BLEU-1 = 0.933, BLEU-2 = 0.843, BLEU-3 = 0.743, and BLEU-4 = 0.542. The SemATT model, which is a combination of separable CNN and LSTM, was confirmed to have better results compared to the VaT-SeTrans Model and previous engineered models. In addition, the SemATT model also has results that exceed those of the VGG16-LSTM-word2vec model, including the VGG16-LSTM-ATT model, Bahdanau, and VGG16-LLSTM-AAtt. Luong. After conducting various experiments with VGG16, Resnet50, InceptionV3, and VaT (Xception) models with LSTM, the results of the VaT (Xception) and LSTM models were confirmed to exceed the VGG16, Resnet50, and InceptionV3 models. vii The use of linearly arranged convolution layers has the effect that fitur extraction information tends to be reduced, so it has the potential to lose important fiturs as the number of layers deepens. Meanwhile, the Xception model with 36 layers is able to maintain the fitur needed for the caption process. On the language generation side, LSTM is able to produce words that have a relationship between the word before and the one produced. Transformers have an advantage if the number of sentences or words is very large. In the problem studied, the number of unique words only reached 397 out of 4215 sentences. So there may still be research opportunities for the sustainability of this research, both in terms of image recognition and language generation.
format	Dissertations
author	Nursikuwagus, Agus
spellingShingle	Nursikuwagus, Agus AUTOMATIC TEXT GENERATION OF GEOLOGICAL ROCK IMAGES WITH DETECTION OF SEMANTIC ATTENTION (SEMATT) RELATIONS BETWEEN OBJECTS
author_facet	Nursikuwagus, Agus
author_sort	Nursikuwagus, Agus
title	AUTOMATIC TEXT GENERATION OF GEOLOGICAL ROCK IMAGES WITH DETECTION OF SEMANTIC ATTENTION (SEMATT) RELATIONS BETWEEN OBJECTS
title_short	AUTOMATIC TEXT GENERATION OF GEOLOGICAL ROCK IMAGES WITH DETECTION OF SEMANTIC ATTENTION (SEMATT) RELATIONS BETWEEN OBJECTS
title_full	AUTOMATIC TEXT GENERATION OF GEOLOGICAL ROCK IMAGES WITH DETECTION OF SEMANTIC ATTENTION (SEMATT) RELATIONS BETWEEN OBJECTS
title_fullStr	AUTOMATIC TEXT GENERATION OF GEOLOGICAL ROCK IMAGES WITH DETECTION OF SEMANTIC ATTENTION (SEMATT) RELATIONS BETWEEN OBJECTS
title_full_unstemmed	AUTOMATIC TEXT GENERATION OF GEOLOGICAL ROCK IMAGES WITH DETECTION OF SEMANTIC ATTENTION (SEMATT) RELATIONS BETWEEN OBJECTS
title_sort	automatic text generation of geological rock images with detection of semantic attention (sematt) relations between objects
url	https://digilib.itb.ac.id/gdl/view/76854
_version_	1822995084656771072
spelling	id-itb.:768542023-08-19T08:22:33ZAUTOMATIC TEXT GENERATION OF GEOLOGICAL ROCK IMAGES WITH DETECTION OF SEMANTIC ATTENTION (SEMATT) RELATIONS BETWEEN OBJECTS Nursikuwagus, Agus Indonesia Dissertations CNN, LSTM, Transformers, Word2vec, OneHotVector, Caption. INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/76854 Image captioning is a combination of computer vision and natural language processing (NLP) tasks. The process carried out is to translate an image into text, or a caption. The task of computer vision is to carry out the process of image recognition for image extraction, while in NLP, text generation is used using the recurrent neural network (RNN) method, which is a semantic prediction of image features. The use of image captioning to provide captioning on images has been carried out by various researchers using MSCOCO and FLICKR images. In this study, there are differences in objects, namely geological objects of rocks that have not been represented in the MSCOCO and FLICKr datasets. The domain of this object is limited to the geology of rocks, and the description of the image is validated directly by geologists. Image captioning is a combination of convolutional neural network (CNN) models and language models such as long- short-term memory (LSTM), Transformers, and attention models. The language model was redeveloped by engineering the model by adding attention methods and, subsequently, Transformers. The recognition process carried out on images is carried out using the regular CNN and separable CNN methods. Object identification in the form of image features is interpreted as a language model for predicting words that correspond to image features. The interpretation of these image features is carried out based on images and references to geological images of rocks, so as to obtain semantics that are in accordance with the geologist's interpretation.. Machine learning in image captioning has cointained of two parts, namely the encoder and decoder. A task encoder is a process that performs recognition on images. In another task, the decoder task is part of a process that empowers the language model as a word generator that receives input in the form of combined image feature extraction and word embedding. This study examines these two parts by exploring the CNN method as an encoder and the LSTM method as a decoder. The study of the word generation model began with the LSTM method with the addition of the attention method, then continued by exploring the Transformers method. The semantic attention approach is a task approach that emphasizes the vi decoder with the intention of obtaining words that correspond semantically to the image features. Models that have been developed using MSCOCO and FLICKr, such as the Szegedy and Karpathy models, have not been able to predict the geological caption of rocks that are close to reference. The contribution of this research is to combine state-of-the-art architectural image captioning with the CNN approach, the LSTM method, or Transformers that can generate captioning from geological images of rocks. The approach to identifying background image features that the research contributed to was the proposal of architectural engineering semantic attention (SemATT). Ensemble architecture considerations were the target of research experiments that combined CNN methods with language models. On the language model side, such as the use of LSTM methods, attention methods, and transformers, these became experimental targets as text generation models. The proposed word embedding method is also an encoding target to improve the performance of the value encoding feature. The next proposed performance is the use of image sizes of 224x224 and 229x229 pixels. The results of this study are a geological image captioning model of rocks, which is a deep learning machine ensemble consisting of image extraction models, text extraction models, and word generation models. The image extraction model is CNN's architectural engineering, with both regular and separable models. The text extraction model is the empowerment of the word embedding model by using word2vec as encoding. In the word generation model, that means empowering the LSTM model or the Transformers model. The hasil of the machine deep learning ensemble is in the form of Fully Connected (FC). The FC units is processed with the softmax function to get the word with the greatest probability. The sequential process is processed with greedy search algorithms or beam search to get sentence semantics that have a relationship with the geological image of rocks. Previous research experiments have provided various in captioning results from the architecture of captioning geological images of proposed rocks. Evaluate the performance of the image captioning model by using BLEU and RougeL scores and show that the results of caption predictions have the accuracy and availability of the predicted words. Visual Attention (VaT), consisting of separable CNN and Semantic Transformers (SeTrans), has BLEU score values including BLEU-1 = 0.908, BLEU-2 = 0.877, BLEU-3 = 0.750, and BLEU-4 = 0.510. VaT, consisting of separable CNN and LSTM (SemATT), has BLEU scores including BLEU-1 = 0.933, BLEU-2 = 0.843, BLEU-3 = 0.743, and BLEU-4 = 0.542. The SemATT model, which is a combination of separable CNN and LSTM, was confirmed to have better results compared to the VaT-SeTrans Model and previous engineered models. In addition, the SemATT model also has results that exceed those of the VGG16-LSTM-word2vec model, including the VGG16-LSTM-ATT model, Bahdanau, and VGG16-LLSTM-AAtt. Luong. After conducting various experiments with VGG16, Resnet50, InceptionV3, and VaT (Xception) models with LSTM, the results of the VaT (Xception) and LSTM models were confirmed to exceed the VGG16, Resnet50, and InceptionV3 models. vii The use of linearly arranged convolution layers has the effect that fitur extraction information tends to be reduced, so it has the potential to lose important fiturs as the number of layers deepens. Meanwhile, the Xception model with 36 layers is able to maintain the fitur needed for the caption process. On the language generation side, LSTM is able to produce words that have a relationship between the word before and the one produced. Transformers have an advantage if the number of sentences or words is very large. In the problem studied, the number of unique words only reached 397 out of 4215 sentences. So there may still be research opportunities for the sustainability of this research, both in terms of image recognition and language generation. text

AUTOMATIC TEXT GENERATION OF GEOLOGICAL ROCK IMAGES WITH DETECTION OF SEMANTIC ATTENTION (SEMATT) RELATIONS BETWEEN OBJECTS

Similar Items