EXTRACTIVE SUMMARIZATION WITH SENTENCE-BERT TEXT ENCODER AND REINFORCEMENT LEARNING FOR INDONESIAN LANGUAGE TEXT

In the era of rapid technological advancements, the abundance of information on the internet can lead to information overload. To address this issue, the research domain of text summarization has been developed with the goal of extracting the essence from a document. Currently, neural network approa...

Full description

Saved in:

Bibliographic Details
Main Author:	Denaya Rahadika Diana, Kadek
Format:	Theses
Language:	Indonesia
Online Access:	https://digilib.itb.ac.id/gdl/view/78021
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Institut Teknologi Bandung
Language:	Indonesia

id	id-itb.:78021
spelling	id-itb.:780212023-09-15T22:01:48ZEXTRACTIVE SUMMARIZATION WITH SENTENCE-BERT TEXT ENCODER AND REINFORCEMENT LEARNING FOR INDONESIAN LANGUAGE TEXT Denaya Rahadika Diana, Kadek Indonesia Theses IndoSBERT, extractive text summarization, REFRESH, reinforcement learning, sentence embeddings. INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/78021 In the era of rapid technological advancements, the abundance of information on the internet can lead to information overload. To address this issue, the research domain of text summarization has been developed with the goal of extracting the essence from a document. Currently, neural network approaches in extractive text summarization using deep learning have surpassed other methods. However, there is a discrepancy between the model's training objective, i.e., cross-entropy loss, and the evaluation metric, i.e., ROUGE. To overcome this difference, some studies have employed reinforcement learning, such as REFRESH. REFRESH directly incorporates ROUGE as one of the components in the objective function during model training. However, in the implementation of REFRESH, there is a long- distance dependency problem caused by using CNN as the sentence encoder. Therefore, this research proposes the utilization of SentenceBERT (SBERT) as an alternative sentence encoder to replace CNN in REFRESH. SBERT is a transformer-based model capable of producing semantically meaningful vector representations of a sentence and can address long-distance dependency issues. However, currently, there is no SBERT for the Indonesian language. This research presents two main contributions. First, a specialized sentence embedding model for the Indonesian language, called IndoSBERT, was developed, trained using a siamese network architecture for the Semantic Textual Similarity task, capable of producing semantically meaningful representations of Indonesian sentences. This model will be used in REFRESH as a replacement for CNN in the sentence encoder part, to avoid long-distance dependency problems. Second, REFRESH was developed for the Indonesian language, leveraging the previously created IndoSBERT. IndoSBERT model has shown improved performance in the Semantic Textual Similarity task compared to the IndoBERT model and several other multilingual models, with a Spearman Rank Correlation Score of 0.856. Additionally, in the evaluation of the REFRESH model, using IndoSBERT as the sentence embedding resulted in higher ROUGE scores compared to using CNN as the sentence embedding. IndoSBERT-REFRESH achieved a ROUGE-1 score of 0.324, better than CNN-REFRESH with a ROUGE-1 score of 0.273. text
institution	Institut Teknologi Bandung
building	Institut Teknologi Bandung Library
continent	Asia
country	Indonesia Indonesia
content_provider	Institut Teknologi Bandung
collection	Digital ITB
language	Indonesia
description	In the era of rapid technological advancements, the abundance of information on the internet can lead to information overload. To address this issue, the research domain of text summarization has been developed with the goal of extracting the essence from a document. Currently, neural network approaches in extractive text summarization using deep learning have surpassed other methods. However, there is a discrepancy between the model's training objective, i.e., cross-entropy loss, and the evaluation metric, i.e., ROUGE. To overcome this difference, some studies have employed reinforcement learning, such as REFRESH. REFRESH directly incorporates ROUGE as one of the components in the objective function during model training. However, in the implementation of REFRESH, there is a long- distance dependency problem caused by using CNN as the sentence encoder. Therefore, this research proposes the utilization of SentenceBERT (SBERT) as an alternative sentence encoder to replace CNN in REFRESH. SBERT is a transformer-based model capable of producing semantically meaningful vector representations of a sentence and can address long-distance dependency issues. However, currently, there is no SBERT for the Indonesian language. This research presents two main contributions. First, a specialized sentence embedding model for the Indonesian language, called IndoSBERT, was developed, trained using a siamese network architecture for the Semantic Textual Similarity task, capable of producing semantically meaningful representations of Indonesian sentences. This model will be used in REFRESH as a replacement for CNN in the sentence encoder part, to avoid long-distance dependency problems. Second, REFRESH was developed for the Indonesian language, leveraging the previously created IndoSBERT. IndoSBERT model has shown improved performance in the Semantic Textual Similarity task compared to the IndoBERT model and several other multilingual models, with a Spearman Rank Correlation Score of 0.856. Additionally, in the evaluation of the REFRESH model, using IndoSBERT as the sentence embedding resulted in higher ROUGE scores compared to using CNN as the sentence embedding. IndoSBERT-REFRESH achieved a ROUGE-1 score of 0.324, better than CNN-REFRESH with a ROUGE-1 score of 0.273.
format	Theses
author	Denaya Rahadika Diana, Kadek
spellingShingle	Denaya Rahadika Diana, Kadek EXTRACTIVE SUMMARIZATION WITH SENTENCE-BERT TEXT ENCODER AND REINFORCEMENT LEARNING FOR INDONESIAN LANGUAGE TEXT
author_facet	Denaya Rahadika Diana, Kadek
author_sort	Denaya Rahadika Diana, Kadek
title	EXTRACTIVE SUMMARIZATION WITH SENTENCE-BERT TEXT ENCODER AND REINFORCEMENT LEARNING FOR INDONESIAN LANGUAGE TEXT
title_short	EXTRACTIVE SUMMARIZATION WITH SENTENCE-BERT TEXT ENCODER AND REINFORCEMENT LEARNING FOR INDONESIAN LANGUAGE TEXT
title_full	EXTRACTIVE SUMMARIZATION WITH SENTENCE-BERT TEXT ENCODER AND REINFORCEMENT LEARNING FOR INDONESIAN LANGUAGE TEXT
title_fullStr	EXTRACTIVE SUMMARIZATION WITH SENTENCE-BERT TEXT ENCODER AND REINFORCEMENT LEARNING FOR INDONESIAN LANGUAGE TEXT
title_full_unstemmed	EXTRACTIVE SUMMARIZATION WITH SENTENCE-BERT TEXT ENCODER AND REINFORCEMENT LEARNING FOR INDONESIAN LANGUAGE TEXT
title_sort	extractive summarization with sentence-bert text encoder and reinforcement learning for indonesian language text
url	https://digilib.itb.ac.id/gdl/view/78021
_version_	1822008450528313344

EXTRACTIVE SUMMARIZATION WITH SENTENCE-BERT TEXT ENCODER AND REINFORCEMENT LEARNING FOR INDONESIAN LANGUAGE TEXT

Similar Items