EXTRACTIVE SUMMARIZATION WITH SENTENCE-BERT TEXT ENCODER AND REINFORCEMENT LEARNING FOR INDONESIAN LANGUAGE TEXT

In the era of rapid technological advancements, the abundance of information on the internet can lead to information overload. To address this issue, the research domain of text summarization has been developed with the goal of extracting the essence from a document. Currently, neural network approa...

Full description

Saved in:
Bibliographic Details
Main Author: Denaya Rahadika Diana, Kadek
Format: Theses
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/78021
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
id id-itb.:78021
spelling id-itb.:780212023-09-15T22:01:48ZEXTRACTIVE SUMMARIZATION WITH SENTENCE-BERT TEXT ENCODER AND REINFORCEMENT LEARNING FOR INDONESIAN LANGUAGE TEXT Denaya Rahadika Diana, Kadek Indonesia Theses IndoSBERT, extractive text summarization, REFRESH, reinforcement learning, sentence embeddings. INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/78021 In the era of rapid technological advancements, the abundance of information on the internet can lead to information overload. To address this issue, the research domain of text summarization has been developed with the goal of extracting the essence from a document. Currently, neural network approaches in extractive text summarization using deep learning have surpassed other methods. However, there is a discrepancy between the model's training objective, i.e., cross-entropy loss, and the evaluation metric, i.e., ROUGE. To overcome this difference, some studies have employed reinforcement learning, such as REFRESH. REFRESH directly incorporates ROUGE as one of the components in the objective function during model training. However, in the implementation of REFRESH, there is a long- distance dependency problem caused by using CNN as the sentence encoder. Therefore, this research proposes the utilization of SentenceBERT (SBERT) as an alternative sentence encoder to replace CNN in REFRESH. SBERT is a transformer-based model capable of producing semantically meaningful vector representations of a sentence and can address long-distance dependency issues. However, currently, there is no SBERT for the Indonesian language. This research presents two main contributions. First, a specialized sentence embedding model for the Indonesian language, called IndoSBERT, was developed, trained using a siamese network architecture for the Semantic Textual Similarity task, capable of producing semantically meaningful representations of Indonesian sentences. This model will be used in REFRESH as a replacement for CNN in the sentence encoder part, to avoid long-distance dependency problems. Second, REFRESH was developed for the Indonesian language, leveraging the previously created IndoSBERT. IndoSBERT model has shown improved performance in the Semantic Textual Similarity task compared to the IndoBERT model and several other multilingual models, with a Spearman Rank Correlation Score of 0.856. Additionally, in the evaluation of the REFRESH model, using IndoSBERT as the sentence embedding resulted in higher ROUGE scores compared to using CNN as the sentence embedding. IndoSBERT-REFRESH achieved a ROUGE-1 score of 0.324, better than CNN-REFRESH with a ROUGE-1 score of 0.273. text
institution Institut Teknologi Bandung
building Institut Teknologi Bandung Library
continent Asia
country Indonesia
Indonesia
content_provider Institut Teknologi Bandung
collection Digital ITB
language Indonesia
description In the era of rapid technological advancements, the abundance of information on the internet can lead to information overload. To address this issue, the research domain of text summarization has been developed with the goal of extracting the essence from a document. Currently, neural network approaches in extractive text summarization using deep learning have surpassed other methods. However, there is a discrepancy between the model's training objective, i.e., cross-entropy loss, and the evaluation metric, i.e., ROUGE. To overcome this difference, some studies have employed reinforcement learning, such as REFRESH. REFRESH directly incorporates ROUGE as one of the components in the objective function during model training. However, in the implementation of REFRESH, there is a long- distance dependency problem caused by using CNN as the sentence encoder. Therefore, this research proposes the utilization of SentenceBERT (SBERT) as an alternative sentence encoder to replace CNN in REFRESH. SBERT is a transformer-based model capable of producing semantically meaningful vector representations of a sentence and can address long-distance dependency issues. However, currently, there is no SBERT for the Indonesian language. This research presents two main contributions. First, a specialized sentence embedding model for the Indonesian language, called IndoSBERT, was developed, trained using a siamese network architecture for the Semantic Textual Similarity task, capable of producing semantically meaningful representations of Indonesian sentences. This model will be used in REFRESH as a replacement for CNN in the sentence encoder part, to avoid long-distance dependency problems. Second, REFRESH was developed for the Indonesian language, leveraging the previously created IndoSBERT. IndoSBERT model has shown improved performance in the Semantic Textual Similarity task compared to the IndoBERT model and several other multilingual models, with a Spearman Rank Correlation Score of 0.856. Additionally, in the evaluation of the REFRESH model, using IndoSBERT as the sentence embedding resulted in higher ROUGE scores compared to using CNN as the sentence embedding. IndoSBERT-REFRESH achieved a ROUGE-1 score of 0.324, better than CNN-REFRESH with a ROUGE-1 score of 0.273.
format Theses
author Denaya Rahadika Diana, Kadek
spellingShingle Denaya Rahadika Diana, Kadek
EXTRACTIVE SUMMARIZATION WITH SENTENCE-BERT TEXT ENCODER AND REINFORCEMENT LEARNING FOR INDONESIAN LANGUAGE TEXT
author_facet Denaya Rahadika Diana, Kadek
author_sort Denaya Rahadika Diana, Kadek
title EXTRACTIVE SUMMARIZATION WITH SENTENCE-BERT TEXT ENCODER AND REINFORCEMENT LEARNING FOR INDONESIAN LANGUAGE TEXT
title_short EXTRACTIVE SUMMARIZATION WITH SENTENCE-BERT TEXT ENCODER AND REINFORCEMENT LEARNING FOR INDONESIAN LANGUAGE TEXT
title_full EXTRACTIVE SUMMARIZATION WITH SENTENCE-BERT TEXT ENCODER AND REINFORCEMENT LEARNING FOR INDONESIAN LANGUAGE TEXT
title_fullStr EXTRACTIVE SUMMARIZATION WITH SENTENCE-BERT TEXT ENCODER AND REINFORCEMENT LEARNING FOR INDONESIAN LANGUAGE TEXT
title_full_unstemmed EXTRACTIVE SUMMARIZATION WITH SENTENCE-BERT TEXT ENCODER AND REINFORCEMENT LEARNING FOR INDONESIAN LANGUAGE TEXT
title_sort extractive summarization with sentence-bert text encoder and reinforcement learning for indonesian language text
url https://digilib.itb.ac.id/gdl/view/78021
_version_ 1822008450528313344