EXTRACTIVE SUMMARIZATION WITH SENTENCE-BERT TEXT ENCODER AND REINFORCEMENT LEARNING FOR INDONESIAN LANGUAGE TEXT
In the era of rapid technological advancements, the abundance of information on the internet can lead to information overload. To address this issue, the research domain of text summarization has been developed with the goal of extracting the essence from a document. Currently, neural network approa...
Saved in:
Main Author: | |
---|---|
Format: | Theses |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/78021 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
id |
id-itb.:78021 |
---|---|
spelling |
id-itb.:780212023-09-15T22:01:48ZEXTRACTIVE SUMMARIZATION WITH SENTENCE-BERT TEXT ENCODER AND REINFORCEMENT LEARNING FOR INDONESIAN LANGUAGE TEXT Denaya Rahadika Diana, Kadek Indonesia Theses IndoSBERT, extractive text summarization, REFRESH, reinforcement learning, sentence embeddings. INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/78021 In the era of rapid technological advancements, the abundance of information on the internet can lead to information overload. To address this issue, the research domain of text summarization has been developed with the goal of extracting the essence from a document. Currently, neural network approaches in extractive text summarization using deep learning have surpassed other methods. However, there is a discrepancy between the model's training objective, i.e., cross-entropy loss, and the evaluation metric, i.e., ROUGE. To overcome this difference, some studies have employed reinforcement learning, such as REFRESH. REFRESH directly incorporates ROUGE as one of the components in the objective function during model training. However, in the implementation of REFRESH, there is a long- distance dependency problem caused by using CNN as the sentence encoder. Therefore, this research proposes the utilization of SentenceBERT (SBERT) as an alternative sentence encoder to replace CNN in REFRESH. SBERT is a transformer-based model capable of producing semantically meaningful vector representations of a sentence and can address long-distance dependency issues. However, currently, there is no SBERT for the Indonesian language. This research presents two main contributions. First, a specialized sentence embedding model for the Indonesian language, called IndoSBERT, was developed, trained using a siamese network architecture for the Semantic Textual Similarity task, capable of producing semantically meaningful representations of Indonesian sentences. This model will be used in REFRESH as a replacement for CNN in the sentence encoder part, to avoid long-distance dependency problems. Second, REFRESH was developed for the Indonesian language, leveraging the previously created IndoSBERT. IndoSBERT model has shown improved performance in the Semantic Textual Similarity task compared to the IndoBERT model and several other multilingual models, with a Spearman Rank Correlation Score of 0.856. Additionally, in the evaluation of the REFRESH model, using IndoSBERT as the sentence embedding resulted in higher ROUGE scores compared to using CNN as the sentence embedding. IndoSBERT-REFRESH achieved a ROUGE-1 score of 0.324, better than CNN-REFRESH with a ROUGE-1 score of 0.273. text |
institution |
Institut Teknologi Bandung |
building |
Institut Teknologi Bandung Library |
continent |
Asia |
country |
Indonesia Indonesia |
content_provider |
Institut Teknologi Bandung |
collection |
Digital ITB |
language |
Indonesia |
description |
In the era of rapid technological advancements, the abundance of information on the internet can lead to information overload. To address this issue, the research domain of text summarization has been developed with the goal of extracting the essence from a document. Currently, neural network approaches in extractive text summarization using deep learning have surpassed other methods. However, there is a discrepancy between the model's training objective, i.e., cross-entropy loss, and the evaluation metric, i.e., ROUGE. To overcome this difference, some studies have employed reinforcement learning, such as REFRESH. REFRESH directly incorporates ROUGE as one of the components in the objective function during model training. However, in the implementation of REFRESH, there is a long- distance dependency problem caused by using CNN as the sentence encoder. Therefore, this research proposes the utilization of SentenceBERT (SBERT) as an alternative sentence encoder to replace CNN in REFRESH. SBERT is a transformer-based model capable of producing semantically meaningful vector representations of a sentence and can address long-distance dependency issues. However, currently, there is no SBERT for the Indonesian language.
This research presents two main contributions. First, a specialized sentence embedding model for the Indonesian language, called IndoSBERT, was developed, trained using a siamese network architecture for the Semantic Textual Similarity
task, capable of producing semantically meaningful representations of Indonesian sentences. This model will be used in REFRESH as a replacement for CNN in the sentence encoder part, to avoid long-distance dependency problems. Second, REFRESH was developed for the Indonesian language, leveraging the previously created IndoSBERT.
IndoSBERT model has shown improved performance in the Semantic Textual Similarity task compared to the IndoBERT model and several other multilingual models, with a Spearman Rank Correlation Score of 0.856. Additionally, in the evaluation of the REFRESH model, using IndoSBERT as the sentence embedding resulted in higher ROUGE scores compared to using CNN as the sentence embedding. IndoSBERT-REFRESH achieved a ROUGE-1 score of 0.324, better than CNN-REFRESH with a ROUGE-1 score of 0.273. |
format |
Theses |
author |
Denaya Rahadika Diana, Kadek |
spellingShingle |
Denaya Rahadika Diana, Kadek EXTRACTIVE SUMMARIZATION WITH SENTENCE-BERT TEXT ENCODER AND REINFORCEMENT LEARNING FOR INDONESIAN LANGUAGE TEXT |
author_facet |
Denaya Rahadika Diana, Kadek |
author_sort |
Denaya Rahadika Diana, Kadek |
title |
EXTRACTIVE SUMMARIZATION WITH SENTENCE-BERT TEXT ENCODER AND REINFORCEMENT LEARNING FOR INDONESIAN LANGUAGE TEXT |
title_short |
EXTRACTIVE SUMMARIZATION WITH SENTENCE-BERT TEXT ENCODER AND REINFORCEMENT LEARNING FOR INDONESIAN LANGUAGE TEXT |
title_full |
EXTRACTIVE SUMMARIZATION WITH SENTENCE-BERT TEXT ENCODER AND REINFORCEMENT LEARNING FOR INDONESIAN LANGUAGE TEXT |
title_fullStr |
EXTRACTIVE SUMMARIZATION WITH SENTENCE-BERT TEXT ENCODER AND REINFORCEMENT LEARNING FOR INDONESIAN LANGUAGE TEXT |
title_full_unstemmed |
EXTRACTIVE SUMMARIZATION WITH SENTENCE-BERT TEXT ENCODER AND REINFORCEMENT LEARNING FOR INDONESIAN LANGUAGE TEXT |
title_sort |
extractive summarization with sentence-bert text encoder and reinforcement learning for indonesian language text |
url |
https://digilib.itb.ac.id/gdl/view/78021 |
_version_ |
1822008450528313344 |