EXTRACTIVE SUMMARIZATION WITH SENTENCE-BERT TEXT ENCODER AND REINFORCEMENT LEARNING FOR INDONESIAN LANGUAGE TEXT
In the era of rapid technological advancements, the abundance of information on the internet can lead to information overload. To address this issue, the research domain of text summarization has been developed with the goal of extracting the essence from a document. Currently, neural network approa...
Saved in:
Main Author: | |
---|---|
Format: | Theses |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/78021 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
Summary: | In the era of rapid technological advancements, the abundance of information on the internet can lead to information overload. To address this issue, the research domain of text summarization has been developed with the goal of extracting the essence from a document. Currently, neural network approaches in extractive text summarization using deep learning have surpassed other methods. However, there is a discrepancy between the model's training objective, i.e., cross-entropy loss, and the evaluation metric, i.e., ROUGE. To overcome this difference, some studies have employed reinforcement learning, such as REFRESH. REFRESH directly incorporates ROUGE as one of the components in the objective function during model training. However, in the implementation of REFRESH, there is a long- distance dependency problem caused by using CNN as the sentence encoder. Therefore, this research proposes the utilization of SentenceBERT (SBERT) as an alternative sentence encoder to replace CNN in REFRESH. SBERT is a transformer-based model capable of producing semantically meaningful vector representations of a sentence and can address long-distance dependency issues. However, currently, there is no SBERT for the Indonesian language.
This research presents two main contributions. First, a specialized sentence embedding model for the Indonesian language, called IndoSBERT, was developed, trained using a siamese network architecture for the Semantic Textual Similarity
task, capable of producing semantically meaningful representations of Indonesian sentences. This model will be used in REFRESH as a replacement for CNN in the sentence encoder part, to avoid long-distance dependency problems. Second, REFRESH was developed for the Indonesian language, leveraging the previously created IndoSBERT.
IndoSBERT model has shown improved performance in the Semantic Textual Similarity task compared to the IndoBERT model and several other multilingual models, with a Spearman Rank Correlation Score of 0.856. Additionally, in the evaluation of the REFRESH model, using IndoSBERT as the sentence embedding resulted in higher ROUGE scores compared to using CNN as the sentence embedding. IndoSBERT-REFRESH achieved a ROUGE-1 score of 0.324, better than CNN-REFRESH with a ROUGE-1 score of 0.273. |
---|