APPLICATION OF BERTOPIC ON BERTSUM MODEL FOR INDONESIAN ABSTRACTIVE TEXT SUMMARIZATION
Text summarization is a field in natural language processing that aims to generate summaries by removing redundant information, creating a shorter version while retaining essential content. Text summarization can be divided into two types: extractive and abstractive. This research focuses on abst...
Saved in:
Main Author: | |
---|---|
Format: | Theses |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/86170 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
Summary: | Text summarization is a field in natural language processing that aims to generate
summaries by removing redundant information, creating a shorter version while
retaining essential content. Text summarization can be divided into two types:
extractive and abstractive. This research focuses on abstractive summarization,
which generates summaries by rephrasing the main information in a style different
from the original text.
In recent years, neural-based models, particularly those using transformer
architectures, have gained popularity in text summarization. One of the leading
models is BERTSum, which uses BERT as an encoder to obtain feature
representations from the document. BERTSum has proven to outperform other
abstractive text summarization models.
Evaluating abstractive text summarization requires a specially designed dataset
with a high proportion of novel n-grams in the summary labels, indicating that the
summary content is unique and original. Therefore, this study uses the XL-Sum
Indonesia dataset, which has the highest percentage of novel n-grams among other
Indonesian datasets, as a basis for developing an abstractive text summarization
model.
Research on BERT for Indonesian text summarization remains limited. Previous
studies comparing Indonesian BERT and English BERT concluded that the
performance of English BERT is superior to that of Indonesian BERT. Theoretically,
the performance of Indonesian BERT should be better because it undergoes
pretraining using an Indonesian corpus. This study aims to investigate the reasons
behind this phenomenon.
Transformer-based models like BERTSum have advantages, but they also have
limitations in the number of tokens that can be used as input. BERT has a limit of
512 tokens, which often results in document truncation and the loss of important
information, especially in long documents. Topic modeling, which identifies hidden
topics in documents, can help address this issue by capturing the document's global
semantics. The combination of topic models and transformers can improve the
model's understanding of the entire document.
The TEMA method (Topic Embedding with Masked Attention) combines topic
embeddings derived from topic distributions with a masked attention mechanism to
generate summaries based on topics. TEMA has proven to enhance model
performance and can be further improved using higher-quality topics. BERTopic is
a topic modeling method that uses BERT-based semantic representations to
generate high-quality and interpretable topics. BERTopic has advantages over
traditional topic models as it leverages embeddings from transformer models,
producing richer and more contextual document representations.
This study aims to adapt the BERTSum model to be used for abstractive
summarization of Indonesian text using the XL-Sum Indonesia dataset.
Additionally, this research evaluates the impact of the TEMA method on BERTSum's
performance by utilizing topic embeddings from the BERTopic model to overcome
the input token limitation.
The findings of this study show that the BERTSum model using Indonesian BERT
achieves better performance than English BERT after optimization. BERTSum's
performance also improves with the application of the TEMA method, especially on
shorter texts, although its performance fluctuates on texts longer than 1000 words.
The use of topic embeddings from BERTopic provides better results than
conventional topic models. The proposed model achieves a ROUGE-1 score of
25.39%, ROUGE-2 of 9.16%, and ROUGE-L of 20.61% on the XL-Sum Indonesia
dataset, with an average improvement of 4.71% over the baseline model. |
---|