APPLICATION OF BERTOPIC ON BERTSUM MODEL FOR INDONESIAN ABSTRACTIVE TEXT SUMMARIZATION

Text summarization is a field in natural language processing that aims to generate summaries by removing redundant information, creating a shorter version while retaining essential content. Text summarization can be divided into two types: extractive and abstractive. This research focuses on abst...

Full description

Saved in:
Bibliographic Details
Main Author: Satya Putra Mahendra, Farrel
Format: Theses
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/86170
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
Description
Summary:Text summarization is a field in natural language processing that aims to generate summaries by removing redundant information, creating a shorter version while retaining essential content. Text summarization can be divided into two types: extractive and abstractive. This research focuses on abstractive summarization, which generates summaries by rephrasing the main information in a style different from the original text. In recent years, neural-based models, particularly those using transformer architectures, have gained popularity in text summarization. One of the leading models is BERTSum, which uses BERT as an encoder to obtain feature representations from the document. BERTSum has proven to outperform other abstractive text summarization models. Evaluating abstractive text summarization requires a specially designed dataset with a high proportion of novel n-grams in the summary labels, indicating that the summary content is unique and original. Therefore, this study uses the XL-Sum Indonesia dataset, which has the highest percentage of novel n-grams among other Indonesian datasets, as a basis for developing an abstractive text summarization model. Research on BERT for Indonesian text summarization remains limited. Previous studies comparing Indonesian BERT and English BERT concluded that the performance of English BERT is superior to that of Indonesian BERT. Theoretically, the performance of Indonesian BERT should be better because it undergoes pretraining using an Indonesian corpus. This study aims to investigate the reasons behind this phenomenon. Transformer-based models like BERTSum have advantages, but they also have limitations in the number of tokens that can be used as input. BERT has a limit of 512 tokens, which often results in document truncation and the loss of important information, especially in long documents. Topic modeling, which identifies hidden topics in documents, can help address this issue by capturing the document's global semantics. The combination of topic models and transformers can improve the model's understanding of the entire document. The TEMA method (Topic Embedding with Masked Attention) combines topic embeddings derived from topic distributions with a masked attention mechanism to generate summaries based on topics. TEMA has proven to enhance model performance and can be further improved using higher-quality topics. BERTopic is a topic modeling method that uses BERT-based semantic representations to generate high-quality and interpretable topics. BERTopic has advantages over traditional topic models as it leverages embeddings from transformer models, producing richer and more contextual document representations. This study aims to adapt the BERTSum model to be used for abstractive summarization of Indonesian text using the XL-Sum Indonesia dataset. Additionally, this research evaluates the impact of the TEMA method on BERTSum's performance by utilizing topic embeddings from the BERTopic model to overcome the input token limitation. The findings of this study show that the BERTSum model using Indonesian BERT achieves better performance than English BERT after optimization. BERTSum's performance also improves with the application of the TEMA method, especially on shorter texts, although its performance fluctuates on texts longer than 1000 words. The use of topic embeddings from BERTopic provides better results than conventional topic models. The proposed model achieves a ROUGE-1 score of 25.39%, ROUGE-2 of 9.16%, and ROUGE-L of 20.61% on the XL-Sum Indonesia dataset, with an average improvement of 4.71% over the baseline model.