MULTI-DOCUMENT SUMMARIZATION USING SEMANTIC ROLE LABELING AND LINEAR REGRESSION FOR INDONESIAN NEWS ARTICLE

Automatic summarization system for Indonesian news articles needs to be more developed, along with the increasing amount of news on the internet. Extractive summarization system for Indonesian news articles was previously developed using semantic role labeling (SRL) to produce predicate argument...

Full description

Saved in:
Bibliographic Details
Main Author: Yumna Khairunnisa, Nisrina
Format: Final Project
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/56248
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
Description
Summary:Automatic summarization system for Indonesian news articles needs to be more developed, along with the increasing amount of news on the internet. Extractive summarization system for Indonesian news articles was previously developed using semantic role labeling (SRL) to produce predicate argument structure (PAS) and decision tree model to predict sentence’s salience score. However, sentence label inconsistencies was found in the training dataset of decision tree. As an alternative to decision tree, linear regression trained with sentence ROUGE score against reference summary as target can be used. The training dataset can be annotated automatically. In addition, sentence fusion based summarization system for Indonesian news article was developed to produce semi-abstractive summary. In this thesis, the impact of PAS-to-document and PAS-to-document set features and also linear regression trained with automatically annotated data on the SRL and semantic graph based summarization system will be investigated. In addition, the effect of sentence fusion on the quality of the summary will also be examined. This final project summarization system is developed with decision tree or linear regression to predict sentence’s salience score and sentence fusion. Decision tree is trained with some additional manual annotated data. Linear regression is trained with automatically annotated data based on ROUGE score. Those two models will use 13 features from PAS-to-document and PAS-to-document set relationship. Sentence fusion generates new sentences from group of similar sentences based on result of clustering. Experiment aims to investigate the impact of increasing data size for decision tree, determine the best model to predict sentence’s salience score, determine the best linkage configuration for PAS to document similarity and PAS to document set similarity feature, determine the optimal feature set, determine the effect of title feature, and determine clustering parameter. The best model get average ROUGE2 recall of 0.2471 and 0.3026 for summary of 100 and 200 words, respectively.