AUTOMATIC GUIDED SUMMARIZATION BASED ON RHETORICAL CATEGORY FOR SCIENTIFIC PAPERS

Guided summarization extracts important information from a document taking into account the reader's knowledge of the previous document. Guided summarization of this study was constructed for the scientific paper domain using rhetorical categorization. This summary produces two main componen...

Full description

Saved in:
Bibliographic Details
Main Author: Haitan Rachman, Ghoziyah
Format: Dissertations
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/50036
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
Description
Summary:Guided summarization extracts important information from a document taking into account the reader's knowledge of the previous document. Guided summarization of this study was constructed for the scientific paper domain using rhetorical categorization. This summary produces two main components, namely the initial and update summary. Initial summaries are constructed from a collection of Set A papers (read) linked to Set B's papers. An updated summary builds on Set B's papers (to be read). This summary was built because there was no summarization of scientific papers that took into account the reader's knowledge of a collection of previous scientific papers so that the summary results did not differentiate information from papers that have been and will be read that have topic relevance. There are two contributions to this research, namely 1) Identification of topics related to scientific papers that have been read (Set A) and papers to be read (Set B) with categorization of citation sentences. 2) Guided summarization of scientific papers using a collection of rhetorical category building plans tailored to the relevance of the topic of citation sentences between papers. The identification of relevance topics between papers has been constructed using citation sentence categorization. These categories are 'Problem' (citation sentences containing problems or weak gaps from other studies), 'UseModel' (citation sentences containing use of models / techniques / methods from other studies), 'UseTool' (citation sentences containing use of tools / algorithms / software from other studies), 'UseData' (citation sentences containing data from other studies), and 'Other' (citation sentences that are not classified into other categories). The highest f-measure value in the training data is obtained when using Support Vector Machine and SMOTE techniques to handle imbalaced datasets. With this method in the test data, 905 sentences or 78.5% of 1,153 citations are successfully classified correctly. This indicates that 78 out of 100 papers that will be read (Set B) can be found as related to papers that have been read (Set A) through the citation category. Then, the guided summarization for scientific paper is constructed using building plan collection of rhetorical category as an aspect structure. The rhetorical categories used in the summary are ‘AIM_NOV’ (purpose and novelty), ‘OWN_CONC_RES_FAIL’ (the conclusion of either results or failures), ‘MTHD_USE’ (method) and ‘DATA’ (data). Selection of sentences for summaries employs Maximal Marginal Relevance and then through the surface repair process. The first evaluation uses ROUGE where the results of the system summary are iv compared with the results of the manual summary. After using surface repair, the F-measure result of initial summary (Set A) increased from 0.419 to 0.464. In addition, ROUGE test results for update summary (Set B) show that more than 50% of the information topics in the form of NN (noun) and JJ (adjective) overlap with the manual summary. Meanwhile, the second evaluation uses a subjective assessment of the reader with a questionnaire. The results show that most readers can separate information from papers that have been read (Set A) and papers that will be read (Set B). However, some readers consider the topic relevance between the selected Set A papers and Set B papers is unclear because they fail to capture the topic relatedness in the update summary.