AUTOMATIC GUIDED SUMMARIZATION BASED ON RHETORICAL CATEGORY FOR SCIENTIFIC PAPERS
Guided summarization extracts important information from a document taking into account the reader's knowledge of the previous document. Guided summarization of this study was constructed for the scientific paper domain using rhetorical categorization. This summary produces two main componen...
Saved in:
Main Author: | |
---|---|
Format: | Dissertations |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/50036 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
Summary: | Guided summarization extracts important information from a document taking into
account the reader's knowledge of the previous document. Guided summarization
of this study was constructed for the scientific paper domain using rhetorical
categorization. This summary produces two main components, namely the initial
and update summary. Initial summaries are constructed from a collection of Set A
papers (read) linked to Set B's papers. An updated summary builds on Set B's
papers (to be read). This summary was built because there was no summarization
of scientific papers that took into account the reader's knowledge of a collection of
previous scientific papers so that the summary results did not differentiate
information from papers that have been and will be read that have topic relevance.
There are two contributions to this research, namely 1) Identification of topics
related to scientific papers that have been read (Set A) and papers to be read (Set
B) with categorization of citation sentences. 2) Guided summarization of scientific
papers using a collection of rhetorical category building plans tailored to the
relevance of the topic of citation sentences between papers.
The identification of relevance topics between papers has been constructed using
citation sentence categorization. These categories are 'Problem' (citation sentences
containing problems or weak gaps from other studies), 'UseModel' (citation
sentences containing use of models / techniques / methods from other studies),
'UseTool' (citation sentences containing use of tools / algorithms / software from
other studies), 'UseData' (citation sentences containing data from other studies),
and 'Other' (citation sentences that are not classified into other categories). The
highest f-measure value in the training data is obtained when using Support Vector
Machine and SMOTE techniques to handle imbalaced datasets. With this method
in the test data, 905 sentences or 78.5% of 1,153 citations are successfully classified
correctly. This indicates that 78 out of 100 papers that will be read (Set B) can be
found as related to papers that have been read (Set A) through the citation category.
Then, the guided summarization for scientific paper is constructed using building
plan collection of rhetorical category as an aspect structure. The rhetorical
categories used in the summary are ‘AIM_NOV’ (purpose and novelty),
‘OWN_CONC_RES_FAIL’ (the conclusion of either results or failures),
‘MTHD_USE’ (method) and ‘DATA’ (data). Selection of sentences for summaries
employs Maximal Marginal Relevance and then through the surface repair process.
The first evaluation uses ROUGE where the results of the system summary are
iv
compared with the results of the manual summary. After using surface repair, the
F-measure result of initial summary (Set A) increased from 0.419 to 0.464. In
addition, ROUGE test results for update summary (Set B) show that more than 50%
of the information topics in the form of NN (noun) and JJ (adjective) overlap with
the manual summary. Meanwhile, the second evaluation uses a subjective
assessment of the reader with a questionnaire. The results show that most readers
can separate information from papers that have been read (Set A) and papers that
will be read (Set B). However, some readers consider the topic relevance between
the selected Set A papers and Set B papers is unclear because they fail to capture
the topic relatedness in the update summary.
|
---|