GENERATING PARAPHRASE USING SIMULATED ANNEALING FOR SCIENTIFIC PAPER CITATION SENTENCES

Citation sentences are typically used by researchers to evaluate other studies in terms of comparisons, similarities, or development of previous results. Citation sentences are employed as scientific justifications to support the research being conducted. In writing citation sentences, it is nece...

Full description

Saved in:
Bibliographic Details
Main Author: Ilyas, Ridwan
Format: Dissertations
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/76660
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
Description
Summary:Citation sentences are typically used by researchers to evaluate other studies in terms of comparisons, similarities, or development of previous results. Citation sentences are employed as scientific justifications to support the research being conducted. In writing citation sentences, it is necessary to have alternative writing with various sentences but still having the same meaning. As a result, the paraphrasing approach is required while writing citation phrases. There are three major tasks in paraphrasing. The first is paraphrasing extraction, which is the process of extracting paraphrased sentence pairs from a set of texts. The second task is paraphrase detection, which involves determining whether two pairs of text units are paraphrasing or not. The third activity is paraphrase generation, which is the process of creating a new sentence from the incoming sentence. All paraphrasing assignments in the sphere of scientific articles, particularly citation sentences, are completed in this panel. Paraphrase generation becomes the primary focus of research. Paraphrase extraction is done by extracting citation sentences from papers. The clustering approach is used to group citation sentences that have the same citation target. By pairing each sentence with another, each cluster that forms becomes a candidate for a citation corpus. The expert labels the results of making corpus candidates one by one, resulting in a corpus of paraphrased and non-paraphrased sentence pairs. The total number of labeled sentence pairings is 4675, comprising 2386 paraphrased sentences and 2289 non-paraphrased sentences. The detection is accomplished by using a formula to assess the degree of paraphrase between two statements. The processes of creating this formula are divided into two. The first step is to select the components of two sentences' calculation to quantify semantic similarities and lexical differences. Meteor, Meta Discourse, and Pinc Score were the component candidates employed in this investigation. The second step is to select a formula template to be filled with computation components. Weighted Linear, Harmonic Means 2 and Harmonic Means 3 are among the possible formulas employed in this investigation. According to the study's findings, the best formula produced was Linear Weight, with the formula's components being Meteor and Pinc Score. The final result of the formula that is formed is PScore = 0,9 ? Meteor + (1 ? 0,9) Pinc. Paraphrasing is generated by developing a simulated annealing algorithm. The simulated annealing algorithm was chosen because it can generate stochastic and can leave the minimal iv location, allowing the results to converge to the objective function. Several things are required when employing an algorithm for text generation, including objective functions, language resources for text modification operations, and operational methods. The objective function is needed to assess whether each state resulting from a sentence operation is better or not than the previous operation. In the paraphrase detection research section, it has been found that the optimal objective function for the data set is owned. Therefore, the formula for measuring the phrase is used as an objective function. The operation (state) in the simulated annealing algorithm for generating text in this study is carried out at the lexical level. Possible operations include substitution, addition and deletion. The substitution operation is to replace a text unit in a sentence with another text unit. The addition operation is to insert a new text unit in a sentence. The delete operation is to remove certain text units in a sentence. The substitution and addition operations require language resources. In this study, word2vec was used as a supporting language resource formed from a collection of sentences in scientific papers. There can be more than one candidate new sentence from the results of the substitution and addition operations. To select the most optimal new sentence candidate, this is done by calculating the probability of the appearance of the sentence arrangement based on the N-Gram language model. We propose the StoPGEN method, which stands for Stochastic Generator, as a method of generating citation sentences. The process of evaluating the generation method was carried out with two corpus groups, namely the standard corpus (twitter and quora) as well as the citation sentence corpus. Evaluation was also carried out by comparing with other methods such as Variant Auto Encoder, Lagging Variant Auto Encoder, Metropolis Hastings, Unsupervised Simulated annealing, LSTM encoder-decoder, bidirectional LSTM and Transformer. Evaluation using the standard corpus, StoPGEN resulted in a BLEU value of 6.26, Rouge 1 28.60 and Rouge 2 8.75 in the Twitter data set. StoPGEN produces a BLEU value of 22.37, Rouge 1 61.09 and Rouge 2 40.79 in the Quora data set. All of these scores outperform the other methods. Evaluated using the citation corpus, StoPGEN resulted in BLEU 55.37, Rouge 1 71.28, Rouge 2 47.46 and RougeL 66.32. In addition to quantitative evaluation, this research also conducted qualitative evaluation. Qualitative evaluation is carried out by conducting a survey on the acceptance of generated sentences. The first survey was conducted by measuring the level of acceptance of the output of 3 variants of the StoPGEN method. As a result, the StoPGEN III method gets the highest acceptance value with a value of 50.96. The second survey was conducted by measuring the level of acceptability of the output of the StoPGEN method compared to UPSA and UPSA which had modified baggage resources. As a result, the StoPGEN method gets the highest sentence acceptance rate with a value of 50.80.