GENERATING PARAPHRASE USING SIMULATED ANNEALING FOR SCIENTIFIC PAPER CITATION SENTENCES
Citation sentences are typically used by researchers to evaluate other studies in terms of comparisons, similarities, or development of previous results. Citation sentences are employed as scientific justifications to support the research being conducted. In writing citation sentences, it is nece...
Saved in:
Main Author: | |
---|---|
Format: | Dissertations |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/76660 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
id |
id-itb.:76660 |
---|---|
institution |
Institut Teknologi Bandung |
building |
Institut Teknologi Bandung Library |
continent |
Asia |
country |
Indonesia Indonesia |
content_provider |
Institut Teknologi Bandung |
collection |
Digital ITB |
language |
Indonesia |
description |
Citation sentences are typically used by researchers to evaluate other studies in terms of
comparisons, similarities, or development of previous results. Citation sentences are employed
as scientific justifications to support the research being conducted. In writing citation
sentences, it is necessary to have alternative writing with various sentences but still having the
same meaning. As a result, the paraphrasing approach is required while writing citation
phrases.
There are three major tasks in paraphrasing. The first is paraphrasing extraction, which is the
process of extracting paraphrased sentence pairs from a set of texts. The second task is
paraphrase detection, which involves determining whether two pairs of text units are
paraphrasing or not. The third activity is paraphrase generation, which is the process of
creating a new sentence from the incoming sentence. All paraphrasing assignments in the
sphere of scientific articles, particularly citation sentences, are completed in this panel.
Paraphrase generation becomes the primary focus of research.
Paraphrase extraction is done by extracting citation sentences from papers. The clustering
approach is used to group citation sentences that have the same citation target. By pairing
each sentence with another, each cluster that forms becomes a candidate for a citation corpus.
The expert labels the results of making corpus candidates one by one, resulting in a corpus of
paraphrased and non-paraphrased sentence pairs. The total number of labeled sentence
pairings is 4675, comprising 2386 paraphrased sentences and 2289 non-paraphrased
sentences.
The detection is accomplished by using a formula to assess the degree of paraphrase between
two statements. The processes of creating this formula are divided into two. The first step is to
select the components of two sentences' calculation to quantify semantic similarities and lexical
differences. Meteor, Meta Discourse, and Pinc Score were the component candidates employed
in this investigation. The second step is to select a formula template to be filled with
computation components. Weighted Linear, Harmonic Means 2 and Harmonic Means 3 are
among the possible formulas employed in this investigation. According to the study's findings,
the best formula produced was Linear Weight, with the formula's components being Meteor
and Pinc Score. The final result of the formula that is formed is PScore = 0,9 ? Meteor +
(1 ? 0,9) Pinc.
Paraphrasing is generated by developing a simulated annealing algorithm. The simulated
annealing algorithm was chosen because it can generate stochastic and can leave the minimal
iv
location, allowing the results to converge to the objective function. Several things are required
when employing an algorithm for text generation, including objective functions, language
resources for text modification operations, and operational methods.
The objective function is needed to assess whether each state resulting from a sentence
operation is better or not than the previous operation. In the paraphrase detection research
section, it has been found that the optimal objective function for the data set is owned.
Therefore, the formula for measuring the phrase is used as an objective function.
The operation (state) in the simulated annealing algorithm for generating text in this study is
carried out at the lexical level. Possible operations include substitution, addition and deletion.
The substitution operation is to replace a text unit in a sentence with another text unit. The
addition operation is to insert a new text unit in a sentence. The delete operation is to remove
certain text units in a sentence. The substitution and addition operations require language
resources. In this study, word2vec was used as a supporting language resource formed from a
collection of sentences in scientific papers. There can be more than one candidate new sentence
from the results of the substitution and addition operations. To select the most optimal new
sentence candidate, this is done by calculating the probability of the appearance of the sentence
arrangement based on the N-Gram language model.
We propose the StoPGEN method, which stands for Stochastic Generator, as a method of
generating citation sentences. The process of evaluating the generation method was carried
out with two corpus groups, namely the standard corpus (twitter and quora) as well as the
citation sentence corpus. Evaluation was also carried out by comparing with other methods
such as Variant Auto Encoder, Lagging Variant Auto Encoder, Metropolis Hastings,
Unsupervised Simulated annealing, LSTM encoder-decoder, bidirectional LSTM and
Transformer. Evaluation using the standard corpus, StoPGEN resulted in a BLEU value of
6.26, Rouge 1 28.60 and Rouge 2 8.75 in the Twitter data set. StoPGEN produces a BLEU
value of 22.37, Rouge 1 61.09 and Rouge 2 40.79 in the Quora data set. All of these scores
outperform the other methods. Evaluated using the citation corpus, StoPGEN resulted in BLEU
55.37, Rouge 1 71.28, Rouge 2 47.46 and RougeL 66.32.
In addition to quantitative evaluation, this research also conducted qualitative evaluation.
Qualitative evaluation is carried out by conducting a survey on the acceptance of generated
sentences. The first survey was conducted by measuring the level of acceptance of the output
of 3 variants of the StoPGEN method. As a result, the StoPGEN III method gets the highest
acceptance value with a value of 50.96. The second survey was conducted by measuring the
level of acceptability of the output of the StoPGEN method compared to UPSA and UPSA which
had modified baggage resources. As a result, the StoPGEN method gets the highest sentence
acceptance rate with a value of 50.80. |
format |
Dissertations |
author |
Ilyas, Ridwan |
spellingShingle |
Ilyas, Ridwan GENERATING PARAPHRASE USING SIMULATED ANNEALING FOR SCIENTIFIC PAPER CITATION SENTENCES |
author_facet |
Ilyas, Ridwan |
author_sort |
Ilyas, Ridwan |
title |
GENERATING PARAPHRASE USING SIMULATED ANNEALING FOR SCIENTIFIC PAPER CITATION SENTENCES |
title_short |
GENERATING PARAPHRASE USING SIMULATED ANNEALING FOR SCIENTIFIC PAPER CITATION SENTENCES |
title_full |
GENERATING PARAPHRASE USING SIMULATED ANNEALING FOR SCIENTIFIC PAPER CITATION SENTENCES |
title_fullStr |
GENERATING PARAPHRASE USING SIMULATED ANNEALING FOR SCIENTIFIC PAPER CITATION SENTENCES |
title_full_unstemmed |
GENERATING PARAPHRASE USING SIMULATED ANNEALING FOR SCIENTIFIC PAPER CITATION SENTENCES |
title_sort |
generating paraphrase using simulated annealing for scientific paper citation sentences |
url |
https://digilib.itb.ac.id/gdl/view/76660 |
_version_ |
1822995008743014400 |
spelling |
id-itb.:766602023-08-17T07:49:55ZGENERATING PARAPHRASE USING SIMULATED ANNEALING FOR SCIENTIFIC PAPER CITATION SENTENCES Ilyas, Ridwan Indonesia Dissertations generation of paraphrases for citation sentences, simulated annealing algorithm, paraphrase measurement formula, paraphrase corpus of citation sentences. INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/76660 Citation sentences are typically used by researchers to evaluate other studies in terms of comparisons, similarities, or development of previous results. Citation sentences are employed as scientific justifications to support the research being conducted. In writing citation sentences, it is necessary to have alternative writing with various sentences but still having the same meaning. As a result, the paraphrasing approach is required while writing citation phrases. There are three major tasks in paraphrasing. The first is paraphrasing extraction, which is the process of extracting paraphrased sentence pairs from a set of texts. The second task is paraphrase detection, which involves determining whether two pairs of text units are paraphrasing or not. The third activity is paraphrase generation, which is the process of creating a new sentence from the incoming sentence. All paraphrasing assignments in the sphere of scientific articles, particularly citation sentences, are completed in this panel. Paraphrase generation becomes the primary focus of research. Paraphrase extraction is done by extracting citation sentences from papers. The clustering approach is used to group citation sentences that have the same citation target. By pairing each sentence with another, each cluster that forms becomes a candidate for a citation corpus. The expert labels the results of making corpus candidates one by one, resulting in a corpus of paraphrased and non-paraphrased sentence pairs. The total number of labeled sentence pairings is 4675, comprising 2386 paraphrased sentences and 2289 non-paraphrased sentences. The detection is accomplished by using a formula to assess the degree of paraphrase between two statements. The processes of creating this formula are divided into two. The first step is to select the components of two sentences' calculation to quantify semantic similarities and lexical differences. Meteor, Meta Discourse, and Pinc Score were the component candidates employed in this investigation. The second step is to select a formula template to be filled with computation components. Weighted Linear, Harmonic Means 2 and Harmonic Means 3 are among the possible formulas employed in this investigation. According to the study's findings, the best formula produced was Linear Weight, with the formula's components being Meteor and Pinc Score. The final result of the formula that is formed is PScore = 0,9 ? Meteor + (1 ? 0,9) Pinc. Paraphrasing is generated by developing a simulated annealing algorithm. The simulated annealing algorithm was chosen because it can generate stochastic and can leave the minimal iv location, allowing the results to converge to the objective function. Several things are required when employing an algorithm for text generation, including objective functions, language resources for text modification operations, and operational methods. The objective function is needed to assess whether each state resulting from a sentence operation is better or not than the previous operation. In the paraphrase detection research section, it has been found that the optimal objective function for the data set is owned. Therefore, the formula for measuring the phrase is used as an objective function. The operation (state) in the simulated annealing algorithm for generating text in this study is carried out at the lexical level. Possible operations include substitution, addition and deletion. The substitution operation is to replace a text unit in a sentence with another text unit. The addition operation is to insert a new text unit in a sentence. The delete operation is to remove certain text units in a sentence. The substitution and addition operations require language resources. In this study, word2vec was used as a supporting language resource formed from a collection of sentences in scientific papers. There can be more than one candidate new sentence from the results of the substitution and addition operations. To select the most optimal new sentence candidate, this is done by calculating the probability of the appearance of the sentence arrangement based on the N-Gram language model. We propose the StoPGEN method, which stands for Stochastic Generator, as a method of generating citation sentences. The process of evaluating the generation method was carried out with two corpus groups, namely the standard corpus (twitter and quora) as well as the citation sentence corpus. Evaluation was also carried out by comparing with other methods such as Variant Auto Encoder, Lagging Variant Auto Encoder, Metropolis Hastings, Unsupervised Simulated annealing, LSTM encoder-decoder, bidirectional LSTM and Transformer. Evaluation using the standard corpus, StoPGEN resulted in a BLEU value of 6.26, Rouge 1 28.60 and Rouge 2 8.75 in the Twitter data set. StoPGEN produces a BLEU value of 22.37, Rouge 1 61.09 and Rouge 2 40.79 in the Quora data set. All of these scores outperform the other methods. Evaluated using the citation corpus, StoPGEN resulted in BLEU 55.37, Rouge 1 71.28, Rouge 2 47.46 and RougeL 66.32. In addition to quantitative evaluation, this research also conducted qualitative evaluation. Qualitative evaluation is carried out by conducting a survey on the acceptance of generated sentences. The first survey was conducted by measuring the level of acceptance of the output of 3 variants of the StoPGEN method. As a result, the StoPGEN III method gets the highest acceptance value with a value of 50.96. The second survey was conducted by measuring the level of acceptability of the output of the StoPGEN method compared to UPSA and UPSA which had modified baggage resources. As a result, the StoPGEN method gets the highest sentence acceptance rate with a value of 50.80. text |