PARAPHRASE DETECTION FOR INDONESIAN SENTENCE PAIRS USING ABSTRACT MEANING REPRESENTATION AND LATENT SEMANTIC ANALYSIS
Paraphrase detection is a classification task in natural language processing with the goal of determining whether two sentences are paraphrases or not. Abstract Meaning Representation (AMR) is a sentence semantic representation that can represent various sentences with different syntax but the sa...
Saved in:
Main Author: | |
---|---|
Format: | Final Project |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/85008 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
Summary: | Paraphrase detection is a classification task in natural language processing with
the goal of determining whether two sentences are paraphrases or not. Abstract
Meaning Representation (AMR) is a sentence semantic representation that can
represent various sentences with different syntax but the same semantic meaning,
through an isomorphic AMR graph. With these characteristics, AMR is a suitable
choice to be used in paraphrase detection. The Indonesian paraphrase detection
model in the previous study showed unsatisfactory results with an F1 score of 0.682
on the validation data, compared to English which achieved an F1 score of 0.900.
This shortcoming is caused by the Indonesian paraphrase detection model which
only uses features based on similarity scores, while related research in English uses
Latent Semantic Analysis (LSA) based features combined with AMR. This Final
Project aims to compare models trained with similarity score-based features and
LSA-based features. The models used in this research are Support Vector Machine,
XGBoost, Random Forest, and LightGBM. To address the shortcomings of previous
research, Issa et al. (2018) research was reimplemented and applied to Indonesian
sentences. The first experiment tested the effect of adding the Paraphrase
Adversaries from Word Scrambling dataset on F1 score validation, which showed
that the additional dataset added noise and made it difficult for the model to learn
patterns. The second experiment tested similarity score-based features, while the
third experiment tested LSA-based features. As a result, the XGBoost model with
the jaccard score feature of TF representation achieved the best validation F1 score
of 0.685, F1 score of 0.683 for the automatic translation test data, and F1 score of
0.670 for the manual translation test data. The combination of LSA-based features
with AMR has a lower validation F1 performance than similarity score-based
features in Indonesian paraphrase detection. |
---|