PARAPHRASE DETECTION FOR INDONESIAN SENTENCE PAIRS USING ABSTRACT MEANING REPRESENTATION AND LATENT SEMANTIC ANALYSIS

Paraphrase detection is a classification task in natural language processing with the goal of determining whether two sentences are paraphrases or not. Abstract Meaning Representation (AMR) is a sentence semantic representation that can represent various sentences with different syntax but the sa...

Full description

Saved in:
Bibliographic Details
Main Author: Muhammad Muflich, Faiz
Format: Final Project
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/85008
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
Description
Summary:Paraphrase detection is a classification task in natural language processing with the goal of determining whether two sentences are paraphrases or not. Abstract Meaning Representation (AMR) is a sentence semantic representation that can represent various sentences with different syntax but the same semantic meaning, through an isomorphic AMR graph. With these characteristics, AMR is a suitable choice to be used in paraphrase detection. The Indonesian paraphrase detection model in the previous study showed unsatisfactory results with an F1 score of 0.682 on the validation data, compared to English which achieved an F1 score of 0.900. This shortcoming is caused by the Indonesian paraphrase detection model which only uses features based on similarity scores, while related research in English uses Latent Semantic Analysis (LSA) based features combined with AMR. This Final Project aims to compare models trained with similarity score-based features and LSA-based features. The models used in this research are Support Vector Machine, XGBoost, Random Forest, and LightGBM. To address the shortcomings of previous research, Issa et al. (2018) research was reimplemented and applied to Indonesian sentences. The first experiment tested the effect of adding the Paraphrase Adversaries from Word Scrambling dataset on F1 score validation, which showed that the additional dataset added noise and made it difficult for the model to learn patterns. The second experiment tested similarity score-based features, while the third experiment tested LSA-based features. As a result, the XGBoost model with the jaccard score feature of TF representation achieved the best validation F1 score of 0.685, F1 score of 0.683 for the automatic translation test data, and F1 score of 0.670 for the manual translation test data. The combination of LSA-based features with AMR has a lower validation F1 performance than similarity score-based features in Indonesian paraphrase detection.