PARAPHRASE DETECTION FOR INDONESIAN SENTENCE PAIRS USING ABSTRACT MEANING REPRESENTATION AND LATENT SEMANTIC ANALYSIS

Paraphrase detection is a classification task in natural language processing with the goal of determining whether two sentences are paraphrases or not. Abstract Meaning Representation (AMR) is a sentence semantic representation that can represent various sentences with different syntax but the sa...

Full description

Saved in:

Bibliographic Details
Main Author:	Muhammad Muflich, Faiz
Format:	Final Project
Language:	Indonesia
Online Access:	https://digilib.itb.ac.id/gdl/view/85008
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Institut Teknologi Bandung
Language:	Indonesia

id	id-itb.:85008
spelling	id-itb.:850082024-08-19T12:57:10ZPARAPHRASE DETECTION FOR INDONESIAN SENTENCE PAIRS USING ABSTRACT MEANING REPRESENTATION AND LATENT SEMANTIC ANALYSIS Muhammad Muflich, Faiz Indonesia Final Project paraphrase detection, Latent Semantic Analysis, Abstract Meaning Representation, XGBoost INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/85008 Paraphrase detection is a classification task in natural language processing with the goal of determining whether two sentences are paraphrases or not. Abstract Meaning Representation (AMR) is a sentence semantic representation that can represent various sentences with different syntax but the same semantic meaning, through an isomorphic AMR graph. With these characteristics, AMR is a suitable choice to be used in paraphrase detection. The Indonesian paraphrase detection model in the previous study showed unsatisfactory results with an F1 score of 0.682 on the validation data, compared to English which achieved an F1 score of 0.900. This shortcoming is caused by the Indonesian paraphrase detection model which only uses features based on similarity scores, while related research in English uses Latent Semantic Analysis (LSA) based features combined with AMR. This Final Project aims to compare models trained with similarity score-based features and LSA-based features. The models used in this research are Support Vector Machine, XGBoost, Random Forest, and LightGBM. To address the shortcomings of previous research, Issa et al. (2018) research was reimplemented and applied to Indonesian sentences. The first experiment tested the effect of adding the Paraphrase Adversaries from Word Scrambling dataset on F1 score validation, which showed that the additional dataset added noise and made it difficult for the model to learn patterns. The second experiment tested similarity score-based features, while the third experiment tested LSA-based features. As a result, the XGBoost model with the jaccard score feature of TF representation achieved the best validation F1 score of 0.685, F1 score of 0.683 for the automatic translation test data, and F1 score of 0.670 for the manual translation test data. The combination of LSA-based features with AMR has a lower validation F1 performance than similarity score-based features in Indonesian paraphrase detection. text
institution	Institut Teknologi Bandung
building	Institut Teknologi Bandung Library
continent	Asia
country	Indonesia Indonesia
content_provider	Institut Teknologi Bandung
collection	Digital ITB
language	Indonesia
description	Paraphrase detection is a classification task in natural language processing with the goal of determining whether two sentences are paraphrases or not. Abstract Meaning Representation (AMR) is a sentence semantic representation that can represent various sentences with different syntax but the same semantic meaning, through an isomorphic AMR graph. With these characteristics, AMR is a suitable choice to be used in paraphrase detection. The Indonesian paraphrase detection model in the previous study showed unsatisfactory results with an F1 score of 0.682 on the validation data, compared to English which achieved an F1 score of 0.900. This shortcoming is caused by the Indonesian paraphrase detection model which only uses features based on similarity scores, while related research in English uses Latent Semantic Analysis (LSA) based features combined with AMR. This Final Project aims to compare models trained with similarity score-based features and LSA-based features. The models used in this research are Support Vector Machine, XGBoost, Random Forest, and LightGBM. To address the shortcomings of previous research, Issa et al. (2018) research was reimplemented and applied to Indonesian sentences. The first experiment tested the effect of adding the Paraphrase Adversaries from Word Scrambling dataset on F1 score validation, which showed that the additional dataset added noise and made it difficult for the model to learn patterns. The second experiment tested similarity score-based features, while the third experiment tested LSA-based features. As a result, the XGBoost model with the jaccard score feature of TF representation achieved the best validation F1 score of 0.685, F1 score of 0.683 for the automatic translation test data, and F1 score of 0.670 for the manual translation test data. The combination of LSA-based features with AMR has a lower validation F1 performance than similarity score-based features in Indonesian paraphrase detection.
format	Final Project
author	Muhammad Muflich, Faiz
spellingShingle	Muhammad Muflich, Faiz PARAPHRASE DETECTION FOR INDONESIAN SENTENCE PAIRS USING ABSTRACT MEANING REPRESENTATION AND LATENT SEMANTIC ANALYSIS
author_facet	Muhammad Muflich, Faiz
author_sort	Muhammad Muflich, Faiz
title	PARAPHRASE DETECTION FOR INDONESIAN SENTENCE PAIRS USING ABSTRACT MEANING REPRESENTATION AND LATENT SEMANTIC ANALYSIS
title_short	PARAPHRASE DETECTION FOR INDONESIAN SENTENCE PAIRS USING ABSTRACT MEANING REPRESENTATION AND LATENT SEMANTIC ANALYSIS
title_full	PARAPHRASE DETECTION FOR INDONESIAN SENTENCE PAIRS USING ABSTRACT MEANING REPRESENTATION AND LATENT SEMANTIC ANALYSIS
title_fullStr	PARAPHRASE DETECTION FOR INDONESIAN SENTENCE PAIRS USING ABSTRACT MEANING REPRESENTATION AND LATENT SEMANTIC ANALYSIS
title_full_unstemmed	PARAPHRASE DETECTION FOR INDONESIAN SENTENCE PAIRS USING ABSTRACT MEANING REPRESENTATION AND LATENT SEMANTIC ANALYSIS
title_sort	paraphrase detection for indonesian sentence pairs using abstract meaning representation and latent semantic analysis
url	https://digilib.itb.ac.id/gdl/view/85008
_version_	1822998874940243968

PARAPHRASE DETECTION FOR INDONESIAN SENTENCE PAIRS USING ABSTRACT MEANING REPRESENTATION AND LATENT SEMANTIC ANALYSIS

Similar Items