PARAPHRASE DETECTION FOR INDONESIAN SENTENCE PAIRS USING ABSTRACT MEANING REPRESENTATION AND LATENT SEMANTIC ANALYSIS

Paraphrase detection is a classification task in natural language processing with the goal of determining whether two sentences are paraphrases or not. Abstract Meaning Representation (AMR) is a sentence semantic representation that can represent various sentences with different syntax but the sa...

Full description

Saved in:
Bibliographic Details
Main Author: Muhammad Muflich, Faiz
Format: Final Project
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/85008
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
id id-itb.:85008
spelling id-itb.:850082024-08-19T12:57:10ZPARAPHRASE DETECTION FOR INDONESIAN SENTENCE PAIRS USING ABSTRACT MEANING REPRESENTATION AND LATENT SEMANTIC ANALYSIS Muhammad Muflich, Faiz Indonesia Final Project paraphrase detection, Latent Semantic Analysis, Abstract Meaning Representation, XGBoost INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/85008 Paraphrase detection is a classification task in natural language processing with the goal of determining whether two sentences are paraphrases or not. Abstract Meaning Representation (AMR) is a sentence semantic representation that can represent various sentences with different syntax but the same semantic meaning, through an isomorphic AMR graph. With these characteristics, AMR is a suitable choice to be used in paraphrase detection. The Indonesian paraphrase detection model in the previous study showed unsatisfactory results with an F1 score of 0.682 on the validation data, compared to English which achieved an F1 score of 0.900. This shortcoming is caused by the Indonesian paraphrase detection model which only uses features based on similarity scores, while related research in English uses Latent Semantic Analysis (LSA) based features combined with AMR. This Final Project aims to compare models trained with similarity score-based features and LSA-based features. The models used in this research are Support Vector Machine, XGBoost, Random Forest, and LightGBM. To address the shortcomings of previous research, Issa et al. (2018) research was reimplemented and applied to Indonesian sentences. The first experiment tested the effect of adding the Paraphrase Adversaries from Word Scrambling dataset on F1 score validation, which showed that the additional dataset added noise and made it difficult for the model to learn patterns. The second experiment tested similarity score-based features, while the third experiment tested LSA-based features. As a result, the XGBoost model with the jaccard score feature of TF representation achieved the best validation F1 score of 0.685, F1 score of 0.683 for the automatic translation test data, and F1 score of 0.670 for the manual translation test data. The combination of LSA-based features with AMR has a lower validation F1 performance than similarity score-based features in Indonesian paraphrase detection. text
institution Institut Teknologi Bandung
building Institut Teknologi Bandung Library
continent Asia
country Indonesia
Indonesia
content_provider Institut Teknologi Bandung
collection Digital ITB
language Indonesia
description Paraphrase detection is a classification task in natural language processing with the goal of determining whether two sentences are paraphrases or not. Abstract Meaning Representation (AMR) is a sentence semantic representation that can represent various sentences with different syntax but the same semantic meaning, through an isomorphic AMR graph. With these characteristics, AMR is a suitable choice to be used in paraphrase detection. The Indonesian paraphrase detection model in the previous study showed unsatisfactory results with an F1 score of 0.682 on the validation data, compared to English which achieved an F1 score of 0.900. This shortcoming is caused by the Indonesian paraphrase detection model which only uses features based on similarity scores, while related research in English uses Latent Semantic Analysis (LSA) based features combined with AMR. This Final Project aims to compare models trained with similarity score-based features and LSA-based features. The models used in this research are Support Vector Machine, XGBoost, Random Forest, and LightGBM. To address the shortcomings of previous research, Issa et al. (2018) research was reimplemented and applied to Indonesian sentences. The first experiment tested the effect of adding the Paraphrase Adversaries from Word Scrambling dataset on F1 score validation, which showed that the additional dataset added noise and made it difficult for the model to learn patterns. The second experiment tested similarity score-based features, while the third experiment tested LSA-based features. As a result, the XGBoost model with the jaccard score feature of TF representation achieved the best validation F1 score of 0.685, F1 score of 0.683 for the automatic translation test data, and F1 score of 0.670 for the manual translation test data. The combination of LSA-based features with AMR has a lower validation F1 performance than similarity score-based features in Indonesian paraphrase detection.
format Final Project
author Muhammad Muflich, Faiz
spellingShingle Muhammad Muflich, Faiz
PARAPHRASE DETECTION FOR INDONESIAN SENTENCE PAIRS USING ABSTRACT MEANING REPRESENTATION AND LATENT SEMANTIC ANALYSIS
author_facet Muhammad Muflich, Faiz
author_sort Muhammad Muflich, Faiz
title PARAPHRASE DETECTION FOR INDONESIAN SENTENCE PAIRS USING ABSTRACT MEANING REPRESENTATION AND LATENT SEMANTIC ANALYSIS
title_short PARAPHRASE DETECTION FOR INDONESIAN SENTENCE PAIRS USING ABSTRACT MEANING REPRESENTATION AND LATENT SEMANTIC ANALYSIS
title_full PARAPHRASE DETECTION FOR INDONESIAN SENTENCE PAIRS USING ABSTRACT MEANING REPRESENTATION AND LATENT SEMANTIC ANALYSIS
title_fullStr PARAPHRASE DETECTION FOR INDONESIAN SENTENCE PAIRS USING ABSTRACT MEANING REPRESENTATION AND LATENT SEMANTIC ANALYSIS
title_full_unstemmed PARAPHRASE DETECTION FOR INDONESIAN SENTENCE PAIRS USING ABSTRACT MEANING REPRESENTATION AND LATENT SEMANTIC ANALYSIS
title_sort paraphrase detection for indonesian sentence pairs using abstract meaning representation and latent semantic analysis
url https://digilib.itb.ac.id/gdl/view/85008
_version_ 1822998874940243968