PARAPHRASE DETECTION FOR INDONESIAN SENTENCE PAIRS USING ABSTRACT MEANING REPRESENTATION AND LATENT SEMANTIC ANALYSIS
Paraphrase detection is a classification task in natural language processing with the goal of determining whether two sentences are paraphrases or not. Abstract Meaning Representation (AMR) is a sentence semantic representation that can represent various sentences with different syntax but the sa...
Saved in:
Main Author: | |
---|---|
Format: | Final Project |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/85008 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
id |
id-itb.:85008 |
---|---|
spelling |
id-itb.:850082024-08-19T12:57:10ZPARAPHRASE DETECTION FOR INDONESIAN SENTENCE PAIRS USING ABSTRACT MEANING REPRESENTATION AND LATENT SEMANTIC ANALYSIS Muhammad Muflich, Faiz Indonesia Final Project paraphrase detection, Latent Semantic Analysis, Abstract Meaning Representation, XGBoost INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/85008 Paraphrase detection is a classification task in natural language processing with the goal of determining whether two sentences are paraphrases or not. Abstract Meaning Representation (AMR) is a sentence semantic representation that can represent various sentences with different syntax but the same semantic meaning, through an isomorphic AMR graph. With these characteristics, AMR is a suitable choice to be used in paraphrase detection. The Indonesian paraphrase detection model in the previous study showed unsatisfactory results with an F1 score of 0.682 on the validation data, compared to English which achieved an F1 score of 0.900. This shortcoming is caused by the Indonesian paraphrase detection model which only uses features based on similarity scores, while related research in English uses Latent Semantic Analysis (LSA) based features combined with AMR. This Final Project aims to compare models trained with similarity score-based features and LSA-based features. The models used in this research are Support Vector Machine, XGBoost, Random Forest, and LightGBM. To address the shortcomings of previous research, Issa et al. (2018) research was reimplemented and applied to Indonesian sentences. The first experiment tested the effect of adding the Paraphrase Adversaries from Word Scrambling dataset on F1 score validation, which showed that the additional dataset added noise and made it difficult for the model to learn patterns. The second experiment tested similarity score-based features, while the third experiment tested LSA-based features. As a result, the XGBoost model with the jaccard score feature of TF representation achieved the best validation F1 score of 0.685, F1 score of 0.683 for the automatic translation test data, and F1 score of 0.670 for the manual translation test data. The combination of LSA-based features with AMR has a lower validation F1 performance than similarity score-based features in Indonesian paraphrase detection. text |
institution |
Institut Teknologi Bandung |
building |
Institut Teknologi Bandung Library |
continent |
Asia |
country |
Indonesia Indonesia |
content_provider |
Institut Teknologi Bandung |
collection |
Digital ITB |
language |
Indonesia |
description |
Paraphrase detection is a classification task in natural language processing with
the goal of determining whether two sentences are paraphrases or not. Abstract
Meaning Representation (AMR) is a sentence semantic representation that can
represent various sentences with different syntax but the same semantic meaning,
through an isomorphic AMR graph. With these characteristics, AMR is a suitable
choice to be used in paraphrase detection. The Indonesian paraphrase detection
model in the previous study showed unsatisfactory results with an F1 score of 0.682
on the validation data, compared to English which achieved an F1 score of 0.900.
This shortcoming is caused by the Indonesian paraphrase detection model which
only uses features based on similarity scores, while related research in English uses
Latent Semantic Analysis (LSA) based features combined with AMR. This Final
Project aims to compare models trained with similarity score-based features and
LSA-based features. The models used in this research are Support Vector Machine,
XGBoost, Random Forest, and LightGBM. To address the shortcomings of previous
research, Issa et al. (2018) research was reimplemented and applied to Indonesian
sentences. The first experiment tested the effect of adding the Paraphrase
Adversaries from Word Scrambling dataset on F1 score validation, which showed
that the additional dataset added noise and made it difficult for the model to learn
patterns. The second experiment tested similarity score-based features, while the
third experiment tested LSA-based features. As a result, the XGBoost model with
the jaccard score feature of TF representation achieved the best validation F1 score
of 0.685, F1 score of 0.683 for the automatic translation test data, and F1 score of
0.670 for the manual translation test data. The combination of LSA-based features
with AMR has a lower validation F1 performance than similarity score-based
features in Indonesian paraphrase detection. |
format |
Final Project |
author |
Muhammad Muflich, Faiz |
spellingShingle |
Muhammad Muflich, Faiz PARAPHRASE DETECTION FOR INDONESIAN SENTENCE PAIRS USING ABSTRACT MEANING REPRESENTATION AND LATENT SEMANTIC ANALYSIS |
author_facet |
Muhammad Muflich, Faiz |
author_sort |
Muhammad Muflich, Faiz |
title |
PARAPHRASE DETECTION FOR INDONESIAN SENTENCE PAIRS USING ABSTRACT MEANING REPRESENTATION AND LATENT SEMANTIC ANALYSIS |
title_short |
PARAPHRASE DETECTION FOR INDONESIAN SENTENCE PAIRS USING ABSTRACT MEANING REPRESENTATION AND LATENT SEMANTIC ANALYSIS |
title_full |
PARAPHRASE DETECTION FOR INDONESIAN SENTENCE PAIRS USING ABSTRACT MEANING REPRESENTATION AND LATENT SEMANTIC ANALYSIS |
title_fullStr |
PARAPHRASE DETECTION FOR INDONESIAN SENTENCE PAIRS USING ABSTRACT MEANING REPRESENTATION AND LATENT SEMANTIC ANALYSIS |
title_full_unstemmed |
PARAPHRASE DETECTION FOR INDONESIAN SENTENCE PAIRS USING ABSTRACT MEANING REPRESENTATION AND LATENT SEMANTIC ANALYSIS |
title_sort |
paraphrase detection for indonesian sentence pairs using abstract meaning representation and latent semantic analysis |
url |
https://digilib.itb.ac.id/gdl/view/85008 |
_version_ |
1822998874940243968 |