SENTIMENT ANALYSIS IN SUNDANESE USING PRETRAINED MULTILINGUAL LANGUAGE MODEL

Many social media users have resulted in many digitally available text data. The data from social media use Indonesian and regional languages, such as in the West Java area using Sundanese. Unfortunately, there is still little use of Sundanese language data in the case of natural language process...

Full description

Saved in:
Bibliographic Details
Main Author: Permana, Hadi
Format: Theses
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/68354
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
id id-itb.:68354
spelling id-itb.:683542022-09-14T09:24:20ZSENTIMENT ANALYSIS IN SUNDANESE USING PRETRAINED MULTILINGUAL LANGUAGE MODEL Permana, Hadi Indonesia Theses Sentiment Analysis, Sundanese, Multilingual Model, Low-Resource Language, XLM-R, XLM-Ttweet INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/68354 Many social media users have resulted in many digitally available text data. The data from social media use Indonesian and regional languages, such as in the West Java area using Sundanese. Unfortunately, there is still little use of Sundanese language data in the case of natural language processing (NLP). So this study aims to utilize the data in the sentiment analysis task. However, there is a text mixed with other languages in its use. When retrieving data from social media, there is a problem: the number of typo words or words that are not standard or out-ofvocabulary (OOV). The author will use a multilingual pre-trained language model to overcome this problem. In the experiment, this research uses four models to determine the best performance in the sentiment analysis task in Sundanese, namely, Naive Bayes, XLM-R, XLMTw and mBERT. The data used in this study were taken from social media Twitter as many as 7,771 tweets with the query "Persib" which had been annotated with neutral, negative and positive sentiments. The data is divided into 60% training data, 20% validation data and 20% test data. The experimental results in this study obtained the highest performance with an accuracy of 87% using the XLM-Tw model with fine-tuning techniques. This result is increased compared to the accuracy of the Naïve Bayes model and XLM-R. An experiment was also conducted using the Sundanese language dataset NusaX with the XLM-Tw model and obtained an accuracy of 82%. text
institution Institut Teknologi Bandung
building Institut Teknologi Bandung Library
continent Asia
country Indonesia
Indonesia
content_provider Institut Teknologi Bandung
collection Digital ITB
language Indonesia
description Many social media users have resulted in many digitally available text data. The data from social media use Indonesian and regional languages, such as in the West Java area using Sundanese. Unfortunately, there is still little use of Sundanese language data in the case of natural language processing (NLP). So this study aims to utilize the data in the sentiment analysis task. However, there is a text mixed with other languages in its use. When retrieving data from social media, there is a problem: the number of typo words or words that are not standard or out-ofvocabulary (OOV). The author will use a multilingual pre-trained language model to overcome this problem. In the experiment, this research uses four models to determine the best performance in the sentiment analysis task in Sundanese, namely, Naive Bayes, XLM-R, XLMTw and mBERT. The data used in this study were taken from social media Twitter as many as 7,771 tweets with the query "Persib" which had been annotated with neutral, negative and positive sentiments. The data is divided into 60% training data, 20% validation data and 20% test data. The experimental results in this study obtained the highest performance with an accuracy of 87% using the XLM-Tw model with fine-tuning techniques. This result is increased compared to the accuracy of the Naïve Bayes model and XLM-R. An experiment was also conducted using the Sundanese language dataset NusaX with the XLM-Tw model and obtained an accuracy of 82%.
format Theses
author Permana, Hadi
spellingShingle Permana, Hadi
SENTIMENT ANALYSIS IN SUNDANESE USING PRETRAINED MULTILINGUAL LANGUAGE MODEL
author_facet Permana, Hadi
author_sort Permana, Hadi
title SENTIMENT ANALYSIS IN SUNDANESE USING PRETRAINED MULTILINGUAL LANGUAGE MODEL
title_short SENTIMENT ANALYSIS IN SUNDANESE USING PRETRAINED MULTILINGUAL LANGUAGE MODEL
title_full SENTIMENT ANALYSIS IN SUNDANESE USING PRETRAINED MULTILINGUAL LANGUAGE MODEL
title_fullStr SENTIMENT ANALYSIS IN SUNDANESE USING PRETRAINED MULTILINGUAL LANGUAGE MODEL
title_full_unstemmed SENTIMENT ANALYSIS IN SUNDANESE USING PRETRAINED MULTILINGUAL LANGUAGE MODEL
title_sort sentiment analysis in sundanese using pretrained multilingual language model
url https://digilib.itb.ac.id/gdl/view/68354
_version_ 1822005720103518208