SENTIMENT ANALYSIS IN SUNDANESE USING PRETRAINED MULTILINGUAL LANGUAGE MODEL

Many social media users have resulted in many digitally available text data. The data from social media use Indonesian and regional languages, such as in the West Java area using Sundanese. Unfortunately, there is still little use of Sundanese language data in the case of natural language process...

Full description

Saved in:

Bibliographic Details
Main Author:	Permana, Hadi
Format:	Theses
Language:	Indonesia
Online Access:	https://digilib.itb.ac.id/gdl/view/68354
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Institut Teknologi Bandung
Language:	Indonesia

id	id-itb.:68354
spelling	id-itb.:683542022-09-14T09:24:20ZSENTIMENT ANALYSIS IN SUNDANESE USING PRETRAINED MULTILINGUAL LANGUAGE MODEL Permana, Hadi Indonesia Theses Sentiment Analysis, Sundanese, Multilingual Model, Low-Resource Language, XLM-R, XLM-Ttweet INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/68354 Many social media users have resulted in many digitally available text data. The data from social media use Indonesian and regional languages, such as in the West Java area using Sundanese. Unfortunately, there is still little use of Sundanese language data in the case of natural language processing (NLP). So this study aims to utilize the data in the sentiment analysis task. However, there is a text mixed with other languages in its use. When retrieving data from social media, there is a problem: the number of typo words or words that are not standard or out-ofvocabulary (OOV). The author will use a multilingual pre-trained language model to overcome this problem. In the experiment, this research uses four models to determine the best performance in the sentiment analysis task in Sundanese, namely, Naive Bayes, XLM-R, XLMTw and mBERT. The data used in this study were taken from social media Twitter as many as 7,771 tweets with the query "Persib" which had been annotated with neutral, negative and positive sentiments. The data is divided into 60% training data, 20% validation data and 20% test data. The experimental results in this study obtained the highest performance with an accuracy of 87% using the XLM-Tw model with fine-tuning techniques. This result is increased compared to the accuracy of the Naïve Bayes model and XLM-R. An experiment was also conducted using the Sundanese language dataset NusaX with the XLM-Tw model and obtained an accuracy of 82%. text
institution	Institut Teknologi Bandung
building	Institut Teknologi Bandung Library
continent	Asia
country	Indonesia Indonesia
content_provider	Institut Teknologi Bandung
collection	Digital ITB
language	Indonesia
description	Many social media users have resulted in many digitally available text data. The data from social media use Indonesian and regional languages, such as in the West Java area using Sundanese. Unfortunately, there is still little use of Sundanese language data in the case of natural language processing (NLP). So this study aims to utilize the data in the sentiment analysis task. However, there is a text mixed with other languages in its use. When retrieving data from social media, there is a problem: the number of typo words or words that are not standard or out-ofvocabulary (OOV). The author will use a multilingual pre-trained language model to overcome this problem. In the experiment, this research uses four models to determine the best performance in the sentiment analysis task in Sundanese, namely, Naive Bayes, XLM-R, XLMTw and mBERT. The data used in this study were taken from social media Twitter as many as 7,771 tweets with the query "Persib" which had been annotated with neutral, negative and positive sentiments. The data is divided into 60% training data, 20% validation data and 20% test data. The experimental results in this study obtained the highest performance with an accuracy of 87% using the XLM-Tw model with fine-tuning techniques. This result is increased compared to the accuracy of the Naïve Bayes model and XLM-R. An experiment was also conducted using the Sundanese language dataset NusaX with the XLM-Tw model and obtained an accuracy of 82%.
format	Theses
author	Permana, Hadi
spellingShingle	Permana, Hadi SENTIMENT ANALYSIS IN SUNDANESE USING PRETRAINED MULTILINGUAL LANGUAGE MODEL
author_facet	Permana, Hadi
author_sort	Permana, Hadi
title	SENTIMENT ANALYSIS IN SUNDANESE USING PRETRAINED MULTILINGUAL LANGUAGE MODEL
title_short	SENTIMENT ANALYSIS IN SUNDANESE USING PRETRAINED MULTILINGUAL LANGUAGE MODEL
title_full	SENTIMENT ANALYSIS IN SUNDANESE USING PRETRAINED MULTILINGUAL LANGUAGE MODEL
title_fullStr	SENTIMENT ANALYSIS IN SUNDANESE USING PRETRAINED MULTILINGUAL LANGUAGE MODEL
title_full_unstemmed	SENTIMENT ANALYSIS IN SUNDANESE USING PRETRAINED MULTILINGUAL LANGUAGE MODEL
title_sort	sentiment analysis in sundanese using pretrained multilingual language model
url	https://digilib.itb.ac.id/gdl/view/68354
_version_	1822005720103518208

SENTIMENT ANALYSIS IN SUNDANESE USING PRETRAINED MULTILINGUAL LANGUAGE MODEL

Similar Items