SENTIMENT ANALYSIS IN SUNDANESE USING PRETRAINED MULTILINGUAL LANGUAGE MODEL

Many social media users have resulted in many digitally available text data. The data from social media use Indonesian and regional languages, such as in the West Java area using Sundanese. Unfortunately, there is still little use of Sundanese language data in the case of natural language process...

Full description

Saved in:
Bibliographic Details
Main Author: Permana, Hadi
Format: Theses
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/68354
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
Description
Summary:Many social media users have resulted in many digitally available text data. The data from social media use Indonesian and regional languages, such as in the West Java area using Sundanese. Unfortunately, there is still little use of Sundanese language data in the case of natural language processing (NLP). So this study aims to utilize the data in the sentiment analysis task. However, there is a text mixed with other languages in its use. When retrieving data from social media, there is a problem: the number of typo words or words that are not standard or out-ofvocabulary (OOV). The author will use a multilingual pre-trained language model to overcome this problem. In the experiment, this research uses four models to determine the best performance in the sentiment analysis task in Sundanese, namely, Naive Bayes, XLM-R, XLMTw and mBERT. The data used in this study were taken from social media Twitter as many as 7,771 tweets with the query "Persib" which had been annotated with neutral, negative and positive sentiments. The data is divided into 60% training data, 20% validation data and 20% test data. The experimental results in this study obtained the highest performance with an accuracy of 87% using the XLM-Tw model with fine-tuning techniques. This result is increased compared to the accuracy of the Naïve Bayes model and XLM-R. An experiment was also conducted using the Sundanese language dataset NusaX with the XLM-Tw model and obtained an accuracy of 82%.