SENTIMENT ANALYSIS IN SUNDANESE USING PRETRAINED MULTILINGUAL LANGUAGE MODEL
Many social media users have resulted in many digitally available text data. The data from social media use Indonesian and regional languages, such as in the West Java area using Sundanese. Unfortunately, there is still little use of Sundanese language data in the case of natural language process...
Saved in:
Main Author: | |
---|---|
Format: | Theses |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/68354 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
Summary: | Many social media users have resulted in many digitally available text data. The
data from social media use Indonesian and regional languages, such as in the West
Java area using Sundanese. Unfortunately, there is still little use of Sundanese
language data in the case of natural language processing (NLP). So this study aims
to utilize the data in the sentiment analysis task. However, there is a text mixed with
other languages in its use. When retrieving data from social media, there is a
problem: the number of typo words or words that are not standard or out-ofvocabulary
(OOV). The author will use a multilingual pre-trained language model
to overcome this problem.
In the experiment, this research uses four models to determine the best performance
in the sentiment analysis task in Sundanese, namely, Naive Bayes, XLM-R, XLMTw
and mBERT. The data used in this study were taken from social media Twitter
as many as 7,771 tweets with the query "Persib" which had been annotated with
neutral, negative and positive sentiments. The data is divided into 60% training
data, 20% validation data and 20% test data. The experimental results in this study
obtained the highest performance with an accuracy of 87% using the XLM-Tw
model with fine-tuning techniques. This result is increased compared to the
accuracy of the Naïve Bayes model and XLM-R. An experiment was also conducted
using the Sundanese language dataset NusaX with the XLM-Tw model and obtained
an accuracy of 82%. |
---|