IDENTIFYING AND CATEGORIZING OFFENSIVE LANGUAGES IN SOCIAL MEDIA

Freedom of expression on social media is often misused by some people to carry out offensive actions. So we need a mechanism to filter uploads to keep social media conducive. This final project aims to identify and categorize offensive language on social media which consists of three subtasks, na...

Full description

Saved in:
Bibliographic Details
Main Author: Muslim, Fajar
Format: Final Project
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/56244
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
Description
Summary:Freedom of expression on social media is often misused by some people to carry out offensive actions. So we need a mechanism to filter uploads to keep social media conducive. This final project aims to identify and categorize offensive language on social media which consists of three subtasks, namely identification of offensive language (subtask A), categorization of offensive language targets (subtask B) and identification of offensive targets (subtask C). This final project uses OLID dataset which is relatively small and imbalanced. In a previous study (SemEval-2019 task 6) BERT got the best performance on subtask A (Liu, et al 2019) and subtask C (Radivchev & Nikolov, 2019). On the other hand, the use of fine-tuning BERT results in a high variance (Risch. et al 2020). The final project focuses on conducting experiments to get the best model architecture through cost-sensitive learning techniques to overcome imbalanced datasets and ensembles to improve BERT performance. Based on the experimental results on the validation data, the use of cost-sensitive learning and ensemble improves the performance of the model on the three subtasks. However, after testing, the increase in cost-sensitive learning performance was only found in subtask B 3.16% compared to the baseline model (study of Liu et al. 2019) and subtask C by 6.85% compared to the baseline model (research by Zhou et al. 2019). While in subtask A there was no increase. The best ensemble technique for the three subtasks is the hard majority voting approach. This technique provides an increase in performance on subtask A by 0.78% and subtask B by 1.72% compared to models with cost-sensitive learning techniques. Meanwhile, the C subtask does not improve the performance of the model. The results of this final project are in the first place in the state of the art OLID dataset in subtask B and second in subtask A and subtask C. Kebebasan mengungkapkan pendapat di media sosial seringkali disalahgunakan oleh sebagian orang untuk melakukan tindakan ofensif. Sehingga diperlukan suatu mekanisme untuk menyaring unggahan untuk menjaga sosial media tetap kondusif. Tugas akhir ini bertujuan untuk melakukan identifikasi dan kategorisasi bahasa ofensif di media sosial yang terdiri dari tiga subtask yaitu identifikasi bahasa ofensif (subtask A), Kategorisasi terget bahasa ofensif (subtask B) dan identifikasi target ofensif (subtask C).