IDENTIFYING AND CATEGORIZING OFFENSIVE LANGUAGES IN SOCIAL MEDIA
Freedom of expression on social media is often misused by some people to carry out offensive actions. So we need a mechanism to filter uploads to keep social media conducive. This final project aims to identify and categorize offensive language on social media which consists of three subtasks, na...
Saved in:
Main Author: | |
---|---|
Format: | Final Project |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/56244 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
Summary: | Freedom of expression on social media is often misused by some people to carry out offensive
actions. So we need a mechanism to filter uploads to keep social media conducive. This final
project aims to identify and categorize offensive language on social media which consists of three
subtasks, namely identification of offensive language (subtask A), categorization of offensive
language targets (subtask B) and identification of offensive targets (subtask C).
This final project uses OLID dataset which is relatively small and imbalanced. In a previous study
(SemEval-2019 task 6) BERT got the best performance on subtask A (Liu, et al 2019) and subtask
C (Radivchev & Nikolov, 2019). On the other hand, the use of fine-tuning BERT results in a high
variance (Risch. et al 2020). The final project focuses on conducting experiments to get the best
model architecture through cost-sensitive learning techniques to overcome imbalanced datasets
and ensembles to improve BERT performance.
Based on the experimental results on the validation data, the use of cost-sensitive learning and
ensemble improves the performance of the model on the three subtasks. However, after testing,
the increase in cost-sensitive learning performance was only found in subtask B 3.16% compared
to the baseline model (study of Liu et al. 2019) and subtask C by 6.85% compared to the baseline
model (research by Zhou et al. 2019). While in subtask A there was no increase. The best ensemble
technique for the three subtasks is the hard majority voting approach. This technique provides an
increase in performance on subtask A by 0.78% and subtask B by 1.72% compared to models with
cost-sensitive learning techniques. Meanwhile, the C subtask does not improve the performance
of the model. The results of this final project are in the first place in the state of the art OLID
dataset in subtask B and second in subtask A and subtask C.
Kebebasan mengungkapkan pendapat di media sosial seringkali disalahgunakan oleh sebagian
orang untuk melakukan tindakan ofensif. Sehingga diperlukan suatu mekanisme untuk menyaring
unggahan untuk menjaga sosial media tetap kondusif. Tugas akhir ini bertujuan untuk melakukan
identifikasi dan kategorisasi bahasa ofensif di media sosial yang terdiri dari tiga subtask yaitu
identifikasi bahasa ofensif (subtask A), Kategorisasi terget bahasa ofensif (subtask B) dan
identifikasi target ofensif (subtask C). |
---|