IDENTIFYING AND CATEGORIZING OFFENSIVE LANGUAGES IN SOCIAL MEDIA

Freedom of expression on social media is often misused by some people to carry out offensive actions. So we need a mechanism to filter uploads to keep social media conducive. This final project aims to identify and categorize offensive language on social media which consists of three subtasks, na...

Full description

Saved in:
Bibliographic Details
Main Author: Muslim, Fajar
Format: Final Project
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/56244
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
id id-itb.:56244
spelling id-itb.:562442021-06-21T16:31:05ZIDENTIFYING AND CATEGORIZING OFFENSIVE LANGUAGES IN SOCIAL MEDIA Muslim, Fajar Indonesia Final Project offensive, cost-sensitive learning, ensemble, hard majority voting. INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/56244 Freedom of expression on social media is often misused by some people to carry out offensive actions. So we need a mechanism to filter uploads to keep social media conducive. This final project aims to identify and categorize offensive language on social media which consists of three subtasks, namely identification of offensive language (subtask A), categorization of offensive language targets (subtask B) and identification of offensive targets (subtask C). This final project uses OLID dataset which is relatively small and imbalanced. In a previous study (SemEval-2019 task 6) BERT got the best performance on subtask A (Liu, et al 2019) and subtask C (Radivchev & Nikolov, 2019). On the other hand, the use of fine-tuning BERT results in a high variance (Risch. et al 2020). The final project focuses on conducting experiments to get the best model architecture through cost-sensitive learning techniques to overcome imbalanced datasets and ensembles to improve BERT performance. Based on the experimental results on the validation data, the use of cost-sensitive learning and ensemble improves the performance of the model on the three subtasks. However, after testing, the increase in cost-sensitive learning performance was only found in subtask B 3.16% compared to the baseline model (study of Liu et al. 2019) and subtask C by 6.85% compared to the baseline model (research by Zhou et al. 2019). While in subtask A there was no increase. The best ensemble technique for the three subtasks is the hard majority voting approach. This technique provides an increase in performance on subtask A by 0.78% and subtask B by 1.72% compared to models with cost-sensitive learning techniques. Meanwhile, the C subtask does not improve the performance of the model. The results of this final project are in the first place in the state of the art OLID dataset in subtask B and second in subtask A and subtask C. Kebebasan mengungkapkan pendapat di media sosial seringkali disalahgunakan oleh sebagian orang untuk melakukan tindakan ofensif. Sehingga diperlukan suatu mekanisme untuk menyaring unggahan untuk menjaga sosial media tetap kondusif. Tugas akhir ini bertujuan untuk melakukan identifikasi dan kategorisasi bahasa ofensif di media sosial yang terdiri dari tiga subtask yaitu identifikasi bahasa ofensif (subtask A), Kategorisasi terget bahasa ofensif (subtask B) dan identifikasi target ofensif (subtask C). text
institution Institut Teknologi Bandung
building Institut Teknologi Bandung Library
continent Asia
country Indonesia
Indonesia
content_provider Institut Teknologi Bandung
collection Digital ITB
language Indonesia
description Freedom of expression on social media is often misused by some people to carry out offensive actions. So we need a mechanism to filter uploads to keep social media conducive. This final project aims to identify and categorize offensive language on social media which consists of three subtasks, namely identification of offensive language (subtask A), categorization of offensive language targets (subtask B) and identification of offensive targets (subtask C). This final project uses OLID dataset which is relatively small and imbalanced. In a previous study (SemEval-2019 task 6) BERT got the best performance on subtask A (Liu, et al 2019) and subtask C (Radivchev & Nikolov, 2019). On the other hand, the use of fine-tuning BERT results in a high variance (Risch. et al 2020). The final project focuses on conducting experiments to get the best model architecture through cost-sensitive learning techniques to overcome imbalanced datasets and ensembles to improve BERT performance. Based on the experimental results on the validation data, the use of cost-sensitive learning and ensemble improves the performance of the model on the three subtasks. However, after testing, the increase in cost-sensitive learning performance was only found in subtask B 3.16% compared to the baseline model (study of Liu et al. 2019) and subtask C by 6.85% compared to the baseline model (research by Zhou et al. 2019). While in subtask A there was no increase. The best ensemble technique for the three subtasks is the hard majority voting approach. This technique provides an increase in performance on subtask A by 0.78% and subtask B by 1.72% compared to models with cost-sensitive learning techniques. Meanwhile, the C subtask does not improve the performance of the model. The results of this final project are in the first place in the state of the art OLID dataset in subtask B and second in subtask A and subtask C. Kebebasan mengungkapkan pendapat di media sosial seringkali disalahgunakan oleh sebagian orang untuk melakukan tindakan ofensif. Sehingga diperlukan suatu mekanisme untuk menyaring unggahan untuk menjaga sosial media tetap kondusif. Tugas akhir ini bertujuan untuk melakukan identifikasi dan kategorisasi bahasa ofensif di media sosial yang terdiri dari tiga subtask yaitu identifikasi bahasa ofensif (subtask A), Kategorisasi terget bahasa ofensif (subtask B) dan identifikasi target ofensif (subtask C).
format Final Project
author Muslim, Fajar
spellingShingle Muslim, Fajar
IDENTIFYING AND CATEGORIZING OFFENSIVE LANGUAGES IN SOCIAL MEDIA
author_facet Muslim, Fajar
author_sort Muslim, Fajar
title IDENTIFYING AND CATEGORIZING OFFENSIVE LANGUAGES IN SOCIAL MEDIA
title_short IDENTIFYING AND CATEGORIZING OFFENSIVE LANGUAGES IN SOCIAL MEDIA
title_full IDENTIFYING AND CATEGORIZING OFFENSIVE LANGUAGES IN SOCIAL MEDIA
title_fullStr IDENTIFYING AND CATEGORIZING OFFENSIVE LANGUAGES IN SOCIAL MEDIA
title_full_unstemmed IDENTIFYING AND CATEGORIZING OFFENSIVE LANGUAGES IN SOCIAL MEDIA
title_sort identifying and categorizing offensive languages in social media
url https://digilib.itb.ac.id/gdl/view/56244
_version_ 1822930140239233024