IDENTIFYING AND CATEGORIZING OFFENSIVE LANGUAGES IN SOCIAL MEDIA

Freedom of expression on social media is often misused by some people to carry out offensive actions. So we need a mechanism to filter uploads to keep social media conducive. This final project aims to identify and categorize offensive language on social media which consists of three subtasks, na...

Full description

Saved in:

Bibliographic Details
Main Author:	Muslim, Fajar
Format:	Final Project
Language:	Indonesia
Online Access:	https://digilib.itb.ac.id/gdl/view/56244
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Institut Teknologi Bandung
Language:	Indonesia

id	id-itb.:56244
spelling	id-itb.:562442021-06-21T16:31:05ZIDENTIFYING AND CATEGORIZING OFFENSIVE LANGUAGES IN SOCIAL MEDIA Muslim, Fajar Indonesia Final Project offensive, cost-sensitive learning, ensemble, hard majority voting. INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/56244 Freedom of expression on social media is often misused by some people to carry out offensive actions. So we need a mechanism to filter uploads to keep social media conducive. This final project aims to identify and categorize offensive language on social media which consists of three subtasks, namely identification of offensive language (subtask A), categorization of offensive language targets (subtask B) and identification of offensive targets (subtask C). This final project uses OLID dataset which is relatively small and imbalanced. In a previous study (SemEval-2019 task 6) BERT got the best performance on subtask A (Liu, et al 2019) and subtask C (Radivchev & Nikolov, 2019). On the other hand, the use of fine-tuning BERT results in a high variance (Risch. et al 2020). The final project focuses on conducting experiments to get the best model architecture through cost-sensitive learning techniques to overcome imbalanced datasets and ensembles to improve BERT performance. Based on the experimental results on the validation data, the use of cost-sensitive learning and ensemble improves the performance of the model on the three subtasks. However, after testing, the increase in cost-sensitive learning performance was only found in subtask B 3.16% compared to the baseline model (study of Liu et al. 2019) and subtask C by 6.85% compared to the baseline model (research by Zhou et al. 2019). While in subtask A there was no increase. The best ensemble technique for the three subtasks is the hard majority voting approach. This technique provides an increase in performance on subtask A by 0.78% and subtask B by 1.72% compared to models with cost-sensitive learning techniques. Meanwhile, the C subtask does not improve the performance of the model. The results of this final project are in the first place in the state of the art OLID dataset in subtask B and second in subtask A and subtask C. Kebebasan mengungkapkan pendapat di media sosial seringkali disalahgunakan oleh sebagian orang untuk melakukan tindakan ofensif. Sehingga diperlukan suatu mekanisme untuk menyaring unggahan untuk menjaga sosial media tetap kondusif. Tugas akhir ini bertujuan untuk melakukan identifikasi dan kategorisasi bahasa ofensif di media sosial yang terdiri dari tiga subtask yaitu identifikasi bahasa ofensif (subtask A), Kategorisasi terget bahasa ofensif (subtask B) dan identifikasi target ofensif (subtask C). text
institution	Institut Teknologi Bandung
building	Institut Teknologi Bandung Library
continent	Asia
country	Indonesia Indonesia
content_provider	Institut Teknologi Bandung
collection	Digital ITB
language	Indonesia
description	Freedom of expression on social media is often misused by some people to carry out offensive actions. So we need a mechanism to filter uploads to keep social media conducive. This final project aims to identify and categorize offensive language on social media which consists of three subtasks, namely identification of offensive language (subtask A), categorization of offensive language targets (subtask B) and identification of offensive targets (subtask C). This final project uses OLID dataset which is relatively small and imbalanced. In a previous study (SemEval-2019 task 6) BERT got the best performance on subtask A (Liu, et al 2019) and subtask C (Radivchev & Nikolov, 2019). On the other hand, the use of fine-tuning BERT results in a high variance (Risch. et al 2020). The final project focuses on conducting experiments to get the best model architecture through cost-sensitive learning techniques to overcome imbalanced datasets and ensembles to improve BERT performance. Based on the experimental results on the validation data, the use of cost-sensitive learning and ensemble improves the performance of the model on the three subtasks. However, after testing, the increase in cost-sensitive learning performance was only found in subtask B 3.16% compared to the baseline model (study of Liu et al. 2019) and subtask C by 6.85% compared to the baseline model (research by Zhou et al. 2019). While in subtask A there was no increase. The best ensemble technique for the three subtasks is the hard majority voting approach. This technique provides an increase in performance on subtask A by 0.78% and subtask B by 1.72% compared to models with cost-sensitive learning techniques. Meanwhile, the C subtask does not improve the performance of the model. The results of this final project are in the first place in the state of the art OLID dataset in subtask B and second in subtask A and subtask C. Kebebasan mengungkapkan pendapat di media sosial seringkali disalahgunakan oleh sebagian orang untuk melakukan tindakan ofensif. Sehingga diperlukan suatu mekanisme untuk menyaring unggahan untuk menjaga sosial media tetap kondusif. Tugas akhir ini bertujuan untuk melakukan identifikasi dan kategorisasi bahasa ofensif di media sosial yang terdiri dari tiga subtask yaitu identifikasi bahasa ofensif (subtask A), Kategorisasi terget bahasa ofensif (subtask B) dan identifikasi target ofensif (subtask C).
format	Final Project
author	Muslim, Fajar
spellingShingle	Muslim, Fajar IDENTIFYING AND CATEGORIZING OFFENSIVE LANGUAGES IN SOCIAL MEDIA
author_facet	Muslim, Fajar
author_sort	Muslim, Fajar
title	IDENTIFYING AND CATEGORIZING OFFENSIVE LANGUAGES IN SOCIAL MEDIA
title_short	IDENTIFYING AND CATEGORIZING OFFENSIVE LANGUAGES IN SOCIAL MEDIA
title_full	IDENTIFYING AND CATEGORIZING OFFENSIVE LANGUAGES IN SOCIAL MEDIA
title_fullStr	IDENTIFYING AND CATEGORIZING OFFENSIVE LANGUAGES IN SOCIAL MEDIA
title_full_unstemmed	IDENTIFYING AND CATEGORIZING OFFENSIVE LANGUAGES IN SOCIAL MEDIA
title_sort	identifying and categorizing offensive languages in social media
url	https://digilib.itb.ac.id/gdl/view/56244
_version_	1822930140239233024

IDENTIFYING AND CATEGORIZING OFFENSIVE LANGUAGES IN SOCIAL MEDIA

Similar Items