BUILDING SEXISM DETECTION AND CLASSIFICATION MODEL FOR SOCIAL MEDIA TEXT USING ROBERTA AND DATA AUGMENTATION

Sexism is actions based on the belief that the members of one sex are less intelligent, able, skillful, etc. than the members of the other sex, especially that women are less able than men. In the modern days, sexism is often found in social media because of the lack of consequences given when a...

Full description

Saved in:
Bibliographic Details
Main Author: Tri Rahutami, Gayuh
Format: Final Project
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/74111
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
Description
Summary:Sexism is actions based on the belief that the members of one sex are less intelligent, able, skillful, etc. than the members of the other sex, especially that women are less able than men. In the modern days, sexism is often found in social media because of the lack of consequences given when a user performs a sexism act. To go against this trend, an organization called Rewire has conducted a competition in SemEval 2023 titled Toward Explainable Detection of Online Sexism (EDOS), a competition with a goal to create a model that can detect sexism in social media text while also classifying the text to four general categories and eleven specific categories. In this final year project, three artificial neural network models, each for each task specified above, will be created using a transformer-based model, RoBERTa. In the dataset provided, it was also found that there is an imbalance in the data provided, causing the model to unable to predict some of the categories that have less data than the others. To fix this, experiment on data augmentation will also be performed to increase the models’ performance. There will be four data augmentation experiments, without data augmentation, using random oversampling, using easy data augmentation, and using backtranslations. From the experiments, it was found that data augmentation was able to increase the performance of category classification and sub-category classification. In the category classification task, data augmentation was able to increase the F1 score from 0.29 to 0.66. Meanwhile, in the sub-category classification task, data augmentation was able to increase the F1 score from 0.18 to 0.51. From further analysis, it was found that the characteristics of the sexist texts that were successfully predicted were the ones that contain a lot of derogative terms.