BUILDING SEXISM DETECTION AND CLASSIFICATION MODEL FOR SOCIAL MEDIA TEXT USING ROBERTA AND DATA AUGMENTATION
Sexism is actions based on the belief that the members of one sex are less intelligent, able, skillful, etc. than the members of the other sex, especially that women are less able than men. In the modern days, sexism is often found in social media because of the lack of consequences given when a...
Saved in:
Main Author: | |
---|---|
Format: | Final Project |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/74111 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
id |
id-itb.:74111 |
---|---|
spelling |
id-itb.:741112023-06-26T13:05:14ZBUILDING SEXISM DETECTION AND CLASSIFICATION MODEL FOR SOCIAL MEDIA TEXT USING ROBERTA AND DATA AUGMENTATION Tri Rahutami, Gayuh Indonesia Final Project sexism, text classification, social media, RoBERTa INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/74111 Sexism is actions based on the belief that the members of one sex are less intelligent, able, skillful, etc. than the members of the other sex, especially that women are less able than men. In the modern days, sexism is often found in social media because of the lack of consequences given when a user performs a sexism act. To go against this trend, an organization called Rewire has conducted a competition in SemEval 2023 titled Toward Explainable Detection of Online Sexism (EDOS), a competition with a goal to create a model that can detect sexism in social media text while also classifying the text to four general categories and eleven specific categories. In this final year project, three artificial neural network models, each for each task specified above, will be created using a transformer-based model, RoBERTa. In the dataset provided, it was also found that there is an imbalance in the data provided, causing the model to unable to predict some of the categories that have less data than the others. To fix this, experiment on data augmentation will also be performed to increase the models’ performance. There will be four data augmentation experiments, without data augmentation, using random oversampling, using easy data augmentation, and using backtranslations. From the experiments, it was found that data augmentation was able to increase the performance of category classification and sub-category classification. In the category classification task, data augmentation was able to increase the F1 score from 0.29 to 0.66. Meanwhile, in the sub-category classification task, data augmentation was able to increase the F1 score from 0.18 to 0.51. From further analysis, it was found that the characteristics of the sexist texts that were successfully predicted were the ones that contain a lot of derogative terms. text |
institution |
Institut Teknologi Bandung |
building |
Institut Teknologi Bandung Library |
continent |
Asia |
country |
Indonesia Indonesia |
content_provider |
Institut Teknologi Bandung |
collection |
Digital ITB |
language |
Indonesia |
description |
Sexism is actions based on the belief that the members of one sex are less intelligent,
able, skillful, etc. than the members of the other sex, especially that women are less
able than men. In the modern days, sexism is often found in social media because of
the lack of consequences given when a user performs a sexism act. To go against this
trend, an organization called Rewire has conducted a competition in SemEval 2023
titled Toward Explainable Detection of Online Sexism (EDOS), a competition with a
goal to create a model that can detect sexism in social media text while also classifying
the text to four general categories and eleven specific categories.
In this final year project, three artificial neural network models, each for each task
specified above, will be created using a transformer-based model, RoBERTa. In the
dataset provided, it was also found that there is an imbalance in the data provided,
causing the model to unable to predict some of the categories that have less data than
the others. To fix this, experiment on data augmentation will also be performed to
increase the models’ performance. There will be four data augmentation experiments,
without data augmentation, using random oversampling, using easy data augmentation,
and using backtranslations.
From the experiments, it was found that data augmentation was able to increase the
performance of category classification and sub-category classification. In the category
classification task, data augmentation was able to increase the F1 score from 0.29 to
0.66. Meanwhile, in the sub-category classification task, data augmentation was able
to increase the F1 score from 0.18 to 0.51. From further analysis, it was found that the
characteristics of the sexist texts that were successfully predicted were the ones that
contain a lot of derogative terms. |
format |
Final Project |
author |
Tri Rahutami, Gayuh |
spellingShingle |
Tri Rahutami, Gayuh BUILDING SEXISM DETECTION AND CLASSIFICATION MODEL FOR SOCIAL MEDIA TEXT USING ROBERTA AND DATA AUGMENTATION |
author_facet |
Tri Rahutami, Gayuh |
author_sort |
Tri Rahutami, Gayuh |
title |
BUILDING SEXISM DETECTION AND CLASSIFICATION MODEL FOR SOCIAL MEDIA TEXT USING ROBERTA AND DATA AUGMENTATION |
title_short |
BUILDING SEXISM DETECTION AND CLASSIFICATION MODEL FOR SOCIAL MEDIA TEXT USING ROBERTA AND DATA AUGMENTATION |
title_full |
BUILDING SEXISM DETECTION AND CLASSIFICATION MODEL FOR SOCIAL MEDIA TEXT USING ROBERTA AND DATA AUGMENTATION |
title_fullStr |
BUILDING SEXISM DETECTION AND CLASSIFICATION MODEL FOR SOCIAL MEDIA TEXT USING ROBERTA AND DATA AUGMENTATION |
title_full_unstemmed |
BUILDING SEXISM DETECTION AND CLASSIFICATION MODEL FOR SOCIAL MEDIA TEXT USING ROBERTA AND DATA AUGMENTATION |
title_sort |
building sexism detection and classification model for social media text using roberta and data augmentation |
url |
https://digilib.itb.ac.id/gdl/view/74111 |
_version_ |
1822007305327083520 |