Deep learning for hate speech detection on X (Twitter) with different word embedding techniques
This project was conducted to develop hate speech detection models using several deep learning techniques with different word embedding techniques to detect English hate speech tweets on X (Twitter) with the goal of enhancing the online communication environment and reducing the suicide rate due to...
Saved in:
Main Author: | |
---|---|
Format: | Final Year Project / Dissertation / Thesis |
Published: |
2024
|
Subjects: | |
Online Access: | http://eprints.utar.edu.my/6684/1/fyp_CS_2024_TWX.pdf http://eprints.utar.edu.my/6684/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Universiti Tunku Abdul Rahman |
Summary: | This project was conducted to develop hate speech detection models using several deep learning techniques with different word embedding techniques to detect English hate speech tweets on X (Twitter) with the goal of enhancing the online communication environment and reducing the suicide rate due to cyberbullying. Several deep learning techniques were utilised in this project, such as CNN, BiLSTM, a pretrained DistilBERT model named 'distilbert/distilbert-base-uncased', and a pretrained RoBERTa model named 'facebook/roberta-hate-speech-dynabench-r4-target'. The word embedding techniques utilised in this project can be classified into two groups: those utilising a single word embedding technique such as GloVe (Global Vectors for Word Representation), Word2Vec, or word embedding vectors provided by DistilBERT and RoBERTa itself, and those combining two different word embedding techniques by stacking, averaging, and taking the root mean square of them. In comparison to the old trend models that utilised word-based tokenisation in the preprocessing of data, subword tokenisation is utilised in this project to tokenise the tweets in the dataset.
Several papers on cyberbullying or hate speech detection models using deep learning were reviewed, outlining the strengths and weaknesses of the models developed by various authors. In addition to detailing the architectures of these models used in this project, the paper also explains the model development process, techniques employed to address class imbalance issues or hyperparameter tuning, which were visualised or explained to provide newcomers in text classification with a comprehensive understanding of how models were developed. The most significant focus was on the performance evaluation and analysis of the DistilBERT, RoBERTa transformer models, as well as those CNN and BiLSTM models utilising single word embedding techniques and combining different word embedding techniques. |
---|