HATE SPEECH DETECTION WITH LOGISTIC REGRESSION AND NAÏVE BAYES CLASSIFIER

Hate Speech has several key characteristics, such as a specific word that has a hateful sentiment, the order of words in a sentence so that it builds a certain context or the frequency of occurrence of a word. The focus of this research is to build a model that can understand these characteristics,...

Full description

Saved in:
Bibliographic Details
Main Author: Lorenzo, Feraldo
Format: Final Project
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/54809
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
Description
Summary:Hate Speech has several key characteristics, such as a specific word that has a hateful sentiment, the order of words in a sentence so that it builds a certain context or the frequency of occurrence of a word. The focus of this research is to build a model that can understand these characteristics, then classify a given sentence as hate speech or not. In building the model, data is needed for the model to learn, and the data used in this study contains sentences from the Twitter social media platform. These sentences are sentences that have been separated into classes, Hate Speech and Not Hate Speech. Then data pre-processing is carried out to process the data before it is learned by the model. The process that the data goes through includes converting the data to lowercase, removing excess spaces, removing words that don't provide information (subject, preposition, etc.) and converting emojis into words related to that emoji. The model used in this analysis is simple statistical models, Logistic Regression and Naïve Bayes Classifier. In this study, we want to see and compare the performance of the two models when additional features or variables are added to help the model learn the characteristics of hate speech. The variables that are added include the frequency of the word occurrences in the data and variables that record the position of the word in the sentence. The performance of the model is not only measured from validation values such as the AUC Score and F1 Score, but it will also be seen from how the model classifies several samples of new sentences. From the experiment, it was found that in terms of performance, the Logistic Regression Model and the Naïve Bayes Classifier did not have a significant difference. However, the Logistic Regression model has the ability to interpret variables that have the most significant influence on the model's performance through the values of the coefficients within the logit.