Fake news detection using social media data

Along with the large transition to the online social media market, a large number of "Fake News", i.e., articles that purposefully contain false information, are being spread across the network[1]. Fake news can be produced for many purposes, such as financial or political gain, and can...

Full description

Saved in:
Bibliographic Details
Main Author: Widjaja, Elbert
Other Authors: Ke Yiping, Kelly
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2021
Subjects:
Online Access:https://hdl.handle.net/10356/147544
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Along with the large transition to the online social media market, a large number of "Fake News", i.e., articles that purposefully contain false information, are being spread across the network[1]. Fake news can be produced for many purposes, such as financial or political gain, and can have a negative impact on society. Therefore, to mitigate the negative impact of fake news, it is crucial to develop a method to detect fake news on social media. This project involves discovering the best "state-of-art" machine learning model that can be used to detect Fake News in social media. By researching and analyzing several data sources, experimenting on the past model used and exploring new models using Transformers, this project aims to determine which models were the most optimal to classify news into their respective classes accurately. In this report, the author will review multiple data sources and applying multiple exploratory data analysis to filter out biased dataset. The author created three crucial metrics to inspect the dataset: Amount of data, credibility, and bias. By applying the above techniques and metrics, the author was able to determine the best data sources that are unbiased and fit to be trained. This report will also explore the pre-processing steps done to news articles. After research, the author found out that the level of text preprocessing needed was determined by the data domain and data amount. By implementing multiple versions of data pre-processing, the author was able to grasp the dataset domain and was able to use the most optimal data pre-processing method. Furthermore, based on this experiment, the author was also able to determine a trend or pattern, of which pairings of combinations between each machine learning algorithm and the corresponding preprocessing technique would be the best to obtain the highest accuracy. For this experiment, multiple machine learning algorithms such as Naïve Bayes, Word Embedding LSTM, and the new transformer model will be introduced. To evaluate the model's performance, the author will split the data into three sets: train, validation, and test to further mitigate the overfit and reduce bias. With accuracy as the model's main metrics, the author also had multiple metrics to support the verdict, such as F1-score, precision, recall, and MCC. These metrics will further support the author's decision in determining the best model without concern about overfitting. The experiment results reflect that the newest model developed, transformers perform the best amongst all models. The models consistently perform at the highest benchmark, ultimately surpassing the previous model developed from the range of 5 to 15%. The transformer models performed at the highest accuracy of around 87-88% consistently without overfitting and while using a standard base-parameters. The results indicate that the transformers model (particularly ELECTRA and BERT) is the best "state-of-art" machine learning model for fake news classification problems. The experiments also imply that further research and experiment can be done with a larger parameter, combining with generative upscaling and sentiment analysis, to obtain even higher performance.