Fake news detection using social media data
Along with the large transition to the online social media market, a large number of "Fake News", i.e., articles that purposefully contain false information, are being spread across the network[1]. Fake news can be produced for many purposes, such as financial or political gain, and can...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | English |
Published: |
Nanyang Technological University
2021
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/147544 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | Along with the large transition to the online social media market, a large number of
"Fake News", i.e., articles that purposefully contain false information, are being spread
across the network[1]. Fake news can be produced for many purposes, such as financial
or political gain, and can have a negative impact on society. Therefore, to mitigate the
negative impact of fake news, it is crucial to develop a method to detect fake news on
social media.
This project involves discovering the best "state-of-art" machine learning model that
can be used to detect Fake News in social media. By researching and analyzing several
data sources, experimenting on the past model used and exploring new models using
Transformers, this project aims to determine which models were the most optimal to
classify news into their respective classes accurately.
In this report, the author will review multiple data sources and applying multiple
exploratory data analysis to filter out biased dataset. The author created three crucial
metrics to inspect the dataset: Amount of data, credibility, and bias. By applying the
above techniques and metrics, the author was able to determine the best data sources
that are unbiased and fit to be trained.
This report will also explore the pre-processing steps done to news articles. After
research, the author found out that the level of text preprocessing needed was
determined by the data domain and data amount. By implementing multiple versions
of data pre-processing, the author was able to grasp the dataset domain and was able
to use the most optimal data pre-processing method. Furthermore, based on this
experiment, the author was also able to determine a trend or pattern, of which pairings
of combinations between each machine learning algorithm and the corresponding preprocessing
technique would be the best to obtain the highest accuracy.
For this experiment, multiple machine learning algorithms such as Naïve Bayes, Word
Embedding LSTM, and the new transformer model will be introduced. To evaluate the
model's performance, the author will split the data into three sets: train, validation, and
test to further mitigate the overfit and reduce bias. With accuracy as the model's main
metrics, the author also had multiple metrics to support the verdict, such as F1-score,
precision, recall, and MCC. These metrics will further support the author's decision in
determining the best model without concern about overfitting.
The experiment results reflect that the newest model developed, transformers
perform the best amongst all models. The models consistently perform at the highest
benchmark, ultimately surpassing the previous model developed from the range of 5
to 15%. The transformer models performed at the highest accuracy of around 87-88%
consistently without overfitting and while using a standard base-parameters. The
results indicate that the transformers model (particularly ELECTRA and BERT)
is the best "state-of-art" machine learning model for fake news classification
problems. The experiments also imply that further research and experiment can be
done with a larger parameter, combining with generative upscaling and sentiment
analysis, to obtain even higher performance. |
---|