Towards better prediction and content detection through online social media mining

With the astronomical growth of Online Social Networks (OSN), they have become the new target of many cyber criminals like spammers and phishers and many advertisers which have resulted in worrying issues. These issues range from low-quality content to phishing and frauds. Rumor diffusion is another...

Full description

Saved in:
Bibliographic Details
Main Author: Chen, Weiling
Other Authors: Lau Chiew Tong
Format: Theses and Dissertations
Language:English
Published: 2018
Subjects:
Online Access:http://hdl.handle.net/10356/75925
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:With the astronomical growth of Online Social Networks (OSN), they have become the new target of many cyber criminals like spammers and phishers and many advertisers which have resulted in worrying issues. These issues range from low-quality content to phishing and frauds. Rumor diffusion is another problem causing serious social issues. Since information can propagate much faster than ever on OSN, the negative impact of rumors is thus much worse. However, we would not stop using OSN to interact with our friends and acquaintances, to share news and information, and to take part in other interesting online activities just because of the issues it may cause. As a matter of fact, with the content collected from OSN, data analysts would be able to predict box office, terrorism and even the stock price and a lot of other interesting topics. OSN is like a double-edged sword. Therefore it is necessary to reduce the negative effect of it and benefit as many individuals and organizations as possible. In this thesis, the author carries out research on making detection and prediction tasks more accurate through mining the different aspects of the content collected from OSN. Detection techniques of malicious content like spam and phishing on OSN are common while in contrast little attention is paid to other low-quality content which actually impacts user browsing experience most. The author proposes a framework to detect low-quality content from the users' perspective in real time. Based on preliminary studies, a survey is carefully designed to gather users' opinions on different categories of low-quality content. Both direct and indirect features including newly proposed features are identified to characterize the different types of low-quality content. The author then combines word level analysis with the identified features and builds a keyword blacklist dictionary to improve the detection performance. The author labels an extensive Twitter dataset of 100,000 tweets and performs low-quality content detection in real time based on the characterized significant features and word level analysis. Since information can spread rapidly and widely more than ever on OSN, they have become new hot beds of misinformation diffusion. Owing to the potential harm the false information may bring to the public, rumor detection has become a significant but challenging research topic. In order to detect the few but potentially harmful rumors to prevent the public issues they may cause, the author proposes an unsupervised learning model combining Recurrent Neural Networks (RNN) and Autoencoders (AE) to distinguish rumors as anomalies from other credible microblogs based on users' behaviors. In addition, some features based on comments posted by other users are newly proposed and are then analyzed over their posting time so as to exploit the crowd wisdom to improve the detection performance. The reason that people sometimes read rumors from OSN is because today, OSN play a significant role as a platform for information sharing especially news updates. News from traditional media has been used to facilitate the prediction of stock movement for a long time. This inspires the author to exploit the news content collected from OSN to predict stock index movement. In this work, the author carefully selects official accounts from China's largest OSN, i.e. Sina Weibo and analyzes the news content crawled from these accounts by extracting sentiment features and Latent Dirichlet allocation (LDA) features. The author then inputs these features together with technical indicators into a novel model called RNN-boost to predict the stock volatility in the Chinese stock market. The work presented in this thesis demonstrates the boon and bane of OSN and provides the methodologies and applications to exploit the good aspects while minimizing OSN's potential negative impact. The author would expect that this thesis could give some insights into the future work of related research.