Identifying behavioral anomalies in Twitter users

In this modern society, social media are widely used in people’s lives for communication, B2C engagements and many more. Twitter, in particular, has become a popular medium for communication and with its popularity comes the increasing attention of spammers and cyber attackers who are looking to ups...

Full description

Saved in:
Bibliographic Details
Main Author: Teo, Yue Qi
Other Authors: Yeo Chai Kiat
Format: Final Year Project
Language:English
Published: 2016
Subjects:
Online Access:http://hdl.handle.net/10356/66710
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:In this modern society, social media are widely used in people’s lives for communication, B2C engagements and many more. Twitter, in particular, has become a popular medium for communication and with its popularity comes the increasing attention of spammers and cyber attackers who are looking to upset the Twitter experience by spreading spams, hacking and etc. Therefore, the objective of this project is to identify behavioral anomalies in Twitter users and be able to detect these suspicious accounts by looking out for differing features that such users have from normal users and minimize the amount of damage they will be able to inflict as much as possible with early detection. To achieve the objective of this project, Twitter REST API was first used to obtain the latest 50 tweets of 10,000 users as well as two data mining algorithms which are the Random Forest and Time Series Anomaly Detection algorithms. Approximately 15,000 tweets were manually labelled before the remaining tweets were being automatically labelled using the prediction values derived from the Random Forest algorithm which was being implemented using ‘Caret’ R package in RStudio. Thereafter, the tweets dataset was being processed into users’ features as the final dataset used for behavioral anomalies identification. Anacondas Spyder was used as the processing tool. Finally, Time Series Anomaly Detection algorithm was used to identify abnormal tweeting frequencies over a period of time and was implemented using Twitter’s AnomalyDetection R package in RStudio. Lastly, the values of the Twitter users’ features are being plotted into graphs to showcase the differences between anomalous and normal users and results have shown that behavioral anomalies can be identified in features such as retweet counts, followers counts, friends counts, followers-to-friends ratio and many more. In addition, anomalous users tend to have greater variation of tweeting frequency at unexpected occasions and time frames while normal users usually become much more active around special events occurrences which in this case are the Christmas and New Year festive seasons. Results have also shown that anomalous users’ behaviors are versatile and more prone to changes as compared to normal users. Overall, the objective of this project was successfully achieved but there are certain areas which have not been studied due to time constraints and limited manpower and therefore can be researched further into in future to achieve even better identification of behavioral anomalies.