Event detection from social media on COVID-19

Event detection has been one of the most important research topics in social media analysis this decade due to the widespread availability of rich data generated by social media platforms. These platforms have become a major source of information describing real-world and trending events. However, m...

Full description

Saved in:
Bibliographic Details
Main Author: Ho, Yin Wee
Other Authors: Sun Aixin
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2022
Subjects:
Online Access:https://hdl.handle.net/10356/156483
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Event detection has been one of the most important research topics in social media analysis this decade due to the widespread availability of rich data generated by social media platforms. These platforms have become a major source of information describing real-world and trending events. However, major challenges are faced in detecting events due to the dynamic nature and high volume of data production in social media streams. Previously, most works were either applicable to detect breaking news or localised events, only to overlook on other significant events. Furthermore, these works were focused on processing Twitter data and the same techniques cannot be directly adopted for Facebook data. In this project, we implemented an event detection system based on word embeddings, adapted for detecting events in our Facebook dataset. This system is comprised of 1) Stream Splitter, 2) Word Embedder and Document Clustering (within individual time windows), 3) Document Clustering (across all time windows) and 4) Event Summarisation. In 1), we first performed some natural language processing on our data before splitting them into separate time windows. Next, we embedded our documents with 3 different models: Skip-gram, TF-IDF and GloVe, and clustered the documents within their individual time windows using a modified version of the Jarvis-Patrick clustering algorithm. Document similarity was determined by finding the cosine similarity score of any pair of documents and placing them in the same event cluster if their score was above a certain threshold. In 3), we applied the same techniques used in the previous component but now we clustered the event clusters across the entire time frame. Finally, the last component extracted a representative post, as well as the top 5 most frequent occurring words, that describes the event cluster. After tuning the hyperparameters to obtain the best possible set of results for each model, we found out that TF-IDF produced the highest quality events but was only able to detect a moderate number of events. In contrast, Skip-gram and GloVe were able to produce more events with slightly lower quality but more work is needed to filter out events that are not as significant. Finally, we also tracked the development of some sample topics over time and the public’s reactions to them. These insights can help to qualify the public’s perception of certain topics which can aid in shaping the authorities’ approach when introducing them to the public.