Using unsupervised techniques and manual analysis: A framework for discovering themes from social media posts
Given the role of social media in the modern society, it is imperative that the data from these sources be organized in order for them to be properly utilized. Hence, current technologies rely on supervised learning approaches that require the development of training data. However, for these trainin...
Saved in:
Main Author: | |
---|---|
Format: | text |
Language: | English |
Published: |
Animo Repository
2014
|
Online Access: | https://animorepository.dlsu.edu.ph/etd_masteral/4623 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | De La Salle University |
Language: | English |
Summary: | Given the role of social media in the modern society, it is imperative that the data from these sources be organized in order for them to be properly utilized. Hence, current technologies rely on supervised learning approaches that require the development of training data. However, for these training data to be useful, accurate or expert knowledge is often required. As alternative to manual approaches, which are impractical and uneconomical, social scientists utilize Natural Language Processing (NLP) as guide in order to derive themes from the dataset. However, these automatic approaches are either biased to frequently occurring terms or do not provide enough information in order to aid experts. Given these constraints, a framework that combines unsupervised methods and a manual means for topic extraction is presented.
For this research, the data gathered from related researches (Meier, 2012a Meier, 2012b Pablo, Oco, Cheng, Roldan, & Roxas, 2014) are first preprocessed and represented using the bag-of-words representation and TF-IDF weighting scheme. Then the entire data undergoes feature reduction in order to reduce the length of the vector space. Next, k-means clustering (k = 3, 5 and 8) is used in order to organize the data in categories. It has been observed that silhouette coefficient of the clusters indicate that the clustering is suffering from high dimensionality of the features. Furthermore, due to the unlabeled nature of the unsupervised methods, content analysis using open coding is performed. Evaluation of the assigned labels yielded accuracy rate of 41.5% agreement rate while analysis of the results show different types of cluster behaviors (1) multi-clustered theme (2) consistent clusters (3) multi-topic clusters (4) language clusters (5) dispersing cluster. As future work, an improved preprocessing technique could be used for the clustering as well as exploring other value for k. |
---|