Hadoop on data analytics

Twitter is a micro-blogging application that entails strong capabilities to share and convey ideas effectively through social connectivity. The data sets generated by Twitter could easily breakthrough millions of tweets per day. It could be computational infeasible to perform information mining from...

Full description

Saved in:
Bibliographic Details
Main Author: Gee, Denny Jee King.
Other Authors: Lee Bu Sung
Format: Final Year Project
Language:English
Published: 2012
Subjects:
Online Access:http://hdl.handle.net/10356/48540
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Twitter is a micro-blogging application that entails strong capabilities to share and convey ideas effectively through social connectivity. The data sets generated by Twitter could easily breakthrough millions of tweets per day. It could be computational infeasible to perform information mining from a massive amount of data. Hence, we took an approach in adopting the MapReduce Framework provided by Apache Hadoop. Our model initially pre-processed the tweets by tokenizing them into individual words, where stop words and punctuations were filtered away. Next, we grouped the remaining words into their respective intervals, together with their tweet frequency distributions, and constructed time series signals based on their Document Frequency – Inverse Document Frequency (DF-IDF) vector through chaining a sequence of MapReduce jobs on Hadoop framework. After the data transformation, we computed the auto correlations of each word signals and filter trivial words that are of less importance with a threshold of 0.1 We further calculated the entropy of the word signals to determine the level of randomness, and match it accordingly such that words with low IDF values and narrow entropy (H < 0.25) were also taken away implicitly to better extract only the words that contains burst features in their time series. The outstanding words were then sorted by their auto-correlated coefficient using Hadoop Partition Sorting mechanism, and we introduced a percentile selection on the number of words for event detection. The events detected were then mapped onto a cross correlation matrix based on the bi-words combinations. The words were then represented as a form of adjacency graph, whereby we partition the graph by modularity and clustered words of similar relevance and features to reconstruct events. The events were then evaluated based on their relevance to corresponding real-life events. The computation on our Hadoop cluster gave remarkable results in terms of efficiency and compression data size. The cluster received a beneficial performance of 75% reduction in computational time and at the same time, the MapReduce architecture of Hadoop also reduced the data size by closed to 99% by indexing terms.