Hadoop on data analytics
Twitter is a micro-blogging application that entails strong capabilities to share and convey ideas effectively through social connectivity. The data sets generated by Twitter could easily breakthrough millions of tweets per day. It could be computational infeasible to perform information mining from...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | English |
Published: |
2012
|
Subjects: | |
Online Access: | http://hdl.handle.net/10356/48540 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-48540 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-485402023-03-03T20:38:15Z Hadoop on data analytics Gee, Denny Jee King. Lee Bu Sung School of Computer Engineering DRNTU::Engineering::Computer science and engineering::Computing methodologies::Document and text processing Twitter is a micro-blogging application that entails strong capabilities to share and convey ideas effectively through social connectivity. The data sets generated by Twitter could easily breakthrough millions of tweets per day. It could be computational infeasible to perform information mining from a massive amount of data. Hence, we took an approach in adopting the MapReduce Framework provided by Apache Hadoop. Our model initially pre-processed the tweets by tokenizing them into individual words, where stop words and punctuations were filtered away. Next, we grouped the remaining words into their respective intervals, together with their tweet frequency distributions, and constructed time series signals based on their Document Frequency – Inverse Document Frequency (DF-IDF) vector through chaining a sequence of MapReduce jobs on Hadoop framework. After the data transformation, we computed the auto correlations of each word signals and filter trivial words that are of less importance with a threshold of 0.1 We further calculated the entropy of the word signals to determine the level of randomness, and match it accordingly such that words with low IDF values and narrow entropy (H < 0.25) were also taken away implicitly to better extract only the words that contains burst features in their time series. The outstanding words were then sorted by their auto-correlated coefficient using Hadoop Partition Sorting mechanism, and we introduced a percentile selection on the number of words for event detection. The events detected were then mapped onto a cross correlation matrix based on the bi-words combinations. The words were then represented as a form of adjacency graph, whereby we partition the graph by modularity and clustered words of similar relevance and features to reconstruct events. The events were then evaluated based on their relevance to corresponding real-life events. The computation on our Hadoop cluster gave remarkable results in terms of efficiency and compression data size. The cluster received a beneficial performance of 75% reduction in computational time and at the same time, the MapReduce architecture of Hadoop also reduced the data size by closed to 99% by indexing terms. Bachelor of Engineering (Computer Science) 2012-04-26T01:28:04Z 2012-04-26T01:28:04Z 2012 2012 Final Year Project (FYP) http://hdl.handle.net/10356/48540 en Nanyang Technological University 54 p. application/pdf |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
DRNTU::Engineering::Computer science and engineering::Computing methodologies::Document and text processing |
spellingShingle |
DRNTU::Engineering::Computer science and engineering::Computing methodologies::Document and text processing Gee, Denny Jee King. Hadoop on data analytics |
description |
Twitter is a micro-blogging application that entails strong capabilities to share and convey ideas effectively through social connectivity. The data sets generated by Twitter could easily breakthrough millions of tweets per day. It could be computational infeasible to perform information mining from a massive amount of data. Hence, we took an approach in adopting the MapReduce Framework provided by Apache Hadoop.
Our model initially pre-processed the tweets by tokenizing them into individual words, where stop words and punctuations were filtered away. Next, we grouped the remaining words into their respective intervals, together with their tweet frequency distributions, and constructed time series signals based on their Document Frequency – Inverse Document Frequency (DF-IDF) vector through chaining a sequence of MapReduce jobs on Hadoop framework. After the data transformation, we computed the auto correlations of each word signals and filter trivial words that are of less importance with a threshold of 0.1 We further calculated the entropy of the word signals to determine the level of randomness, and match it accordingly such that words with low IDF values and narrow entropy (H < 0.25) were also taken away implicitly to better extract only the words that contains burst features in their time series. The outstanding words were then sorted by their auto-correlated coefficient using Hadoop Partition Sorting mechanism, and we introduced a percentile selection on the number of words for event detection. The events detected were then mapped onto a cross correlation matrix based on the bi-words combinations. The words were then represented as a form of adjacency graph, whereby we partition the graph by modularity and clustered words of similar relevance and features to reconstruct events. The events were then evaluated based on their relevance to corresponding real-life events.
The computation on our Hadoop cluster gave remarkable results in terms of efficiency and compression data size. The cluster received a beneficial performance of 75% reduction in computational time and at the same time, the MapReduce architecture of Hadoop also reduced the data size by closed to 99% by indexing terms. |
author2 |
Lee Bu Sung |
author_facet |
Lee Bu Sung Gee, Denny Jee King. |
format |
Final Year Project |
author |
Gee, Denny Jee King. |
author_sort |
Gee, Denny Jee King. |
title |
Hadoop on data analytics |
title_short |
Hadoop on data analytics |
title_full |
Hadoop on data analytics |
title_fullStr |
Hadoop on data analytics |
title_full_unstemmed |
Hadoop on data analytics |
title_sort |
hadoop on data analytics |
publishDate |
2012 |
url |
http://hdl.handle.net/10356/48540 |
_version_ |
1759858229143142400 |