Big data processing model for authorship identification

The era of Big Data has arrived and an average of about quintillions of data is produced daily. Data can be in many forms such as image, document or movie. For document file, there are digitalized document and handwritten document that often relates to the issue of copyright or ownership. This is du...

Full description

Saved in:

Bibliographic Details
Main Authors:	Eng, T. C., Hasan, S., Shamsuddin, S. M., Wong, N. E., Jalil, I. A.
Format:	Article
Published:	International Center for Scientific Research and Studies 2017
Subjects:	QA75 Electronic computers. Computer science
Online Access:	http://eprints.utm.my/id/eprint/76314/ https://www.scopus.com/inward/record.uri?eid=2-s2.0-85033701418&partnerID=40&md5=48ba51a458663108253929e5316fcc55
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Universiti Teknologi Malaysia

id	my.utm.76314
record_format	eprints
spelling	my.utm.763142018-06-29T22:01:19Z http://eprints.utm.my/id/eprint/76314/ Big data processing model for authorship identification Eng, T. C. Hasan, S. Shamsuddin, S. M. Wong, N. E. Jalil, I. A. QA75 Electronic computers. Computer science The era of Big Data has arrived and an average of about quintillions of data is produced daily. Data can be in many forms such as image, document or movie. For document file, there are digitalized document and handwritten document that often relates to the issue of copyright or ownership. This is due to improper authentication that leads to unhealthy authorship claimed of that particular handwritten document. Authorship identification is a sub-area of Document Image Analysis and Identification (DIAR). DIAR aim is to analyze and identify document authorship. However, for big scale of documents text images, the issue of document processing time becomes crucial for better authorship identification. Therefore, in this study, we propose an alternative solution to solve the above problems dealing with massive amount of document text images by integrating Hadoop MapReduce and Spark's MLlib for authorship identification through data processing parallelization. MapReduce processing is used as the platform to pre- process these huge document text images in Hadoop Distributed File Systems (HDFS), follows by the authorship identification through Apache Spark machine learning library.The experiments show the integration is successfully implemented for big size of document text images. However, further improvement is needed for the post-analytics of the reduced document text images for better identification. International Center for Scientific Research and Studies 2017 Article PeerReviewed Eng, T. C. and Hasan, S. and Shamsuddin, S. M. and Wong, N. E. and Jalil, I. A. (2017) Big data processing model for authorship identification. International Journal of Advances in Soft Computing and its Applications, 9 (3). pp. 1-22. ISSN 2074-8523 https://www.scopus.com/inward/record.uri?eid=2-s2.0-85033701418&partnerID=40&md5=48ba51a458663108253929e5316fcc55
institution	Universiti Teknologi Malaysia
building	UTM Library
collection	Institutional Repository
continent	Asia
country	Malaysia
content_provider	Universiti Teknologi Malaysia
content_source	UTM Institutional Repository
url_provider	http://eprints.utm.my/
topic	QA75 Electronic computers. Computer science
spellingShingle	QA75 Electronic computers. Computer science Eng, T. C. Hasan, S. Shamsuddin, S. M. Wong, N. E. Jalil, I. A. Big data processing model for authorship identification
description	The era of Big Data has arrived and an average of about quintillions of data is produced daily. Data can be in many forms such as image, document or movie. For document file, there are digitalized document and handwritten document that often relates to the issue of copyright or ownership. This is due to improper authentication that leads to unhealthy authorship claimed of that particular handwritten document. Authorship identification is a sub-area of Document Image Analysis and Identification (DIAR). DIAR aim is to analyze and identify document authorship. However, for big scale of documents text images, the issue of document processing time becomes crucial for better authorship identification. Therefore, in this study, we propose an alternative solution to solve the above problems dealing with massive amount of document text images by integrating Hadoop MapReduce and Spark's MLlib for authorship identification through data processing parallelization. MapReduce processing is used as the platform to pre- process these huge document text images in Hadoop Distributed File Systems (HDFS), follows by the authorship identification through Apache Spark machine learning library.The experiments show the integration is successfully implemented for big size of document text images. However, further improvement is needed for the post-analytics of the reduced document text images for better identification.
format	Article
author	Eng, T. C. Hasan, S. Shamsuddin, S. M. Wong, N. E. Jalil, I. A.
author_facet	Eng, T. C. Hasan, S. Shamsuddin, S. M. Wong, N. E. Jalil, I. A.
author_sort	Eng, T. C.
title	Big data processing model for authorship identification
title_short	Big data processing model for authorship identification
title_full	Big data processing model for authorship identification
title_fullStr	Big data processing model for authorship identification
title_full_unstemmed	Big data processing model for authorship identification
title_sort	big data processing model for authorship identification
publisher	International Center for Scientific Research and Studies
publishDate	2017
url	http://eprints.utm.my/id/eprint/76314/ https://www.scopus.com/inward/record.uri?eid=2-s2.0-85033701418&partnerID=40&md5=48ba51a458663108253929e5316fcc55
_version_	1643657275527659520

Big data processing model for authorship identification

Similar Items