Automatic document clustering and indexing of multiple documents using KNMF for feature extraction through Hadoop and lucene on big data

Automatic indexing; Big data; Cluster analysis; Extraction; Factorization; Indexing (of information); Information retrieval; K-means clustering; Natural language processing systems; Open source software; Open systems; Pattern matching; Software quality; Software testing; Text mining; Hadoop; Key phr...

Full description

Saved in:

Bibliographic Details
Main Authors:	Laxmi Lydia E., Sharmili N., Nguyen P.T., Hashim W., Maseleno A.
Other Authors:	57196059278
Format:	Article
Published:	Mattingley Publishing 2023
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Universiti Tenaga Nasional

id	my.uniten.dspace-24853
record_format	dspace
spelling	my.uniten.dspace-248532023-05-29T15:27:55Z Automatic document clustering and indexing of multiple documents using KNMF for feature extraction through Hadoop and lucene on big data Laxmi Lydia E. Sharmili N. Nguyen P.T. Hashim W. Maseleno A. 57196059278 57191575400 57216386109 11440260100 55354910900 Automatic indexing; Big data; Cluster analysis; Extraction; Factorization; Indexing (of information); Information retrieval; K-means clustering; Natural language processing systems; Open source software; Open systems; Pattern matching; Software quality; Software testing; Text mining; Hadoop; Key phrase extractions; Map-reduce; Pattern-matching technique; Porters; Pre-processing algorithms; Software environments; Unlabeled; Matrix algebra The existence of unlabeledtext data in documents has become larger and excavating such datasets is a provocative task. The objective of Big Data is to store, retrieve and analyse multipletext documents. Problem Statement:The retrieval of the identical data over large databases is of major concern. Existing Solution:Existing problem is solved by Full-Text Search (FTS) which means pattern matching technique that allows searching of multiple keywords at specific time.Proposed Solution: In this paper, we consider multiple text documents as input and processed using text mining pre-processing algorithms like Key Phrase extraction, Porters stemming for tokenizing and TF_IDF toobtain all non-negative values. These values further processed to get matrix data throughNonnegative matrix factorization (NMF). On performing NMF, K-means algorithmis upgraded with NMF to obtain quality clusters of data sets.Performances of the algorithms are tested using Newsgroup20 data in Open Source Hadoop software environment which also analyses the performance of the MapReduce framework. The final outcome is to generate clusters and index them for the Newsgroup20dataset. Later on, Apache Lucene is presented for automatic document clustering with aGUI interface developed for indexing. Thus, this proposed algorithm resultsby improving the performance of document clustering through Map Reduce framework in Hadoop. � 2019 Mattingley Publishing. All rights reserved. Final 2023-05-29T07:27:55Z 2023-05-29T07:27:55Z 2019 Article 2-s2.0-85079574447 https://www.scopus.com/inward/record.uri?eid=2-s2.0-85079574447&partnerID=40&md5=1ed7ff4baa70eeccef9e5755fa21fcec https://irepository.uniten.edu.my/handle/123456789/24853 81 11-Dec 1107 1130 Mattingley Publishing Scopus
institution	Universiti Tenaga Nasional
building	UNITEN Library
collection	Institutional Repository
continent	Asia
country	Malaysia
content_provider	Universiti Tenaga Nasional
content_source	UNITEN Institutional Repository
url_provider	http://dspace.uniten.edu.my/
description	Automatic indexing; Big data; Cluster analysis; Extraction; Factorization; Indexing (of information); Information retrieval; K-means clustering; Natural language processing systems; Open source software; Open systems; Pattern matching; Software quality; Software testing; Text mining; Hadoop; Key phrase extractions; Map-reduce; Pattern-matching technique; Porters; Pre-processing algorithms; Software environments; Unlabeled; Matrix algebra
author2	57196059278
author_facet	57196059278 Laxmi Lydia E. Sharmili N. Nguyen P.T. Hashim W. Maseleno A.
format	Article
author	Laxmi Lydia E. Sharmili N. Nguyen P.T. Hashim W. Maseleno A.
spellingShingle	Laxmi Lydia E. Sharmili N. Nguyen P.T. Hashim W. Maseleno A. Automatic document clustering and indexing of multiple documents using KNMF for feature extraction through Hadoop and lucene on big data
author_sort	Laxmi Lydia E.
title	Automatic document clustering and indexing of multiple documents using KNMF for feature extraction through Hadoop and lucene on big data
title_short	Automatic document clustering and indexing of multiple documents using KNMF for feature extraction through Hadoop and lucene on big data
title_full	Automatic document clustering and indexing of multiple documents using KNMF for feature extraction through Hadoop and lucene on big data
title_fullStr	Automatic document clustering and indexing of multiple documents using KNMF for feature extraction through Hadoop and lucene on big data
title_full_unstemmed	Automatic document clustering and indexing of multiple documents using KNMF for feature extraction through Hadoop and lucene on big data
title_sort	automatic document clustering and indexing of multiple documents using knmf for feature extraction through hadoop and lucene on big data
publisher	Mattingley Publishing
publishDate	2023
_version_	1806428295791640576

Automatic document clustering and indexing of multiple documents using KNMF for feature extraction through Hadoop and lucene on big data

Similar Items