Classification with large datasets

Big data Classification is a relevant modern-day problem and various techniques are being researched on to find optimal solutions to this problem. The sheer size of data nowadays renders many state-of-the-art classification algorithms reduntant as training the classifiers on very large datasets degr...

Full description

Saved in:
Bibliographic Details
Main Author: Souryadeep Sen
Other Authors: Ponnuthurai N. Suganthan
Format: Theses and Dissertations
Language:English
Published: 2018
Subjects:
Online Access:http://hdl.handle.net/10356/76017
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-76017
record_format dspace
spelling sg-ntu-dr.10356-760172023-07-04T15:56:10Z Classification with large datasets Souryadeep Sen Ponnuthurai N. Suganthan School of Electrical and Electronic Engineering DRNTU::Engineering::Electrical and electronic engineering Big data Classification is a relevant modern-day problem and various techniques are being researched on to find optimal solutions to this problem. The sheer size of data nowadays renders many state-of-the-art classification algorithms reduntant as training the classifiers on very large datasets degrades time complexity and may also affect the accuracy achieved. Random Vector Functional Link (RVFL) networks is one such base classifier which does not perform well on very large datasets. Various techniques such as dimensionality reduction and divide and conquer can be used to overcome this problem of compromised performance. In this dissertation we experiment extensively with dimension reduction techniques such as Random Projection and Principal Component Analysis before applying the classification algorithm to the dataset. Intuitively, applying dimension reduction techniques to the datasets would lead to a loss in accuracy of classification, compared to when the entire dataset was used, as information is lost. However, this is traded of with improved time complexity. The accuracy of the classifiers on reduced datasets is attempted to be improved by using ensemble variants of base classifers. The accuracies are dataset dependant, but, accuracies achieved using these ensemble variants on reduced datasets are highly competitive with state-of-the-art results of classification on these datasets (these state-of-the-art results do not use reduction techniques). The time complexity with respect to the state-of-the-art result classification methods is considerably improved in our case as dimensionality reduction techniques have been employed. The results are highly promising as time complexity is improved and competitive accuracies are achieved even with reduced dimesions. Master of Science (Electronics) 2018-09-18T05:28:02Z 2018-09-18T05:28:02Z 2018 Thesis http://hdl.handle.net/10356/76017 en 69 p. application/pdf
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic DRNTU::Engineering::Electrical and electronic engineering
spellingShingle DRNTU::Engineering::Electrical and electronic engineering
Souryadeep Sen
Classification with large datasets
description Big data Classification is a relevant modern-day problem and various techniques are being researched on to find optimal solutions to this problem. The sheer size of data nowadays renders many state-of-the-art classification algorithms reduntant as training the classifiers on very large datasets degrades time complexity and may also affect the accuracy achieved. Random Vector Functional Link (RVFL) networks is one such base classifier which does not perform well on very large datasets. Various techniques such as dimensionality reduction and divide and conquer can be used to overcome this problem of compromised performance. In this dissertation we experiment extensively with dimension reduction techniques such as Random Projection and Principal Component Analysis before applying the classification algorithm to the dataset. Intuitively, applying dimension reduction techniques to the datasets would lead to a loss in accuracy of classification, compared to when the entire dataset was used, as information is lost. However, this is traded of with improved time complexity. The accuracy of the classifiers on reduced datasets is attempted to be improved by using ensemble variants of base classifers. The accuracies are dataset dependant, but, accuracies achieved using these ensemble variants on reduced datasets are highly competitive with state-of-the-art results of classification on these datasets (these state-of-the-art results do not use reduction techniques). The time complexity with respect to the state-of-the-art result classification methods is considerably improved in our case as dimensionality reduction techniques have been employed. The results are highly promising as time complexity is improved and competitive accuracies are achieved even with reduced dimesions.
author2 Ponnuthurai N. Suganthan
author_facet Ponnuthurai N. Suganthan
Souryadeep Sen
format Theses and Dissertations
author Souryadeep Sen
author_sort Souryadeep Sen
title Classification with large datasets
title_short Classification with large datasets
title_full Classification with large datasets
title_fullStr Classification with large datasets
title_full_unstemmed Classification with large datasets
title_sort classification with large datasets
publishDate 2018
url http://hdl.handle.net/10356/76017
_version_ 1772828405778087936