Classification with large datasets
Big data Classification is a relevant modern-day problem and various techniques are being researched on to find optimal solutions to this problem. The sheer size of data nowadays renders many state-of-the-art classification algorithms reduntant as training the classifiers on very large datasets degr...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Theses and Dissertations |
Language: | English |
Published: |
2018
|
Subjects: | |
Online Access: | http://hdl.handle.net/10356/76017 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-76017 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-760172023-07-04T15:56:10Z Classification with large datasets Souryadeep Sen Ponnuthurai N. Suganthan School of Electrical and Electronic Engineering DRNTU::Engineering::Electrical and electronic engineering Big data Classification is a relevant modern-day problem and various techniques are being researched on to find optimal solutions to this problem. The sheer size of data nowadays renders many state-of-the-art classification algorithms reduntant as training the classifiers on very large datasets degrades time complexity and may also affect the accuracy achieved. Random Vector Functional Link (RVFL) networks is one such base classifier which does not perform well on very large datasets. Various techniques such as dimensionality reduction and divide and conquer can be used to overcome this problem of compromised performance. In this dissertation we experiment extensively with dimension reduction techniques such as Random Projection and Principal Component Analysis before applying the classification algorithm to the dataset. Intuitively, applying dimension reduction techniques to the datasets would lead to a loss in accuracy of classification, compared to when the entire dataset was used, as information is lost. However, this is traded of with improved time complexity. The accuracy of the classifiers on reduced datasets is attempted to be improved by using ensemble variants of base classifers. The accuracies are dataset dependant, but, accuracies achieved using these ensemble variants on reduced datasets are highly competitive with state-of-the-art results of classification on these datasets (these state-of-the-art results do not use reduction techniques). The time complexity with respect to the state-of-the-art result classification methods is considerably improved in our case as dimensionality reduction techniques have been employed. The results are highly promising as time complexity is improved and competitive accuracies are achieved even with reduced dimesions. Master of Science (Electronics) 2018-09-18T05:28:02Z 2018-09-18T05:28:02Z 2018 Thesis http://hdl.handle.net/10356/76017 en 69 p. application/pdf |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
DRNTU::Engineering::Electrical and electronic engineering |
spellingShingle |
DRNTU::Engineering::Electrical and electronic engineering Souryadeep Sen Classification with large datasets |
description |
Big data Classification is a relevant modern-day problem and various techniques are being researched on to find optimal solutions to this problem. The sheer size of data nowadays renders many state-of-the-art classification algorithms reduntant as training the classifiers on very large datasets degrades time complexity and may also affect the accuracy achieved. Random Vector Functional Link (RVFL) networks is one such base classifier which does not perform well on very large datasets. Various techniques such as dimensionality reduction and divide and conquer can be used to overcome this problem of compromised performance.
In this dissertation we experiment extensively with dimension reduction techniques such as Random Projection and Principal Component Analysis before applying the classification algorithm to the dataset. Intuitively, applying dimension reduction techniques to the datasets would lead to a loss in accuracy of classification, compared to when the entire dataset was used, as information is lost. However, this is traded of with improved time complexity. The accuracy of the classifiers on reduced datasets is attempted to be improved by using ensemble variants of base classifers. The accuracies are dataset dependant, but, accuracies achieved using these ensemble variants on reduced datasets are highly competitive with state-of-the-art results of classification on these datasets (these state-of-the-art results do not use reduction techniques). The time complexity with respect to the state-of-the-art result classification methods is considerably improved in our case as dimensionality reduction techniques have been employed. The results are highly promising as time complexity is improved and competitive accuracies are achieved even with reduced dimesions. |
author2 |
Ponnuthurai N. Suganthan |
author_facet |
Ponnuthurai N. Suganthan Souryadeep Sen |
format |
Theses and Dissertations |
author |
Souryadeep Sen |
author_sort |
Souryadeep Sen |
title |
Classification with large datasets |
title_short |
Classification with large datasets |
title_full |
Classification with large datasets |
title_fullStr |
Classification with large datasets |
title_full_unstemmed |
Classification with large datasets |
title_sort |
classification with large datasets |
publishDate |
2018 |
url |
http://hdl.handle.net/10356/76017 |
_version_ |
1772828405778087936 |