Classification with large datasets

Big data Classification is a relevant modern-day problem and various techniques are being researched on to find optimal solutions to this problem. The sheer size of data nowadays renders many state-of-the-art classification algorithms reduntant as training the classifiers on very large datasets degr...

Full description

Saved in:
Bibliographic Details
Main Author: Souryadeep Sen
Other Authors: Ponnuthurai N. Suganthan
Format: Theses and Dissertations
Language:English
Published: 2018
Subjects:
Online Access:http://hdl.handle.net/10356/76017
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Big data Classification is a relevant modern-day problem and various techniques are being researched on to find optimal solutions to this problem. The sheer size of data nowadays renders many state-of-the-art classification algorithms reduntant as training the classifiers on very large datasets degrades time complexity and may also affect the accuracy achieved. Random Vector Functional Link (RVFL) networks is one such base classifier which does not perform well on very large datasets. Various techniques such as dimensionality reduction and divide and conquer can be used to overcome this problem of compromised performance. In this dissertation we experiment extensively with dimension reduction techniques such as Random Projection and Principal Component Analysis before applying the classification algorithm to the dataset. Intuitively, applying dimension reduction techniques to the datasets would lead to a loss in accuracy of classification, compared to when the entire dataset was used, as information is lost. However, this is traded of with improved time complexity. The accuracy of the classifiers on reduced datasets is attempted to be improved by using ensemble variants of base classifers. The accuracies are dataset dependant, but, accuracies achieved using these ensemble variants on reduced datasets are highly competitive with state-of-the-art results of classification on these datasets (these state-of-the-art results do not use reduction techniques). The time complexity with respect to the state-of-the-art result classification methods is considerably improved in our case as dimensionality reduction techniques have been employed. The results are highly promising as time complexity is improved and competitive accuracies are achieved even with reduced dimesions.