Grouping features in big dimensionality

To date, the world continues to generate quintillion bytes of data daily, leading to the pressing needs for new efforts in dealing with the grand challenges brought by Big Data. When talking about big data, there is a growing consensus among the computational intelligence communities that data volum...

Full description

Saved in:
Bibliographic Details
Main Author: Zhai, Yiteng
Other Authors: Ong Yew Soon
Format: Theses and Dissertations
Language:English
Published: 2016
Subjects:
Online Access:https://hdl.handle.net/10356/66541
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:To date, the world continues to generate quintillion bytes of data daily, leading to the pressing needs for new efforts in dealing with the grand challenges brought by Big Data. When talking about big data, there is a growing consensus among the computational intelligence communities that data volume presents an immediate challenge pertaining to the scalability issue. Note that, when addressing volume in Big Data analytics, researchers have largely taken a one-sided study which refers to the "Big Instance Size" factor of the data. The flip side of volume which is the "Big Dimensionality" factor of big data, on the other hand, has received much less attention. A motivating example is related with cell phone manufacturing industry. It is worth noting that presently one can easily enjoy up to an extremely high resolution of 41-megapixels on the pictures taken, which is 400 times more than the 0.11-megapixels almost a decade ago. As a pixel based feature representation, this will explicitly translate to 41 million features. Taking this cue, in this dissertation, the first work represents an attempt to fill in this gap and places special focus on this relatively under-explored topic of big dimensionality, wherein the explosion of features brings about new challenges to computational intelligence. {An analysis of three popular data repositories has uncovered} an exponential increase in the dimensionality of many datasets that have been produced since early 2000s, there is much evidence reinforcing our contention that the upward trend of Big Dimensionality will only continue to follow, as influenced by the rapid advancements in computing and information technologies and the arising myriads of feature descriptors. Moreover, the blessings of Big Dimensionality is also discussed based on feature correlation, which serves as a cue to the success of handling such challenge. Based on the observation of the growing trend of big dimensionality on modern databases, existing approaches that require the calculations of pairwise feature correlations in their algorithmic designs have scored miserably, since computing the full correlation/covariance matrix (i.e., square of dimensionality in size, where million features would translate to trillion correlation computations) can become computationally very impractical. This poses a notable challenge that has received little attention in the field of machine learning and data mining research. Thus, an efficient feature grouping and selection method has been proposed to fill in this gap, which is considered as the second work presented in this thesis. Specifically, the interesting findings on several established databases with big dimensionality have indicated that an extremely small portion of the feature pairs could contribute significantly to the underlying interactions and there exists feature groups that are highly correlated, which is termed as "sparse correlation" in this thesis. Inspired by the intriguing observations, a novel learning approach, namely, Group Discovery Machine (GDM) is hence introduced that exploits the presence of sparse correlations for the efficient identifications of informative and correlated feature groups from big dimensional data that translates to a reduction in complexity from O(m^2n) to O(m log m + K_a mn), where K_a << min(m,n) generally holds. In particular, the proposed approach considers an explicit incorporation of linear, nonlinear or specific correlation measures as constraints in the learning model. An efficient embedded feature selection strategy, designed to filter out the large number of non-contributing correlations that could otherwise confuse the classifier while identifying the correlated and informative feature groups, forms one of the highlights of this approach. Extensive empirical studies on both synthetic and several real-world datasets comprising up to 30 million dimensions are subsequently conducted to assess and showcase the efficacy of the proposed framework. In addition, to better illustrate the properties of the proposed framework, the sensitivity analysis of the key parameters in GDM are examined in this thesis to demonstrate the robustness and stability. Besides, the proposed framework on different machine learning settings is discussed, such as one-class learning, where notable speedup can be observed when solving one-class problems on big dimensional data. Further, to identify robust informative features with minimal sampling bias, the embedding of the $V$-fold cross validation in the learning model is hence considered, so as to seek for features that exhibit stable or consistent performance accuracy on multiple data folds. Last but not least, to better illustrate the usefulness of the informative feature groups, the potential benefits of affiliated features are presented using various real-world datasets.