Dissimilarity-based semi-supervised subset selection

Extracting useful information from large-scale data is a major challenge in the era of big data. As an effective means of information filtering and data summarization, the subset selection method selects the most informative subset from large-scale data to represent the entire data set to reduce the...

وصف كامل

محفوظ في:
التفاصيل البيبلوغرافية
المؤلف الرئيسي: Lei, Yiran
مؤلفون آخرون: Tan Yap Peng
التنسيق: Thesis-Master by Coursework
اللغة:English
منشور في: Nanyang Technological University 2020
الموضوعات:
الوصول للمادة أونلاين:https://hdl.handle.net/10356/140899
الوسوم: إضافة وسم
لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!
المؤسسة: Nanyang Technological University
اللغة: English
الوصف
الملخص:Extracting useful information from large-scale data is a major challenge in the era of big data. As an effective means of information filtering and data summarization, the subset selection method selects the most informative subset from large-scale data to represent the entire data set to reduce the size of the data that needs to be processed. In this thesis, a kind of dissimilarity-based semi-supervised subset selection method is proposed. To begin with, the subset selection problem is treated as an convex optimization process with regularization. Thus the wanted subset is modeled as an unknown sparse matrix, which non-zero rows represent the target set by the source set. Then alternating optimization method is used to solve the Lagrangian form of the objective function. To utilize the information implicated in the labels of samples, semi-supervised algorithm is proposed to do unsupervised clustering and supervised representatives judgement. Afterwards, the iterative process will update the distribution of representatives based on the overall correlation coefficients of each category of target set. In the end, the optimal matrix and representatives will be output.