Evaluation of semi-supervised classification algorithms with deep contextualizes document representations

Automatic text classification is one of the major research topics in the field of text mining and has a variety of applications, including sentiment analysis, spam filtering, and web page categorization, etc. Automatic text classification tasks face two major problems: labelled training data are in...

وصف كامل

محفوظ في:

التفاصيل البيبلوغرافية
المؤلف الرئيسي:	Yong, Hao
مؤلفون آخرون:	Joty Shafiq Rayhan
التنسيق:	Final Year Project
اللغة:	English
منشور في:	Nanyang Technological University 2021
الموضوعات:	Engineering::Computer science and engineering::Information systems::Information storage and retrieval
الوصول للمادة أونلاين:	https://hdl.handle.net/10356/147954
الوسوم:	إضافة وسم لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!

id	sg-ntu-dr.10356-147954
record_format	dspace
spelling	sg-ntu-dr.10356-1479542021-04-20T07:39:08Z Evaluation of semi-supervised classification algorithms with deep contextualizes document representations Yong, Hao Joty Shafiq Rayhan Sun Aixin School of Computer Science and Engineering AXSun@ntu.edu.sg, srjoty@ntu.edu.sg Engineering::Computer science and engineering::Information systems::Information storage and retrieval Automatic text classification is one of the major research topics in the field of text mining and has a variety of applications, including sentiment analysis, spam filtering, and web page categorization, etc. Automatic text classification tasks face two major problems: labelled training data are insufficient and hard-to-acquire while unlabelled data are available in abundance and embedding unstructured source texts in diverse formats to structured fixed- length vector representations while preserving semantic relations and high-level concepts between words. Co-training is a prominent solution to the former problem, as labelled training data is replenished with the most confident predictions. However, co-training requires two sufficient and redundant views on the same training data, which might not be available in real-life cases. In 2005, Zhou et al. proposed a semi-supervised learning algorithm called tri-training as an extension to co-training inspired us for further investigation. Thus, in this project, we conduct a systematic evaluation of a semi-supervised text classification algorithm – tri-training, which automatically labels unlabelled data in each training iteration to refine classifiers and does not assume multiple sufficient and redundant views, along with traditional and recent distributed document representations (TFIDF, doc2vec, BERT, ELMo, Universal Sentence Encoder, SkipThoughts, InferSent, GenSen). In the designed experiments, we evaluate the performance comparisons of tri-training to its semi-supervised learning counterparts – self-training and co-training. Then using the results as the new baseline, we evaluate the performance gain of expanding the redundancy of training data by providing each classifier of tri-training with different representations. In addition to the aforementioned results, various conventional classifiers were adopted and evaluated, including Naïve Bayesian, Support Vector Machine, Random forest, Multi-layer Perceptron, and XGBoost. Bachelor of Engineering (Computer Science) 2021-04-20T07:39:08Z 2021-04-20T07:39:08Z 2021 Final Year Project (FYP) Yong, H. (2021). Evaluation of semi-supervised classification algorithms with deep contextualizes document representations. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/147954 https://hdl.handle.net/10356/147954 en SCSE20-0249 application/pdf Nanyang Technological University
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Engineering::Computer science and engineering::Information systems::Information storage and retrieval
spellingShingle	Engineering::Computer science and engineering::Information systems::Information storage and retrieval Yong, Hao Evaluation of semi-supervised classification algorithms with deep contextualizes document representations
description	Automatic text classification is one of the major research topics in the field of text mining and has a variety of applications, including sentiment analysis, spam filtering, and web page categorization, etc. Automatic text classification tasks face two major problems: labelled training data are insufficient and hard-to-acquire while unlabelled data are available in abundance and embedding unstructured source texts in diverse formats to structured fixed- length vector representations while preserving semantic relations and high-level concepts between words. Co-training is a prominent solution to the former problem, as labelled training data is replenished with the most confident predictions. However, co-training requires two sufficient and redundant views on the same training data, which might not be available in real-life cases. In 2005, Zhou et al. proposed a semi-supervised learning algorithm called tri-training as an extension to co-training inspired us for further investigation. Thus, in this project, we conduct a systematic evaluation of a semi-supervised text classification algorithm – tri-training, which automatically labels unlabelled data in each training iteration to refine classifiers and does not assume multiple sufficient and redundant views, along with traditional and recent distributed document representations (TFIDF, doc2vec, BERT, ELMo, Universal Sentence Encoder, SkipThoughts, InferSent, GenSen). In the designed experiments, we evaluate the performance comparisons of tri-training to its semi-supervised learning counterparts – self-training and co-training. Then using the results as the new baseline, we evaluate the performance gain of expanding the redundancy of training data by providing each classifier of tri-training with different representations. In addition to the aforementioned results, various conventional classifiers were adopted and evaluated, including Naïve Bayesian, Support Vector Machine, Random forest, Multi-layer Perceptron, and XGBoost.
author2	Joty Shafiq Rayhan
author_facet	Joty Shafiq Rayhan Yong, Hao
format	Final Year Project
author	Yong, Hao
author_sort	Yong, Hao
title	Evaluation of semi-supervised classification algorithms with deep contextualizes document representations
title_short	Evaluation of semi-supervised classification algorithms with deep contextualizes document representations
title_full	Evaluation of semi-supervised classification algorithms with deep contextualizes document representations
title_fullStr	Evaluation of semi-supervised classification algorithms with deep contextualizes document representations
title_full_unstemmed	Evaluation of semi-supervised classification algorithms with deep contextualizes document representations
title_sort	evaluation of semi-supervised classification algorithms with deep contextualizes document representations
publisher	Nanyang Technological University
publishDate	2021
url	https://hdl.handle.net/10356/147954
_version_	1698713686828384256

Evaluation of semi-supervised classification algorithms with deep contextualizes document representations

مواد مشابهة