Evaluation of semi-supervised classification algorithms with deep contextualizes document representations

Automatic text classification is one of the major research topics in the field of text mining and has a variety of applications, including sentiment analysis, spam filtering, and web page categorization, etc. Automatic text classification tasks face two major problems: labelled training data are in...

Full description

Saved in:
Bibliographic Details
Main Author: Yong, Hao
Other Authors: Joty Shafiq Rayhan
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2021
Subjects:
Online Access:https://hdl.handle.net/10356/147954
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-147954
record_format dspace
spelling sg-ntu-dr.10356-1479542021-04-20T07:39:08Z Evaluation of semi-supervised classification algorithms with deep contextualizes document representations Yong, Hao Joty Shafiq Rayhan Sun Aixin School of Computer Science and Engineering AXSun@ntu.edu.sg, srjoty@ntu.edu.sg Engineering::Computer science and engineering::Information systems::Information storage and retrieval Automatic text classification is one of the major research topics in the field of text mining and has a variety of applications, including sentiment analysis, spam filtering, and web page categorization, etc. Automatic text classification tasks face two major problems: labelled training data are insufficient and hard-to-acquire while unlabelled data are available in abundance and embedding unstructured source texts in diverse formats to structured fixed- length vector representations while preserving semantic relations and high-level concepts between words. Co-training is a prominent solution to the former problem, as labelled training data is replenished with the most confident predictions. However, co-training requires two sufficient and redundant views on the same training data, which might not be available in real-life cases. In 2005, Zhou et al. proposed a semi-supervised learning algorithm called tri-training as an extension to co-training inspired us for further investigation. Thus, in this project, we conduct a systematic evaluation of a semi-supervised text classification algorithm – tri-training, which automatically labels unlabelled data in each training iteration to refine classifiers and does not assume multiple sufficient and redundant views, along with traditional and recent distributed document representations (TFIDF, doc2vec, BERT, ELMo, Universal Sentence Encoder, SkipThoughts, InferSent, GenSen). In the designed experiments, we evaluate the performance comparisons of tri-training to its semi-supervised learning counterparts – self-training and co-training. Then using the results as the new baseline, we evaluate the performance gain of expanding the redundancy of training data by providing each classifier of tri-training with different representations. In addition to the aforementioned results, various conventional classifiers were adopted and evaluated, including Naïve Bayesian, Support Vector Machine, Random forest, Multi-layer Perceptron, and XGBoost. Bachelor of Engineering (Computer Science) 2021-04-20T07:39:08Z 2021-04-20T07:39:08Z 2021 Final Year Project (FYP) Yong, H. (2021). Evaluation of semi-supervised classification algorithms with deep contextualizes document representations. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/147954 https://hdl.handle.net/10356/147954 en SCSE20-0249 application/pdf Nanyang Technological University
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Engineering::Computer science and engineering::Information systems::Information storage and retrieval
spellingShingle Engineering::Computer science and engineering::Information systems::Information storage and retrieval
Yong, Hao
Evaluation of semi-supervised classification algorithms with deep contextualizes document representations
description Automatic text classification is one of the major research topics in the field of text mining and has a variety of applications, including sentiment analysis, spam filtering, and web page categorization, etc. Automatic text classification tasks face two major problems: labelled training data are insufficient and hard-to-acquire while unlabelled data are available in abundance and embedding unstructured source texts in diverse formats to structured fixed- length vector representations while preserving semantic relations and high-level concepts between words. Co-training is a prominent solution to the former problem, as labelled training data is replenished with the most confident predictions. However, co-training requires two sufficient and redundant views on the same training data, which might not be available in real-life cases. In 2005, Zhou et al. proposed a semi-supervised learning algorithm called tri-training as an extension to co-training inspired us for further investigation. Thus, in this project, we conduct a systematic evaluation of a semi-supervised text classification algorithm – tri-training, which automatically labels unlabelled data in each training iteration to refine classifiers and does not assume multiple sufficient and redundant views, along with traditional and recent distributed document representations (TFIDF, doc2vec, BERT, ELMo, Universal Sentence Encoder, SkipThoughts, InferSent, GenSen). In the designed experiments, we evaluate the performance comparisons of tri-training to its semi-supervised learning counterparts – self-training and co-training. Then using the results as the new baseline, we evaluate the performance gain of expanding the redundancy of training data by providing each classifier of tri-training with different representations. In addition to the aforementioned results, various conventional classifiers were adopted and evaluated, including Naïve Bayesian, Support Vector Machine, Random forest, Multi-layer Perceptron, and XGBoost.
author2 Joty Shafiq Rayhan
author_facet Joty Shafiq Rayhan
Yong, Hao
format Final Year Project
author Yong, Hao
author_sort Yong, Hao
title Evaluation of semi-supervised classification algorithms with deep contextualizes document representations
title_short Evaluation of semi-supervised classification algorithms with deep contextualizes document representations
title_full Evaluation of semi-supervised classification algorithms with deep contextualizes document representations
title_fullStr Evaluation of semi-supervised classification algorithms with deep contextualizes document representations
title_full_unstemmed Evaluation of semi-supervised classification algorithms with deep contextualizes document representations
title_sort evaluation of semi-supervised classification algorithms with deep contextualizes document representations
publisher Nanyang Technological University
publishDate 2021
url https://hdl.handle.net/10356/147954
_version_ 1698713686828384256