Evaluation of semi-supervised classification algorithms with deep contextualizes document representations
Automatic text classification is one of the major research topics in the field of text mining and has a variety of applications, including sentiment analysis, spam filtering, and web page categorization, etc. Automatic text classification tasks face two major problems: labelled training data are in...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | English |
Published: |
Nanyang Technological University
2021
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/147954 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-147954 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-1479542021-04-20T07:39:08Z Evaluation of semi-supervised classification algorithms with deep contextualizes document representations Yong, Hao Joty Shafiq Rayhan Sun Aixin School of Computer Science and Engineering AXSun@ntu.edu.sg, srjoty@ntu.edu.sg Engineering::Computer science and engineering::Information systems::Information storage and retrieval Automatic text classification is one of the major research topics in the field of text mining and has a variety of applications, including sentiment analysis, spam filtering, and web page categorization, etc. Automatic text classification tasks face two major problems: labelled training data are insufficient and hard-to-acquire while unlabelled data are available in abundance and embedding unstructured source texts in diverse formats to structured fixed- length vector representations while preserving semantic relations and high-level concepts between words. Co-training is a prominent solution to the former problem, as labelled training data is replenished with the most confident predictions. However, co-training requires two sufficient and redundant views on the same training data, which might not be available in real-life cases. In 2005, Zhou et al. proposed a semi-supervised learning algorithm called tri-training as an extension to co-training inspired us for further investigation. Thus, in this project, we conduct a systematic evaluation of a semi-supervised text classification algorithm – tri-training, which automatically labels unlabelled data in each training iteration to refine classifiers and does not assume multiple sufficient and redundant views, along with traditional and recent distributed document representations (TFIDF, doc2vec, BERT, ELMo, Universal Sentence Encoder, SkipThoughts, InferSent, GenSen). In the designed experiments, we evaluate the performance comparisons of tri-training to its semi-supervised learning counterparts – self-training and co-training. Then using the results as the new baseline, we evaluate the performance gain of expanding the redundancy of training data by providing each classifier of tri-training with different representations. In addition to the aforementioned results, various conventional classifiers were adopted and evaluated, including Naïve Bayesian, Support Vector Machine, Random forest, Multi-layer Perceptron, and XGBoost. Bachelor of Engineering (Computer Science) 2021-04-20T07:39:08Z 2021-04-20T07:39:08Z 2021 Final Year Project (FYP) Yong, H. (2021). Evaluation of semi-supervised classification algorithms with deep contextualizes document representations. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/147954 https://hdl.handle.net/10356/147954 en SCSE20-0249 application/pdf Nanyang Technological University |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Engineering::Computer science and engineering::Information systems::Information storage and retrieval |
spellingShingle |
Engineering::Computer science and engineering::Information systems::Information storage and retrieval Yong, Hao Evaluation of semi-supervised classification algorithms with deep contextualizes document representations |
description |
Automatic text classification is one of the major research topics in the field of text mining
and has a variety of applications, including sentiment analysis, spam filtering, and web page categorization, etc. Automatic text classification tasks face two major problems: labelled training data are insufficient and hard-to-acquire while unlabelled data are available in abundance and embedding unstructured source texts in diverse formats to structured fixed- length vector representations while preserving semantic relations and high-level concepts between words. Co-training is a prominent solution to the former problem, as labelled training data is replenished with the most confident predictions. However, co-training requires two sufficient and redundant views on the same training data, which might not be available in real-life cases. In 2005, Zhou et al. proposed a semi-supervised learning algorithm called tri-training as an extension to co-training inspired us for further investigation. Thus, in this project, we conduct a systematic evaluation of a semi-supervised text classification algorithm – tri-training, which automatically labels unlabelled data in each training iteration to refine classifiers and does not assume multiple sufficient and redundant views, along with traditional and recent distributed document representations (TFIDF, doc2vec, BERT, ELMo, Universal Sentence Encoder, SkipThoughts, InferSent, GenSen). In the designed experiments, we evaluate the performance comparisons of tri-training to its semi-supervised learning counterparts – self-training and co-training. Then using the results as the new baseline, we evaluate the performance gain of expanding the redundancy of training data by providing each classifier of tri-training with different representations. In addition to the aforementioned results, various conventional classifiers were adopted and evaluated, including Naïve Bayesian, Support Vector Machine, Random forest, Multi-layer Perceptron, and XGBoost. |
author2 |
Joty Shafiq Rayhan |
author_facet |
Joty Shafiq Rayhan Yong, Hao |
format |
Final Year Project |
author |
Yong, Hao |
author_sort |
Yong, Hao |
title |
Evaluation of semi-supervised classification algorithms with deep contextualizes document representations |
title_short |
Evaluation of semi-supervised classification algorithms with deep contextualizes document representations |
title_full |
Evaluation of semi-supervised classification algorithms with deep contextualizes document representations |
title_fullStr |
Evaluation of semi-supervised classification algorithms with deep contextualizes document representations |
title_full_unstemmed |
Evaluation of semi-supervised classification algorithms with deep contextualizes document representations |
title_sort |
evaluation of semi-supervised classification algorithms with deep contextualizes document representations |
publisher |
Nanyang Technological University |
publishDate |
2021 |
url |
https://hdl.handle.net/10356/147954 |
_version_ |
1698713686828384256 |