Exploring the use of pre-trained transformer-based models and semi-supervised learning to build training sets for text classification

Data annotation is the process of labeling text, images, or other types of content for machine learning tasks. With the rise in popularity of machine learning for classification tasks, large amounts of labeled data is typically desired to train effective models using different algorithms and archite...

Full description

Saved in:

Bibliographic Details
Main Author:	Te, Gian Marco I.
Format:	text
Language:	English
Published:	Animo Repository 2022
Subjects:	Supervised learning (Machine learning) Natural language processing (Computer science) Computer Sciences
Online Access:	https://animorepository.dlsu.edu.ph/etdm_softtech/6 https://animorepository.dlsu.edu.ph/cgi/viewcontent.cgi?article=1005&context=etdm_softtech
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	De La Salle University
Language:	English

id	oai:animorepository.dlsu.edu.ph:etdm_softtech-1005
record_format	eprints
spelling	oai:animorepository.dlsu.edu.ph:etdm_softtech-10052022-12-14T08:07:49Z Exploring the use of pre-trained transformer-based models and semi-supervised learning to build training sets for text classification Te, Gian Marco I. Data annotation is the process of labeling text, images, or other types of content for machine learning tasks. With the rise in popularity of machine learning for classification tasks, large amounts of labeled data is typically desired to train effective models using different algorithms and architectures. Data annotation is a critical step in developing these models and, while there is an abundance of unlabeled data that is being generated everyday, annotation is often a laborious and costly process. Furthermore, low-resource languages such as Filipino do not have as many readily available datasets as mainstream languages that can be leveraged to fine-tune existing models that were pre-trained with large amounts of data. In this study, we explored the use of BERT and semi-supervised learning for textual data in order to see how it might ease the burden of human annotation when building text classification training sets and at the same time reduce the amount of manually-labeled data needed to fine-tune a pre-trained model for a specific downstream text classification task. We then analyzed relevant factors that may affect pseudo-labeling performance, and also compared the accuracy scores of different non-BERT classifiers when trained with the same samples having solely human-labeled data versus its counterpart composed of a mixture of human-labeled data and pseudo-labeled data after semi-supervised learning. 2022-12-12T08:00:00Z text application/pdf https://animorepository.dlsu.edu.ph/etdm_softtech/6 https://animorepository.dlsu.edu.ph/cgi/viewcontent.cgi?article=1005&context=etdm_softtech Software Technology Master's Theses English Animo Repository Supervised learning (Machine learning) Natural language processing (Computer science) Computer Sciences
institution	De La Salle University
building	De La Salle University Library
continent	Asia
country	Philippines Philippines
content_provider	De La Salle University Library
collection	DLSU Institutional Repository
language	English
topic	Supervised learning (Machine learning) Natural language processing (Computer science) Computer Sciences
spellingShingle	Supervised learning (Machine learning) Natural language processing (Computer science) Computer Sciences Te, Gian Marco I. Exploring the use of pre-trained transformer-based models and semi-supervised learning to build training sets for text classification
description	Data annotation is the process of labeling text, images, or other types of content for machine learning tasks. With the rise in popularity of machine learning for classification tasks, large amounts of labeled data is typically desired to train effective models using different algorithms and architectures. Data annotation is a critical step in developing these models and, while there is an abundance of unlabeled data that is being generated everyday, annotation is often a laborious and costly process. Furthermore, low-resource languages such as Filipino do not have as many readily available datasets as mainstream languages that can be leveraged to fine-tune existing models that were pre-trained with large amounts of data. In this study, we explored the use of BERT and semi-supervised learning for textual data in order to see how it might ease the burden of human annotation when building text classification training sets and at the same time reduce the amount of manually-labeled data needed to fine-tune a pre-trained model for a specific downstream text classification task. We then analyzed relevant factors that may affect pseudo-labeling performance, and also compared the accuracy scores of different non-BERT classifiers when trained with the same samples having solely human-labeled data versus its counterpart composed of a mixture of human-labeled data and pseudo-labeled data after semi-supervised learning.
format	text
author	Te, Gian Marco I.
author_facet	Te, Gian Marco I.
author_sort	Te, Gian Marco I.
title	Exploring the use of pre-trained transformer-based models and semi-supervised learning to build training sets for text classification
title_short	Exploring the use of pre-trained transformer-based models and semi-supervised learning to build training sets for text classification
title_full	Exploring the use of pre-trained transformer-based models and semi-supervised learning to build training sets for text classification
title_fullStr	Exploring the use of pre-trained transformer-based models and semi-supervised learning to build training sets for text classification
title_full_unstemmed	Exploring the use of pre-trained transformer-based models and semi-supervised learning to build training sets for text classification
title_sort	exploring the use of pre-trained transformer-based models and semi-supervised learning to build training sets for text classification
publisher	Animo Repository
publishDate	2022
url	https://animorepository.dlsu.edu.ph/etdm_softtech/6 https://animorepository.dlsu.edu.ph/cgi/viewcontent.cgi?article=1005&context=etdm_softtech
_version_	1753806427988688896

Exploring the use of pre-trained transformer-based models and semi-supervised learning to build training sets for text classification

Similar Items