Domain-agnostic document and question classification using natural language processing techniques

This thesis addresses the classification of documents and questions to domain-agnostic class labels. Domain refers to the subject matter with which the class labels are associated. Domain-specific document or question classification is commonly applied in articles categorization or in factoid questi...

Full description

Saved in:

Bibliographic Details
Main Author:	Supraja, S.
Other Authors:	Andy Khong W H
Format:	Thesis-Doctor of Philosophy
Language:	English
Published:	Nanyang Technological University 2022
Subjects:	Engineering::Electrical and electronic engineering
Online Access:	https://hdl.handle.net/10356/157159
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-157159
record_format	dspace
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Engineering::Electrical and electronic engineering
spellingShingle	Engineering::Electrical and electronic engineering Supraja, S. Domain-agnostic document and question classification using natural language processing techniques
description	This thesis addresses the classification of documents and questions to domain-agnostic class labels. Domain refers to the subject matter with which the class labels are associated. Domain-specific document or question classification is commonly applied in articles categorization or in factoid question answering with class labels being defined by subject matter. For instance, considering digital signal processing (DSP) questions, the explicit meaning of the questions will be reflected if the domain-specific class labels consist of Fourier Transform or z-transform. In contrast, applications for domain-agnostic document classification include classifying job descriptions into generic skillsets, scientific statements into section types, and sentences into argumentative zone functions. With questions possessing different characteristics, domain-agnostic question classification is applied in information query or dialogue interactions in which the class labels may comprise question types or reasoning capabilities. To enhance the effectiveness of deliberate practice, questions are classified into their respective cognitive complexities for instructors to determine learners’ proficiencies. Quite often, in scenarios where the size of the question bank is limited, statistical approaches are adopted for feature extraction. Since domain-agnostic classification takes the implicit substance of a text into account (e.g., learning outcome of the same DSP question irrespective of the content), it relies on a suitable feature extraction process. This thesis explores the use of topic modeling techniques as feature extractors for questions due to its ability of offering linguistic insights into language patterns by grouping associated words into topics and, thereafter, computing the probabilities of topics occurring in each document. Considering the limitations of employing baseline topic modeling algorithms for automatic question classification (AQC), an algorithm that observes the effect of pre-processing procedures and word co-occurrence redundancy is proposed. However, the limitation of this method is that it is dataset-specific and requires hand-curated word tagging. To address these shortcomings, a new holistic generalizable regularized phrase-based topic modeling technique is proposed. This technique is driven by the fact that phrases have been shown to be more effective than words to represent questions. Further elements such as nested regular expressions and scaling parameters are being employed to facilitate efficient mapping of questions to class labels. For documents, the baseline algorithm of graph networks is adopted. This thesis shows that graph networks are suitable since it is important to establish the relationships between documents to better classify them into domain-agnostic categories. In addition, graphs encompass a global perspective compared to conventional deep learning techniques that are both localized and sequential. In the proposed quad-faceted feature-based graph network, this thesis shows that the addition of a new topical layer is vital for observing the impact of topic modeling on generating a meaningful set of features. It also highlights that the use of regular expressions with a domain-agnostic nature is important for co-occurrence statistics while the meaning of a document encapsulated via phrase nodes are crucial for semantic relationships.
author2	Andy Khong W H
author_facet	Andy Khong W H Supraja, S.
format	Thesis-Doctor of Philosophy
author	Supraja, S.
author_sort	Supraja, S.
title	Domain-agnostic document and question classification using natural language processing techniques
title_short	Domain-agnostic document and question classification using natural language processing techniques
title_full	Domain-agnostic document and question classification using natural language processing techniques
title_fullStr	Domain-agnostic document and question classification using natural language processing techniques
title_full_unstemmed	Domain-agnostic document and question classification using natural language processing techniques
title_sort	domain-agnostic document and question classification using natural language processing techniques
publisher	Nanyang Technological University
publishDate	2022
url	https://hdl.handle.net/10356/157159
_version_	1772829003370987520
spelling	sg-ntu-dr.10356-1571592023-07-04T17:48:17Z Domain-agnostic document and question classification using natural language processing techniques Supraja, S. Andy Khong W H School of Electrical and Electronic Engineering AndyKhong@ntu.edu.sg Engineering::Electrical and electronic engineering This thesis addresses the classification of documents and questions to domain-agnostic class labels. Domain refers to the subject matter with which the class labels are associated. Domain-specific document or question classification is commonly applied in articles categorization or in factoid question answering with class labels being defined by subject matter. For instance, considering digital signal processing (DSP) questions, the explicit meaning of the questions will be reflected if the domain-specific class labels consist of Fourier Transform or z-transform. In contrast, applications for domain-agnostic document classification include classifying job descriptions into generic skillsets, scientific statements into section types, and sentences into argumentative zone functions. With questions possessing different characteristics, domain-agnostic question classification is applied in information query or dialogue interactions in which the class labels may comprise question types or reasoning capabilities. To enhance the effectiveness of deliberate practice, questions are classified into their respective cognitive complexities for instructors to determine learners’ proficiencies. Quite often, in scenarios where the size of the question bank is limited, statistical approaches are adopted for feature extraction. Since domain-agnostic classification takes the implicit substance of a text into account (e.g., learning outcome of the same DSP question irrespective of the content), it relies on a suitable feature extraction process. This thesis explores the use of topic modeling techniques as feature extractors for questions due to its ability of offering linguistic insights into language patterns by grouping associated words into topics and, thereafter, computing the probabilities of topics occurring in each document. Considering the limitations of employing baseline topic modeling algorithms for automatic question classification (AQC), an algorithm that observes the effect of pre-processing procedures and word co-occurrence redundancy is proposed. However, the limitation of this method is that it is dataset-specific and requires hand-curated word tagging. To address these shortcomings, a new holistic generalizable regularized phrase-based topic modeling technique is proposed. This technique is driven by the fact that phrases have been shown to be more effective than words to represent questions. Further elements such as nested regular expressions and scaling parameters are being employed to facilitate efficient mapping of questions to class labels. For documents, the baseline algorithm of graph networks is adopted. This thesis shows that graph networks are suitable since it is important to establish the relationships between documents to better classify them into domain-agnostic categories. In addition, graphs encompass a global perspective compared to conventional deep learning techniques that are both localized and sequential. In the proposed quad-faceted feature-based graph network, this thesis shows that the addition of a new topical layer is vital for observing the impact of topic modeling on generating a meaningful set of features. It also highlights that the use of regular expressions with a domain-agnostic nature is important for co-occurrence statistics while the meaning of a document encapsulated via phrase nodes are crucial for semantic relationships. Doctor of Philosophy 2022-05-09T12:20:44Z 2022-05-09T12:20:44Z 2022 Thesis-Doctor of Philosophy Supraja, S. (2022). Domain-agnostic document and question classification using natural language processing techniques. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/157159 https://hdl.handle.net/10356/157159 10.32657/10356/157159 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University

Domain-agnostic document and question classification using natural language processing techniques

Similar Items