Generating semantically similar permutations of questions by clustering

With sophisticated machine learning techniques available to the public, many industry has used their own data to solve their own problems, including training chat bots. However, a lack of data is major concern when trying to train a bot for specific use-cases, such as a university FAQ-answerin...

Full description

Saved in:

Bibliographic Details
Main Author:	Famili, Kurniawan Aryanto
Other Authors:	Chng Eng Siong
Format:	Final Year Project
Language:	English
Published:	2018
Subjects:	DRNTU::Engineering
Online Access:	http://hdl.handle.net/10356/74129
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-74129
record_format	dspace
spelling	sg-ntu-dr.10356-741292023-03-03T20:33:47Z Generating semantically similar permutations of questions by clustering Famili, Kurniawan Aryanto Chng Eng Siong School of Computer Science and Engineering DRNTU::Engineering With sophisticated machine learning techniques available to the public, many industry has used their own data to solve their own problems, including training chat bots. However, a lack of data is major concern when trying to train a bot for specific use-cases, such as a university FAQ-answering bot. The researcher proposes a solution to create more training data by generating question permutations of existing questions from the campus’ FAQ page. The proposed system employs a combination of rule-based and cluster-based approach. The rule-based approach takes a straightforward way of doing parts-of-speech tagging on the question, finding synonyms of the applicable words in WordNet, and producing new questions by replacing the original words with them and restructuring based on production rules. The cluster-based approach relies on mining question patterns from existing questions, finding the ones semantically similar with a given question by a clustering algorithm such as K-means or affinity propagation, and generating permutations from the question patterns. An experiment with a small dataset of manually-written 30 questions covering 6 topics resulted in an F1 score of 0.561 for both clustering algorithms paired with sent2vec using a pre-trained model. A web-based user testing experiment required users to ask a question regarding 6 topics and rate the quality of generated permutations with a score range 0-3. The overall average score is 1.18/3.00 (39.3%). It is noted that for the topic with the most questions in the dataset, the average score is 1.92/3.00 (64%). Given a big enough dataset, it is believed that the generator’s performance would be able to solve the problem more efficiently and accurately across all topics. Bachelor of Engineering (Computer Science) 2018-04-29T12:12:01Z 2018-04-29T12:12:01Z 2018 Final Year Project (FYP) http://hdl.handle.net/10356/74129 en Nanyang Technological University 85 p. application/pdf
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	DRNTU::Engineering
spellingShingle	DRNTU::Engineering Famili, Kurniawan Aryanto Generating semantically similar permutations of questions by clustering
description	With sophisticated machine learning techniques available to the public, many industry has used their own data to solve their own problems, including training chat bots. However, a lack of data is major concern when trying to train a bot for specific use-cases, such as a university FAQ-answering bot. The researcher proposes a solution to create more training data by generating question permutations of existing questions from the campus’ FAQ page. The proposed system employs a combination of rule-based and cluster-based approach. The rule-based approach takes a straightforward way of doing parts-of-speech tagging on the question, finding synonyms of the applicable words in WordNet, and producing new questions by replacing the original words with them and restructuring based on production rules. The cluster-based approach relies on mining question patterns from existing questions, finding the ones semantically similar with a given question by a clustering algorithm such as K-means or affinity propagation, and generating permutations from the question patterns. An experiment with a small dataset of manually-written 30 questions covering 6 topics resulted in an F1 score of 0.561 for both clustering algorithms paired with sent2vec using a pre-trained model. A web-based user testing experiment required users to ask a question regarding 6 topics and rate the quality of generated permutations with a score range 0-3. The overall average score is 1.18/3.00 (39.3%). It is noted that for the topic with the most questions in the dataset, the average score is 1.92/3.00 (64%). Given a big enough dataset, it is believed that the generator’s performance would be able to solve the problem more efficiently and accurately across all topics.
author2	Chng Eng Siong
author_facet	Chng Eng Siong Famili, Kurniawan Aryanto
format	Final Year Project
author	Famili, Kurniawan Aryanto
author_sort	Famili, Kurniawan Aryanto
title	Generating semantically similar permutations of questions by clustering
title_short	Generating semantically similar permutations of questions by clustering
title_full	Generating semantically similar permutations of questions by clustering
title_fullStr	Generating semantically similar permutations of questions by clustering
title_full_unstemmed	Generating semantically similar permutations of questions by clustering
title_sort	generating semantically similar permutations of questions by clustering
publishDate	2018
url	http://hdl.handle.net/10356/74129
_version_	1759856801617018880

Generating semantically similar permutations of questions by clustering

Similar Items