Generating semantically similar permutations of questions by clustering
With sophisticated machine learning techniques available to the public, many industry has used their own data to solve their own problems, including training chat bots. However, a lack of data is major concern when trying to train a bot for specific use-cases, such as a university FAQ-answerin...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | English |
Published: |
2018
|
Subjects: | |
Online Access: | http://hdl.handle.net/10356/74129 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | With sophisticated machine learning techniques available to the public, many
industry has used their own data to solve their own problems, including training
chat bots. However, a lack of data is major concern when trying to train a bot
for specific use-cases, such as a university FAQ-answering bot.
The researcher proposes a solution to create more training data by generating
question permutations of existing questions from the campus’ FAQ page. The
proposed system employs a combination of rule-based and cluster-based approach.
The rule-based approach takes a straightforward way of doing parts-of-speech
tagging on the question, finding synonyms of the applicable words in WordNet,
and producing new questions by replacing the original words with them and
restructuring based on production rules.
The cluster-based approach relies on mining question patterns from existing
questions, finding the ones semantically similar with a given question by a
clustering algorithm such as K-means or affinity propagation, and generating
permutations from the question patterns. An experiment with a small dataset of
manually-written 30 questions covering 6 topics resulted in an F1 score of 0.561
for both clustering algorithms paired with sent2vec using a pre-trained model.
A web-based user testing experiment required users to ask a question regarding
6 topics and rate the quality of generated permutations with a score range 0-3.
The overall average score is 1.18/3.00 (39.3%). It is noted that for the topic with
the most questions in the dataset, the average score is 1.92/3.00 (64%). Given a
big enough dataset, it is believed that the generator’s performance would be able
to solve the problem more efficiently and accurately across all topics. |
---|