Generating domain-specific paraphrases of questions from FAQ

This project will introduce a paraphrase generation system that will generate domain-specific paraphrases of the questions of Frequently Asked Question (FAQ) corpuses. This project aims to minimise the cost associated with the manual generation of paraphrases and performs effective data augmentation...

Full description

Saved in:
Bibliographic Details
Main Author: Ng, Jing Rui
Other Authors: Chng Eng Siong
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2021
Subjects:
Online Access:https://hdl.handle.net/10356/148155
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-148155
record_format dspace
spelling sg-ntu-dr.10356-1481552021-04-24T06:23:54Z Generating domain-specific paraphrases of questions from FAQ Ng, Jing Rui Chng Eng Siong School of Computer Science and Engineering ASESChng@ntu.edu.sg Engineering::Computer science and engineering This project will introduce a paraphrase generation system that will generate domain-specific paraphrases of the questions of Frequently Asked Question (FAQ) corpuses. This project aims to minimise the cost associated with the manual generation of paraphrases and performs effective data augmentation to complement end-to-end FAQ retrieval that uses large language models by reducing overfitting. This is achieved by paying attention to several unique characteristics of the FAQ corpuses and through the use of two large language models, an off- domain labelled paraphrase dataset and abbreviations handling. The two language models used are T5 and Sentence Transformer. The approach proposed involves pre-processing, paraphrase generation, post-processing and candidate paraphrase selection. Firstly, T5 is used to fine-tune on the paraphrase dataset for the task of paraphrase generation. Secondly, abbreviations handling was incorporated into the pre-processing of the original question and post-processing of the generated paraphrase. Thirdly, Sentence Transformer Library is used for candidate paraphrase selection to ensure the semantic similarity of the paraphrase with the original question and the integrity of the paraphrase’s class label. Lastly, a GUI application is provided for users to generate paraphrases of questions from a FAQ dataset. From our experiments, we conclude that the pre-processing, post-processing and candidate paraphrase selection are effective in the successful generation of paraphrases and subsequent filtering of these paraphrases to output a set of high-quality domain-specific paraphrases for augmenting the FAQ corpuses. Bachelor of Engineering (Computer Science) 2021-04-24T06:23:54Z 2021-04-24T06:23:54Z 2021 Final Year Project (FYP) Ng, J. R. (2021). Generating domain-specific paraphrases of questions from FAQ. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/148155 https://hdl.handle.net/10356/148155 en application/pdf Nanyang Technological University
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Engineering::Computer science and engineering
spellingShingle Engineering::Computer science and engineering
Ng, Jing Rui
Generating domain-specific paraphrases of questions from FAQ
description This project will introduce a paraphrase generation system that will generate domain-specific paraphrases of the questions of Frequently Asked Question (FAQ) corpuses. This project aims to minimise the cost associated with the manual generation of paraphrases and performs effective data augmentation to complement end-to-end FAQ retrieval that uses large language models by reducing overfitting. This is achieved by paying attention to several unique characteristics of the FAQ corpuses and through the use of two large language models, an off- domain labelled paraphrase dataset and abbreviations handling. The two language models used are T5 and Sentence Transformer. The approach proposed involves pre-processing, paraphrase generation, post-processing and candidate paraphrase selection. Firstly, T5 is used to fine-tune on the paraphrase dataset for the task of paraphrase generation. Secondly, abbreviations handling was incorporated into the pre-processing of the original question and post-processing of the generated paraphrase. Thirdly, Sentence Transformer Library is used for candidate paraphrase selection to ensure the semantic similarity of the paraphrase with the original question and the integrity of the paraphrase’s class label. Lastly, a GUI application is provided for users to generate paraphrases of questions from a FAQ dataset. From our experiments, we conclude that the pre-processing, post-processing and candidate paraphrase selection are effective in the successful generation of paraphrases and subsequent filtering of these paraphrases to output a set of high-quality domain-specific paraphrases for augmenting the FAQ corpuses.
author2 Chng Eng Siong
author_facet Chng Eng Siong
Ng, Jing Rui
format Final Year Project
author Ng, Jing Rui
author_sort Ng, Jing Rui
title Generating domain-specific paraphrases of questions from FAQ
title_short Generating domain-specific paraphrases of questions from FAQ
title_full Generating domain-specific paraphrases of questions from FAQ
title_fullStr Generating domain-specific paraphrases of questions from FAQ
title_full_unstemmed Generating domain-specific paraphrases of questions from FAQ
title_sort generating domain-specific paraphrases of questions from faq
publisher Nanyang Technological University
publishDate 2021
url https://hdl.handle.net/10356/148155
_version_ 1698713742665056256