Generating domain-specific paraphrases of questions from FAQ
This project will introduce a paraphrase generation system that will generate domain-specific paraphrases of the questions of Frequently Asked Question (FAQ) corpuses. This project aims to minimise the cost associated with the manual generation of paraphrases and performs effective data augmentation...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | English |
Published: |
Nanyang Technological University
2021
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/148155 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | This project will introduce a paraphrase generation system that will generate domain-specific paraphrases of the questions of Frequently Asked Question (FAQ) corpuses. This project aims to minimise the cost associated with the manual generation of paraphrases and performs effective data augmentation to complement end-to-end FAQ retrieval that uses large language models by reducing overfitting. This is achieved by paying attention to several unique characteristics of the FAQ corpuses and through the use of two large language models, an off- domain labelled paraphrase dataset and abbreviations handling. The two language models used are T5 and Sentence Transformer. The approach proposed involves pre-processing, paraphrase generation, post-processing and candidate paraphrase selection. Firstly, T5 is used to fine-tune on the paraphrase dataset for the task of paraphrase generation. Secondly, abbreviations handling was incorporated into the pre-processing of the original question and post-processing of the generated paraphrase. Thirdly, Sentence Transformer Library is used for candidate paraphrase selection to ensure the semantic similarity of the paraphrase with the original question and the integrity of the paraphrase’s class label. Lastly, a GUI application is provided for users to generate paraphrases of questions from a FAQ dataset. From our experiments, we conclude that the pre-processing, post-processing and candidate paraphrase selection are effective in the successful generation of paraphrases and subsequent filtering of these paraphrases to output a set of high-quality domain-specific paraphrases for augmenting the FAQ corpuses. |
---|