A domain specific virtual assistant using paraphrase generation for data augmentation and Ssentence transformers on limited data

The use of conversational agents can be extremely beneficial in many areas such as government offices, schools, banks, malls, etc. where people often make inquiries and responses from personnel can take some time. Many of these areas, however, have inquiries that involve domain-specific vocabulary a...

Full description

Saved in:
Bibliographic Details
Main Author: Roque, Matthew Theodore C.
Format: text
Language:English
Published: Animo Repository 2023
Subjects:
Online Access:https://animorepository.dlsu.edu.ph/etdm_ece/28
https://animorepository.dlsu.edu.ph/context/etdm_ece/article/1028/viewcontent/A_Domain2_Specific_Virtual_Assistant_Using_Paraphrase_Generation_f.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: De La Salle University
Language: English
Description
Summary:The use of conversational agents can be extremely beneficial in many areas such as government offices, schools, banks, malls, etc. where people often make inquiries and responses from personnel can take some time. Many of these areas, however, have inquiries that involve domain-specific vocabulary and most likely do not have a large amount of data or computational resources to properly train a complex natural language processing (NLP) model. This paper proposes a method for creating a domain-specific virtual assistant using Generative Pre-Trained Transformer-3 (GPT-3) to generate paraphrases on a relatively small dataset, and a Sentence Transformer (SBERT) model with a distilled version of BERT (DistilBERT) base, pretrained on the Quora Question Pairs dataset, and fine-tuned on the augmented dataset. This method of creating a model is evaluated on the MS MARCO, SemEval, and PubMed datasets using mean average precision (MAP), precision at k (P@k), normalized discounted cumulative gain (NDCG), and mean reciprocal rank (MRR) as performance metrics. The method was also demonstrated using a small dataset of 188 frequently asked questions from the De La Salle University website that also includes domain-specific vocabulary. The implementation of the fine-tuned model was demonstrated on a simple webpage and the results were found to be satisfactory.