Speech synthesis and quality evaluation

The objective of this dissertation is to compare the results of objective Speech Quality Assessment (SQA) between human and synthetic speeches to verify the feasibility of using this method to identify if a speech is human-recorded. We also tried using speech synthesis and SQA to quantify the perfor...

وصف كامل

محفوظ في:

التفاصيل البيبلوغرافية
المؤلف الرئيسي:	Jiang, Xiaotong
مؤلفون آخرون:	Tan Yap Peng
التنسيق:	Thesis-Master by Coursework
اللغة:	English
منشور في:	Nanyang Technological University 2024
الموضوعات:	Computer and Information Science Engineering Speech quality assessment (SQA) MOSNet Human speech Synthetic speech WhisperX Word error rate (WER) Character error rate (CER) Speech synthesis Speech recognition
الوصول للمادة أونلاين:	https://hdl.handle.net/10356/181485
الوسوم:	إضافة وسم لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!
المؤسسة:	Nanyang Technological University
اللغة:	English

الوصف
الملخص:	The objective of this dissertation is to compare the results of objective Speech Quality Assessment (SQA) between human and synthetic speeches to verify the feasibility of using this method to identify if a speech is human-recorded. We also tried using speech synthesis and SQA to quantify the performance of a speech recognition task without original transcript. Human speech samples were taken from LibriSpeech, VCC 2018, and AISHELL-3, while synthetic speeches were generated by synthesizers called VITS, ChatTTS, and Tacotron 2. Preprocessing involved standardizing sampling rates and bit depths, followed by transcription with WhisperX to calculate Word Error Rate (WER) and Character Error Rate (CER). MOSNet, an SQA system was implemented to score speech quality, with results showing that MOSNet can accurately identify human speech within its training set but struggles with generalization outside it. Despite some correlation between MOSNet predictions and WERs, the results suggest that MOSNet alone cannot reliably assess speech recognition quality. The dissertation also conducted a subjective SQA test with 14 participants to compare human estimations with MOSNet evaluations, revealing challenges in distinguishing natural human speech from synthetic counterparts, and underscoring the importance of factors such as authentic accents and natural delivery in speech evaluations.

Speech synthesis and quality evaluation

مواد مشابهة