Speech synthesis and quality evaluation

The objective of this dissertation is to compare the results of objective Speech Quality Assessment (SQA) between human and synthetic speeches to verify the feasibility of using this method to identify if a speech is human-recorded. We also tried using speech synthesis and SQA to quantify the perfor...

Full description

Saved in:

Bibliographic Details
Main Author:	Jiang, Xiaotong
Other Authors:	Tan Yap Peng
Format:	Thesis-Master by Coursework
Language:	English
Published:	Nanyang Technological University 2024
Subjects:	Computer and Information Science Engineering Speech quality assessment (SQA) MOSNet Human speech Synthetic speech WhisperX Word error rate (WER) Character error rate (CER) Speech synthesis Speech recognition
Online Access:	https://hdl.handle.net/10356/181485
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-181485
record_format	dspace
spelling	sg-ntu-dr.10356-1814852024-12-06T15:49:16Z Speech synthesis and quality evaluation Jiang, Xiaotong Tan Yap Peng School of Electrical and Electronic Engineering EYPTan@ntu.edu.sg Computer and Information Science Engineering Speech quality assessment (SQA) MOSNet Human speech Synthetic speech WhisperX Word error rate (WER) Character error rate (CER) Speech synthesis Speech recognition The objective of this dissertation is to compare the results of objective Speech Quality Assessment (SQA) between human and synthetic speeches to verify the feasibility of using this method to identify if a speech is human-recorded. We also tried using speech synthesis and SQA to quantify the performance of a speech recognition task without original transcript. Human speech samples were taken from LibriSpeech, VCC 2018, and AISHELL-3, while synthetic speeches were generated by synthesizers called VITS, ChatTTS, and Tacotron 2. Preprocessing involved standardizing sampling rates and bit depths, followed by transcription with WhisperX to calculate Word Error Rate (WER) and Character Error Rate (CER). MOSNet, an SQA system was implemented to score speech quality, with results showing that MOSNet can accurately identify human speech within its training set but struggles with generalization outside it. Despite some correlation between MOSNet predictions and WERs, the results suggest that MOSNet alone cannot reliably assess speech recognition quality. The dissertation also conducted a subjective SQA test with 14 participants to compare human estimations with MOSNet evaluations, revealing challenges in distinguishing natural human speech from synthetic counterparts, and underscoring the importance of factors such as authentic accents and natural delivery in speech evaluations. Master's degree 2024-12-04T05:38:32Z 2024-12-04T05:38:32Z 2024 Thesis-Master by Coursework Jiang, X. (2024). Speech synthesis and quality evaluation. Master's thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/181485 https://hdl.handle.net/10356/181485 en application/pdf Nanyang Technological University
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Computer and Information Science Engineering Speech quality assessment (SQA) MOSNet Human speech Synthetic speech WhisperX Word error rate (WER) Character error rate (CER) Speech synthesis Speech recognition
spellingShingle	Computer and Information Science Engineering Speech quality assessment (SQA) MOSNet Human speech Synthetic speech WhisperX Word error rate (WER) Character error rate (CER) Speech synthesis Speech recognition Jiang, Xiaotong Speech synthesis and quality evaluation
description	The objective of this dissertation is to compare the results of objective Speech Quality Assessment (SQA) between human and synthetic speeches to verify the feasibility of using this method to identify if a speech is human-recorded. We also tried using speech synthesis and SQA to quantify the performance of a speech recognition task without original transcript. Human speech samples were taken from LibriSpeech, VCC 2018, and AISHELL-3, while synthetic speeches were generated by synthesizers called VITS, ChatTTS, and Tacotron 2. Preprocessing involved standardizing sampling rates and bit depths, followed by transcription with WhisperX to calculate Word Error Rate (WER) and Character Error Rate (CER). MOSNet, an SQA system was implemented to score speech quality, with results showing that MOSNet can accurately identify human speech within its training set but struggles with generalization outside it. Despite some correlation between MOSNet predictions and WERs, the results suggest that MOSNet alone cannot reliably assess speech recognition quality. The dissertation also conducted a subjective SQA test with 14 participants to compare human estimations with MOSNet evaluations, revealing challenges in distinguishing natural human speech from synthetic counterparts, and underscoring the importance of factors such as authentic accents and natural delivery in speech evaluations.
author2	Tan Yap Peng
author_facet	Tan Yap Peng Jiang, Xiaotong
format	Thesis-Master by Coursework
author	Jiang, Xiaotong
author_sort	Jiang, Xiaotong
title	Speech synthesis and quality evaluation
title_short	Speech synthesis and quality evaluation
title_full	Speech synthesis and quality evaluation
title_fullStr	Speech synthesis and quality evaluation
title_full_unstemmed	Speech synthesis and quality evaluation
title_sort	speech synthesis and quality evaluation
publisher	Nanyang Technological University
publishDate	2024
url	https://hdl.handle.net/10356/181485
_version_	1819113005176061952

Speech synthesis and quality evaluation

Similar Items