Voice conversion with parallel/non-parallel data and synthetic speech detection

The objective of voice conversion techniques is to convert a source speaker's voice so that it sounds like that of a target speaker. Voice conversion belongs to a popular area of personalized speech generation. On one hand, it can be applied to solve problems, e.g. emotion conversion, improving...

全面介紹

Saved in:

書目詳細資料
主要作者:	Tian, Xiaohai
其他作者:	Chng Eng Siong
格式:	Theses and Dissertations
語言:	English
出版:	2019
主題:	DRNTU::Engineering::Computer science and engineering
在線閱讀:	https://hdl.handle.net/10356/89880 http://hdl.handle.net/10220/47729
標簽:	添加標簽沒有標簽, 成為第一個標記此記錄!

id	sg-ntu-dr.10356-89880
record_format	dspace
spelling	sg-ntu-dr.10356-898802020-06-24T08:35:43Z Voice conversion with parallel/non-parallel data and synthetic speech detection Tian, Xiaohai Chng Eng Siong School of Computer Science and Engineering DRNTU::Engineering::Computer science and engineering The objective of voice conversion techniques is to convert a source speaker's voice so that it sounds like that of a target speaker. Voice conversion belongs to a popular area of personalized speech generation. On one hand, it can be applied to solve problems, e.g. emotion conversion, improving the intelligibility of speech, or change whisper/murmur into speech. On the other hand, voice conversion also presents a threat to Automatic Speaker Verification (ASV) systems. Synthetic speech detection is a technique to discriminate between live and synthetic speech. In this sense, it provides a feasible way to improve robustness and protect ASV systems. This thesis focuses on the following two aspects: improving performance in voice conversion and studying countermeasures for synthetic speech detection. For voice conversion, a typical system usually requires parallel data, which means that the source and target utterances contain the same linguistic content for conversion model training. As parallel data is always difficult to collect in practice, the challenge is to estimate a robust conversion function with limited parallel data or even non-parallel data. In this thesis, we propose two novel voice conversion methods to improve the system performance for both parallel and non-parallel data based voice conversion. First, we propose a novel parallel data based voice conversion framework, in which we combine frequency warping and exemplar-based methods. Under an exemplar-based frequency warping (EFW) framework, each warping function and spectral residual is generated from dictionaries of warping function exemplars and spectral residual exemplars respectively. With sparsity constraint, we avoid statistical averaging effects due to Gaussian Mixture Models (GMM) and provide a more accurate warping function and residual compensation. Results from experiments on VOICES database show that the proposed method significantly improved speech quality compared with state-of-the-art parametric methods. Second, we propose a non-parallel data based voice conversion framework that we call the Average Modeling Approach (AMA). The proposed approach makes use of a multi-speaker average model that maps speaker independent linguistic features to speaker dependent acoustic features. As linguistic features and acoustic features can be extracted from the same utterance, the proposed approach doesn't require parallel data for average model training and adaptation. Two average model adaptation approaches are introduced, namely model-based adaptation and feature-based adaptation. Experiments on the Voice Conversion Challenge 2018 (VCC2018) database confirmed the effectiveness of the proposed method. For synthetic speech detection, we investigated the use of different feature representations to discriminate between live and synthetic speech. As low-dimensional features are typically used in synthetic techniques, detailed information has been abandoned in synthetic speech. This work mainly focused on the detailed information of high-dimensional features, especially their high-frequency parts, for synthetic speech detection. Dynamic features were also calculated to assess the effectiveness of temporal information to detect artifacts across frames. Results of experiments on the standard ASVspoof 2015 corpus suggest that high-dimensional features and dynamic features are useful for synthetic speech detection. Doctor of Philosophy 2019-02-26T06:54:31Z 2019-12-06T17:35:42Z 2019-02-26T06:54:31Z 2019-12-06T17:35:42Z 2019 Thesis Tian, X. (2019). Voice conversion with parallel/non-parallel data and synthetic speech detection. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/89880 http://hdl.handle.net/10220/47729 10.32657/10220/47729 en 138 p. application/pdf
institution	Nanyang Technological University
building	NTU Library
country	Singapore
collection	DR-NTU
language	English
topic	DRNTU::Engineering::Computer science and engineering
spellingShingle	DRNTU::Engineering::Computer science and engineering Tian, Xiaohai Voice conversion with parallel/non-parallel data and synthetic speech detection
description	The objective of voice conversion techniques is to convert a source speaker's voice so that it sounds like that of a target speaker. Voice conversion belongs to a popular area of personalized speech generation. On one hand, it can be applied to solve problems, e.g. emotion conversion, improving the intelligibility of speech, or change whisper/murmur into speech. On the other hand, voice conversion also presents a threat to Automatic Speaker Verification (ASV) systems. Synthetic speech detection is a technique to discriminate between live and synthetic speech. In this sense, it provides a feasible way to improve robustness and protect ASV systems. This thesis focuses on the following two aspects: improving performance in voice conversion and studying countermeasures for synthetic speech detection. For voice conversion, a typical system usually requires parallel data, which means that the source and target utterances contain the same linguistic content for conversion model training. As parallel data is always difficult to collect in practice, the challenge is to estimate a robust conversion function with limited parallel data or even non-parallel data. In this thesis, we propose two novel voice conversion methods to improve the system performance for both parallel and non-parallel data based voice conversion. First, we propose a novel parallel data based voice conversion framework, in which we combine frequency warping and exemplar-based methods. Under an exemplar-based frequency warping (EFW) framework, each warping function and spectral residual is generated from dictionaries of warping function exemplars and spectral residual exemplars respectively. With sparsity constraint, we avoid statistical averaging effects due to Gaussian Mixture Models (GMM) and provide a more accurate warping function and residual compensation. Results from experiments on VOICES database show that the proposed method significantly improved speech quality compared with state-of-the-art parametric methods. Second, we propose a non-parallel data based voice conversion framework that we call the Average Modeling Approach (AMA). The proposed approach makes use of a multi-speaker average model that maps speaker independent linguistic features to speaker dependent acoustic features. As linguistic features and acoustic features can be extracted from the same utterance, the proposed approach doesn't require parallel data for average model training and adaptation. Two average model adaptation approaches are introduced, namely model-based adaptation and feature-based adaptation. Experiments on the Voice Conversion Challenge 2018 (VCC2018) database confirmed the effectiveness of the proposed method. For synthetic speech detection, we investigated the use of different feature representations to discriminate between live and synthetic speech. As low-dimensional features are typically used in synthetic techniques, detailed information has been abandoned in synthetic speech. This work mainly focused on the detailed information of high-dimensional features, especially their high-frequency parts, for synthetic speech detection. Dynamic features were also calculated to assess the effectiveness of temporal information to detect artifacts across frames. Results of experiments on the standard ASVspoof 2015 corpus suggest that high-dimensional features and dynamic features are useful for synthetic speech detection.
author2	Chng Eng Siong
author_facet	Chng Eng Siong Tian, Xiaohai
format	Theses and Dissertations
author	Tian, Xiaohai
author_sort	Tian, Xiaohai
title	Voice conversion with parallel/non-parallel data and synthetic speech detection
title_short	Voice conversion with parallel/non-parallel data and synthetic speech detection
title_full	Voice conversion with parallel/non-parallel data and synthetic speech detection
title_fullStr	Voice conversion with parallel/non-parallel data and synthetic speech detection
title_full_unstemmed	Voice conversion with parallel/non-parallel data and synthetic speech detection
title_sort	voice conversion with parallel/non-parallel data and synthetic speech detection
publishDate	2019
url	https://hdl.handle.net/10356/89880 http://hdl.handle.net/10220/47729
_version_	1681059259027030016

Voice conversion with parallel/non-parallel data and synthetic speech detection

相似書籍