Voice conversion with parallel/non-parallel data and synthetic speech detection
The objective of voice conversion techniques is to convert a source speaker's voice so that it sounds like that of a target speaker. Voice conversion belongs to a popular area of personalized speech generation. On one hand, it can be applied to solve problems, e.g. emotion conversion, improving...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Theses and Dissertations |
Language: | English |
Published: |
2019
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/89880 http://hdl.handle.net/10220/47729 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-89880 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-898802020-06-24T08:35:43Z Voice conversion with parallel/non-parallel data and synthetic speech detection Tian, Xiaohai Chng Eng Siong School of Computer Science and Engineering DRNTU::Engineering::Computer science and engineering The objective of voice conversion techniques is to convert a source speaker's voice so that it sounds like that of a target speaker. Voice conversion belongs to a popular area of personalized speech generation. On one hand, it can be applied to solve problems, e.g. emotion conversion, improving the intelligibility of speech, or change whisper/murmur into speech. On the other hand, voice conversion also presents a threat to Automatic Speaker Verification (ASV) systems. Synthetic speech detection is a technique to discriminate between live and synthetic speech. In this sense, it provides a feasible way to improve robustness and protect ASV systems. This thesis focuses on the following two aspects: improving performance in voice conversion and studying countermeasures for synthetic speech detection. For voice conversion, a typical system usually requires parallel data, which means that the source and target utterances contain the same linguistic content for conversion model training. As parallel data is always difficult to collect in practice, the challenge is to estimate a robust conversion function with limited parallel data or even non-parallel data. In this thesis, we propose two novel voice conversion methods to improve the system performance for both parallel and non-parallel data based voice conversion. First, we propose a novel parallel data based voice conversion framework, in which we combine frequency warping and exemplar-based methods. Under an exemplar-based frequency warping (EFW) framework, each warping function and spectral residual is generated from dictionaries of warping function exemplars and spectral residual exemplars respectively. With sparsity constraint, we avoid statistical averaging effects due to Gaussian Mixture Models (GMM) and provide a more accurate warping function and residual compensation. Results from experiments on VOICES database show that the proposed method significantly improved speech quality compared with state-of-the-art parametric methods. Second, we propose a non-parallel data based voice conversion framework that we call the Average Modeling Approach (AMA). The proposed approach makes use of a multi-speaker average model that maps speaker independent linguistic features to speaker dependent acoustic features. As linguistic features and acoustic features can be extracted from the same utterance, the proposed approach doesn't require parallel data for average model training and adaptation. Two average model adaptation approaches are introduced, namely model-based adaptation and feature-based adaptation. Experiments on the Voice Conversion Challenge 2018 (VCC2018) database confirmed the effectiveness of the proposed method. For synthetic speech detection, we investigated the use of different feature representations to discriminate between live and synthetic speech. As low-dimensional features are typically used in synthetic techniques, detailed information has been abandoned in synthetic speech. This work mainly focused on the detailed information of high-dimensional features, especially their high-frequency parts, for synthetic speech detection. Dynamic features were also calculated to assess the effectiveness of temporal information to detect artifacts across frames. Results of experiments on the standard ASVspoof 2015 corpus suggest that high-dimensional features and dynamic features are useful for synthetic speech detection. Doctor of Philosophy 2019-02-26T06:54:31Z 2019-12-06T17:35:42Z 2019-02-26T06:54:31Z 2019-12-06T17:35:42Z 2019 Thesis Tian, X. (2019). Voice conversion with parallel/non-parallel data and synthetic speech detection. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/89880 http://hdl.handle.net/10220/47729 10.32657/10220/47729 en 138 p. application/pdf |
institution |
Nanyang Technological University |
building |
NTU Library |
country |
Singapore |
collection |
DR-NTU |
language |
English |
topic |
DRNTU::Engineering::Computer science and engineering |
spellingShingle |
DRNTU::Engineering::Computer science and engineering Tian, Xiaohai Voice conversion with parallel/non-parallel data and synthetic speech detection |
description |
The objective of voice conversion techniques is to convert a source speaker's voice so that it sounds like that of a target speaker. Voice conversion belongs to a popular area of personalized speech generation. On one hand, it can be applied to solve problems, e.g. emotion conversion, improving the intelligibility of speech, or change whisper/murmur into speech. On the other hand, voice conversion also presents a threat to Automatic Speaker Verification (ASV) systems. Synthetic speech detection is a technique to discriminate between live and synthetic speech. In this sense, it provides a feasible way to improve robustness and protect ASV systems. This thesis focuses on the following two aspects: improving performance in voice conversion and studying countermeasures for synthetic speech detection.
For voice conversion, a typical system usually requires parallel data, which means that the source and target utterances contain the same linguistic content for conversion model training. As parallel data is always difficult to collect in practice, the challenge is to estimate a robust conversion function with limited parallel data or even non-parallel data. In this thesis, we propose two novel voice conversion methods to improve the system performance for both parallel and non-parallel data based voice conversion.
First, we propose a novel parallel data based voice conversion framework, in which we combine frequency warping and exemplar-based methods. Under an exemplar-based frequency warping (EFW) framework, each warping function and spectral residual is generated from dictionaries of warping function exemplars and spectral residual exemplars respectively. With sparsity constraint, we avoid statistical averaging effects due to Gaussian Mixture Models (GMM) and provide a more accurate warping function and residual compensation. Results from experiments on VOICES database show that the proposed method significantly improved speech quality compared with state-of-the-art parametric methods.
Second, we propose a non-parallel data based voice conversion framework that we call the Average Modeling Approach (AMA). The proposed approach makes use of a multi-speaker average model that maps speaker independent linguistic features to speaker dependent acoustic features. As linguistic features and acoustic features can be extracted from the same utterance, the proposed approach doesn't require parallel data for average model training and adaptation. Two average model adaptation approaches are introduced, namely model-based adaptation and feature-based adaptation. Experiments on the Voice Conversion Challenge 2018 (VCC2018) database confirmed the effectiveness of the proposed method.
For synthetic speech detection, we investigated the use of different feature representations to discriminate between live and synthetic speech. As low-dimensional features are typically used in synthetic techniques, detailed information has been abandoned in synthetic speech. This work mainly focused on the detailed information of high-dimensional features, especially their high-frequency parts, for synthetic speech detection. Dynamic features were also calculated to assess the effectiveness of temporal information to detect artifacts across frames. Results of experiments on the standard ASVspoof 2015 corpus suggest that high-dimensional features and dynamic features are useful for synthetic speech detection. |
author2 |
Chng Eng Siong |
author_facet |
Chng Eng Siong Tian, Xiaohai |
format |
Theses and Dissertations |
author |
Tian, Xiaohai |
author_sort |
Tian, Xiaohai |
title |
Voice conversion with parallel/non-parallel data and synthetic speech detection |
title_short |
Voice conversion with parallel/non-parallel data and synthetic speech detection |
title_full |
Voice conversion with parallel/non-parallel data and synthetic speech detection |
title_fullStr |
Voice conversion with parallel/non-parallel data and synthetic speech detection |
title_full_unstemmed |
Voice conversion with parallel/non-parallel data and synthetic speech detection |
title_sort |
voice conversion with parallel/non-parallel data and synthetic speech detection |
publishDate |
2019 |
url |
https://hdl.handle.net/10356/89880 http://hdl.handle.net/10220/47729 |
_version_ |
1681059259027030016 |