Voice conversion with parallel/non-parallel data and synthetic speech detection

The objective of voice conversion techniques is to convert a source speaker's voice so that it sounds like that of a target speaker. Voice conversion belongs to a popular area of personalized speech generation. On one hand, it can be applied to solve problems, e.g. emotion conversion, improving...

Full description

Saved in:
Bibliographic Details
Main Author: Tian, Xiaohai
Other Authors: Chng Eng Siong
Format: Theses and Dissertations
Language:English
Published: 2019
Subjects:
Online Access:https://hdl.handle.net/10356/89880
http://hdl.handle.net/10220/47729
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-89880
record_format dspace
spelling sg-ntu-dr.10356-898802020-06-24T08:35:43Z Voice conversion with parallel/non-parallel data and synthetic speech detection Tian, Xiaohai Chng Eng Siong School of Computer Science and Engineering DRNTU::Engineering::Computer science and engineering The objective of voice conversion techniques is to convert a source speaker's voice so that it sounds like that of a target speaker. Voice conversion belongs to a popular area of personalized speech generation. On one hand, it can be applied to solve problems, e.g. emotion conversion, improving the intelligibility of speech, or change whisper/murmur into speech. On the other hand, voice conversion also presents a threat to Automatic Speaker Verification (ASV) systems. Synthetic speech detection is a technique to discriminate between live and synthetic speech. In this sense, it provides a feasible way to improve robustness and protect ASV systems. This thesis focuses on the following two aspects: improving performance in voice conversion and studying countermeasures for synthetic speech detection. For voice conversion, a typical system usually requires parallel data, which means that the source and target utterances contain the same linguistic content for conversion model training. As parallel data is always difficult to collect in practice, the challenge is to estimate a robust conversion function with limited parallel data or even non-parallel data. In this thesis, we propose two novel voice conversion methods to improve the system performance for both parallel and non-parallel data based voice conversion. First, we propose a novel parallel data based voice conversion framework, in which we combine frequency warping and exemplar-based methods. Under an exemplar-based frequency warping (EFW) framework, each warping function and spectral residual is generated from dictionaries of warping function exemplars and spectral residual exemplars respectively. With sparsity constraint, we avoid statistical averaging effects due to Gaussian Mixture Models (GMM) and provide a more accurate warping function and residual compensation. Results from experiments on VOICES database show that the proposed method significantly improved speech quality compared with state-of-the-art parametric methods. Second, we propose a non-parallel data based voice conversion framework that we call the Average Modeling Approach (AMA). The proposed approach makes use of a multi-speaker average model that maps speaker independent linguistic features to speaker dependent acoustic features. As linguistic features and acoustic features can be extracted from the same utterance, the proposed approach doesn't require parallel data for average model training and adaptation. Two average model adaptation approaches are introduced, namely model-based adaptation and feature-based adaptation. Experiments on the Voice Conversion Challenge 2018 (VCC2018) database confirmed the effectiveness of the proposed method. For synthetic speech detection, we investigated the use of different feature representations to discriminate between live and synthetic speech. As low-dimensional features are typically used in synthetic techniques, detailed information has been abandoned in synthetic speech. This work mainly focused on the detailed information of high-dimensional features, especially their high-frequency parts, for synthetic speech detection. Dynamic features were also calculated to assess the effectiveness of temporal information to detect artifacts across frames. Results of experiments on the standard ASVspoof 2015 corpus suggest that high-dimensional features and dynamic features are useful for synthetic speech detection. Doctor of Philosophy 2019-02-26T06:54:31Z 2019-12-06T17:35:42Z 2019-02-26T06:54:31Z 2019-12-06T17:35:42Z 2019 Thesis Tian, X. (2019). Voice conversion with parallel/non-parallel data and synthetic speech detection. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/89880 http://hdl.handle.net/10220/47729 10.32657/10220/47729 en 138 p. application/pdf
institution Nanyang Technological University
building NTU Library
country Singapore
collection DR-NTU
language English
topic DRNTU::Engineering::Computer science and engineering
spellingShingle DRNTU::Engineering::Computer science and engineering
Tian, Xiaohai
Voice conversion with parallel/non-parallel data and synthetic speech detection
description The objective of voice conversion techniques is to convert a source speaker's voice so that it sounds like that of a target speaker. Voice conversion belongs to a popular area of personalized speech generation. On one hand, it can be applied to solve problems, e.g. emotion conversion, improving the intelligibility of speech, or change whisper/murmur into speech. On the other hand, voice conversion also presents a threat to Automatic Speaker Verification (ASV) systems. Synthetic speech detection is a technique to discriminate between live and synthetic speech. In this sense, it provides a feasible way to improve robustness and protect ASV systems. This thesis focuses on the following two aspects: improving performance in voice conversion and studying countermeasures for synthetic speech detection. For voice conversion, a typical system usually requires parallel data, which means that the source and target utterances contain the same linguistic content for conversion model training. As parallel data is always difficult to collect in practice, the challenge is to estimate a robust conversion function with limited parallel data or even non-parallel data. In this thesis, we propose two novel voice conversion methods to improve the system performance for both parallel and non-parallel data based voice conversion. First, we propose a novel parallel data based voice conversion framework, in which we combine frequency warping and exemplar-based methods. Under an exemplar-based frequency warping (EFW) framework, each warping function and spectral residual is generated from dictionaries of warping function exemplars and spectral residual exemplars respectively. With sparsity constraint, we avoid statistical averaging effects due to Gaussian Mixture Models (GMM) and provide a more accurate warping function and residual compensation. Results from experiments on VOICES database show that the proposed method significantly improved speech quality compared with state-of-the-art parametric methods. Second, we propose a non-parallel data based voice conversion framework that we call the Average Modeling Approach (AMA). The proposed approach makes use of a multi-speaker average model that maps speaker independent linguistic features to speaker dependent acoustic features. As linguistic features and acoustic features can be extracted from the same utterance, the proposed approach doesn't require parallel data for average model training and adaptation. Two average model adaptation approaches are introduced, namely model-based adaptation and feature-based adaptation. Experiments on the Voice Conversion Challenge 2018 (VCC2018) database confirmed the effectiveness of the proposed method. For synthetic speech detection, we investigated the use of different feature representations to discriminate between live and synthetic speech. As low-dimensional features are typically used in synthetic techniques, detailed information has been abandoned in synthetic speech. This work mainly focused on the detailed information of high-dimensional features, especially their high-frequency parts, for synthetic speech detection. Dynamic features were also calculated to assess the effectiveness of temporal information to detect artifacts across frames. Results of experiments on the standard ASVspoof 2015 corpus suggest that high-dimensional features and dynamic features are useful for synthetic speech detection.
author2 Chng Eng Siong
author_facet Chng Eng Siong
Tian, Xiaohai
format Theses and Dissertations
author Tian, Xiaohai
author_sort Tian, Xiaohai
title Voice conversion with parallel/non-parallel data and synthetic speech detection
title_short Voice conversion with parallel/non-parallel data and synthetic speech detection
title_full Voice conversion with parallel/non-parallel data and synthetic speech detection
title_fullStr Voice conversion with parallel/non-parallel data and synthetic speech detection
title_full_unstemmed Voice conversion with parallel/non-parallel data and synthetic speech detection
title_sort voice conversion with parallel/non-parallel data and synthetic speech detection
publishDate 2019
url https://hdl.handle.net/10356/89880
http://hdl.handle.net/10220/47729
_version_ 1681059259027030016