Spectral mapping for voice conversion

Voice conversion is the process to modify a speech signal of one speaker (source) to sound like an intended speaker (target) without changing the language content. This technology has several applications, such as personalized speech synthesis and speech to singing, and a possible malicious applicat...

Full description

Saved in:
Bibliographic Details
Main Author: Wu, Zhi Zheng
Other Authors: Chng Eng Siong
Format: Theses and Dissertations
Language:English
Published: 2015
Subjects:
Online Access:http://hdl.handle.net/10356/63286
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-63286
record_format dspace
spelling sg-ntu-dr.10356-632862023-03-04T00:34:53Z Spectral mapping for voice conversion Wu, Zhi Zheng Chng Eng Siong Li Haizhou School of Computer Engineering Emerging Research Lab DRNTU::Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Voice conversion is the process to modify a speech signal of one speaker (source) to sound like an intended speaker (target) without changing the language content. This technology has several applications, such as personalized speech synthesis and speech to singing, and a possible malicious application to fool a voice biometric system, also called a speaker verification system. This thesis focuses on techniques to improve voice conversion performance and the use of voice conversion technology to attack speaker verification systems. The study is important as the robustness of the conversion function will affect the performance of voice conversion directly. To this end, the thesis first proposes a method to improve the voice conversion function by benefiting from nonparallel data of background speakers. The strategy is to decompose a speech spectral vector into a phonetic component and a speaker-specific component, which are modeled by a factor analysis model. The nonparallel data are used to estimate the phonetic component and the factor loadings. The speaker-specific component can then be represented by a low-dimensional set of variables via factor loadings, and hence allow for more robust modeling under sparse training data condition. The experimental results show that the proposed method outperforms the conventional Gaussian mixture model (GMM) based method considerably when there are limited parallel training data. The second contribution of the thesis focuses on implementing the conversion function by directly modelling the high-dimensional spectral features using exemplars found in the training data. In this approach, each speech segment is reconstructed as a weighted linear combination of a set of basis exemplars with residual compensation. An exemplar is defined as a speech segment spanning multiple frames extracted from training data. The value of the linear combination weights is constrained to be nonnegative, and most of them are restricted to have a value close to zero. Experiments are conducted to compare the proposed method with a large set of baseline approaches. It is observed that the proposed method can achieve similar performance to the state-of-the-art GMM based and dynamic kernel partial least square based voice conversion methods. The experiments also confirm its flexibility when the amount of training data is varied. The third contribution focuses on the use of voice conversion technology to attack current state-of-the-art speaker verification system with the purpose to identify the weak links of the verification algorithms. Recently, speaker verification technology has been advanced significantly and has led to mass market adoption, such as in smartphones for user authentication. A major concern when deploying speaker verification technology is whether a system is still robust against spoofing attacks. Unfortunately, voice conversion has become one of the most easily accessible techniques to carry out spoofing attacks, and hence presents a threat to speaker verification systems. To address the above concern, this thesis examines the vulnerabilities of nine current state-of-the-art speaker verification systems in the face of voice conversion spoofing attacks. Doctor of Philosophy (SCE) 2015-05-12T04:06:05Z 2015-05-12T04:06:05Z 2015 2015 Thesis Wu, Z. Z. (2015). Spectral mapping for voice conversion. Doctoral thesis, Nanyang Technological University, Singapore. http://hdl.handle.net/10356/63286 en 154 p. application/pdf
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic DRNTU::Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
spellingShingle DRNTU::Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
Wu, Zhi Zheng
Spectral mapping for voice conversion
description Voice conversion is the process to modify a speech signal of one speaker (source) to sound like an intended speaker (target) without changing the language content. This technology has several applications, such as personalized speech synthesis and speech to singing, and a possible malicious application to fool a voice biometric system, also called a speaker verification system. This thesis focuses on techniques to improve voice conversion performance and the use of voice conversion technology to attack speaker verification systems. The study is important as the robustness of the conversion function will affect the performance of voice conversion directly. To this end, the thesis first proposes a method to improve the voice conversion function by benefiting from nonparallel data of background speakers. The strategy is to decompose a speech spectral vector into a phonetic component and a speaker-specific component, which are modeled by a factor analysis model. The nonparallel data are used to estimate the phonetic component and the factor loadings. The speaker-specific component can then be represented by a low-dimensional set of variables via factor loadings, and hence allow for more robust modeling under sparse training data condition. The experimental results show that the proposed method outperforms the conventional Gaussian mixture model (GMM) based method considerably when there are limited parallel training data. The second contribution of the thesis focuses on implementing the conversion function by directly modelling the high-dimensional spectral features using exemplars found in the training data. In this approach, each speech segment is reconstructed as a weighted linear combination of a set of basis exemplars with residual compensation. An exemplar is defined as a speech segment spanning multiple frames extracted from training data. The value of the linear combination weights is constrained to be nonnegative, and most of them are restricted to have a value close to zero. Experiments are conducted to compare the proposed method with a large set of baseline approaches. It is observed that the proposed method can achieve similar performance to the state-of-the-art GMM based and dynamic kernel partial least square based voice conversion methods. The experiments also confirm its flexibility when the amount of training data is varied. The third contribution focuses on the use of voice conversion technology to attack current state-of-the-art speaker verification system with the purpose to identify the weak links of the verification algorithms. Recently, speaker verification technology has been advanced significantly and has led to mass market adoption, such as in smartphones for user authentication. A major concern when deploying speaker verification technology is whether a system is still robust against spoofing attacks. Unfortunately, voice conversion has become one of the most easily accessible techniques to carry out spoofing attacks, and hence presents a threat to speaker verification systems. To address the above concern, this thesis examines the vulnerabilities of nine current state-of-the-art speaker verification systems in the face of voice conversion spoofing attacks.
author2 Chng Eng Siong
author_facet Chng Eng Siong
Wu, Zhi Zheng
format Theses and Dissertations
author Wu, Zhi Zheng
author_sort Wu, Zhi Zheng
title Spectral mapping for voice conversion
title_short Spectral mapping for voice conversion
title_full Spectral mapping for voice conversion
title_fullStr Spectral mapping for voice conversion
title_full_unstemmed Spectral mapping for voice conversion
title_sort spectral mapping for voice conversion
publishDate 2015
url http://hdl.handle.net/10356/63286
_version_ 1759856427244978176