Spectral mapping for voice conversion

Voice conversion is the process to modify a speech signal of one speaker (source) to sound like an intended speaker (target) without changing the language content. This technology has several applications, such as personalized speech synthesis and speech to singing, and a possible malicious applicat...

Full description

Saved in:

Bibliographic Details
Main Author:	Wu, Zhi Zheng
Other Authors:	Chng Eng Siong
Format:	Theses and Dissertations
Language:	English
Published:	2015
Subjects:	DRNTU::Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
Online Access:	http://hdl.handle.net/10356/63286
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-63286
record_format	dspace
spelling	sg-ntu-dr.10356-632862023-03-04T00:34:53Z Spectral mapping for voice conversion Wu, Zhi Zheng Chng Eng Siong Li Haizhou School of Computer Engineering Emerging Research Lab DRNTU::Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Voice conversion is the process to modify a speech signal of one speaker (source) to sound like an intended speaker (target) without changing the language content. This technology has several applications, such as personalized speech synthesis and speech to singing, and a possible malicious application to fool a voice biometric system, also called a speaker verification system. This thesis focuses on techniques to improve voice conversion performance and the use of voice conversion technology to attack speaker verification systems. The study is important as the robustness of the conversion function will affect the performance of voice conversion directly. To this end, the thesis first proposes a method to improve the voice conversion function by benefiting from nonparallel data of background speakers. The strategy is to decompose a speech spectral vector into a phonetic component and a speaker-specific component, which are modeled by a factor analysis model. The nonparallel data are used to estimate the phonetic component and the factor loadings. The speaker-specific component can then be represented by a low-dimensional set of variables via factor loadings, and hence allow for more robust modeling under sparse training data condition. The experimental results show that the proposed method outperforms the conventional Gaussian mixture model (GMM) based method considerably when there are limited parallel training data. The second contribution of the thesis focuses on implementing the conversion function by directly modelling the high-dimensional spectral features using exemplars found in the training data. In this approach, each speech segment is reconstructed as a weighted linear combination of a set of basis exemplars with residual compensation. An exemplar is defined as a speech segment spanning multiple frames extracted from training data. The value of the linear combination weights is constrained to be nonnegative, and most of them are restricted to have a value close to zero. Experiments are conducted to compare the proposed method with a large set of baseline approaches. It is observed that the proposed method can achieve similar performance to the state-of-the-art GMM based and dynamic kernel partial least square based voice conversion methods. The experiments also confirm its flexibility when the amount of training data is varied. The third contribution focuses on the use of voice conversion technology to attack current state-of-the-art speaker verification system with the purpose to identify the weak links of the verification algorithms. Recently, speaker verification technology has been advanced significantly and has led to mass market adoption, such as in smartphones for user authentication. A major concern when deploying speaker verification technology is whether a system is still robust against spoofing attacks. Unfortunately, voice conversion has become one of the most easily accessible techniques to carry out spoofing attacks, and hence presents a threat to speaker verification systems. To address the above concern, this thesis examines the vulnerabilities of nine current state-of-the-art speaker verification systems in the face of voice conversion spoofing attacks. Doctor of Philosophy (SCE) 2015-05-12T04:06:05Z 2015-05-12T04:06:05Z 2015 2015 Thesis Wu, Z. Z. (2015). Spectral mapping for voice conversion. Doctoral thesis, Nanyang Technological University, Singapore. http://hdl.handle.net/10356/63286 en 154 p. application/pdf
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	DRNTU::Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
spellingShingle	DRNTU::Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Wu, Zhi Zheng Spectral mapping for voice conversion
description	Voice conversion is the process to modify a speech signal of one speaker (source) to sound like an intended speaker (target) without changing the language content. This technology has several applications, such as personalized speech synthesis and speech to singing, and a possible malicious application to fool a voice biometric system, also called a speaker verification system. This thesis focuses on techniques to improve voice conversion performance and the use of voice conversion technology to attack speaker verification systems. The study is important as the robustness of the conversion function will affect the performance of voice conversion directly. To this end, the thesis first proposes a method to improve the voice conversion function by benefiting from nonparallel data of background speakers. The strategy is to decompose a speech spectral vector into a phonetic component and a speaker-specific component, which are modeled by a factor analysis model. The nonparallel data are used to estimate the phonetic component and the factor loadings. The speaker-specific component can then be represented by a low-dimensional set of variables via factor loadings, and hence allow for more robust modeling under sparse training data condition. The experimental results show that the proposed method outperforms the conventional Gaussian mixture model (GMM) based method considerably when there are limited parallel training data. The second contribution of the thesis focuses on implementing the conversion function by directly modelling the high-dimensional spectral features using exemplars found in the training data. In this approach, each speech segment is reconstructed as a weighted linear combination of a set of basis exemplars with residual compensation. An exemplar is defined as a speech segment spanning multiple frames extracted from training data. The value of the linear combination weights is constrained to be nonnegative, and most of them are restricted to have a value close to zero. Experiments are conducted to compare the proposed method with a large set of baseline approaches. It is observed that the proposed method can achieve similar performance to the state-of-the-art GMM based and dynamic kernel partial least square based voice conversion methods. The experiments also confirm its flexibility when the amount of training data is varied. The third contribution focuses on the use of voice conversion technology to attack current state-of-the-art speaker verification system with the purpose to identify the weak links of the verification algorithms. Recently, speaker verification technology has been advanced significantly and has led to mass market adoption, such as in smartphones for user authentication. A major concern when deploying speaker verification technology is whether a system is still robust against spoofing attacks. Unfortunately, voice conversion has become one of the most easily accessible techniques to carry out spoofing attacks, and hence presents a threat to speaker verification systems. To address the above concern, this thesis examines the vulnerabilities of nine current state-of-the-art speaker verification systems in the face of voice conversion spoofing attacks.
author2	Chng Eng Siong
author_facet	Chng Eng Siong Wu, Zhi Zheng
format	Theses and Dissertations
author	Wu, Zhi Zheng
author_sort	Wu, Zhi Zheng
title	Spectral mapping for voice conversion
title_short	Spectral mapping for voice conversion
title_full	Spectral mapping for voice conversion
title_fullStr	Spectral mapping for voice conversion
title_full_unstemmed	Spectral mapping for voice conversion
title_sort	spectral mapping for voice conversion
publishDate	2015
url	http://hdl.handle.net/10356/63286
_version_	1759856427244978176

Spectral mapping for voice conversion

Similar Items