Spectral mapping for voice conversion

Voice conversion is the process to modify a speech signal of one speaker (source) to sound like an intended speaker (target) without changing the language content. This technology has several applications, such as personalized speech synthesis and speech to singing, and a possible malicious applicat...

Full description

Saved in:
Bibliographic Details
Main Author: Wu, Zhi Zheng
Other Authors: Chng Eng Siong
Format: Theses and Dissertations
Language:English
Published: 2015
Subjects:
Online Access:http://hdl.handle.net/10356/63286
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Voice conversion is the process to modify a speech signal of one speaker (source) to sound like an intended speaker (target) without changing the language content. This technology has several applications, such as personalized speech synthesis and speech to singing, and a possible malicious application to fool a voice biometric system, also called a speaker verification system. This thesis focuses on techniques to improve voice conversion performance and the use of voice conversion technology to attack speaker verification systems. The study is important as the robustness of the conversion function will affect the performance of voice conversion directly. To this end, the thesis first proposes a method to improve the voice conversion function by benefiting from nonparallel data of background speakers. The strategy is to decompose a speech spectral vector into a phonetic component and a speaker-specific component, which are modeled by a factor analysis model. The nonparallel data are used to estimate the phonetic component and the factor loadings. The speaker-specific component can then be represented by a low-dimensional set of variables via factor loadings, and hence allow for more robust modeling under sparse training data condition. The experimental results show that the proposed method outperforms the conventional Gaussian mixture model (GMM) based method considerably when there are limited parallel training data. The second contribution of the thesis focuses on implementing the conversion function by directly modelling the high-dimensional spectral features using exemplars found in the training data. In this approach, each speech segment is reconstructed as a weighted linear combination of a set of basis exemplars with residual compensation. An exemplar is defined as a speech segment spanning multiple frames extracted from training data. The value of the linear combination weights is constrained to be nonnegative, and most of them are restricted to have a value close to zero. Experiments are conducted to compare the proposed method with a large set of baseline approaches. It is observed that the proposed method can achieve similar performance to the state-of-the-art GMM based and dynamic kernel partial least square based voice conversion methods. The experiments also confirm its flexibility when the amount of training data is varied. The third contribution focuses on the use of voice conversion technology to attack current state-of-the-art speaker verification system with the purpose to identify the weak links of the verification algorithms. Recently, speaker verification technology has been advanced significantly and has led to mass market adoption, such as in smartphones for user authentication. A major concern when deploying speaker verification technology is whether a system is still robust against spoofing attacks. Unfortunately, voice conversion has become one of the most easily accessible techniques to carry out spoofing attacks, and hence presents a threat to speaker verification systems. To address the above concern, this thesis examines the vulnerabilities of nine current state-of-the-art speaker verification systems in the face of voice conversion spoofing attacks.