Linguistic and acoustic analysis of voice disguise by impersonators

Human voices are distinct and rich with information such as the age, gender, emotional state and identity of the speaker. Voice impersonators possess a great deal of flexibility over their voices which allows them to imitate various people or characters. The question as to how impersonators are able...

Full description

Saved in:

Bibliographic Details
Main Author:	Talal Amin
Other Authors:	Pina Marziliano
Format:	Thesis-Doctor of Philosophy
Language:	English
Published:	Nanyang Technological University 2015
Subjects:	DRNTU::Humanities::Language::Linguistics DRNTU::Engineering::Electrical and electronic engineering
Online Access:	http://hdl.handle.net/10356/62949
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-62949
record_format	dspace
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	DRNTU::Humanities::Language::Linguistics DRNTU::Engineering::Electrical and electronic engineering
spellingShingle	DRNTU::Humanities::Language::Linguistics DRNTU::Engineering::Electrical and electronic engineering Talal Amin Linguistic and acoustic analysis of voice disguise by impersonators
description	Human voices are distinct and rich with information such as the age, gender, emotional state and identity of the speaker. Voice impersonators possess a great deal of flexibility over their voices which allows them to imitate various people or characters. The question as to how impersonators are able transform their voices and what linguistic and acoustic parameters they rely on is still relatively unexplored. Understanding how they are able to transform their voices holds the key for many applications, such as, speaker recognition, voice transformation, voice disguise detection and speech coding etc. In the first part of the thesis, the extent to which professional voice artists are able to modulate their voices in order to produce distinct and natural sounding voice identities was investigated. For this purpose, a database of voice impersonations was first constructed using data from three professional voice artists (one male, two females). Each artist produced 9 different voice identities including their natural voice. The data included synchronous speech and electroglottograph signals. The electroglottograph signals provide useful insights into the complex periodic movements of the vocal folds. An acoustic and linguistic analysis was then performed to understand how various glottal parameters such as pitch, vocal fold timing (open quotient through electroglottograph signals), speech rate and vocal tract formants are manipulated by the artists. The analysis revealed that the artists utilized variation in both their glottal and vocal tract characteristics for impersonating different ages and genders. The glottal measures were found to be highly correlated with the perceived age and gender of the impersonated voices. In a novel finding, the artists were found to make changes to their vowel formants on a vowel-by-vowel basis. It was found, in terms of vowel space variability, that the artists were also more consistent with their natural voices as compared to their disguised voice. A listening experiment revealed that the artists were highly successful in deceiving humans which could only correctly identify 56% of the disguised voices. A new objective metric of voice naturalness was proposed which utilizes the variability related to the vowel space. The objective metric is found to correlate highly with the subjective naturalness ratings of the voices. We also highlight the various constraints involved and the space available to a speaker for producing natural sounding impersonations. A novel method for the analysis of electroglottogrpah signals is also introduced. This method models the electroglottogrpah signal as a sparse signal and allows for the automatic and reliable extraction of the glottal opening and closing instants. Compared to existing methods, this novel method models the glottal opening and closing instants as non-bandlimited signals (diracs) and thus provides more accurate estimates of their timings. Voice impersonations also present a challenge for forensic and biometric systems. The final part of the thesis builds upon the linguistic and acoustic analysis and focuses on two biometric applications. The first application aims to automatically discriminate disguised voices of speakers from their natural voices. Acoustic variability related to vowel variances in the F1-F2 space was used as a novel feature for this purpose. This feature was used together with a quadratic discriminant classifier for automatic voice disguise detection. The proposed method was found to outperform the state-of-the-art methods. For the second application, the goal was to uncover the identity of speakers from both their natural and disguised voices. We proposed a novel method for forensic speaker recognition which uses a phonetic speaker modeling approach for feature extraction and then identifies speakers using the extreme learning machine classifier. This new model requires a very short duration of speech (a frame of 25 ms) for recognition and was found to be more robust than other speaker recognition models. We also investigated and showed how different phonetic units of speech offer different amounts of speaker recognition accuracy.
author2	Pina Marziliano
author_facet	Pina Marziliano Talal Amin
format	Thesis-Doctor of Philosophy
author	Talal Amin
author_sort	Talal Amin
title	Linguistic and acoustic analysis of voice disguise by impersonators
title_short	Linguistic and acoustic analysis of voice disguise by impersonators
title_full	Linguistic and acoustic analysis of voice disguise by impersonators
title_fullStr	Linguistic and acoustic analysis of voice disguise by impersonators
title_full_unstemmed	Linguistic and acoustic analysis of voice disguise by impersonators
title_sort	linguistic and acoustic analysis of voice disguise by impersonators
publisher	Nanyang Technological University
publishDate	2015
url	http://hdl.handle.net/10356/62949
_version_	1772825408341803008
spelling	sg-ntu-dr.10356-629492023-07-04T16:58:10Z Linguistic and acoustic analysis of voice disguise by impersonators Talal Amin Pina Marziliano School of Electrical and Electronic Engineering James Sneed German EPina@ntu.edu.sg DRNTU::Humanities::Language::Linguistics DRNTU::Engineering::Electrical and electronic engineering Human voices are distinct and rich with information such as the age, gender, emotional state and identity of the speaker. Voice impersonators possess a great deal of flexibility over their voices which allows them to imitate various people or characters. The question as to how impersonators are able transform their voices and what linguistic and acoustic parameters they rely on is still relatively unexplored. Understanding how they are able to transform their voices holds the key for many applications, such as, speaker recognition, voice transformation, voice disguise detection and speech coding etc. In the first part of the thesis, the extent to which professional voice artists are able to modulate their voices in order to produce distinct and natural sounding voice identities was investigated. For this purpose, a database of voice impersonations was first constructed using data from three professional voice artists (one male, two females). Each artist produced 9 different voice identities including their natural voice. The data included synchronous speech and electroglottograph signals. The electroglottograph signals provide useful insights into the complex periodic movements of the vocal folds. An acoustic and linguistic analysis was then performed to understand how various glottal parameters such as pitch, vocal fold timing (open quotient through electroglottograph signals), speech rate and vocal tract formants are manipulated by the artists. The analysis revealed that the artists utilized variation in both their glottal and vocal tract characteristics for impersonating different ages and genders. The glottal measures were found to be highly correlated with the perceived age and gender of the impersonated voices. In a novel finding, the artists were found to make changes to their vowel formants on a vowel-by-vowel basis. It was found, in terms of vowel space variability, that the artists were also more consistent with their natural voices as compared to their disguised voice. A listening experiment revealed that the artists were highly successful in deceiving humans which could only correctly identify 56% of the disguised voices. A new objective metric of voice naturalness was proposed which utilizes the variability related to the vowel space. The objective metric is found to correlate highly with the subjective naturalness ratings of the voices. We also highlight the various constraints involved and the space available to a speaker for producing natural sounding impersonations. A novel method for the analysis of electroglottogrpah signals is also introduced. This method models the electroglottogrpah signal as a sparse signal and allows for the automatic and reliable extraction of the glottal opening and closing instants. Compared to existing methods, this novel method models the glottal opening and closing instants as non-bandlimited signals (diracs) and thus provides more accurate estimates of their timings. Voice impersonations also present a challenge for forensic and biometric systems. The final part of the thesis builds upon the linguistic and acoustic analysis and focuses on two biometric applications. The first application aims to automatically discriminate disguised voices of speakers from their natural voices. Acoustic variability related to vowel variances in the F1-F2 space was used as a novel feature for this purpose. This feature was used together with a quadratic discriminant classifier for automatic voice disguise detection. The proposed method was found to outperform the state-of-the-art methods. For the second application, the goal was to uncover the identity of speakers from both their natural and disguised voices. We proposed a novel method for forensic speaker recognition which uses a phonetic speaker modeling approach for feature extraction and then identifies speakers using the extreme learning machine classifier. This new model requires a very short duration of speech (a frame of 25 ms) for recognition and was found to be more robust than other speaker recognition models. We also investigated and showed how different phonetic units of speech offer different amounts of speaker recognition accuracy. Doctor of Philosophy 2015-05-04T04:49:28Z 2015-05-04T04:49:28Z 2015 2015 Thesis-Doctor of Philosophy Talal Amin. (2015). Linguistic and acoustic analysis of voice disguise by impersonators. Doctoral thesis, Nanyang Technological University, Singapore. http://hdl.handle.net/10356/62949 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). 129 p. application/pdf Nanyang Technological University

Linguistic and acoustic analysis of voice disguise by impersonators

Similar Items