Linguistic and acoustic analysis of voice disguise by impersonators

Human voices are distinct and rich with information such as the age, gender, emotional state and identity of the speaker. Voice impersonators possess a great deal of flexibility over their voices which allows them to imitate various people or characters. The question as to how impersonators are able...

Full description

Saved in:
Bibliographic Details
Main Author: Talal Amin
Other Authors: Pina Marziliano
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2015
Subjects:
Online Access:http://hdl.handle.net/10356/62949
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-62949
record_format dspace
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic DRNTU::Humanities::Language::Linguistics
DRNTU::Engineering::Electrical and electronic engineering
spellingShingle DRNTU::Humanities::Language::Linguistics
DRNTU::Engineering::Electrical and electronic engineering
Talal Amin
Linguistic and acoustic analysis of voice disguise by impersonators
description Human voices are distinct and rich with information such as the age, gender, emotional state and identity of the speaker. Voice impersonators possess a great deal of flexibility over their voices which allows them to imitate various people or characters. The question as to how impersonators are able transform their voices and what linguistic and acoustic parameters they rely on is still relatively unexplored. Understanding how they are able to transform their voices holds the key for many applications, such as, speaker recognition, voice transformation, voice disguise detection and speech coding etc. In the first part of the thesis, the extent to which professional voice artists are able to modulate their voices in order to produce distinct and natural sounding voice identities was investigated. For this purpose, a database of voice impersonations was first constructed using data from three professional voice artists (one male, two females). Each artist produced 9 different voice identities including their natural voice. The data included synchronous speech and electroglottograph signals. The electroglottograph signals provide useful insights into the complex periodic movements of the vocal folds. An acoustic and linguistic analysis was then performed to understand how various glottal parameters such as pitch, vocal fold timing (open quotient through electroglottograph signals), speech rate and vocal tract formants are manipulated by the artists. The analysis revealed that the artists utilized variation in both their glottal and vocal tract characteristics for impersonating different ages and genders. The glottal measures were found to be highly correlated with the perceived age and gender of the impersonated voices. In a novel finding, the artists were found to make changes to their vowel formants on a vowel-by-vowel basis. It was found, in terms of vowel space variability, that the artists were also more consistent with their natural voices as compared to their disguised voice. A listening experiment revealed that the artists were highly successful in deceiving humans which could only correctly identify 56% of the disguised voices. A new objective metric of voice naturalness was proposed which utilizes the variability related to the vowel space. The objective metric is found to correlate highly with the subjective naturalness ratings of the voices. We also highlight the various constraints involved and the space available to a speaker for producing natural sounding impersonations. A novel method for the analysis of electroglottogrpah signals is also introduced. This method models the electroglottogrpah signal as a sparse signal and allows for the automatic and reliable extraction of the glottal opening and closing instants. Compared to existing methods, this novel method models the glottal opening and closing instants as non-bandlimited signals (diracs) and thus provides more accurate estimates of their timings. Voice impersonations also present a challenge for forensic and biometric systems. The final part of the thesis builds upon the linguistic and acoustic analysis and focuses on two biometric applications. The first application aims to automatically discriminate disguised voices of speakers from their natural voices. Acoustic variability related to vowel variances in the F1-F2 space was used as a novel feature for this purpose. This feature was used together with a quadratic discriminant classifier for automatic voice disguise detection. The proposed method was found to outperform the state-of-the-art methods. For the second application, the goal was to uncover the identity of speakers from both their natural and disguised voices. We proposed a novel method for forensic speaker recognition which uses a phonetic speaker modeling approach for feature extraction and then identifies speakers using the extreme learning machine classifier. This new model requires a very short duration of speech (a frame of 25 ms) for recognition and was found to be more robust than other speaker recognition models. We also investigated and showed how different phonetic units of speech offer different amounts of speaker recognition accuracy.
author2 Pina Marziliano
author_facet Pina Marziliano
Talal Amin
format Thesis-Doctor of Philosophy
author Talal Amin
author_sort Talal Amin
title Linguistic and acoustic analysis of voice disguise by impersonators
title_short Linguistic and acoustic analysis of voice disguise by impersonators
title_full Linguistic and acoustic analysis of voice disguise by impersonators
title_fullStr Linguistic and acoustic analysis of voice disguise by impersonators
title_full_unstemmed Linguistic and acoustic analysis of voice disguise by impersonators
title_sort linguistic and acoustic analysis of voice disguise by impersonators
publisher Nanyang Technological University
publishDate 2015
url http://hdl.handle.net/10356/62949
_version_ 1772825408341803008
spelling sg-ntu-dr.10356-629492023-07-04T16:58:10Z Linguistic and acoustic analysis of voice disguise by impersonators Talal Amin Pina Marziliano School of Electrical and Electronic Engineering James Sneed German EPina@ntu.edu.sg DRNTU::Humanities::Language::Linguistics DRNTU::Engineering::Electrical and electronic engineering Human voices are distinct and rich with information such as the age, gender, emotional state and identity of the speaker. Voice impersonators possess a great deal of flexibility over their voices which allows them to imitate various people or characters. The question as to how impersonators are able transform their voices and what linguistic and acoustic parameters they rely on is still relatively unexplored. Understanding how they are able to transform their voices holds the key for many applications, such as, speaker recognition, voice transformation, voice disguise detection and speech coding etc. In the first part of the thesis, the extent to which professional voice artists are able to modulate their voices in order to produce distinct and natural sounding voice identities was investigated. For this purpose, a database of voice impersonations was first constructed using data from three professional voice artists (one male, two females). Each artist produced 9 different voice identities including their natural voice. The data included synchronous speech and electroglottograph signals. The electroglottograph signals provide useful insights into the complex periodic movements of the vocal folds. An acoustic and linguistic analysis was then performed to understand how various glottal parameters such as pitch, vocal fold timing (open quotient through electroglottograph signals), speech rate and vocal tract formants are manipulated by the artists. The analysis revealed that the artists utilized variation in both their glottal and vocal tract characteristics for impersonating different ages and genders. The glottal measures were found to be highly correlated with the perceived age and gender of the impersonated voices. In a novel finding, the artists were found to make changes to their vowel formants on a vowel-by-vowel basis. It was found, in terms of vowel space variability, that the artists were also more consistent with their natural voices as compared to their disguised voice. A listening experiment revealed that the artists were highly successful in deceiving humans which could only correctly identify 56% of the disguised voices. A new objective metric of voice naturalness was proposed which utilizes the variability related to the vowel space. The objective metric is found to correlate highly with the subjective naturalness ratings of the voices. We also highlight the various constraints involved and the space available to a speaker for producing natural sounding impersonations. A novel method for the analysis of electroglottogrpah signals is also introduced. This method models the electroglottogrpah signal as a sparse signal and allows for the automatic and reliable extraction of the glottal opening and closing instants. Compared to existing methods, this novel method models the glottal opening and closing instants as non-bandlimited signals (diracs) and thus provides more accurate estimates of their timings. Voice impersonations also present a challenge for forensic and biometric systems. The final part of the thesis builds upon the linguistic and acoustic analysis and focuses on two biometric applications. The first application aims to automatically discriminate disguised voices of speakers from their natural voices. Acoustic variability related to vowel variances in the F1-F2 space was used as a novel feature for this purpose. This feature was used together with a quadratic discriminant classifier for automatic voice disguise detection. The proposed method was found to outperform the state-of-the-art methods. For the second application, the goal was to uncover the identity of speakers from both their natural and disguised voices. We proposed a novel method for forensic speaker recognition which uses a phonetic speaker modeling approach for feature extraction and then identifies speakers using the extreme learning machine classifier. This new model requires a very short duration of speech (a frame of 25 ms) for recognition and was found to be more robust than other speaker recognition models. We also investigated and showed how different phonetic units of speech offer different amounts of speaker recognition accuracy. Doctor of Philosophy 2015-05-04T04:49:28Z 2015-05-04T04:49:28Z 2015 2015 Thesis-Doctor of Philosophy Talal Amin. (2015). Linguistic and acoustic analysis of voice disguise by impersonators. Doctoral thesis, Nanyang Technological University, Singapore. http://hdl.handle.net/10356/62949 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). 129 p. application/pdf Nanyang Technological University