Linguistic and acoustic analysis of voice disguise by impersonators
Human voices are distinct and rich with information such as the age, gender, emotional state and identity of the speaker. Voice impersonators possess a great deal of flexibility over their voices which allows them to imitate various people or characters. The question as to how impersonators are able...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Doctor of Philosophy |
Language: | English |
Published: |
Nanyang Technological University
2015
|
Subjects: | |
Online Access: | http://hdl.handle.net/10356/62949 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | Human voices are distinct and rich with information such as the age, gender, emotional state and identity of the speaker. Voice impersonators possess a great deal of flexibility over their voices which allows them to imitate various people or characters. The question as to how impersonators are able transform their voices and what linguistic and acoustic parameters they rely on is still relatively unexplored. Understanding how they are able to transform their voices holds the key for many applications, such as, speaker recognition, voice transformation, voice disguise detection and speech coding etc.
In the first part of the thesis, the extent to which professional voice artists are able to modulate their voices in order to produce distinct and natural sounding voice identities was investigated. For this purpose, a database of voice impersonations was first constructed using data from three professional voice artists (one male, two females). Each artist produced 9 different voice identities including their natural voice. The data included synchronous speech and electroglottograph signals. The electroglottograph signals provide useful insights into the complex periodic movements of the vocal folds. An acoustic and linguistic analysis was then performed to understand how various glottal parameters such as pitch, vocal fold timing (open quotient through electroglottograph signals), speech rate and vocal tract formants are manipulated by the artists. The analysis revealed that the artists utilized variation in both their glottal and vocal tract characteristics for impersonating different ages and genders. The glottal measures were found to be highly correlated with the perceived age and gender of the impersonated voices. In a novel finding, the artists were found to make changes to their vowel formants on a vowel-by-vowel basis. It was found, in terms of vowel space variability, that the artists were also more consistent with their natural voices as compared to their disguised voice. A listening experiment revealed that the artists were highly successful in deceiving humans which could only correctly identify 56% of the disguised voices. A new objective metric of voice naturalness was proposed which utilizes the variability related to the vowel space. The objective metric is found to correlate highly with the subjective naturalness ratings of the voices. We also highlight the various constraints involved and the space available to a speaker for producing natural sounding impersonations.
A novel method for the analysis of electroglottogrpah signals is also introduced. This method models the electroglottogrpah signal as a sparse signal and allows for the automatic and reliable extraction of the glottal opening and closing instants. Compared to existing methods, this novel method models the glottal opening and closing instants as non-bandlimited signals (diracs) and thus provides more accurate estimates of their timings.
Voice impersonations also present a challenge for forensic and biometric systems. The final part of the thesis builds upon the linguistic and acoustic analysis and focuses on two biometric applications. The first application aims to automatically discriminate disguised voices of speakers from their natural voices. Acoustic variability related to vowel variances in the F1-F2 space was used as a novel feature for this purpose. This feature was used together with a quadratic discriminant classifier for automatic voice disguise detection. The proposed method was found to outperform the state-of-the-art methods.
For the second application, the goal was to uncover the identity of speakers from both their natural and disguised voices. We proposed a novel method for forensic speaker recognition which uses a phonetic speaker modeling approach for feature extraction and then identifies speakers using the extreme learning machine classifier. This new model requires a very short duration of speech (a frame of 25 ms) for recognition and was found to be more robust than other speaker recognition models. We also investigated and showed how different phonetic units of speech offer different amounts of speaker recognition accuracy. |
---|