Speaker-invariant speech emotion recognition with domain adversarial training

Recent advances in technology have given birth to intelligent speech assistants such as Siri and Alexa. While these intelligent speech assistants can perform a myriad of tasks just from the end users’ voice command, they lack the capability to recognize human emotions when formulating a response — a...

Full description

Saved in:
Bibliographic Details
Main Author: Leow, Bryan Xuan Zhen
Other Authors: Jagath C Rajapakse
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2021
Subjects:
Online Access:https://hdl.handle.net/10356/148092
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Recent advances in technology have given birth to intelligent speech assistants such as Siri and Alexa. While these intelligent speech assistants can perform a myriad of tasks just from the end users’ voice command, they lack the capability to recognize human emotions when formulating a response — a feature that would promote more ingenious usage for such speech assistants. Since such a Speech Emotion Recognition (SER) system would be used by the general population, it is necessary to derive speaker-invariant representation for the SER system. In this project, we use Domain Adversarial Training (DAT) in deep neural network to learn representation that is invariant to speaker characteristics. DAT was used for domain adaptation, in which data at training and test time come from similar but different distributions or speakers. Recognising that speaker invariant SER can be framed as a domain adaptation problem, we explore the use of DAT in this project to derive speaker-invariant representations for SER and observe if they perform better than the representations formed without DAT. DAT network for speaker-invariant emotion recognition (SIER) tasks consist of an encoder, an emotion classifier, and a speaker classifier. By having a Gradient Reversal Layer (GRL) between the encoder and the speaker classifier, the emotion representation learned will be independent of speakers. DAT encoder in existing literature has typically been limited to 1D Convolutional Neural Network (CNN) with Recurrent Neural Network (RNN) architectures. In contrast to such architectures which use 1D filters to learn features along a single dimension, this paper investigates DAT encoders of 2D CNN with RNN architecture which use 2D filters to learn features along two dimensions. We also investigate Log Mel Spectrograms (LMS) and Mel Frequency Cepstral Coefficients (MFCC) features for 2D CNN with RNN DAT encoders. Our experimental results on Emo-DB and RAVDESS datasets show that MFCC features with 2D CNN with RNN DAT encoders performs better than features and encoders that relies on 1D filters.