Speaker-invariant speech emotion recognition with domain adversarial training

Recent advances in technology have given birth to intelligent speech assistants such as Siri and Alexa. While these intelligent speech assistants can perform a myriad of tasks just from the end users’ voice command, they lack the capability to recognize human emotions when formulating a response — a...

Full description

Saved in:
Bibliographic Details
Main Author: Leow, Bryan Xuan Zhen
Other Authors: Jagath C Rajapakse
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2021
Subjects:
Online Access:https://hdl.handle.net/10356/148092
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-148092
record_format dspace
spelling sg-ntu-dr.10356-1480922021-04-22T13:35:02Z Speaker-invariant speech emotion recognition with domain adversarial training Leow, Bryan Xuan Zhen Jagath C Rajapakse School of Computer Science and Engineering ASJagath@ntu.edu.sg Engineering::Computer science and engineering Recent advances in technology have given birth to intelligent speech assistants such as Siri and Alexa. While these intelligent speech assistants can perform a myriad of tasks just from the end users’ voice command, they lack the capability to recognize human emotions when formulating a response — a feature that would promote more ingenious usage for such speech assistants. Since such a Speech Emotion Recognition (SER) system would be used by the general population, it is necessary to derive speaker-invariant representation for the SER system. In this project, we use Domain Adversarial Training (DAT) in deep neural network to learn representation that is invariant to speaker characteristics. DAT was used for domain adaptation, in which data at training and test time come from similar but different distributions or speakers. Recognising that speaker invariant SER can be framed as a domain adaptation problem, we explore the use of DAT in this project to derive speaker-invariant representations for SER and observe if they perform better than the representations formed without DAT. DAT network for speaker-invariant emotion recognition (SIER) tasks consist of an encoder, an emotion classifier, and a speaker classifier. By having a Gradient Reversal Layer (GRL) between the encoder and the speaker classifier, the emotion representation learned will be independent of speakers. DAT encoder in existing literature has typically been limited to 1D Convolutional Neural Network (CNN) with Recurrent Neural Network (RNN) architectures. In contrast to such architectures which use 1D filters to learn features along a single dimension, this paper investigates DAT encoders of 2D CNN with RNN architecture which use 2D filters to learn features along two dimensions. We also investigate Log Mel Spectrograms (LMS) and Mel Frequency Cepstral Coefficients (MFCC) features for 2D CNN with RNN DAT encoders. Our experimental results on Emo-DB and RAVDESS datasets show that MFCC features with 2D CNN with RNN DAT encoders performs better than features and encoders that relies on 1D filters. Bachelor of Engineering (Computer Science) 2021-04-22T13:35:02Z 2021-04-22T13:35:02Z 2021 Final Year Project (FYP) Leow, B. X. Z. (2021). Speaker-invariant speech emotion recognition with domain adversarial training. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/148092 https://hdl.handle.net/10356/148092 en application/pdf Nanyang Technological University
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Engineering::Computer science and engineering
spellingShingle Engineering::Computer science and engineering
Leow, Bryan Xuan Zhen
Speaker-invariant speech emotion recognition with domain adversarial training
description Recent advances in technology have given birth to intelligent speech assistants such as Siri and Alexa. While these intelligent speech assistants can perform a myriad of tasks just from the end users’ voice command, they lack the capability to recognize human emotions when formulating a response — a feature that would promote more ingenious usage for such speech assistants. Since such a Speech Emotion Recognition (SER) system would be used by the general population, it is necessary to derive speaker-invariant representation for the SER system. In this project, we use Domain Adversarial Training (DAT) in deep neural network to learn representation that is invariant to speaker characteristics. DAT was used for domain adaptation, in which data at training and test time come from similar but different distributions or speakers. Recognising that speaker invariant SER can be framed as a domain adaptation problem, we explore the use of DAT in this project to derive speaker-invariant representations for SER and observe if they perform better than the representations formed without DAT. DAT network for speaker-invariant emotion recognition (SIER) tasks consist of an encoder, an emotion classifier, and a speaker classifier. By having a Gradient Reversal Layer (GRL) between the encoder and the speaker classifier, the emotion representation learned will be independent of speakers. DAT encoder in existing literature has typically been limited to 1D Convolutional Neural Network (CNN) with Recurrent Neural Network (RNN) architectures. In contrast to such architectures which use 1D filters to learn features along a single dimension, this paper investigates DAT encoders of 2D CNN with RNN architecture which use 2D filters to learn features along two dimensions. We also investigate Log Mel Spectrograms (LMS) and Mel Frequency Cepstral Coefficients (MFCC) features for 2D CNN with RNN DAT encoders. Our experimental results on Emo-DB and RAVDESS datasets show that MFCC features with 2D CNN with RNN DAT encoders performs better than features and encoders that relies on 1D filters.
author2 Jagath C Rajapakse
author_facet Jagath C Rajapakse
Leow, Bryan Xuan Zhen
format Final Year Project
author Leow, Bryan Xuan Zhen
author_sort Leow, Bryan Xuan Zhen
title Speaker-invariant speech emotion recognition with domain adversarial training
title_short Speaker-invariant speech emotion recognition with domain adversarial training
title_full Speaker-invariant speech emotion recognition with domain adversarial training
title_fullStr Speaker-invariant speech emotion recognition with domain adversarial training
title_full_unstemmed Speaker-invariant speech emotion recognition with domain adversarial training
title_sort speaker-invariant speech emotion recognition with domain adversarial training
publisher Nanyang Technological University
publishDate 2021
url https://hdl.handle.net/10356/148092
_version_ 1698713656211013632