Speaker-invariant speech emotion recognition with domain adversarial training

Recent advances in technology have given birth to intelligent speech assistants such as Siri and Alexa. While these intelligent speech assistants can perform a myriad of tasks just from the end users’ voice command, they lack the capability to recognize human emotions when formulating a response — a...

Full description

Saved in:

Bibliographic Details
Main Author:	Leow, Bryan Xuan Zhen
Other Authors:	Jagath C Rajapakse
Format:	Final Year Project
Language:	English
Published:	Nanyang Technological University 2021
Subjects:	Engineering::Computer science and engineering
Online Access:	https://hdl.handle.net/10356/148092
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-148092
record_format	dspace
spelling	sg-ntu-dr.10356-1480922021-04-22T13:35:02Z Speaker-invariant speech emotion recognition with domain adversarial training Leow, Bryan Xuan Zhen Jagath C Rajapakse School of Computer Science and Engineering ASJagath@ntu.edu.sg Engineering::Computer science and engineering Recent advances in technology have given birth to intelligent speech assistants such as Siri and Alexa. While these intelligent speech assistants can perform a myriad of tasks just from the end users’ voice command, they lack the capability to recognize human emotions when formulating a response — a feature that would promote more ingenious usage for such speech assistants. Since such a Speech Emotion Recognition (SER) system would be used by the general population, it is necessary to derive speaker-invariant representation for the SER system. In this project, we use Domain Adversarial Training (DAT) in deep neural network to learn representation that is invariant to speaker characteristics. DAT was used for domain adaptation, in which data at training and test time come from similar but different distributions or speakers. Recognising that speaker invariant SER can be framed as a domain adaptation problem, we explore the use of DAT in this project to derive speaker-invariant representations for SER and observe if they perform better than the representations formed without DAT. DAT network for speaker-invariant emotion recognition (SIER) tasks consist of an encoder, an emotion classifier, and a speaker classifier. By having a Gradient Reversal Layer (GRL) between the encoder and the speaker classifier, the emotion representation learned will be independent of speakers. DAT encoder in existing literature has typically been limited to 1D Convolutional Neural Network (CNN) with Recurrent Neural Network (RNN) architectures. In contrast to such architectures which use 1D filters to learn features along a single dimension, this paper investigates DAT encoders of 2D CNN with RNN architecture which use 2D filters to learn features along two dimensions. We also investigate Log Mel Spectrograms (LMS) and Mel Frequency Cepstral Coefficients (MFCC) features for 2D CNN with RNN DAT encoders. Our experimental results on Emo-DB and RAVDESS datasets show that MFCC features with 2D CNN with RNN DAT encoders performs better than features and encoders that relies on 1D filters. Bachelor of Engineering (Computer Science) 2021-04-22T13:35:02Z 2021-04-22T13:35:02Z 2021 Final Year Project (FYP) Leow, B. X. Z. (2021). Speaker-invariant speech emotion recognition with domain adversarial training. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/148092 https://hdl.handle.net/10356/148092 en application/pdf Nanyang Technological University
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Engineering::Computer science and engineering
spellingShingle	Engineering::Computer science and engineering Leow, Bryan Xuan Zhen Speaker-invariant speech emotion recognition with domain adversarial training
description	Recent advances in technology have given birth to intelligent speech assistants such as Siri and Alexa. While these intelligent speech assistants can perform a myriad of tasks just from the end users’ voice command, they lack the capability to recognize human emotions when formulating a response — a feature that would promote more ingenious usage for such speech assistants. Since such a Speech Emotion Recognition (SER) system would be used by the general population, it is necessary to derive speaker-invariant representation for the SER system. In this project, we use Domain Adversarial Training (DAT) in deep neural network to learn representation that is invariant to speaker characteristics. DAT was used for domain adaptation, in which data at training and test time come from similar but different distributions or speakers. Recognising that speaker invariant SER can be framed as a domain adaptation problem, we explore the use of DAT in this project to derive speaker-invariant representations for SER and observe if they perform better than the representations formed without DAT. DAT network for speaker-invariant emotion recognition (SIER) tasks consist of an encoder, an emotion classifier, and a speaker classifier. By having a Gradient Reversal Layer (GRL) between the encoder and the speaker classifier, the emotion representation learned will be independent of speakers. DAT encoder in existing literature has typically been limited to 1D Convolutional Neural Network (CNN) with Recurrent Neural Network (RNN) architectures. In contrast to such architectures which use 1D filters to learn features along a single dimension, this paper investigates DAT encoders of 2D CNN with RNN architecture which use 2D filters to learn features along two dimensions. We also investigate Log Mel Spectrograms (LMS) and Mel Frequency Cepstral Coefficients (MFCC) features for 2D CNN with RNN DAT encoders. Our experimental results on Emo-DB and RAVDESS datasets show that MFCC features with 2D CNN with RNN DAT encoders performs better than features and encoders that relies on 1D filters.
author2	Jagath C Rajapakse
author_facet	Jagath C Rajapakse Leow, Bryan Xuan Zhen
format	Final Year Project
author	Leow, Bryan Xuan Zhen
author_sort	Leow, Bryan Xuan Zhen
title	Speaker-invariant speech emotion recognition with domain adversarial training
title_short	Speaker-invariant speech emotion recognition with domain adversarial training
title_full	Speaker-invariant speech emotion recognition with domain adversarial training
title_fullStr	Speaker-invariant speech emotion recognition with domain adversarial training
title_full_unstemmed	Speaker-invariant speech emotion recognition with domain adversarial training
title_sort	speaker-invariant speech emotion recognition with domain adversarial training
publisher	Nanyang Technological University
publishDate	2021
url	https://hdl.handle.net/10356/148092
_version_	1698713656211013632

Speaker-invariant speech emotion recognition with domain adversarial training

Similar Items