Emotion analysis from speech

Speech is the first form of communication that humans instinctively use without thought and most times, our emotions are expressed though them. Emotion in speech helps us in forming interpersonal connections. The process of producing emotions in speech comes from specific acoustic patterns. Speech...

Full description

Saved in:

Bibliographic Details
Main Author:	Mus'ifah Amran
Other Authors:	Chng Eng Siong
Format:	Final Year Project
Language:	English
Published:	Nanyang Technological University 2021
Subjects:	Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
Online Access:	https://hdl.handle.net/10356/153198
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Speech is the first form of communication that humans instinctively use without thought and most times, our emotions are expressed though them. Emotion in speech helps us in forming interpersonal connections. The process of producing emotions in speech comes from specific acoustic patterns. Speech emotion recognition systems extract those acoustic features to identify emotions in utterances and analyse the link between those acoustic features and their respective emotions. There are different techniques to perform speech emotion recognition such as using deep neural networks, Hidden Markov models and many more. In this report, we focus on the deep learning techniques to infer emotion from speech with models from an existing work by approaching it as an image classification problem. We focus on three networks, specifically AlexNet, Fully Convolutional Network with Global Average Pooling and Residual Network. As the prior two networks have been trained with the IEMOCAP corpus, ResNet is also trained to compare the models’ performance. The three models are then trained again on a down sampled IEMOCAP corpus and the THAI SER corpus. The models were evaluated using k-fold cross validation in line with publications using the same approach. The models from Ng [1] are used a benchmark for ResNet model implemented here. From the experiments conducted, no single model achieved high accuracies with the different corpus. Stability Training implemented from [1] was updated with tuning of α-parameter and the addition of environment noises. From the three models, Fully Convolutional Network achieved a 0.9% increase in accuracy from its result in [1]. It surpassed the benchmark accuracy of AlexNet by 0.2%.

Emotion analysis from speech

Similar Items