Emotion analysis from speech

Speech is the first form of communication that humans instinctively use without thought and most times, our emotions are expressed though them. Emotion in speech helps us in forming interpersonal connections. The process of producing emotions in speech comes from specific acoustic patterns. Speech...

Full description

Saved in:
Bibliographic Details
Main Author: Mus'ifah Amran
Other Authors: Chng Eng Siong
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2021
Subjects:
Online Access:https://hdl.handle.net/10356/153198
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-153198
record_format dspace
spelling sg-ntu-dr.10356-1531982021-11-16T05:15:00Z Emotion analysis from speech Mus'ifah Amran Chng Eng Siong School of Computer Science and Engineering ASESChng@ntu.edu.sg Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Speech is the first form of communication that humans instinctively use without thought and most times, our emotions are expressed though them. Emotion in speech helps us in forming interpersonal connections. The process of producing emotions in speech comes from specific acoustic patterns. Speech emotion recognition systems extract those acoustic features to identify emotions in utterances and analyse the link between those acoustic features and their respective emotions. There are different techniques to perform speech emotion recognition such as using deep neural networks, Hidden Markov models and many more. In this report, we focus on the deep learning techniques to infer emotion from speech with models from an existing work by approaching it as an image classification problem. We focus on three networks, specifically AlexNet, Fully Convolutional Network with Global Average Pooling and Residual Network. As the prior two networks have been trained with the IEMOCAP corpus, ResNet is also trained to compare the models’ performance. The three models are then trained again on a down sampled IEMOCAP corpus and the THAI SER corpus. The models were evaluated using k-fold cross validation in line with publications using the same approach. The models from Ng [1] are used a benchmark for ResNet model implemented here. From the experiments conducted, no single model achieved high accuracies with the different corpus. Stability Training implemented from [1] was updated with tuning of α-parameter and the addition of environment noises. From the three models, Fully Convolutional Network achieved a 0.9% increase in accuracy from its result in [1]. It surpassed the benchmark accuracy of AlexNet by 0.2%. Bachelor of Engineering (Computer Science) 2021-11-16T03:26:08Z 2021-11-16T03:26:08Z 2021 Final Year Project (FYP) Mus'ifah Amran (2021). Emotion analysis from speech. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/153198 https://hdl.handle.net/10356/153198 en application/pdf Nanyang Technological University
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
spellingShingle Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
Mus'ifah Amran
Emotion analysis from speech
description Speech is the first form of communication that humans instinctively use without thought and most times, our emotions are expressed though them. Emotion in speech helps us in forming interpersonal connections. The process of producing emotions in speech comes from specific acoustic patterns. Speech emotion recognition systems extract those acoustic features to identify emotions in utterances and analyse the link between those acoustic features and their respective emotions. There are different techniques to perform speech emotion recognition such as using deep neural networks, Hidden Markov models and many more. In this report, we focus on the deep learning techniques to infer emotion from speech with models from an existing work by approaching it as an image classification problem. We focus on three networks, specifically AlexNet, Fully Convolutional Network with Global Average Pooling and Residual Network. As the prior two networks have been trained with the IEMOCAP corpus, ResNet is also trained to compare the models’ performance. The three models are then trained again on a down sampled IEMOCAP corpus and the THAI SER corpus. The models were evaluated using k-fold cross validation in line with publications using the same approach. The models from Ng [1] are used a benchmark for ResNet model implemented here. From the experiments conducted, no single model achieved high accuracies with the different corpus. Stability Training implemented from [1] was updated with tuning of α-parameter and the addition of environment noises. From the three models, Fully Convolutional Network achieved a 0.9% increase in accuracy from its result in [1]. It surpassed the benchmark accuracy of AlexNet by 0.2%.
author2 Chng Eng Siong
author_facet Chng Eng Siong
Mus'ifah Amran
format Final Year Project
author Mus'ifah Amran
author_sort Mus'ifah Amran
title Emotion analysis from speech
title_short Emotion analysis from speech
title_full Emotion analysis from speech
title_fullStr Emotion analysis from speech
title_full_unstemmed Emotion analysis from speech
title_sort emotion analysis from speech
publisher Nanyang Technological University
publishDate 2021
url https://hdl.handle.net/10356/153198
_version_ 1718368031460032512