Emotion analysis from speech
Speech is the first form of communication that humans instinctively use without thought and most times, our emotions are expressed though them. Emotion in speech helps us in forming interpersonal connections. The process of producing emotions in speech comes from specific acoustic patterns. Speech...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | English |
Published: |
Nanyang Technological University
2021
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/153198 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-153198 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-1531982021-11-16T05:15:00Z Emotion analysis from speech Mus'ifah Amran Chng Eng Siong School of Computer Science and Engineering ASESChng@ntu.edu.sg Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Speech is the first form of communication that humans instinctively use without thought and most times, our emotions are expressed though them. Emotion in speech helps us in forming interpersonal connections. The process of producing emotions in speech comes from specific acoustic patterns. Speech emotion recognition systems extract those acoustic features to identify emotions in utterances and analyse the link between those acoustic features and their respective emotions. There are different techniques to perform speech emotion recognition such as using deep neural networks, Hidden Markov models and many more. In this report, we focus on the deep learning techniques to infer emotion from speech with models from an existing work by approaching it as an image classification problem. We focus on three networks, specifically AlexNet, Fully Convolutional Network with Global Average Pooling and Residual Network. As the prior two networks have been trained with the IEMOCAP corpus, ResNet is also trained to compare the models’ performance. The three models are then trained again on a down sampled IEMOCAP corpus and the THAI SER corpus. The models were evaluated using k-fold cross validation in line with publications using the same approach. The models from Ng [1] are used a benchmark for ResNet model implemented here. From the experiments conducted, no single model achieved high accuracies with the different corpus. Stability Training implemented from [1] was updated with tuning of α-parameter and the addition of environment noises. From the three models, Fully Convolutional Network achieved a 0.9% increase in accuracy from its result in [1]. It surpassed the benchmark accuracy of AlexNet by 0.2%. Bachelor of Engineering (Computer Science) 2021-11-16T03:26:08Z 2021-11-16T03:26:08Z 2021 Final Year Project (FYP) Mus'ifah Amran (2021). Emotion analysis from speech. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/153198 https://hdl.handle.net/10356/153198 en application/pdf Nanyang Technological University |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence |
spellingShingle |
Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Mus'ifah Amran Emotion analysis from speech |
description |
Speech is the first form of communication that humans instinctively use without thought and most times, our emotions are expressed though them. Emotion in speech helps us in forming interpersonal connections. The process of producing emotions in speech comes from specific acoustic patterns.
Speech emotion recognition systems extract those acoustic features to identify emotions in utterances and analyse the link between those acoustic features and their respective emotions. There are different techniques to perform speech emotion recognition such as using deep neural networks, Hidden Markov models and many more.
In this report, we focus on the deep learning techniques to infer emotion from speech with models from an existing work by approaching it as an image classification problem. We focus on three networks, specifically AlexNet, Fully Convolutional Network with Global Average Pooling and Residual Network. As the prior two networks have been trained with the IEMOCAP corpus, ResNet is also trained to compare the models’ performance.
The three models are then trained again on a down sampled IEMOCAP corpus and the THAI SER corpus. The models were evaluated using k-fold cross validation in line with publications using the same approach.
The models from Ng [1] are used a benchmark for ResNet model implemented here. From the experiments conducted, no single model achieved high accuracies with the different corpus.
Stability Training implemented from [1] was updated with tuning of α-parameter and the addition of environment noises. From the three models, Fully Convolutional Network achieved a 0.9% increase in accuracy from its result in [1]. It surpassed the benchmark accuracy of AlexNet by 0.2%. |
author2 |
Chng Eng Siong |
author_facet |
Chng Eng Siong Mus'ifah Amran |
format |
Final Year Project |
author |
Mus'ifah Amran |
author_sort |
Mus'ifah Amran |
title |
Emotion analysis from speech |
title_short |
Emotion analysis from speech |
title_full |
Emotion analysis from speech |
title_fullStr |
Emotion analysis from speech |
title_full_unstemmed |
Emotion analysis from speech |
title_sort |
emotion analysis from speech |
publisher |
Nanyang Technological University |
publishDate |
2021 |
url |
https://hdl.handle.net/10356/153198 |
_version_ |
1718368031460032512 |