End-to-End Speech Emotion Recognition Using Multi-Scale Convolution Networks

Automatic speech emotion recognition is one of the challenging tasks in machine learning community mainly due to the significant variations across individuals while expressing the same emotion cue. The success of emotion recognition with machine learning techniques primarily depends on the feature s...

Full description

Saved in:

Bibliographic Details
Main Authors:	Sivanagaraja, Tatinati, Ho, Mun Kit, Khong, Andy Wai Hoong, Wang, Yubo
Other Authors:	School of Electrical and Electronic Engineering
Format:	Conference or Workshop Item
Language:	English
Published:	2018
Subjects:	Machine Learning Emotion Recognition
Online Access:	https://hdl.handle.net/10356/88357 http://hdl.handle.net/10220/44716
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-88357
record_format	dspace
spelling	sg-ntu-dr.10356-883572020-03-07T13:24:45Z End-to-End Speech Emotion Recognition Using Multi-Scale Convolution Networks Sivanagaraja, Tatinati Ho, Mun Kit Khong, Andy Wai Hoong Wang, Yubo School of Electrical and Electronic Engineering 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) Machine Learning Emotion Recognition Automatic speech emotion recognition is one of the challenging tasks in machine learning community mainly due to the significant variations across individuals while expressing the same emotion cue. The success of emotion recognition with machine learning techniques primarily depends on the feature set chosen to learn. Formulation of appropriate features that cater for all variations in emotion cues however is not a trivial task. Recent works on emotion recognition with deep learning techniques thus focus on the end-to-end learning scheme which identifies the features directly from the raw speech signal instead of relying on hand-crafted feature set. Existing methods in this scheme however did not take into account the fact that speech signals often exhibit distinct features at different time scales and frequencies than in the raw form. We propose the multi- scale convolution neural network (MCNN) to identify features at different time scales and frequencies from raw speech signals. This end-to-end model leverages on the multi-branch input layer and tunable convolution layers to learn the identified features which are subsequently employed to recognize the emotion cues accordingly. As a proof-of-concept, the MCNN method with a fixed transformation stage is evaluated using the SAVEE emotion database. Results showed that MCNN improves the emotion recognition performance when compared to existing methods, which underpins the necessity of learning features at different time scales. NRF (Natl Research Foundation, S’pore) Accepted version 2018-04-25T06:35:53Z 2019-12-06T17:01:26Z 2018-04-25T06:35:53Z 2019-12-06T17:01:26Z 2018-01-01 2017 Conference Paper Sivanagaraja, T., Ho, M. K., Khong, A. W. H., & Wang, Y. (2017). End-to-End Speech Emotion Recognition Using Multi-Scale Convolution Networks. Paper presented at 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Kuala Lumpur, Malaysia (pp. 189-192). https://hdl.handle.net/10356/88357 http://hdl.handle.net/10220/44716 10.1109/APSIPA.2017.8282026 204038 en © 2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. The published version is available at: [http://dx.doi.org/10.1109/APSIPA.2017.8282026]. 4 p. application/pdf
institution	Nanyang Technological University
building	NTU Library
country	Singapore
collection	DR-NTU
language	English
topic	Machine Learning Emotion Recognition
spellingShingle	Machine Learning Emotion Recognition Sivanagaraja, Tatinati Ho, Mun Kit Khong, Andy Wai Hoong Wang, Yubo End-to-End Speech Emotion Recognition Using Multi-Scale Convolution Networks
description	Automatic speech emotion recognition is one of the challenging tasks in machine learning community mainly due to the significant variations across individuals while expressing the same emotion cue. The success of emotion recognition with machine learning techniques primarily depends on the feature set chosen to learn. Formulation of appropriate features that cater for all variations in emotion cues however is not a trivial task. Recent works on emotion recognition with deep learning techniques thus focus on the end-to-end learning scheme which identifies the features directly from the raw speech signal instead of relying on hand-crafted feature set. Existing methods in this scheme however did not take into account the fact that speech signals often exhibit distinct features at different time scales and frequencies than in the raw form. We propose the multi- scale convolution neural network (MCNN) to identify features at different time scales and frequencies from raw speech signals. This end-to-end model leverages on the multi-branch input layer and tunable convolution layers to learn the identified features which are subsequently employed to recognize the emotion cues accordingly. As a proof-of-concept, the MCNN method with a fixed transformation stage is evaluated using the SAVEE emotion database. Results showed that MCNN improves the emotion recognition performance when compared to existing methods, which underpins the necessity of learning features at different time scales.
author2	School of Electrical and Electronic Engineering
author_facet	School of Electrical and Electronic Engineering Sivanagaraja, Tatinati Ho, Mun Kit Khong, Andy Wai Hoong Wang, Yubo
format	Conference or Workshop Item
author	Sivanagaraja, Tatinati Ho, Mun Kit Khong, Andy Wai Hoong Wang, Yubo
author_sort	Sivanagaraja, Tatinati
title	End-to-End Speech Emotion Recognition Using Multi-Scale Convolution Networks
title_short	End-to-End Speech Emotion Recognition Using Multi-Scale Convolution Networks
title_full	End-to-End Speech Emotion Recognition Using Multi-Scale Convolution Networks
title_fullStr	End-to-End Speech Emotion Recognition Using Multi-Scale Convolution Networks
title_full_unstemmed	End-to-End Speech Emotion Recognition Using Multi-Scale Convolution Networks
title_sort	end-to-end speech emotion recognition using multi-scale convolution networks
publishDate	2018
url	https://hdl.handle.net/10356/88357 http://hdl.handle.net/10220/44716
_version_	1681043762937069568

End-to-End Speech Emotion Recognition Using Multi-Scale Convolution Networks

Similar Items