Real-Time End-to-End Speech Emotion Recognition with Cross-Domain Adaptation

Language resources are the main factor in speech-emotion-recognition (SER)-based deep learning models. Thai is a low-resource language that has a smaller data size than high-resource languages such as German. This paper describes the framework of using a pretrained-model-based front-end and back-end...

Full description

Saved in:
Bibliographic Details
Main Author: Wongpatikaseree K.
Other Authors: Mahidol University
Format: Article
Published: 2023
Subjects:
Online Access:https://repository.li.mahidol.ac.th/handle/123456789/84256
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Mahidol University
id th-mahidol.84256
record_format dspace
spelling th-mahidol.842562023-06-19T00:01:31Z Real-Time End-to-End Speech Emotion Recognition with Cross-Domain Adaptation Wongpatikaseree K. Mahidol University Computer Science Language resources are the main factor in speech-emotion-recognition (SER)-based deep learning models. Thai is a low-resource language that has a smaller data size than high-resource languages such as German. This paper describes the framework of using a pretrained-model-based front-end and back-end network to adapt feature spaces from the speech recognition domain to the speech emotion classification domain. It consists of two parts: a speech recognition front-end network and a speech emotion recognition back-end network. For speech recognition, Wav2Vec2 is the state-of-the-art for high-resource languages, while XLSR is used for low-resource languages. Wav2Vec2 and XLSR have proposed generalized end-to-end learning for speech understanding based on the speech recognition domain as feature space representations from feature encoding. This is one reason why our front-end network was selected as Wav2Vec2 and XLSR for the pretrained model. The pre-trained Wav2Vec2 and XLSR are used for front-end networks and fine-tuned for specific languages using the Common Voice 7.0 dataset. Then, feature vectors of the front-end network are input for back-end networks; this includes convolution time reduction (CTR) and linear mean encoding transformation (LMET). Experiments using two different datasets show that our proposed framework can outperform the baselines in terms of unweighted and weighted accuracies. 2023-06-18T17:01:31Z 2023-06-18T17:01:31Z 2022-09-01 Article Big Data and Cognitive Computing Vol.6 No.3 (2022) 10.3390/bdcc6030079 25042289 2-s2.0-85138994749 https://repository.li.mahidol.ac.th/handle/123456789/84256 SCOPUS
institution Mahidol University
building Mahidol University Library
continent Asia
country Thailand
Thailand
content_provider Mahidol University Library
collection Mahidol University Institutional Repository
topic Computer Science
spellingShingle Computer Science
Wongpatikaseree K.
Real-Time End-to-End Speech Emotion Recognition with Cross-Domain Adaptation
description Language resources are the main factor in speech-emotion-recognition (SER)-based deep learning models. Thai is a low-resource language that has a smaller data size than high-resource languages such as German. This paper describes the framework of using a pretrained-model-based front-end and back-end network to adapt feature spaces from the speech recognition domain to the speech emotion classification domain. It consists of two parts: a speech recognition front-end network and a speech emotion recognition back-end network. For speech recognition, Wav2Vec2 is the state-of-the-art for high-resource languages, while XLSR is used for low-resource languages. Wav2Vec2 and XLSR have proposed generalized end-to-end learning for speech understanding based on the speech recognition domain as feature space representations from feature encoding. This is one reason why our front-end network was selected as Wav2Vec2 and XLSR for the pretrained model. The pre-trained Wav2Vec2 and XLSR are used for front-end networks and fine-tuned for specific languages using the Common Voice 7.0 dataset. Then, feature vectors of the front-end network are input for back-end networks; this includes convolution time reduction (CTR) and linear mean encoding transformation (LMET). Experiments using two different datasets show that our proposed framework can outperform the baselines in terms of unweighted and weighted accuracies.
author2 Mahidol University
author_facet Mahidol University
Wongpatikaseree K.
format Article
author Wongpatikaseree K.
author_sort Wongpatikaseree K.
title Real-Time End-to-End Speech Emotion Recognition with Cross-Domain Adaptation
title_short Real-Time End-to-End Speech Emotion Recognition with Cross-Domain Adaptation
title_full Real-Time End-to-End Speech Emotion Recognition with Cross-Domain Adaptation
title_fullStr Real-Time End-to-End Speech Emotion Recognition with Cross-Domain Adaptation
title_full_unstemmed Real-Time End-to-End Speech Emotion Recognition with Cross-Domain Adaptation
title_sort real-time end-to-end speech emotion recognition with cross-domain adaptation
publishDate 2023
url https://repository.li.mahidol.ac.th/handle/123456789/84256
_version_ 1781414573978419200