MACHINE SPEECH CHAIN WITH EMOTION RECOGNITION

Humans communicate with appropriate emotion in speech to convey appropriate meaning. Speech recognition and synthesis system must be able to understand and convey the appropriate emotions. To produce a good system, speech data with real emotions is needed. However, this type of data is difficult...

Full description

Saved in:
Bibliographic Details
Main Author: Pradia Naufal, Akeyla
Format: Final Project
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/82051
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
id id-itb.:82051
spelling id-itb.:820512024-07-05T13:41:13ZMACHINE SPEECH CHAIN WITH EMOTION RECOGNITION Pradia Naufal, Akeyla Indonesia Final Project speech recognition, speech emotion recognition, machine speech chain, unpaired data INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/82051 Humans communicate with appropriate emotion in speech to convey appropriate meaning. Speech recognition and synthesis system must be able to understand and convey the appropriate emotions. To produce a good system, speech data with real emotions is needed. However, this type of data is difficult to obtain. Machine speech chains use unpaired data to continue speech recognition and speech synthesis training with models that are previously trained with paired data. As unpaired data is more abundant than paired data, machine speech chain could be used to recognize emotions in speech in which training data is difficult to obtain. This paper uses speech data with natural emotion and speech data with various emotions to measure the usage of the machine speech chain in speech emotion recognition and speech recognition from emotional speech. Character Error Rate (CER) is used in speech recognition evaluation and accuracy and F1 score are used in speech emotion recognition evaluation. It was found that the model trained with 50% of paired neutral emotion speech data and 22% of paired non-neutral emotional speech data had lower in CER from 37.552% to 34.523% when trained again with unpaired neutral emotion speech data and from 37.552% to 33.749% when trained again with combined unpaired speech data. Accuracy of non-neutral emotions experienced an increase of 2.18% to 53.51% but with a trend of worsened F1 score, ranging from a rise of 20.6% and a decrease of 23.4%. The values of these two metrics indicate that the model is biased towards the majority class. text
institution Institut Teknologi Bandung
building Institut Teknologi Bandung Library
continent Asia
country Indonesia
Indonesia
content_provider Institut Teknologi Bandung
collection Digital ITB
language Indonesia
description Humans communicate with appropriate emotion in speech to convey appropriate meaning. Speech recognition and synthesis system must be able to understand and convey the appropriate emotions. To produce a good system, speech data with real emotions is needed. However, this type of data is difficult to obtain. Machine speech chains use unpaired data to continue speech recognition and speech synthesis training with models that are previously trained with paired data. As unpaired data is more abundant than paired data, machine speech chain could be used to recognize emotions in speech in which training data is difficult to obtain. This paper uses speech data with natural emotion and speech data with various emotions to measure the usage of the machine speech chain in speech emotion recognition and speech recognition from emotional speech. Character Error Rate (CER) is used in speech recognition evaluation and accuracy and F1 score are used in speech emotion recognition evaluation. It was found that the model trained with 50% of paired neutral emotion speech data and 22% of paired non-neutral emotional speech data had lower in CER from 37.552% to 34.523% when trained again with unpaired neutral emotion speech data and from 37.552% to 33.749% when trained again with combined unpaired speech data. Accuracy of non-neutral emotions experienced an increase of 2.18% to 53.51% but with a trend of worsened F1 score, ranging from a rise of 20.6% and a decrease of 23.4%. The values of these two metrics indicate that the model is biased towards the majority class.
format Final Project
author Pradia Naufal, Akeyla
spellingShingle Pradia Naufal, Akeyla
MACHINE SPEECH CHAIN WITH EMOTION RECOGNITION
author_facet Pradia Naufal, Akeyla
author_sort Pradia Naufal, Akeyla
title MACHINE SPEECH CHAIN WITH EMOTION RECOGNITION
title_short MACHINE SPEECH CHAIN WITH EMOTION RECOGNITION
title_full MACHINE SPEECH CHAIN WITH EMOTION RECOGNITION
title_fullStr MACHINE SPEECH CHAIN WITH EMOTION RECOGNITION
title_full_unstemmed MACHINE SPEECH CHAIN WITH EMOTION RECOGNITION
title_sort machine speech chain with emotion recognition
url https://digilib.itb.ac.id/gdl/view/82051
_version_ 1822997537302249472