CONVERSATIONAL SPEECH EMOTION RECOGNITION FROM INDONESIAN SPOKEN LANGUAGE USING RECURRENT NEURAL NETWORK BASED MODEL
In human interaction, emotion is one aspect that has a fundamental role in influencing the information conveyed. The existing studies and researches on Indonesian emotion recognition systems model the emotion in utterance level which considers utterances as independent entities. However, in the n...
Saved in:
Main Author: | |
---|---|
Format: | Final Project |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/56346 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
Summary: | In human interaction, emotion is one aspect that has a fundamental role in
influencing the information conveyed. The existing studies and researches on
Indonesian emotion recognition systems model the emotion in utterance level
which considers utterances as independent entities. However, in the nature of
emotion recognition, the relation among utterances affects the emotional context.
Human can recognize an emotion abstraction from consecutive utterances (termed
conversation) which may have changes or transitions of emotion. Therefore, an
experiment was carried out in order to build a conversational emotion recognition
system in Indonesian.
To build a conversational emotion recognition system, a conversation emotion
corpus is needed. The appropriate corpus and usable for conversational-based
modeling is not yet available. In this study, a new emotion corpus was built which
was obtained by acquiring data from 46 podcast shows. The emotion corpus that
was built consisted of 2003 conversations and 10822 utterances that had labels
among 6 emotional classes: happy, sad, angry, disgusted, afraid, and surprised.
The conversational emotion recognition system in Indonesian was built through
experiments involving the Recurrent Neural Network (RNN) algorithm to capture
information among consecutive utterances. Conversational emotion recognition
learning is carried out based on acoustic features and lexical features. In the
experiment, the process of finding the best features and modeling techniques is
carried out to produce a model that provides the most optimal performance.
The model was evaluated based on the recognition of emotion to the conversation
data. The feature-level context-dependent combined model which is built by the
combination of acoustic and lexical features has the best performance with an Fmeasure of 0.5817 for 6 emotion classes and 0.7252 for 4 emotion classes. The
decision-level context-dependent combined model gives an F-measure of 0.5578
for 6 emotion classes and 0.6924 for 4 emotion classes. Moreover, in the
experiment, we obtained a feature-level context-independent combined model, a
decision-level context-independent combined model, a context-independent
acoustic model, a context-dependent acoustic model, a context-independent lexical
model, and a context-dependent lexical model for each of the 6 emotion classes and
4 emotion classes. |
---|