CONVERSATIONAL SPEECH EMOTION RECOGNITION FROM INDONESIAN SPOKEN LANGUAGE USING RECURRENT NEURAL NETWORK BASED MODEL

In human interaction, emotion is one aspect that has a fundamental role in influencing the information conveyed. The existing studies and researches on Indonesian emotion recognition systems model the emotion in utterance level which considers utterances as independent entities. However, in the n...

Full description

Saved in:
Bibliographic Details
Main Author: Nurul Izzah Adma, Aisyah
Format: Final Project
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/56346
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
Description
Summary:In human interaction, emotion is one aspect that has a fundamental role in influencing the information conveyed. The existing studies and researches on Indonesian emotion recognition systems model the emotion in utterance level which considers utterances as independent entities. However, in the nature of emotion recognition, the relation among utterances affects the emotional context. Human can recognize an emotion abstraction from consecutive utterances (termed conversation) which may have changes or transitions of emotion. Therefore, an experiment was carried out in order to build a conversational emotion recognition system in Indonesian. To build a conversational emotion recognition system, a conversation emotion corpus is needed. The appropriate corpus and usable for conversational-based modeling is not yet available. In this study, a new emotion corpus was built which was obtained by acquiring data from 46 podcast shows. The emotion corpus that was built consisted of 2003 conversations and 10822 utterances that had labels among 6 emotional classes: happy, sad, angry, disgusted, afraid, and surprised. The conversational emotion recognition system in Indonesian was built through experiments involving the Recurrent Neural Network (RNN) algorithm to capture information among consecutive utterances. Conversational emotion recognition learning is carried out based on acoustic features and lexical features. In the experiment, the process of finding the best features and modeling techniques is carried out to produce a model that provides the most optimal performance. The model was evaluated based on the recognition of emotion to the conversation data. The feature-level context-dependent combined model which is built by the combination of acoustic and lexical features has the best performance with an Fmeasure of 0.5817 for 6 emotion classes and 0.7252 for 4 emotion classes. The decision-level context-dependent combined model gives an F-measure of 0.5578 for 6 emotion classes and 0.6924 for 4 emotion classes. Moreover, in the experiment, we obtained a feature-level context-independent combined model, a decision-level context-independent combined model, a context-independent acoustic model, a context-dependent acoustic model, a context-independent lexical model, and a context-dependent lexical model for each of the 6 emotion classes and 4 emotion classes.