EMOTION DIARIZATION ON INDONESIAN LANGUAGE CONVERSATION AUDIO USING SPEAKER ROLE CODE AND HYBRID RNN-CRF MODEL

Emotion diarization is a process to identify emotional labels on homogeneous segments in the audio signal flow. This process can improve the conversational emotion recognition system that has been developed. The existing conversational emotion recognition system has limitations in identifying emo...

Full description

Saved in:
Bibliographic Details
Main Author: Nurul Izzah Adma, Aisyah
Format: Theses
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/68653
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
Description
Summary:Emotion diarization is a process to identify emotional labels on homogeneous segments in the audio signal flow. This process can improve the conversational emotion recognition system that has been developed. The existing conversational emotion recognition system has limitations in identifying emotions in long audio and in handling audio segmentation into segments. The development of an emotion diarization system makes it possible to process the audio input of a full-length conversation without manual segmentation and provide an output of emotion labels for each segment at a specific time stamp. Research on emotional diaries is still minimal, especially for conversations in Indonesian. Although the structure of the speaker's diarization system can be adapted, the model from the speaker's diarization research is relatively difficult to apply in emotion diarization research because of the differences in characteristics and feature representations. Speaker label is discrete and can be grouped into discrete number, while emotional state is more abstract, and the class needs to be defined. Thus, it is necessary to investigate the architecture and algorithms that can handle this emotion diarization task. In this Thesis research, a neural network architecture that utilizes segment and frame level feature representation by using a combination of an RNN-based encoder model, and an RNN-CRF-based classifier model is proposed. The encoder model is used to extract the frame representation of each segment. The classifier model is used to perform sequential emotional labeling on segment sequences. Furthermore, it is also investigated the effect of adding the speaker's role code to the input feature to assist the model in recognizing emotions. Experiments were carried out to select a sequential labeling algorithm, determine the segment size to be used as the sequential segment input data, select the most optimal feature, determine the encoder algorithm, and determine the method of adding the speaker role code. Experiments were conducted to produce an emotion diarization system that gave the most optimal performance. Based on the experimental results, the proposed architecture and algorithm provide the best emotion recognition performance among the baselines. The Hybrid LSTM- BiLSTM-CRF architecture with the addition of a speaker role code gives an F1- score of 0.6318. This represents an improvement of about 9% compared to the baseline algorithm and architecture in Indonesian language domain.