EMOTION DIARIZATION ON INDONESIAN LANGUAGE CONVERSATION AUDIO USING SPEAKER ROLE CODE AND HYBRID RNN-CRF MODEL
Emotion diarization is a process to identify emotional labels on homogeneous segments in the audio signal flow. This process can improve the conversational emotion recognition system that has been developed. The existing conversational emotion recognition system has limitations in identifying emo...
Saved in:
Main Author: | |
---|---|
Format: | Theses |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/68653 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
Summary: | Emotion diarization is a process to identify emotional labels on homogeneous
segments in the audio signal flow. This process can improve the conversational
emotion recognition system that has been developed. The existing conversational
emotion recognition system has limitations in identifying emotions in long audio
and in handling audio segmentation into segments. The development of an emotion
diarization system makes it possible to process the audio input of a full-length
conversation without manual segmentation and provide an output of emotion labels
for each segment at a specific time stamp.
Research on emotional diaries is still minimal, especially for conversations in
Indonesian. Although the structure of the speaker's diarization system can be
adapted, the model from the speaker's diarization research is relatively difficult to
apply in emotion diarization research because of the differences in characteristics
and feature representations. Speaker label is discrete and can be grouped into
discrete number, while emotional state is more abstract, and the class needs to be
defined. Thus, it is necessary to investigate the architecture and algorithms that can
handle this emotion diarization task.
In this Thesis research, a neural network architecture that utilizes segment and
frame level feature representation by using a combination of an RNN-based
encoder model, and an RNN-CRF-based classifier model is proposed. The encoder
model is used to extract the frame representation of each segment. The classifier
model is used to perform sequential emotional labeling on segment sequences.
Furthermore, it is also investigated the effect of adding the speaker's role code to
the input feature to assist the model in recognizing emotions. Experiments were
carried out to select a sequential labeling algorithm, determine the segment size to
be used as the sequential segment input data, select the most optimal feature,
determine the encoder algorithm, and determine the method of adding the speaker
role code. Experiments were conducted to produce an emotion diarization system
that gave the most optimal performance.
Based on the experimental results, the proposed architecture and algorithm provide
the best emotion recognition performance among the baselines. The Hybrid LSTM-
BiLSTM-CRF architecture with the addition of a speaker role code gives an F1-
score of 0.6318. This represents an improvement of about 9% compared to the
baseline algorithm and architecture in Indonesian language domain. |
---|