EMOTION DIARIZATION ON INDONESIAN LANGUAGE CONVERSATION AUDIO USING SPEAKER ROLE CODE AND HYBRID RNN-CRF MODEL

Emotion diarization is a process to identify emotional labels on homogeneous segments in the audio signal flow. This process can improve the conversational emotion recognition system that has been developed. The existing conversational emotion recognition system has limitations in identifying emo...

Full description

Saved in:
Bibliographic Details
Main Author: Nurul Izzah Adma, Aisyah
Format: Theses
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/68653
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
id id-itb.:68653
spelling id-itb.:686532022-09-19T08:01:40ZEMOTION DIARIZATION ON INDONESIAN LANGUAGE CONVERSATION AUDIO USING SPEAKER ROLE CODE AND HYBRID RNN-CRF MODEL Nurul Izzah Adma, Aisyah Indonesia Theses emotion diarization, audio conversation, Indonesian language, speaker role code, hybrid, RNN-CRF INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/68653 Emotion diarization is a process to identify emotional labels on homogeneous segments in the audio signal flow. This process can improve the conversational emotion recognition system that has been developed. The existing conversational emotion recognition system has limitations in identifying emotions in long audio and in handling audio segmentation into segments. The development of an emotion diarization system makes it possible to process the audio input of a full-length conversation without manual segmentation and provide an output of emotion labels for each segment at a specific time stamp. Research on emotional diaries is still minimal, especially for conversations in Indonesian. Although the structure of the speaker's diarization system can be adapted, the model from the speaker's diarization research is relatively difficult to apply in emotion diarization research because of the differences in characteristics and feature representations. Speaker label is discrete and can be grouped into discrete number, while emotional state is more abstract, and the class needs to be defined. Thus, it is necessary to investigate the architecture and algorithms that can handle this emotion diarization task. In this Thesis research, a neural network architecture that utilizes segment and frame level feature representation by using a combination of an RNN-based encoder model, and an RNN-CRF-based classifier model is proposed. The encoder model is used to extract the frame representation of each segment. The classifier model is used to perform sequential emotional labeling on segment sequences. Furthermore, it is also investigated the effect of adding the speaker's role code to the input feature to assist the model in recognizing emotions. Experiments were carried out to select a sequential labeling algorithm, determine the segment size to be used as the sequential segment input data, select the most optimal feature, determine the encoder algorithm, and determine the method of adding the speaker role code. Experiments were conducted to produce an emotion diarization system that gave the most optimal performance. Based on the experimental results, the proposed architecture and algorithm provide the best emotion recognition performance among the baselines. The Hybrid LSTM- BiLSTM-CRF architecture with the addition of a speaker role code gives an F1- score of 0.6318. This represents an improvement of about 9% compared to the baseline algorithm and architecture in Indonesian language domain. text
institution Institut Teknologi Bandung
building Institut Teknologi Bandung Library
continent Asia
country Indonesia
Indonesia
content_provider Institut Teknologi Bandung
collection Digital ITB
language Indonesia
description Emotion diarization is a process to identify emotional labels on homogeneous segments in the audio signal flow. This process can improve the conversational emotion recognition system that has been developed. The existing conversational emotion recognition system has limitations in identifying emotions in long audio and in handling audio segmentation into segments. The development of an emotion diarization system makes it possible to process the audio input of a full-length conversation without manual segmentation and provide an output of emotion labels for each segment at a specific time stamp. Research on emotional diaries is still minimal, especially for conversations in Indonesian. Although the structure of the speaker's diarization system can be adapted, the model from the speaker's diarization research is relatively difficult to apply in emotion diarization research because of the differences in characteristics and feature representations. Speaker label is discrete and can be grouped into discrete number, while emotional state is more abstract, and the class needs to be defined. Thus, it is necessary to investigate the architecture and algorithms that can handle this emotion diarization task. In this Thesis research, a neural network architecture that utilizes segment and frame level feature representation by using a combination of an RNN-based encoder model, and an RNN-CRF-based classifier model is proposed. The encoder model is used to extract the frame representation of each segment. The classifier model is used to perform sequential emotional labeling on segment sequences. Furthermore, it is also investigated the effect of adding the speaker's role code to the input feature to assist the model in recognizing emotions. Experiments were carried out to select a sequential labeling algorithm, determine the segment size to be used as the sequential segment input data, select the most optimal feature, determine the encoder algorithm, and determine the method of adding the speaker role code. Experiments were conducted to produce an emotion diarization system that gave the most optimal performance. Based on the experimental results, the proposed architecture and algorithm provide the best emotion recognition performance among the baselines. The Hybrid LSTM- BiLSTM-CRF architecture with the addition of a speaker role code gives an F1- score of 0.6318. This represents an improvement of about 9% compared to the baseline algorithm and architecture in Indonesian language domain.
format Theses
author Nurul Izzah Adma, Aisyah
spellingShingle Nurul Izzah Adma, Aisyah
EMOTION DIARIZATION ON INDONESIAN LANGUAGE CONVERSATION AUDIO USING SPEAKER ROLE CODE AND HYBRID RNN-CRF MODEL
author_facet Nurul Izzah Adma, Aisyah
author_sort Nurul Izzah Adma, Aisyah
title EMOTION DIARIZATION ON INDONESIAN LANGUAGE CONVERSATION AUDIO USING SPEAKER ROLE CODE AND HYBRID RNN-CRF MODEL
title_short EMOTION DIARIZATION ON INDONESIAN LANGUAGE CONVERSATION AUDIO USING SPEAKER ROLE CODE AND HYBRID RNN-CRF MODEL
title_full EMOTION DIARIZATION ON INDONESIAN LANGUAGE CONVERSATION AUDIO USING SPEAKER ROLE CODE AND HYBRID RNN-CRF MODEL
title_fullStr EMOTION DIARIZATION ON INDONESIAN LANGUAGE CONVERSATION AUDIO USING SPEAKER ROLE CODE AND HYBRID RNN-CRF MODEL
title_full_unstemmed EMOTION DIARIZATION ON INDONESIAN LANGUAGE CONVERSATION AUDIO USING SPEAKER ROLE CODE AND HYBRID RNN-CRF MODEL
title_sort emotion diarization on indonesian language conversation audio using speaker role code and hybrid rnn-crf model
url https://digilib.itb.ac.id/gdl/view/68653
_version_ 1822933712806871040