EMOTION DIARIZATION ON INDONESIAN LANGUAGE CONVERSATION AUDIO USING SPEAKER ROLE CODE AND HYBRID RNN-CRF MODEL

Emotion diarization is a process to identify emotional labels on homogeneous segments in the audio signal flow. This process can improve the conversational emotion recognition system that has been developed. The existing conversational emotion recognition system has limitations in identifying emo...

Full description

Saved in:

Bibliographic Details
Main Author:	Nurul Izzah Adma, Aisyah
Format:	Theses
Language:	Indonesia
Online Access:	https://digilib.itb.ac.id/gdl/view/68653
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Institut Teknologi Bandung
Language:	Indonesia

id	id-itb.:68653
spelling	id-itb.:686532022-09-19T08:01:40ZEMOTION DIARIZATION ON INDONESIAN LANGUAGE CONVERSATION AUDIO USING SPEAKER ROLE CODE AND HYBRID RNN-CRF MODEL Nurul Izzah Adma, Aisyah Indonesia Theses emotion diarization, audio conversation, Indonesian language, speaker role code, hybrid, RNN-CRF INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/68653 Emotion diarization is a process to identify emotional labels on homogeneous segments in the audio signal flow. This process can improve the conversational emotion recognition system that has been developed. The existing conversational emotion recognition system has limitations in identifying emotions in long audio and in handling audio segmentation into segments. The development of an emotion diarization system makes it possible to process the audio input of a full-length conversation without manual segmentation and provide an output of emotion labels for each segment at a specific time stamp. Research on emotional diaries is still minimal, especially for conversations in Indonesian. Although the structure of the speaker's diarization system can be adapted, the model from the speaker's diarization research is relatively difficult to apply in emotion diarization research because of the differences in characteristics and feature representations. Speaker label is discrete and can be grouped into discrete number, while emotional state is more abstract, and the class needs to be defined. Thus, it is necessary to investigate the architecture and algorithms that can handle this emotion diarization task. In this Thesis research, a neural network architecture that utilizes segment and frame level feature representation by using a combination of an RNN-based encoder model, and an RNN-CRF-based classifier model is proposed. The encoder model is used to extract the frame representation of each segment. The classifier model is used to perform sequential emotional labeling on segment sequences. Furthermore, it is also investigated the effect of adding the speaker's role code to the input feature to assist the model in recognizing emotions. Experiments were carried out to select a sequential labeling algorithm, determine the segment size to be used as the sequential segment input data, select the most optimal feature, determine the encoder algorithm, and determine the method of adding the speaker role code. Experiments were conducted to produce an emotion diarization system that gave the most optimal performance. Based on the experimental results, the proposed architecture and algorithm provide the best emotion recognition performance among the baselines. The Hybrid LSTM- BiLSTM-CRF architecture with the addition of a speaker role code gives an F1- score of 0.6318. This represents an improvement of about 9% compared to the baseline algorithm and architecture in Indonesian language domain. text
institution	Institut Teknologi Bandung
building	Institut Teknologi Bandung Library
continent	Asia
country	Indonesia Indonesia
content_provider	Institut Teknologi Bandung
collection	Digital ITB
language	Indonesia
description	Emotion diarization is a process to identify emotional labels on homogeneous segments in the audio signal flow. This process can improve the conversational emotion recognition system that has been developed. The existing conversational emotion recognition system has limitations in identifying emotions in long audio and in handling audio segmentation into segments. The development of an emotion diarization system makes it possible to process the audio input of a full-length conversation without manual segmentation and provide an output of emotion labels for each segment at a specific time stamp. Research on emotional diaries is still minimal, especially for conversations in Indonesian. Although the structure of the speaker's diarization system can be adapted, the model from the speaker's diarization research is relatively difficult to apply in emotion diarization research because of the differences in characteristics and feature representations. Speaker label is discrete and can be grouped into discrete number, while emotional state is more abstract, and the class needs to be defined. Thus, it is necessary to investigate the architecture and algorithms that can handle this emotion diarization task. In this Thesis research, a neural network architecture that utilizes segment and frame level feature representation by using a combination of an RNN-based encoder model, and an RNN-CRF-based classifier model is proposed. The encoder model is used to extract the frame representation of each segment. The classifier model is used to perform sequential emotional labeling on segment sequences. Furthermore, it is also investigated the effect of adding the speaker's role code to the input feature to assist the model in recognizing emotions. Experiments were carried out to select a sequential labeling algorithm, determine the segment size to be used as the sequential segment input data, select the most optimal feature, determine the encoder algorithm, and determine the method of adding the speaker role code. Experiments were conducted to produce an emotion diarization system that gave the most optimal performance. Based on the experimental results, the proposed architecture and algorithm provide the best emotion recognition performance among the baselines. The Hybrid LSTM- BiLSTM-CRF architecture with the addition of a speaker role code gives an F1- score of 0.6318. This represents an improvement of about 9% compared to the baseline algorithm and architecture in Indonesian language domain.
format	Theses
author	Nurul Izzah Adma, Aisyah
spellingShingle	Nurul Izzah Adma, Aisyah EMOTION DIARIZATION ON INDONESIAN LANGUAGE CONVERSATION AUDIO USING SPEAKER ROLE CODE AND HYBRID RNN-CRF MODEL
author_facet	Nurul Izzah Adma, Aisyah
author_sort	Nurul Izzah Adma, Aisyah
title	EMOTION DIARIZATION ON INDONESIAN LANGUAGE CONVERSATION AUDIO USING SPEAKER ROLE CODE AND HYBRID RNN-CRF MODEL
title_short	EMOTION DIARIZATION ON INDONESIAN LANGUAGE CONVERSATION AUDIO USING SPEAKER ROLE CODE AND HYBRID RNN-CRF MODEL
title_full	EMOTION DIARIZATION ON INDONESIAN LANGUAGE CONVERSATION AUDIO USING SPEAKER ROLE CODE AND HYBRID RNN-CRF MODEL
title_fullStr	EMOTION DIARIZATION ON INDONESIAN LANGUAGE CONVERSATION AUDIO USING SPEAKER ROLE CODE AND HYBRID RNN-CRF MODEL
title_full_unstemmed	EMOTION DIARIZATION ON INDONESIAN LANGUAGE CONVERSATION AUDIO USING SPEAKER ROLE CODE AND HYBRID RNN-CRF MODEL
title_sort	emotion diarization on indonesian language conversation audio using speaker role code and hybrid rnn-crf model
url	https://digilib.itb.ac.id/gdl/view/68653
_version_	1822933712806871040

EMOTION DIARIZATION ON INDONESIAN LANGUAGE CONVERSATION AUDIO USING SPEAKER ROLE CODE AND HYBRID RNN-CRF MODEL

Similar Items