I-VECTOR AUTOMATIC SPEAKER RECOGNITION DEVELOPMENT USING ARTIFICIAL NEURAL NETWORK FOR INDONESIAN SPEAKER DATABASE

Human voice is a biometric key that can contain various information: age, language, gender, and even their speaker emotion, feeling, or intent. Based on those information, we could make a system to identify the characteristics of a speaker based on their voice; which commonly called speaker recog...

Full description

Saved in:
Bibliographic Details
Main Author: Paramarta Saniskara, Gumilang
Format: Theses
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/70769
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
id id-itb.:70769
spelling id-itb.:707692023-01-20T13:29:50ZI-VECTOR AUTOMATIC SPEAKER RECOGNITION DEVELOPMENT USING ARTIFICIAL NEURAL NETWORK FOR INDONESIAN SPEAKER DATABASE Paramarta Saniskara, Gumilang Indonesia Theses automatic speaker recognition, i-vector, artificial neural network, indonesian language INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/70769 Human voice is a biometric key that can contain various information: age, language, gender, and even their speaker emotion, feeling, or intent. Based on those information, we could make a system to identify the characteristics of a speaker based on their voice; which commonly called speaker recognition system. Automatic speaker recognition system attain its name from its capability to distinguish the speaker tonal characteristics independently, regardless of what they expressed (utterance). This form of technology are commonly found and used in various application. Nevertheless, the room for advancement in this area are still widely open, especially with the rose of neural networks and deep learning in model creation and predictions. Many found surprising results while combining conventional methods – including several methods of feature extraction and processing – with machine learning sequence, leading to heigthen accuracy and better prediction due to its ability to create more robust model in high variability data. This research focuses on previous state-of-the-art speaker recognition technology, that uses 20 MFCC extracted features, Gaussian Mixture Model (GMM) – Universal Background Model (UBM), that made total variabilty matrix (TV matrix) using their UBM supervectors for i-vector models and feature extraction; which known having superb performance already for speaker recognition system predictions. From there, in this research we use the i-vector feature as the input feed for the neural network we built. This proposed method was chose for its capability in creating model with better nonlinearity and prediction performance, compared to conventional statistical model. We try to optimize the configuration of hyperparameters: number of hidden layers, nodes, batch size, and training epoch on the Deep Learning process. The database used in this research will utilize Indonesian speaker with some channel and scenario variability. Result shows significant performance boost on speaker prediction using neural network models. From the 100 i-vector and 32 Gaussian features inputted into the neural network, a combination of reducing the number of hidden layers and increasing the number of nodes showed a significant increase in accuracy. Equal error rate evaluated between without using neural network model and the new method on the baseline condition shows improvement: 7,57% to 5,26%. The hyperparameters configuration of 2 hidden layers, 1024 nodes, with ReLU activation, and dropout ratio of 0,5 gave a model accuracy of 97,83%, with validation accuracy from the test data showing results that converged without overfitting for its test data. text
institution Institut Teknologi Bandung
building Institut Teknologi Bandung Library
continent Asia
country Indonesia
Indonesia
content_provider Institut Teknologi Bandung
collection Digital ITB
language Indonesia
description Human voice is a biometric key that can contain various information: age, language, gender, and even their speaker emotion, feeling, or intent. Based on those information, we could make a system to identify the characteristics of a speaker based on their voice; which commonly called speaker recognition system. Automatic speaker recognition system attain its name from its capability to distinguish the speaker tonal characteristics independently, regardless of what they expressed (utterance). This form of technology are commonly found and used in various application. Nevertheless, the room for advancement in this area are still widely open, especially with the rose of neural networks and deep learning in model creation and predictions. Many found surprising results while combining conventional methods – including several methods of feature extraction and processing – with machine learning sequence, leading to heigthen accuracy and better prediction due to its ability to create more robust model in high variability data. This research focuses on previous state-of-the-art speaker recognition technology, that uses 20 MFCC extracted features, Gaussian Mixture Model (GMM) – Universal Background Model (UBM), that made total variabilty matrix (TV matrix) using their UBM supervectors for i-vector models and feature extraction; which known having superb performance already for speaker recognition system predictions. From there, in this research we use the i-vector feature as the input feed for the neural network we built. This proposed method was chose for its capability in creating model with better nonlinearity and prediction performance, compared to conventional statistical model. We try to optimize the configuration of hyperparameters: number of hidden layers, nodes, batch size, and training epoch on the Deep Learning process. The database used in this research will utilize Indonesian speaker with some channel and scenario variability. Result shows significant performance boost on speaker prediction using neural network models. From the 100 i-vector and 32 Gaussian features inputted into the neural network, a combination of reducing the number of hidden layers and increasing the number of nodes showed a significant increase in accuracy. Equal error rate evaluated between without using neural network model and the new method on the baseline condition shows improvement: 7,57% to 5,26%. The hyperparameters configuration of 2 hidden layers, 1024 nodes, with ReLU activation, and dropout ratio of 0,5 gave a model accuracy of 97,83%, with validation accuracy from the test data showing results that converged without overfitting for its test data.
format Theses
author Paramarta Saniskara, Gumilang
spellingShingle Paramarta Saniskara, Gumilang
I-VECTOR AUTOMATIC SPEAKER RECOGNITION DEVELOPMENT USING ARTIFICIAL NEURAL NETWORK FOR INDONESIAN SPEAKER DATABASE
author_facet Paramarta Saniskara, Gumilang
author_sort Paramarta Saniskara, Gumilang
title I-VECTOR AUTOMATIC SPEAKER RECOGNITION DEVELOPMENT USING ARTIFICIAL NEURAL NETWORK FOR INDONESIAN SPEAKER DATABASE
title_short I-VECTOR AUTOMATIC SPEAKER RECOGNITION DEVELOPMENT USING ARTIFICIAL NEURAL NETWORK FOR INDONESIAN SPEAKER DATABASE
title_full I-VECTOR AUTOMATIC SPEAKER RECOGNITION DEVELOPMENT USING ARTIFICIAL NEURAL NETWORK FOR INDONESIAN SPEAKER DATABASE
title_fullStr I-VECTOR AUTOMATIC SPEAKER RECOGNITION DEVELOPMENT USING ARTIFICIAL NEURAL NETWORK FOR INDONESIAN SPEAKER DATABASE
title_full_unstemmed I-VECTOR AUTOMATIC SPEAKER RECOGNITION DEVELOPMENT USING ARTIFICIAL NEURAL NETWORK FOR INDONESIAN SPEAKER DATABASE
title_sort i-vector automatic speaker recognition development using artificial neural network for indonesian speaker database
url https://digilib.itb.ac.id/gdl/view/70769
_version_ 1822006404707254272