I-VECTOR AUTOMATIC SPEAKER RECOGNITION DEVELOPMENT USING ARTIFICIAL NEURAL NETWORK FOR INDONESIAN SPEAKER DATABASE
Human voice is a biometric key that can contain various information: age, language, gender, and even their speaker emotion, feeling, or intent. Based on those information, we could make a system to identify the characteristics of a speaker based on their voice; which commonly called speaker recog...
Saved in:
Main Author: | |
---|---|
Format: | Theses |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/70769 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
id |
id-itb.:70769 |
---|---|
spelling |
id-itb.:707692023-01-20T13:29:50ZI-VECTOR AUTOMATIC SPEAKER RECOGNITION DEVELOPMENT USING ARTIFICIAL NEURAL NETWORK FOR INDONESIAN SPEAKER DATABASE Paramarta Saniskara, Gumilang Indonesia Theses automatic speaker recognition, i-vector, artificial neural network, indonesian language INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/70769 Human voice is a biometric key that can contain various information: age, language, gender, and even their speaker emotion, feeling, or intent. Based on those information, we could make a system to identify the characteristics of a speaker based on their voice; which commonly called speaker recognition system. Automatic speaker recognition system attain its name from its capability to distinguish the speaker tonal characteristics independently, regardless of what they expressed (utterance). This form of technology are commonly found and used in various application. Nevertheless, the room for advancement in this area are still widely open, especially with the rose of neural networks and deep learning in model creation and predictions. Many found surprising results while combining conventional methods – including several methods of feature extraction and processing – with machine learning sequence, leading to heigthen accuracy and better prediction due to its ability to create more robust model in high variability data. This research focuses on previous state-of-the-art speaker recognition technology, that uses 20 MFCC extracted features, Gaussian Mixture Model (GMM) – Universal Background Model (UBM), that made total variabilty matrix (TV matrix) using their UBM supervectors for i-vector models and feature extraction; which known having superb performance already for speaker recognition system predictions. From there, in this research we use the i-vector feature as the input feed for the neural network we built. This proposed method was chose for its capability in creating model with better nonlinearity and prediction performance, compared to conventional statistical model. We try to optimize the configuration of hyperparameters: number of hidden layers, nodes, batch size, and training epoch on the Deep Learning process. The database used in this research will utilize Indonesian speaker with some channel and scenario variability. Result shows significant performance boost on speaker prediction using neural network models. From the 100 i-vector and 32 Gaussian features inputted into the neural network, a combination of reducing the number of hidden layers and increasing the number of nodes showed a significant increase in accuracy. Equal error rate evaluated between without using neural network model and the new method on the baseline condition shows improvement: 7,57% to 5,26%. The hyperparameters configuration of 2 hidden layers, 1024 nodes, with ReLU activation, and dropout ratio of 0,5 gave a model accuracy of 97,83%, with validation accuracy from the test data showing results that converged without overfitting for its test data. text |
institution |
Institut Teknologi Bandung |
building |
Institut Teknologi Bandung Library |
continent |
Asia |
country |
Indonesia Indonesia |
content_provider |
Institut Teknologi Bandung |
collection |
Digital ITB |
language |
Indonesia |
description |
Human voice is a biometric key that can contain various information: age, language,
gender, and even their speaker emotion, feeling, or intent. Based on those information,
we could make a system to identify the characteristics of a speaker based on their voice;
which commonly called speaker recognition system. Automatic speaker recognition
system attain its name from its capability to distinguish the speaker tonal characteristics
independently, regardless of what they expressed (utterance). This form of technology
are commonly found and used in various application. Nevertheless, the room for
advancement in this area are still widely open, especially with the rose of neural networks
and deep learning in model creation and predictions. Many found surprising results while
combining conventional methods – including several methods of feature extraction and
processing – with machine learning sequence, leading to heigthen accuracy and better
prediction due to its ability to create more robust model in high variability data.
This research focuses on previous state-of-the-art speaker recognition technology, that
uses 20 MFCC extracted features, Gaussian Mixture Model (GMM) – Universal
Background Model (UBM), that made total variabilty matrix (TV matrix) using
their UBM supervectors for i-vector models and feature extraction; which known having
superb performance already for speaker recognition system predictions. From there, in
this research we use the i-vector feature as the input feed for the neural network we built.
This proposed method was chose for its capability in creating model with better nonlinearity
and prediction performance, compared to conventional statistical model. We try
to optimize the configuration of hyperparameters: number of hidden layers, nodes, batch
size, and training epoch on the Deep Learning process. The database used in this research
will utilize Indonesian speaker with some channel and scenario variability.
Result shows significant performance boost on speaker prediction using neural network
models. From the 100 i-vector and 32 Gaussian features inputted into the neural network,
a combination of reducing the number of hidden layers and increasing the number of
nodes showed a significant increase in accuracy. Equal error rate evaluated between
without using neural network model and the new method on the baseline condition shows
improvement: 7,57% to 5,26%. The hyperparameters configuration of 2 hidden layers,
1024 nodes, with ReLU activation, and dropout ratio of 0,5 gave a model accuracy of
97,83%, with validation accuracy from the test data showing results that converged
without overfitting for its test data.
|
format |
Theses |
author |
Paramarta Saniskara, Gumilang |
spellingShingle |
Paramarta Saniskara, Gumilang I-VECTOR AUTOMATIC SPEAKER RECOGNITION DEVELOPMENT USING ARTIFICIAL NEURAL NETWORK FOR INDONESIAN SPEAKER DATABASE |
author_facet |
Paramarta Saniskara, Gumilang |
author_sort |
Paramarta Saniskara, Gumilang |
title |
I-VECTOR AUTOMATIC SPEAKER RECOGNITION DEVELOPMENT USING ARTIFICIAL NEURAL NETWORK FOR INDONESIAN SPEAKER DATABASE |
title_short |
I-VECTOR AUTOMATIC SPEAKER RECOGNITION DEVELOPMENT USING ARTIFICIAL NEURAL NETWORK FOR INDONESIAN SPEAKER DATABASE |
title_full |
I-VECTOR AUTOMATIC SPEAKER RECOGNITION DEVELOPMENT USING ARTIFICIAL NEURAL NETWORK FOR INDONESIAN SPEAKER DATABASE |
title_fullStr |
I-VECTOR AUTOMATIC SPEAKER RECOGNITION DEVELOPMENT USING ARTIFICIAL NEURAL NETWORK FOR INDONESIAN SPEAKER DATABASE |
title_full_unstemmed |
I-VECTOR AUTOMATIC SPEAKER RECOGNITION DEVELOPMENT USING ARTIFICIAL NEURAL NETWORK FOR INDONESIAN SPEAKER DATABASE |
title_sort |
i-vector automatic speaker recognition development using artificial neural network for indonesian speaker database |
url |
https://digilib.itb.ac.id/gdl/view/70769 |
_version_ |
1822006404707254272 |