I-VECTOR AUTOMATIC SPEAKER RECOGNITION DEVELOPMENT USING ARTIFICIAL NEURAL NETWORK FOR INDONESIAN SPEAKER DATABASE

Human voice is a biometric key that can contain various information: age, language, gender, and even their speaker emotion, feeling, or intent. Based on those information, we could make a system to identify the characteristics of a speaker based on their voice; which commonly called speaker recog...

Full description

Saved in:
Bibliographic Details
Main Author: Paramarta Saniskara, Gumilang
Format: Theses
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/70769
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
Description
Summary:Human voice is a biometric key that can contain various information: age, language, gender, and even their speaker emotion, feeling, or intent. Based on those information, we could make a system to identify the characteristics of a speaker based on their voice; which commonly called speaker recognition system. Automatic speaker recognition system attain its name from its capability to distinguish the speaker tonal characteristics independently, regardless of what they expressed (utterance). This form of technology are commonly found and used in various application. Nevertheless, the room for advancement in this area are still widely open, especially with the rose of neural networks and deep learning in model creation and predictions. Many found surprising results while combining conventional methods – including several methods of feature extraction and processing – with machine learning sequence, leading to heigthen accuracy and better prediction due to its ability to create more robust model in high variability data. This research focuses on previous state-of-the-art speaker recognition technology, that uses 20 MFCC extracted features, Gaussian Mixture Model (GMM) – Universal Background Model (UBM), that made total variabilty matrix (TV matrix) using their UBM supervectors for i-vector models and feature extraction; which known having superb performance already for speaker recognition system predictions. From there, in this research we use the i-vector feature as the input feed for the neural network we built. This proposed method was chose for its capability in creating model with better nonlinearity and prediction performance, compared to conventional statistical model. We try to optimize the configuration of hyperparameters: number of hidden layers, nodes, batch size, and training epoch on the Deep Learning process. The database used in this research will utilize Indonesian speaker with some channel and scenario variability. Result shows significant performance boost on speaker prediction using neural network models. From the 100 i-vector and 32 Gaussian features inputted into the neural network, a combination of reducing the number of hidden layers and increasing the number of nodes showed a significant increase in accuracy. Equal error rate evaluated between without using neural network model and the new method on the baseline condition shows improvement: 7,57% to 5,26%. The hyperparameters configuration of 2 hidden layers, 1024 nodes, with ReLU activation, and dropout ratio of 0,5 gave a model accuracy of 97,83%, with validation accuracy from the test data showing results that converged without overfitting for its test data.