DEVELOPMENT OF SPEAKER RECOGNITION SYSTEM IN INDONESIAN WITH DEEP NEURAL NETWORK AND VECTOR EMBEDDING
The speaker recognition system is a human biometric system that identifies a person with voice parameters. Identification of a person can be done by modeling each characteristic of the speaker. There is a speech recognition model that is considered state-of-the-art, namely the i-vector model. Along...
Saved in:
Main Author: | |
---|---|
Format: | Final Project |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/69107 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
Summary: | The speaker recognition system is a human biometric system that identifies a person with voice parameters. Identification of a person can be done by modeling each characteristic of the speaker. There is a speech recognition model that is considered state-of-the-art, namely the i-vector model. Along with the development of deep learning models, many models are designed with deep learning, one of which is the x-vector model. The performance of the x-vector model is considered better than the i-vector model, but there are also those who think that the x-vector cannot outperform the i-vector. The models built for the speaker recognition system in this final project are vector-i and vector-x models. The i-vector model is an unsupervised learning model, while the x-vector model is a discriminatory model where the training process is carried out with supervised learning. The data used in this study are recorded data collected by themselves for multi-channel testing where the data is recorded with cellphones and laptops. The number of speakers collected was 150 speakers. In order for the speaker recognition system to be more robust in
handling speaker variability, a data augmentation process is carried out on the training data. The data augmentation techniques applied are changing the sound strength, adding white noise, shifting tone, stretching time, and simulating room echoes. The feature extraction technique used is MFCC with 60 features and Fbank with 40 features. Then the feature is processed with VAD and CMVN. The
vector-i model development is carried out by vector extraction with 400 dimensions using GMM 512 gaus. Meanwhile, the x-vector model is extracted by deep learning and applies LDA to reduce the vector dimensions to 200. The backend system for making decisions uses the PLDA method and the evaluation
matrix used by EER.
The test results show that the x-vector model with MFCC feature extraction gives the lowest EER value with the use of all training data, which is 0%. The Vector-x model with the MFCC feature provides a stable EER value in the 5-fold cross-validation test scheme with an average EER value of 1.67%. In addition, in testing the test data against the enroll data, no non-target speaker was identified as the target speaker. |
---|