DEVELOPMENT OF SPEAKER RECOGNITION SYSTEM IN INDONESIAN WITH DEEP NEURAL NETWORK AND VECTOR EMBEDDING

The speaker recognition system is a human biometric system that identifies a person with voice parameters. Identification of a person can be done by modeling each characteristic of the speaker. There is a speech recognition model that is considered state-of-the-art, namely the i-vector model. Along...

Full description

Saved in:
Bibliographic Details
Main Author: Angelia, Tifany
Format: Final Project
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/69107
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
id id-itb.:69107
spelling id-itb.:691072022-09-20T11:54:01ZDEVELOPMENT OF SPEAKER RECOGNITION SYSTEM IN INDONESIAN WITH DEEP NEURAL NETWORK AND VECTOR EMBEDDING Angelia, Tifany Indonesia Final Project speaker recognition, data augmentation, i-vector, x-vector, deep learning, vector embedding. INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/69107 The speaker recognition system is a human biometric system that identifies a person with voice parameters. Identification of a person can be done by modeling each characteristic of the speaker. There is a speech recognition model that is considered state-of-the-art, namely the i-vector model. Along with the development of deep learning models, many models are designed with deep learning, one of which is the x-vector model. The performance of the x-vector model is considered better than the i-vector model, but there are also those who think that the x-vector cannot outperform the i-vector. The models built for the speaker recognition system in this final project are vector-i and vector-x models. The i-vector model is an unsupervised learning model, while the x-vector model is a discriminatory model where the training process is carried out with supervised learning. The data used in this study are recorded data collected by themselves for multi-channel testing where the data is recorded with cellphones and laptops. The number of speakers collected was 150 speakers. In order for the speaker recognition system to be more robust in handling speaker variability, a data augmentation process is carried out on the training data. The data augmentation techniques applied are changing the sound strength, adding white noise, shifting tone, stretching time, and simulating room echoes. The feature extraction technique used is MFCC with 60 features and Fbank with 40 features. Then the feature is processed with VAD and CMVN. The vector-i model development is carried out by vector extraction with 400 dimensions using GMM 512 gaus. Meanwhile, the x-vector model is extracted by deep learning and applies LDA to reduce the vector dimensions to 200. The backend system for making decisions uses the PLDA method and the evaluation matrix used by EER. The test results show that the x-vector model with MFCC feature extraction gives the lowest EER value with the use of all training data, which is 0%. The Vector-x model with the MFCC feature provides a stable EER value in the 5-fold cross-validation test scheme with an average EER value of 1.67%. In addition, in testing the test data against the enroll data, no non-target speaker was identified as the target speaker. text
institution Institut Teknologi Bandung
building Institut Teknologi Bandung Library
continent Asia
country Indonesia
Indonesia
content_provider Institut Teknologi Bandung
collection Digital ITB
language Indonesia
description The speaker recognition system is a human biometric system that identifies a person with voice parameters. Identification of a person can be done by modeling each characteristic of the speaker. There is a speech recognition model that is considered state-of-the-art, namely the i-vector model. Along with the development of deep learning models, many models are designed with deep learning, one of which is the x-vector model. The performance of the x-vector model is considered better than the i-vector model, but there are also those who think that the x-vector cannot outperform the i-vector. The models built for the speaker recognition system in this final project are vector-i and vector-x models. The i-vector model is an unsupervised learning model, while the x-vector model is a discriminatory model where the training process is carried out with supervised learning. The data used in this study are recorded data collected by themselves for multi-channel testing where the data is recorded with cellphones and laptops. The number of speakers collected was 150 speakers. In order for the speaker recognition system to be more robust in handling speaker variability, a data augmentation process is carried out on the training data. The data augmentation techniques applied are changing the sound strength, adding white noise, shifting tone, stretching time, and simulating room echoes. The feature extraction technique used is MFCC with 60 features and Fbank with 40 features. Then the feature is processed with VAD and CMVN. The vector-i model development is carried out by vector extraction with 400 dimensions using GMM 512 gaus. Meanwhile, the x-vector model is extracted by deep learning and applies LDA to reduce the vector dimensions to 200. The backend system for making decisions uses the PLDA method and the evaluation matrix used by EER. The test results show that the x-vector model with MFCC feature extraction gives the lowest EER value with the use of all training data, which is 0%. The Vector-x model with the MFCC feature provides a stable EER value in the 5-fold cross-validation test scheme with an average EER value of 1.67%. In addition, in testing the test data against the enroll data, no non-target speaker was identified as the target speaker.
format Final Project
author Angelia, Tifany
spellingShingle Angelia, Tifany
DEVELOPMENT OF SPEAKER RECOGNITION SYSTEM IN INDONESIAN WITH DEEP NEURAL NETWORK AND VECTOR EMBEDDING
author_facet Angelia, Tifany
author_sort Angelia, Tifany
title DEVELOPMENT OF SPEAKER RECOGNITION SYSTEM IN INDONESIAN WITH DEEP NEURAL NETWORK AND VECTOR EMBEDDING
title_short DEVELOPMENT OF SPEAKER RECOGNITION SYSTEM IN INDONESIAN WITH DEEP NEURAL NETWORK AND VECTOR EMBEDDING
title_full DEVELOPMENT OF SPEAKER RECOGNITION SYSTEM IN INDONESIAN WITH DEEP NEURAL NETWORK AND VECTOR EMBEDDING
title_fullStr DEVELOPMENT OF SPEAKER RECOGNITION SYSTEM IN INDONESIAN WITH DEEP NEURAL NETWORK AND VECTOR EMBEDDING
title_full_unstemmed DEVELOPMENT OF SPEAKER RECOGNITION SYSTEM IN INDONESIAN WITH DEEP NEURAL NETWORK AND VECTOR EMBEDDING
title_sort development of speaker recognition system in indonesian with deep neural network and vector embedding
url https://digilib.itb.ac.id/gdl/view/69107
_version_ 1822990840448942080