AUTOMATIC SPEAKER RECOGNITION BASED ON I-VECTOR MODELLING AND DATA AUGMENTATION FOR BAHASA INDONESIA
Automatic speaker recognition system is a technological process that can be done to identify the speaker's identity from his speech. This system can be used for various applications in the industry. In Indonesia, this system is still rare to be applied because the system based on Indonesian lan...
Saved in:
Main Author: | |
---|---|
Format: | Final Project |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/74654 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
id |
id-itb.:74654 |
---|---|
spelling |
id-itb.:746542023-07-20T13:04:57ZAUTOMATIC SPEAKER RECOGNITION BASED ON I-VECTOR MODELLING AND DATA AUGMENTATION FOR BAHASA INDONESIA Adhitama Harijanto, Nethanael Indonesia Final Project Automated speaker recognition, Data Augmentation, MFCC, I-vector. INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/74654 Automatic speaker recognition system is a technological process that can be done to identify the speaker's identity from his speech. This system can be used for various applications in the industry. In Indonesia, this system is still rare to be applied because the system based on Indonesian language does not have sufficient performance. Currently, the existing system is a recognition system based on interview scenarios and also conversations with limited data. Therefore, a system that is trained by data from various scenarios and also with more data is needed, so that it can produce a system with lower errors. The system created in this study is an automatic speaker recognition system with an Identity Vector (i-vector) model. This system will be trained and tested using an Indonesian speech database obtained from voice data collection in a semi-anechoic room at the Adhiwijogo Acoustics Laboratory, Bandung Institute of Technology, and then Data Augmentation will be performed on the voice data to increase the number of voice data in the database. In this system, speech data will be extracted using Mel Frequency Cepstral Coefficient (MFCC). In addition to 19+1 dimensional MFCC coefficients, delta MFCC and delta-delta MFCC values are also used, each of which has 20 dimensions in order to obtain information on voice changes to complete it. The data that has been extracted will be modeled with I- vector modeling using 32 Gaussian components and 100 I-vector dimensions. The training data will consist of two genders (male and female) and five scenarios (articles, digits, conversations, vocals, and interviews). Furthermore, an assessment of the similarity of K and UK samples will be carried out using cosine distance calculations. The performance assessment of the system is conducted by measuring its ability to recognize samples from the same speaker (target) or different speakers (non- target). The outcome of this assessment is represented by a value called Equal Error Rate (EER). For the conversation and interview scenarios, previous research has reported EER values of 6.41% and 7.57% for male speakers, and 12.78% and 6.04% for female speakers. Therefore, this research will compare its findings with the previous study for these scenarios. For the other scenarios, a comparison of EER values will be conducted between the system with Data Augmentation (DA) and without DA. The lowest EER values for male speakers in each scenario are 2.29%, 6.39%, 4.32%, 6.44%, and 3.72%, while for female speakers they are 3.44%, 6.51%, 6.92%, 6.19%, and 3.56%. These results indicate a decrease of 61.58%, 24.82%, 32.61%, 50.39%, and 50.86% for males, and 39.65%, 40.82%, 45.85%, 56.19%, and 41.06% compared to previous research results and the system without the use of DA. text |
institution |
Institut Teknologi Bandung |
building |
Institut Teknologi Bandung Library |
continent |
Asia |
country |
Indonesia Indonesia |
content_provider |
Institut Teknologi Bandung |
collection |
Digital ITB |
language |
Indonesia |
description |
Automatic speaker recognition system is a technological process that can be done to identify the speaker's identity from his speech. This system can be used for various applications in the industry. In Indonesia, this system is still rare to be applied because the system based on Indonesian language does not have sufficient performance. Currently, the existing system is a recognition system based on interview scenarios and also conversations with limited data. Therefore, a system that is trained by data from various scenarios and also with more data is needed, so that it can produce a system with lower errors.
The system created in this study is an automatic speaker recognition system with an Identity Vector (i-vector) model. This system will be trained and tested using an Indonesian speech database obtained from voice data collection in a semi-anechoic room at the Adhiwijogo Acoustics Laboratory, Bandung Institute of Technology, and then Data Augmentation will be performed on the voice data to increase the number of voice data in the database. In this system, speech data will be extracted using Mel Frequency Cepstral Coefficient (MFCC). In addition to 19+1 dimensional MFCC coefficients, delta MFCC and delta-delta MFCC values are also used, each of which has 20 dimensions in order to obtain information on voice changes to complete it. The data that has been extracted will be modeled with I- vector modeling using 32 Gaussian components and 100 I-vector dimensions. The training data will consist of two genders (male and female) and five scenarios (articles, digits, conversations, vocals, and interviews). Furthermore, an assessment of the similarity of K and UK samples will be carried out using cosine distance calculations.
The performance assessment of the system is conducted by measuring its ability to recognize samples from the same speaker (target) or different speakers (non- target). The outcome of this assessment is represented by a value called Equal Error Rate (EER). For the conversation and interview scenarios, previous research has reported EER values of 6.41% and 7.57% for male speakers, and 12.78% and 6.04% for female speakers. Therefore, this research will compare its findings with the previous study for these scenarios. For the other scenarios, a comparison of EER values will be conducted between the system with Data Augmentation (DA) and without DA. The lowest EER values for male speakers in each scenario are 2.29%, 6.39%, 4.32%, 6.44%, and 3.72%, while for female speakers they are
3.44%, 6.51%, 6.92%, 6.19%, and 3.56%. These results indicate a decrease of
61.58%, 24.82%, 32.61%, 50.39%, and 50.86% for males, and 39.65%, 40.82%,
45.85%, 56.19%, and 41.06% compared to previous research results and the system without the use of DA.
|
format |
Final Project |
author |
Adhitama Harijanto, Nethanael |
spellingShingle |
Adhitama Harijanto, Nethanael AUTOMATIC SPEAKER RECOGNITION BASED ON I-VECTOR MODELLING AND DATA AUGMENTATION FOR BAHASA INDONESIA |
author_facet |
Adhitama Harijanto, Nethanael |
author_sort |
Adhitama Harijanto, Nethanael |
title |
AUTOMATIC SPEAKER RECOGNITION BASED ON I-VECTOR MODELLING AND DATA AUGMENTATION FOR BAHASA INDONESIA |
title_short |
AUTOMATIC SPEAKER RECOGNITION BASED ON I-VECTOR MODELLING AND DATA AUGMENTATION FOR BAHASA INDONESIA |
title_full |
AUTOMATIC SPEAKER RECOGNITION BASED ON I-VECTOR MODELLING AND DATA AUGMENTATION FOR BAHASA INDONESIA |
title_fullStr |
AUTOMATIC SPEAKER RECOGNITION BASED ON I-VECTOR MODELLING AND DATA AUGMENTATION FOR BAHASA INDONESIA |
title_full_unstemmed |
AUTOMATIC SPEAKER RECOGNITION BASED ON I-VECTOR MODELLING AND DATA AUGMENTATION FOR BAHASA INDONESIA |
title_sort |
automatic speaker recognition based on i-vector modelling and data augmentation for bahasa indonesia |
url |
https://digilib.itb.ac.id/gdl/view/74654 |
_version_ |
1822993916773793792 |