CHANNEL NORMALIZATION OF SPEECH ACOUSTIC SIGNAL USING WITHIN-CLASS COVARIANCE NORMALIZATION (WCCN) FOR SPEAKER RECOGNITION SYSTEM WITH BAHASA INDONESIA
Speaker recognition system is a technology that can be used to verify the speaker's identity from an unknown speech voice sample. In Indonesia, this system is actively used to assist the speaker verification process as an evidence in court by the anti-corruption agency, the Police and the Attor...
Saved in:
Main Author: | |
---|---|
Format: | Theses |
Language: | Indonesia |
Subjects: | |
Online Access: | https://digilib.itb.ac.id/gdl/view/56836 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
Summary: | Speaker recognition system is a technology that can be used to verify the speaker's identity from an unknown speech voice sample. In Indonesia, this system is actively used to assist the speaker verification process as an evidence in court by the anti-corruption agency, the Police and the Attorney General's Office. The speech recognition system developed in this study uses I-vector modeling. This system is trained and tested using an Indonesian speech database owned by the Acoustic Laboratory of Physics Engineering Department at Institut Teknologi Bandung. The test data used were speech data of 46 males and 52 females and the training data were the first 20 speakers for each gender and recording scenarios. In this system, the speech features are extracted from its speech data, using 19 Mels Frequency Cepstral Coefficients (MFCCs) along with 1 energy dimension, 20 delta-MFCC, and 20 delta-delta-MFCC. The extracted speech features are modeled using 32 Gaussian components of UBM and 100 I-vector feature dimensions. Furthermore, an assessment of the similarity of the Known (K) and Unknown (UK) samples is carried out using the cosine distance method. The previous experiment using the same dataset and parameters has achieved maximum results in the female speech interview scenario data with Equal Error Rate (EER) = 3.50%. In this study, an effort to improve system performance from the same and different (mismatched) voice recording devices was carried out using the Within-Class Covariance Normalization (WCCN) technique. According to the hypothesis, the WCCN technique applied on the same-channel and mismatched-channel speaker recognition system can improve the system performance. In the same-channel experiment, an increase of 31.43% in system performance was obtained from the previous studies using the same dataset and parameters without WCCN. The best EER obtained in this study was 2.40% which was obtained in a same-channel experiment on a female interview scenario. Compared to the original I-vector system, the mismatched-channel speaker recognition system using WCCN has experienced an average performance increase of 33.75% in each scenario. The best EER obtained in a mismatched-channel speaker recognition system is in the female conversation scenario with an EER of 20.52%.
Keywords: Speech Recognition System, I-vector, cosine distance, same-channel, channel mismatch, Within-class Covariance Normalization, MFCC, Equal Error Rate.?
|
---|