LANGUAGE RECOGNITION FOR UNCONDITIONED JAVANESE, MALAY, AND SUNDANESE DATA BASED ON I-VECTOR FRAMEWORK
Spoken language recognition is a system developed to identify a language from speech data. The system has many advantages for speech recognition system. It is implemented on multilingual speech recognition systems, such as multilingual voice assistant, multilingual automatic transcription, automatic...
Saved in:
Main Author: | |
---|---|
Format: | Theses |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/67159 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
Summary: | Spoken language recognition is a system developed to identify a language from speech data. The system has many advantages for speech recognition system. It is implemented on multilingual speech recognition systems, such as multilingual voice assistant, multilingual automatic transcription, automatic call routing, and document retrieving. Spoken language recognition system has two type of systems, identification and verification. Language identification is a system that identifies and classifies an utterance into a language class or category. Meanwhile language verification is a “yes or no” problem which verifies whether an utterance associated with a specific language or not. Language recognition studies recently focus on developing system on real condition. Thus, system needs an unconditioned dataset that could represent real world speech or audio which has high variabilities, different audio quality, and recording condition. Several variabilities condition those might be present in an audio are the different microphones, background noises, room reverb, overlapping voice, and vocal efforts. High variabilities could make the system performance drops in an unknown dataset. Several methods have been developed to overcome this problem, such as normalizing the features with within-class normalization (WCCN) or reduce the less important features with linear discriminant analysis (LDA).
Language recognition systems mostly use machine learning methods to predict a language which are really influenced by the data characteristics. Meanwhile, language is a biometric dataset thus its characteristics are affected by many factors such as race, culture, demography, etc. Complex combination of those factors can affect but not limited to the language’s grammatical concept, hierarchy, and tone. Thus, language recognition system that is trained with a language dataset will need several adjustments when it is used in a different language dataset. Thus, many studies were conducted using different language combinations. This also motivates many local language recognition studies, such as Indonesian local langauge recognition system.
In this study, the local language recognition system is developed using Javanese, Malay, and Sundanese language. Speech dataset is collected independently from YouTube and participants recorded speech. Dataset that is collected has high variabilities in channel, vocal efforts, background noise, reverb condition to ensure the dataset represents real condition speech. Recorded speech data is collected from participants were recorded using different type of channel, software, and also different recording conditions. A total of 222 speech data were collected, with 102 data are from YouTube and 120 data are from participants. Total duration of each language class is 44.5 minutes, 98.8 minutes, and 74.4 minutes for Javanese, Malay, and Sundanese. All data is divided into two set, 60% or 133 speech data are used as training set and 40% or 89 speech data are used as testing set.
System is developed with several steps, voice activity detection (VAD) using energy-based VAD dan fast robust VAD (fast rVAD), feature extraction using mel-frequency cepstral coefficients (MFCC) and shifted delta cepstral coefficients (SDCC), i-vector modelling, and classification using support vector machine (SVM), logistics regression (LR), K-nearest neighbors (KNN), multilayer perceptron (MLP), and random forest (RF). Afterwards, WCCN and LDA will be implemented to normalized the variabilites in the unconditioned dataset. System performance is evaluated using cost average detection (Cavg), F1-score, and accuracy metrics.
Experiment results show that the best system is achieved using i-vector model with KNN classifier with cost average detection of 0.011, 0.011, and 0.051 for 30, 10, and 3 seconds duration condition. The best performance is achieved using energy-based VAD, i-vector without normalization, and KNN clawssifier. The F1-score for this system is 96%, 98%, 92% and accuracy is 97%, 98%, 92% for 30, 10, and 3 seconds duration condition respectively.
|
---|