INDONESIAN SPONTANEOUS SPEECH RECOGNITION SYSTEM USING DEEP NEURAL NETWORKS
The existing Indonesian speech recognition system has an accuracy that is still not good for spontaneous speech recognition. The system was trained using the HMM-GMM acoustic model. In this study, spontaneous speech data collected in Indonesian with a duration of 14 hours and speech recognition syst...
Saved in:
Main Author: | |
---|---|
Format: | Final Project |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/48149 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
Summary: | The existing Indonesian speech recognition system has an accuracy that is still not good for spontaneous speech recognition. The system was trained using the HMM-GMM acoustic model. In this study, spontaneous speech data collected in Indonesian with a duration of 14 hours and speech recognition system performance was improved by replacing the acoustic model with a neural network-based model. The neural network topology used are Deep Neural Network (DNN), Convolutional Neural Network (CNN), and Time Delay Neural Network (TDNN).
In this paper, the baseline is the HMM-GMM acoustic model trained with dictated speech, the WER obtained by 73.87%. Then the model was trained on data augmented with noise, the WER value dropped to 71.15%. Then the adaptation technique is applied to the model so that the WER drops to 62.75%. Then adaptation model added noise augmentation and WER dropped to 62.16%. In subsequent experiments, the model was trained with mixed training data between dictated and spontaneous speech, the WER value dropped to 57.59%. Furthermore, the acoustic model was replaced with a neural network-based model. In the DNN model, the WER value drops to 50.02%. While on the CNN model, the WER value dropped to 47.58%. The smallest WER value was obtained in acoustic modeling using TDNN topology. The WER value of the model is 40.63%.
|
---|