ACOUSTIC MODELS CONSTRUCTION ON INDONESIAN AUTOMATIC SPEECH RECOGNITION THROUGH WEAKLY SUPERVISED LEARNING WITH AGREEMENT-BASED

Automatic Speech Recognition (ASR) is rapidly growing in the era. ASR has been developed in various languages, one of which is Bahasa Indonesia. But, ASR Bahasa has a little transcribed data if compared by other languages. Transcribing the audio data on word-level takes long about 6-8 times the audi...

Full description

Saved in:

Bibliographic Details
Main Author:	Zakiah, Iftitakhul
Format:	Theses
Language:	Indonesia
Subjects:	Teknik (Rekayasa, enjinering dan kegiatan berkaitan)
Online Access:	https://digilib.itb.ac.id/gdl/view/48062
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Institut Teknologi Bandung
Language:	Indonesia

Description
Summary:	Automatic Speech Recognition (ASR) is rapidly growing in the era. ASR has been developed in various languages, one of which is Bahasa Indonesia. But, ASR Bahasa has a little transcribed data if compared by other languages. Transcribing the audio data on word-level takes long about 6-8 times the audio duration while transcribing on phoneme-level requires more time. Nevertheless, untranscribed data are abundant and easier to collect, thus requiring another approach to optimize ASR performance. Weakly supervised learning has many approaches, using the untranscribed data is one of the strategies. In the thesis, we used an agreement based on four heterogeneous topologies models, that are DNN, LSTM, CNN, and TDNN. All of them decode the untranscribed data and the result was aligned by each model. And then the aligned data are voted per frame by all models, later, it's reformed into segments which are approved by the models. The segments are used as additional data on the training processes. DNN gives relative gains up to 1,95%, CNN up to 1,56%, and TDNN up to 2,59%. Overall, LSTM didn't give improvement yet the approach increased relative performance on the one formal_val corpus up to 1,65%. The segmented data isn’t suitable for LSTM topology because it misses context from the segment before. Yet the DNN, CNN, and TDNN can be further improved.

ACOUSTIC MODELS CONSTRUCTION ON INDONESIAN AUTOMATIC SPEECH RECOGNITION THROUGH WEAKLY SUPERVISED LEARNING WITH AGREEMENT-BASED

Similar Items