ACOUSTIC MODELS CONSTRUCTION ON INDONESIAN AUTOMATIC SPEECH RECOGNITION THROUGH WEAKLY SUPERVISED LEARNING WITH AGREEMENT-BASED
Automatic Speech Recognition (ASR) is rapidly growing in the era. ASR has been developed in various languages, one of which is Bahasa Indonesia. But, ASR Bahasa has a little transcribed data if compared by other languages. Transcribing the audio data on word-level takes long about 6-8 times the audi...
Saved in:
Main Author: | |
---|---|
Format: | Theses |
Language: | Indonesia |
Subjects: | |
Online Access: | https://digilib.itb.ac.id/gdl/view/48062 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
id |
id-itb.:48062 |
---|---|
spelling |
id-itb.:480622020-06-25T23:01:49ZACOUSTIC MODELS CONSTRUCTION ON INDONESIAN AUTOMATIC SPEECH RECOGNITION THROUGH WEAKLY SUPERVISED LEARNING WITH AGREEMENT-BASED Zakiah, Iftitakhul Teknik (Rekayasa, enjinering dan kegiatan berkaitan) Indonesia Theses deep learning, agreement-based, segments, speech recognition INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/48062 Automatic Speech Recognition (ASR) is rapidly growing in the era. ASR has been developed in various languages, one of which is Bahasa Indonesia. But, ASR Bahasa has a little transcribed data if compared by other languages. Transcribing the audio data on word-level takes long about 6-8 times the audio duration while transcribing on phoneme-level requires more time. Nevertheless, untranscribed data are abundant and easier to collect, thus requiring another approach to optimize ASR performance. Weakly supervised learning has many approaches, using the untranscribed data is one of the strategies. In the thesis, we used an agreement based on four heterogeneous topologies models, that are DNN, LSTM, CNN, and TDNN. All of them decode the untranscribed data and the result was aligned by each model. And then the aligned data are voted per frame by all models, later, it's reformed into segments which are approved by the models. The segments are used as additional data on the training processes. DNN gives relative gains up to 1,95%, CNN up to 1,56%, and TDNN up to 2,59%. Overall, LSTM didn't give improvement yet the approach increased relative performance on the one formal_val corpus up to 1,65%. The segmented data isn’t suitable for LSTM topology because it misses context from the segment before. Yet the DNN, CNN, and TDNN can be further improved. text |
institution |
Institut Teknologi Bandung |
building |
Institut Teknologi Bandung Library |
continent |
Asia |
country |
Indonesia Indonesia |
content_provider |
Institut Teknologi Bandung |
collection |
Digital ITB |
language |
Indonesia |
topic |
Teknik (Rekayasa, enjinering dan kegiatan berkaitan) |
spellingShingle |
Teknik (Rekayasa, enjinering dan kegiatan berkaitan) Zakiah, Iftitakhul ACOUSTIC MODELS CONSTRUCTION ON INDONESIAN AUTOMATIC SPEECH RECOGNITION THROUGH WEAKLY SUPERVISED LEARNING WITH AGREEMENT-BASED |
description |
Automatic Speech Recognition (ASR) is rapidly growing in the era. ASR has been developed in various languages, one of which is Bahasa Indonesia. But, ASR Bahasa has a little transcribed data if compared by other languages. Transcribing the audio data on word-level takes long about 6-8 times the audio duration while transcribing on phoneme-level requires more time. Nevertheless, untranscribed data are abundant and easier to collect, thus requiring another approach to optimize ASR performance.
Weakly supervised learning has many approaches, using the untranscribed data is one of the strategies. In the thesis, we used an agreement based on four heterogeneous topologies models, that are DNN, LSTM, CNN, and TDNN. All of them decode the untranscribed data and the result was aligned by each model. And then the aligned data are voted per frame by all models, later, it's reformed into segments which are approved by the models. The segments are used as additional data on the training processes. DNN gives relative gains up to 1,95%, CNN up to 1,56%, and TDNN up to 2,59%. Overall, LSTM didn't give improvement yet the approach increased relative performance on the one formal_val corpus up to 1,65%. The segmented data isn’t suitable for LSTM topology because it misses context from the segment before. Yet the DNN, CNN, and TDNN can be further improved. |
format |
Theses |
author |
Zakiah, Iftitakhul |
author_facet |
Zakiah, Iftitakhul |
author_sort |
Zakiah, Iftitakhul |
title |
ACOUSTIC MODELS CONSTRUCTION ON INDONESIAN AUTOMATIC SPEECH RECOGNITION THROUGH WEAKLY SUPERVISED LEARNING WITH AGREEMENT-BASED |
title_short |
ACOUSTIC MODELS CONSTRUCTION ON INDONESIAN AUTOMATIC SPEECH RECOGNITION THROUGH WEAKLY SUPERVISED LEARNING WITH AGREEMENT-BASED |
title_full |
ACOUSTIC MODELS CONSTRUCTION ON INDONESIAN AUTOMATIC SPEECH RECOGNITION THROUGH WEAKLY SUPERVISED LEARNING WITH AGREEMENT-BASED |
title_fullStr |
ACOUSTIC MODELS CONSTRUCTION ON INDONESIAN AUTOMATIC SPEECH RECOGNITION THROUGH WEAKLY SUPERVISED LEARNING WITH AGREEMENT-BASED |
title_full_unstemmed |
ACOUSTIC MODELS CONSTRUCTION ON INDONESIAN AUTOMATIC SPEECH RECOGNITION THROUGH WEAKLY SUPERVISED LEARNING WITH AGREEMENT-BASED |
title_sort |
acoustic models construction on indonesian automatic speech recognition through weakly supervised learning with agreement-based |
url |
https://digilib.itb.ac.id/gdl/view/48062 |
_version_ |
1822000013675331584 |