ACOUSTIC MODELS CONSTRUCTION ON INDONESIAN AUTOMATIC SPEECH RECOGNITION THROUGH WEAKLY SUPERVISED LEARNING WITH AGREEMENT-BASED

Automatic Speech Recognition (ASR) is rapidly growing in the era. ASR has been developed in various languages, one of which is Bahasa Indonesia. But, ASR Bahasa has a little transcribed data if compared by other languages. Transcribing the audio data on word-level takes long about 6-8 times the audi...

Full description

Saved in:
Bibliographic Details
Main Author: Zakiah, Iftitakhul
Format: Theses
Language:Indonesia
Subjects:
Online Access:https://digilib.itb.ac.id/gdl/view/48062
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
id id-itb.:48062
spelling id-itb.:480622020-06-25T23:01:49ZACOUSTIC MODELS CONSTRUCTION ON INDONESIAN AUTOMATIC SPEECH RECOGNITION THROUGH WEAKLY SUPERVISED LEARNING WITH AGREEMENT-BASED Zakiah, Iftitakhul Teknik (Rekayasa, enjinering dan kegiatan berkaitan) Indonesia Theses deep learning, agreement-based, segments, speech recognition INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/48062 Automatic Speech Recognition (ASR) is rapidly growing in the era. ASR has been developed in various languages, one of which is Bahasa Indonesia. But, ASR Bahasa has a little transcribed data if compared by other languages. Transcribing the audio data on word-level takes long about 6-8 times the audio duration while transcribing on phoneme-level requires more time. Nevertheless, untranscribed data are abundant and easier to collect, thus requiring another approach to optimize ASR performance. Weakly supervised learning has many approaches, using the untranscribed data is one of the strategies. In the thesis, we used an agreement based on four heterogeneous topologies models, that are DNN, LSTM, CNN, and TDNN. All of them decode the untranscribed data and the result was aligned by each model. And then the aligned data are voted per frame by all models, later, it's reformed into segments which are approved by the models. The segments are used as additional data on the training processes. DNN gives relative gains up to 1,95%, CNN up to 1,56%, and TDNN up to 2,59%. Overall, LSTM didn't give improvement yet the approach increased relative performance on the one formal_val corpus up to 1,65%. The segmented data isn’t suitable for LSTM topology because it misses context from the segment before. Yet the DNN, CNN, and TDNN can be further improved. text
institution Institut Teknologi Bandung
building Institut Teknologi Bandung Library
continent Asia
country Indonesia
Indonesia
content_provider Institut Teknologi Bandung
collection Digital ITB
language Indonesia
topic Teknik (Rekayasa, enjinering dan kegiatan berkaitan)
spellingShingle Teknik (Rekayasa, enjinering dan kegiatan berkaitan)
Zakiah, Iftitakhul
ACOUSTIC MODELS CONSTRUCTION ON INDONESIAN AUTOMATIC SPEECH RECOGNITION THROUGH WEAKLY SUPERVISED LEARNING WITH AGREEMENT-BASED
description Automatic Speech Recognition (ASR) is rapidly growing in the era. ASR has been developed in various languages, one of which is Bahasa Indonesia. But, ASR Bahasa has a little transcribed data if compared by other languages. Transcribing the audio data on word-level takes long about 6-8 times the audio duration while transcribing on phoneme-level requires more time. Nevertheless, untranscribed data are abundant and easier to collect, thus requiring another approach to optimize ASR performance. Weakly supervised learning has many approaches, using the untranscribed data is one of the strategies. In the thesis, we used an agreement based on four heterogeneous topologies models, that are DNN, LSTM, CNN, and TDNN. All of them decode the untranscribed data and the result was aligned by each model. And then the aligned data are voted per frame by all models, later, it's reformed into segments which are approved by the models. The segments are used as additional data on the training processes. DNN gives relative gains up to 1,95%, CNN up to 1,56%, and TDNN up to 2,59%. Overall, LSTM didn't give improvement yet the approach increased relative performance on the one formal_val corpus up to 1,65%. The segmented data isn’t suitable for LSTM topology because it misses context from the segment before. Yet the DNN, CNN, and TDNN can be further improved.
format Theses
author Zakiah, Iftitakhul
author_facet Zakiah, Iftitakhul
author_sort Zakiah, Iftitakhul
title ACOUSTIC MODELS CONSTRUCTION ON INDONESIAN AUTOMATIC SPEECH RECOGNITION THROUGH WEAKLY SUPERVISED LEARNING WITH AGREEMENT-BASED
title_short ACOUSTIC MODELS CONSTRUCTION ON INDONESIAN AUTOMATIC SPEECH RECOGNITION THROUGH WEAKLY SUPERVISED LEARNING WITH AGREEMENT-BASED
title_full ACOUSTIC MODELS CONSTRUCTION ON INDONESIAN AUTOMATIC SPEECH RECOGNITION THROUGH WEAKLY SUPERVISED LEARNING WITH AGREEMENT-BASED
title_fullStr ACOUSTIC MODELS CONSTRUCTION ON INDONESIAN AUTOMATIC SPEECH RECOGNITION THROUGH WEAKLY SUPERVISED LEARNING WITH AGREEMENT-BASED
title_full_unstemmed ACOUSTIC MODELS CONSTRUCTION ON INDONESIAN AUTOMATIC SPEECH RECOGNITION THROUGH WEAKLY SUPERVISED LEARNING WITH AGREEMENT-BASED
title_sort acoustic models construction on indonesian automatic speech recognition through weakly supervised learning with agreement-based
url https://digilib.itb.ac.id/gdl/view/48062
_version_ 1822000013675331584