ADDING LANGUAGE MODEL AND AUTOMATICALLY ADAPTING ACOUSTIC MODEL IN AN INDONESIAN SPEECH RECOGNITION SYSTEM BASED ON SELF-SUPERVISED MODEL USING LIMITED SPEECH CORPUS

The accuracy of current speech recognition models has reached human-level performance, but they are only available for fewer than 100 of the approximately 7,000 languages in the world. This is because developing an accurate speech recognition model requires training on large speech datasets with...

Full description

Saved in:
Bibliographic Details
Main Author: Hannania, Nabila
Format: Theses
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/85319
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
id id-itb.:85319
spelling id-itb.:853192024-08-20T10:20:06ZADDING LANGUAGE MODEL AND AUTOMATICALLY ADAPTING ACOUSTIC MODEL IN AN INDONESIAN SPEECH RECOGNITION SYSTEM BASED ON SELF-SUPERVISED MODEL USING LIMITED SPEECH CORPUS Hannania, Nabila Indonesia Theses recognition, speech, Indonesian, self-supervised, limited corpus. INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/85319 The accuracy of current speech recognition models has reached human-level performance, but they are only available for fewer than 100 of the approximately 7,000 languages in the world. This is because developing an accurate speech recognition model requires training on large speech datasets with corresponding transcripts. However, most languages do not have such speech data with transcriptions, or the data is very limited. This data scarcity is due to the high cost and time-consuming process of annotating or labeling audio data, and the limited number of speakers for some languages. To address this data limitation issue, it is necessary to develop speech recognition models using self-supervised learning (SSL) approaches. However, developing SSL models requires large amounts of unlabeled speech data, significant computational resources (GPUs), and long training times, as does fine-tuning pre-trained SSL models. Nevertheless, the knowledge embedded in pre-trained SSL models can be leveraged with limited resources through transfer learning. The utilization of pre-trained self-supervised models to develop Indonesian speech recognition models with limited speech data has been done before, but the performance is still not optimal. Therefore, this thesis research further explores the development of effective speech recognition models based on self-supervised models for the Indonesian language. Efforts to improve the performance of these speech recognition models involve developing an additional language model used in the decoding process and adapting the speech recognition system to handle OOV (Out-of-Vocabulary) words. To address the OOV problem, an Information Retrieval (IR) system was developed to obtain texts containing OOV words. The texts retrieved by the IR system are used to train the language model and serve as input for the Text-to-Speech (TTS) model to generate audio data containing OOV words. The synthesized audio data is then used to retrain the speech recognition model. The proposed approach, which involves using an additional language model to enhance the model's understanding of word structure and retraining the speech recognition model to improve character recognition in sound sequences, has iv significantly improved model performance. The speech recognition model using the proposed approach achieved a Character Error Rate (CER) of 12.2% and a Word Error Rate (WER) of 45.6% when evaluated on test data, whereas the previous approach resulted in a CER of 16.5% and a WER of 68.6%. This proposed approach can automatically adapt the speech recognition system, effectively improving the model's performance by 26% relative to CER and 34% relative to WER compared to the initial approach. text
institution Institut Teknologi Bandung
building Institut Teknologi Bandung Library
continent Asia
country Indonesia
Indonesia
content_provider Institut Teknologi Bandung
collection Digital ITB
language Indonesia
description The accuracy of current speech recognition models has reached human-level performance, but they are only available for fewer than 100 of the approximately 7,000 languages in the world. This is because developing an accurate speech recognition model requires training on large speech datasets with corresponding transcripts. However, most languages do not have such speech data with transcriptions, or the data is very limited. This data scarcity is due to the high cost and time-consuming process of annotating or labeling audio data, and the limited number of speakers for some languages. To address this data limitation issue, it is necessary to develop speech recognition models using self-supervised learning (SSL) approaches. However, developing SSL models requires large amounts of unlabeled speech data, significant computational resources (GPUs), and long training times, as does fine-tuning pre-trained SSL models. Nevertheless, the knowledge embedded in pre-trained SSL models can be leveraged with limited resources through transfer learning. The utilization of pre-trained self-supervised models to develop Indonesian speech recognition models with limited speech data has been done before, but the performance is still not optimal. Therefore, this thesis research further explores the development of effective speech recognition models based on self-supervised models for the Indonesian language. Efforts to improve the performance of these speech recognition models involve developing an additional language model used in the decoding process and adapting the speech recognition system to handle OOV (Out-of-Vocabulary) words. To address the OOV problem, an Information Retrieval (IR) system was developed to obtain texts containing OOV words. The texts retrieved by the IR system are used to train the language model and serve as input for the Text-to-Speech (TTS) model to generate audio data containing OOV words. The synthesized audio data is then used to retrain the speech recognition model. The proposed approach, which involves using an additional language model to enhance the model's understanding of word structure and retraining the speech recognition model to improve character recognition in sound sequences, has iv significantly improved model performance. The speech recognition model using the proposed approach achieved a Character Error Rate (CER) of 12.2% and a Word Error Rate (WER) of 45.6% when evaluated on test data, whereas the previous approach resulted in a CER of 16.5% and a WER of 68.6%. This proposed approach can automatically adapt the speech recognition system, effectively improving the model's performance by 26% relative to CER and 34% relative to WER compared to the initial approach.
format Theses
author Hannania, Nabila
spellingShingle Hannania, Nabila
ADDING LANGUAGE MODEL AND AUTOMATICALLY ADAPTING ACOUSTIC MODEL IN AN INDONESIAN SPEECH RECOGNITION SYSTEM BASED ON SELF-SUPERVISED MODEL USING LIMITED SPEECH CORPUS
author_facet Hannania, Nabila
author_sort Hannania, Nabila
title ADDING LANGUAGE MODEL AND AUTOMATICALLY ADAPTING ACOUSTIC MODEL IN AN INDONESIAN SPEECH RECOGNITION SYSTEM BASED ON SELF-SUPERVISED MODEL USING LIMITED SPEECH CORPUS
title_short ADDING LANGUAGE MODEL AND AUTOMATICALLY ADAPTING ACOUSTIC MODEL IN AN INDONESIAN SPEECH RECOGNITION SYSTEM BASED ON SELF-SUPERVISED MODEL USING LIMITED SPEECH CORPUS
title_full ADDING LANGUAGE MODEL AND AUTOMATICALLY ADAPTING ACOUSTIC MODEL IN AN INDONESIAN SPEECH RECOGNITION SYSTEM BASED ON SELF-SUPERVISED MODEL USING LIMITED SPEECH CORPUS
title_fullStr ADDING LANGUAGE MODEL AND AUTOMATICALLY ADAPTING ACOUSTIC MODEL IN AN INDONESIAN SPEECH RECOGNITION SYSTEM BASED ON SELF-SUPERVISED MODEL USING LIMITED SPEECH CORPUS
title_full_unstemmed ADDING LANGUAGE MODEL AND AUTOMATICALLY ADAPTING ACOUSTIC MODEL IN AN INDONESIAN SPEECH RECOGNITION SYSTEM BASED ON SELF-SUPERVISED MODEL USING LIMITED SPEECH CORPUS
title_sort adding language model and automatically adapting acoustic model in an indonesian speech recognition system based on self-supervised model using limited speech corpus
url https://digilib.itb.ac.id/gdl/view/85319
_version_ 1822010684428255232