ADDING LANGUAGE MODEL AND AUTOMATICALLY ADAPTING ACOUSTIC MODEL IN AN INDONESIAN SPEECH RECOGNITION SYSTEM BASED ON SELF-SUPERVISED MODEL USING LIMITED SPEECH CORPUS
The accuracy of current speech recognition models has reached human-level performance, but they are only available for fewer than 100 of the approximately 7,000 languages in the world. This is because developing an accurate speech recognition model requires training on large speech datasets with...
Saved in:
Main Author: | |
---|---|
Format: | Theses |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/85319 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
id |
id-itb.:85319 |
---|---|
spelling |
id-itb.:853192024-08-20T10:20:06ZADDING LANGUAGE MODEL AND AUTOMATICALLY ADAPTING ACOUSTIC MODEL IN AN INDONESIAN SPEECH RECOGNITION SYSTEM BASED ON SELF-SUPERVISED MODEL USING LIMITED SPEECH CORPUS Hannania, Nabila Indonesia Theses recognition, speech, Indonesian, self-supervised, limited corpus. INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/85319 The accuracy of current speech recognition models has reached human-level performance, but they are only available for fewer than 100 of the approximately 7,000 languages in the world. This is because developing an accurate speech recognition model requires training on large speech datasets with corresponding transcripts. However, most languages do not have such speech data with transcriptions, or the data is very limited. This data scarcity is due to the high cost and time-consuming process of annotating or labeling audio data, and the limited number of speakers for some languages. To address this data limitation issue, it is necessary to develop speech recognition models using self-supervised learning (SSL) approaches. However, developing SSL models requires large amounts of unlabeled speech data, significant computational resources (GPUs), and long training times, as does fine-tuning pre-trained SSL models. Nevertheless, the knowledge embedded in pre-trained SSL models can be leveraged with limited resources through transfer learning. The utilization of pre-trained self-supervised models to develop Indonesian speech recognition models with limited speech data has been done before, but the performance is still not optimal. Therefore, this thesis research further explores the development of effective speech recognition models based on self-supervised models for the Indonesian language. Efforts to improve the performance of these speech recognition models involve developing an additional language model used in the decoding process and adapting the speech recognition system to handle OOV (Out-of-Vocabulary) words. To address the OOV problem, an Information Retrieval (IR) system was developed to obtain texts containing OOV words. The texts retrieved by the IR system are used to train the language model and serve as input for the Text-to-Speech (TTS) model to generate audio data containing OOV words. The synthesized audio data is then used to retrain the speech recognition model. The proposed approach, which involves using an additional language model to enhance the model's understanding of word structure and retraining the speech recognition model to improve character recognition in sound sequences, has iv significantly improved model performance. The speech recognition model using the proposed approach achieved a Character Error Rate (CER) of 12.2% and a Word Error Rate (WER) of 45.6% when evaluated on test data, whereas the previous approach resulted in a CER of 16.5% and a WER of 68.6%. This proposed approach can automatically adapt the speech recognition system, effectively improving the model's performance by 26% relative to CER and 34% relative to WER compared to the initial approach. text |
institution |
Institut Teknologi Bandung |
building |
Institut Teknologi Bandung Library |
continent |
Asia |
country |
Indonesia Indonesia |
content_provider |
Institut Teknologi Bandung |
collection |
Digital ITB |
language |
Indonesia |
description |
The accuracy of current speech recognition models has reached human-level
performance, but they are only available for fewer than 100 of the approximately
7,000 languages in the world. This is because developing an accurate speech
recognition model requires training on large speech datasets with corresponding
transcripts. However, most languages do not have such speech data with
transcriptions, or the data is very limited. This data scarcity is due to the high cost
and time-consuming process of annotating or labeling audio data, and the limited
number of speakers for some languages. To address this data limitation issue, it is
necessary to develop speech recognition models using self-supervised learning
(SSL) approaches. However, developing SSL models requires large amounts of
unlabeled speech data, significant computational resources (GPUs), and long
training times, as does fine-tuning pre-trained SSL models. Nevertheless, the
knowledge embedded in pre-trained SSL models can be leveraged with limited
resources through transfer learning. The utilization of pre-trained self-supervised
models to develop Indonesian speech recognition models with limited speech data
has been done before, but the performance is still not optimal. Therefore, this thesis
research further explores the development of effective speech recognition models
based on self-supervised models for the Indonesian language.
Efforts to improve the performance of these speech recognition models involve
developing an additional language model used in the decoding process and
adapting the speech recognition system to handle OOV (Out-of-Vocabulary) words.
To address the OOV problem, an Information Retrieval (IR) system was developed
to obtain texts containing OOV words. The texts retrieved by the IR system are used
to train the language model and serve as input for the Text-to-Speech (TTS) model
to generate audio data containing OOV words. The synthesized audio data is then
used to retrain the speech recognition model.
The proposed approach, which involves using an additional language model to
enhance the model's understanding of word structure and retraining the speech
recognition model to improve character recognition in sound sequences, has
iv
significantly improved model performance. The speech recognition model using the
proposed approach achieved a Character Error Rate (CER) of 12.2% and a Word
Error Rate (WER) of 45.6% when evaluated on test data, whereas the previous
approach resulted in a CER of 16.5% and a WER of 68.6%. This proposed
approach can automatically adapt the speech recognition system, effectively
improving the model's performance by 26% relative to CER and 34% relative to
WER compared to the initial approach. |
format |
Theses |
author |
Hannania, Nabila |
spellingShingle |
Hannania, Nabila ADDING LANGUAGE MODEL AND AUTOMATICALLY ADAPTING ACOUSTIC MODEL IN AN INDONESIAN SPEECH RECOGNITION SYSTEM BASED ON SELF-SUPERVISED MODEL USING LIMITED SPEECH CORPUS |
author_facet |
Hannania, Nabila |
author_sort |
Hannania, Nabila |
title |
ADDING LANGUAGE MODEL AND AUTOMATICALLY ADAPTING ACOUSTIC MODEL IN AN INDONESIAN SPEECH RECOGNITION SYSTEM BASED ON SELF-SUPERVISED MODEL USING LIMITED SPEECH CORPUS |
title_short |
ADDING LANGUAGE MODEL AND AUTOMATICALLY ADAPTING ACOUSTIC MODEL IN AN INDONESIAN SPEECH RECOGNITION SYSTEM BASED ON SELF-SUPERVISED MODEL USING LIMITED SPEECH CORPUS |
title_full |
ADDING LANGUAGE MODEL AND AUTOMATICALLY ADAPTING ACOUSTIC MODEL IN AN INDONESIAN SPEECH RECOGNITION SYSTEM BASED ON SELF-SUPERVISED MODEL USING LIMITED SPEECH CORPUS |
title_fullStr |
ADDING LANGUAGE MODEL AND AUTOMATICALLY ADAPTING ACOUSTIC MODEL IN AN INDONESIAN SPEECH RECOGNITION SYSTEM BASED ON SELF-SUPERVISED MODEL USING LIMITED SPEECH CORPUS |
title_full_unstemmed |
ADDING LANGUAGE MODEL AND AUTOMATICALLY ADAPTING ACOUSTIC MODEL IN AN INDONESIAN SPEECH RECOGNITION SYSTEM BASED ON SELF-SUPERVISED MODEL USING LIMITED SPEECH CORPUS |
title_sort |
adding language model and automatically adapting acoustic model in an indonesian speech recognition system based on self-supervised model using limited speech corpus |
url |
https://digilib.itb.ac.id/gdl/view/85319 |
_version_ |
1822010684428255232 |