ADDING LANGUAGE MODEL AND AUTOMATICALLY ADAPTING ACOUSTIC MODEL IN AN INDONESIAN SPEECH RECOGNITION SYSTEM BASED ON SELF-SUPERVISED MODEL USING LIMITED SPEECH CORPUS

The accuracy of current speech recognition models has reached human-level performance, but they are only available for fewer than 100 of the approximately 7,000 languages in the world. This is because developing an accurate speech recognition model requires training on large speech datasets with...

Full description

Saved in:
Bibliographic Details
Main Author: Hannania, Nabila
Format: Theses
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/85319
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
Description
Summary:The accuracy of current speech recognition models has reached human-level performance, but they are only available for fewer than 100 of the approximately 7,000 languages in the world. This is because developing an accurate speech recognition model requires training on large speech datasets with corresponding transcripts. However, most languages do not have such speech data with transcriptions, or the data is very limited. This data scarcity is due to the high cost and time-consuming process of annotating or labeling audio data, and the limited number of speakers for some languages. To address this data limitation issue, it is necessary to develop speech recognition models using self-supervised learning (SSL) approaches. However, developing SSL models requires large amounts of unlabeled speech data, significant computational resources (GPUs), and long training times, as does fine-tuning pre-trained SSL models. Nevertheless, the knowledge embedded in pre-trained SSL models can be leveraged with limited resources through transfer learning. The utilization of pre-trained self-supervised models to develop Indonesian speech recognition models with limited speech data has been done before, but the performance is still not optimal. Therefore, this thesis research further explores the development of effective speech recognition models based on self-supervised models for the Indonesian language. Efforts to improve the performance of these speech recognition models involve developing an additional language model used in the decoding process and adapting the speech recognition system to handle OOV (Out-of-Vocabulary) words. To address the OOV problem, an Information Retrieval (IR) system was developed to obtain texts containing OOV words. The texts retrieved by the IR system are used to train the language model and serve as input for the Text-to-Speech (TTS) model to generate audio data containing OOV words. The synthesized audio data is then used to retrain the speech recognition model. The proposed approach, which involves using an additional language model to enhance the model's understanding of word structure and retraining the speech recognition model to improve character recognition in sound sequences, has iv significantly improved model performance. The speech recognition model using the proposed approach achieved a Character Error Rate (CER) of 12.2% and a Word Error Rate (WER) of 45.6% when evaluated on test data, whereas the previous approach resulted in a CER of 16.5% and a WER of 68.6%. This proposed approach can automatically adapt the speech recognition system, effectively improving the model's performance by 26% relative to CER and 34% relative to WER compared to the initial approach.