INDONESIAN AUTOMATIC SPEECH RECOGNITION DEVELOPMENT USING TRANSFER LEARNING ON MASSIVELY MULTILINGUAL SPEECH (MMS) AND WHISPER

An ideal speech recognition model is capable of accurately transcribing speech across a variety of voice signal characteristics, such as speaking style (dictated and spontaneous), speech context (formal and informal), and background noise conditions (clean and moderate). Building a model from scr...

Full description

Saved in:
Bibliographic Details
Main Author: Adila, Aulia
Format: Final Project
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/78316
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
Description
Summary:An ideal speech recognition model is capable of accurately transcribing speech across a variety of voice signal characteristics, such as speaking style (dictated and spontaneous), speech context (formal and informal), and background noise conditions (clean and moderate). Building a model from scratch with large training data is a possible approach. However, there is no substantial amount of Indonesian speech training data available that represents the variability in characteristics; therefore, an alternative approach is used to build the model effectively by utilizing the knowledge already possessed by pretrained models through transfer learning. In this final project, research was carried out on the development of an Indonesian speech recognition model using the transfer learning method applied to state-of- the-art Massively Multilingual Speech (MMS) and Whisper models, leveraging 48,570 recordings. The transfer learning output models (fine-tuned models) were tested against speech data representing a range of characteristics, and then compared with the testing of models without transfer learning (baseline models). The experimental results indicate an enhanced predictive capability of the models post transfer learning, marked by a decrease in WER (word error rate). The lowest WER value was achieved by the fine-tuned Whisper model across all test data groups. The lowest WER score was recorded on the DFC (dictated-formal-clean) test data, while the highest was noted on the SIC (spontaneous-informal-clean) dataset. Furthermore, it was concluded that the characteristics most influencing the predictive capacity of the model are variations in speaking style and speech context.