INDONESIAN AUTOMATIC SPEECH RECOGNITION DEVELOPMENT USING TRANSFER LEARNING ON MASSIVELY MULTILINGUAL SPEECH (MMS) AND WHISPER

An ideal speech recognition model is capable of accurately transcribing speech across a variety of voice signal characteristics, such as speaking style (dictated and spontaneous), speech context (formal and informal), and background noise conditions (clean and moderate). Building a model from scr...

Full description

Saved in:
Bibliographic Details
Main Author: Adila, Aulia
Format: Final Project
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/78316
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
id id-itb.:78316
spelling id-itb.:783162023-09-18T23:52:49ZINDONESIAN AUTOMATIC SPEECH RECOGNITION DEVELOPMENT USING TRANSFER LEARNING ON MASSIVELY MULTILINGUAL SPEECH (MMS) AND WHISPER Adila, Aulia Indonesia Final Project end-to-end speech recognition model, transfer learning, MMS, Whisper, speech variability. INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/78316 An ideal speech recognition model is capable of accurately transcribing speech across a variety of voice signal characteristics, such as speaking style (dictated and spontaneous), speech context (formal and informal), and background noise conditions (clean and moderate). Building a model from scratch with large training data is a possible approach. However, there is no substantial amount of Indonesian speech training data available that represents the variability in characteristics; therefore, an alternative approach is used to build the model effectively by utilizing the knowledge already possessed by pretrained models through transfer learning. In this final project, research was carried out on the development of an Indonesian speech recognition model using the transfer learning method applied to state-of- the-art Massively Multilingual Speech (MMS) and Whisper models, leveraging 48,570 recordings. The transfer learning output models (fine-tuned models) were tested against speech data representing a range of characteristics, and then compared with the testing of models without transfer learning (baseline models). The experimental results indicate an enhanced predictive capability of the models post transfer learning, marked by a decrease in WER (word error rate). The lowest WER value was achieved by the fine-tuned Whisper model across all test data groups. The lowest WER score was recorded on the DFC (dictated-formal-clean) test data, while the highest was noted on the SIC (spontaneous-informal-clean) dataset. Furthermore, it was concluded that the characteristics most influencing the predictive capacity of the model are variations in speaking style and speech context. text
institution Institut Teknologi Bandung
building Institut Teknologi Bandung Library
continent Asia
country Indonesia
Indonesia
content_provider Institut Teknologi Bandung
collection Digital ITB
language Indonesia
description An ideal speech recognition model is capable of accurately transcribing speech across a variety of voice signal characteristics, such as speaking style (dictated and spontaneous), speech context (formal and informal), and background noise conditions (clean and moderate). Building a model from scratch with large training data is a possible approach. However, there is no substantial amount of Indonesian speech training data available that represents the variability in characteristics; therefore, an alternative approach is used to build the model effectively by utilizing the knowledge already possessed by pretrained models through transfer learning. In this final project, research was carried out on the development of an Indonesian speech recognition model using the transfer learning method applied to state-of- the-art Massively Multilingual Speech (MMS) and Whisper models, leveraging 48,570 recordings. The transfer learning output models (fine-tuned models) were tested against speech data representing a range of characteristics, and then compared with the testing of models without transfer learning (baseline models). The experimental results indicate an enhanced predictive capability of the models post transfer learning, marked by a decrease in WER (word error rate). The lowest WER value was achieved by the fine-tuned Whisper model across all test data groups. The lowest WER score was recorded on the DFC (dictated-formal-clean) test data, while the highest was noted on the SIC (spontaneous-informal-clean) dataset. Furthermore, it was concluded that the characteristics most influencing the predictive capacity of the model are variations in speaking style and speech context.
format Final Project
author Adila, Aulia
spellingShingle Adila, Aulia
INDONESIAN AUTOMATIC SPEECH RECOGNITION DEVELOPMENT USING TRANSFER LEARNING ON MASSIVELY MULTILINGUAL SPEECH (MMS) AND WHISPER
author_facet Adila, Aulia
author_sort Adila, Aulia
title INDONESIAN AUTOMATIC SPEECH RECOGNITION DEVELOPMENT USING TRANSFER LEARNING ON MASSIVELY MULTILINGUAL SPEECH (MMS) AND WHISPER
title_short INDONESIAN AUTOMATIC SPEECH RECOGNITION DEVELOPMENT USING TRANSFER LEARNING ON MASSIVELY MULTILINGUAL SPEECH (MMS) AND WHISPER
title_full INDONESIAN AUTOMATIC SPEECH RECOGNITION DEVELOPMENT USING TRANSFER LEARNING ON MASSIVELY MULTILINGUAL SPEECH (MMS) AND WHISPER
title_fullStr INDONESIAN AUTOMATIC SPEECH RECOGNITION DEVELOPMENT USING TRANSFER LEARNING ON MASSIVELY MULTILINGUAL SPEECH (MMS) AND WHISPER
title_full_unstemmed INDONESIAN AUTOMATIC SPEECH RECOGNITION DEVELOPMENT USING TRANSFER LEARNING ON MASSIVELY MULTILINGUAL SPEECH (MMS) AND WHISPER
title_sort indonesian automatic speech recognition development using transfer learning on massively multilingual speech (mms) and whisper
url https://digilib.itb.ac.id/gdl/view/78316
_version_ 1822995703132061696