INDONESIAN AUTOMATIC SPEECH RECOGNITION DEVELOPMENT USING TRANSFER LEARNING ON MASSIVELY MULTILINGUAL SPEECH (MMS) AND WHISPER
An ideal speech recognition model is capable of accurately transcribing speech across a variety of voice signal characteristics, such as speaking style (dictated and spontaneous), speech context (formal and informal), and background noise conditions (clean and moderate). Building a model from scr...
Saved in:
Main Author: | |
---|---|
Format: | Final Project |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/78316 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
id |
id-itb.:78316 |
---|---|
spelling |
id-itb.:783162023-09-18T23:52:49ZINDONESIAN AUTOMATIC SPEECH RECOGNITION DEVELOPMENT USING TRANSFER LEARNING ON MASSIVELY MULTILINGUAL SPEECH (MMS) AND WHISPER Adila, Aulia Indonesia Final Project end-to-end speech recognition model, transfer learning, MMS, Whisper, speech variability. INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/78316 An ideal speech recognition model is capable of accurately transcribing speech across a variety of voice signal characteristics, such as speaking style (dictated and spontaneous), speech context (formal and informal), and background noise conditions (clean and moderate). Building a model from scratch with large training data is a possible approach. However, there is no substantial amount of Indonesian speech training data available that represents the variability in characteristics; therefore, an alternative approach is used to build the model effectively by utilizing the knowledge already possessed by pretrained models through transfer learning. In this final project, research was carried out on the development of an Indonesian speech recognition model using the transfer learning method applied to state-of- the-art Massively Multilingual Speech (MMS) and Whisper models, leveraging 48,570 recordings. The transfer learning output models (fine-tuned models) were tested against speech data representing a range of characteristics, and then compared with the testing of models without transfer learning (baseline models). The experimental results indicate an enhanced predictive capability of the models post transfer learning, marked by a decrease in WER (word error rate). The lowest WER value was achieved by the fine-tuned Whisper model across all test data groups. The lowest WER score was recorded on the DFC (dictated-formal-clean) test data, while the highest was noted on the SIC (spontaneous-informal-clean) dataset. Furthermore, it was concluded that the characteristics most influencing the predictive capacity of the model are variations in speaking style and speech context. text |
institution |
Institut Teknologi Bandung |
building |
Institut Teknologi Bandung Library |
continent |
Asia |
country |
Indonesia Indonesia |
content_provider |
Institut Teknologi Bandung |
collection |
Digital ITB |
language |
Indonesia |
description |
An ideal speech recognition model is capable of accurately transcribing speech
across a variety of voice signal characteristics, such as speaking style (dictated and
spontaneous), speech context (formal and informal), and background noise
conditions (clean and moderate). Building a model from scratch with large training
data is a possible approach. However, there is no substantial amount of Indonesian
speech training data available that represents the variability in characteristics;
therefore, an alternative approach is used to build the model effectively by utilizing
the knowledge already possessed by pretrained models through transfer learning.
In this final project, research was carried out on the development of an Indonesian
speech recognition model using the transfer learning method applied to state-of-
the-art Massively Multilingual Speech (MMS) and Whisper models, leveraging
48,570 recordings. The transfer learning output models (fine-tuned models) were
tested against speech data representing a range of characteristics, and then
compared with the testing of models without transfer learning (baseline models).
The experimental results indicate an enhanced predictive capability of the models
post transfer learning, marked by a decrease in WER (word error rate). The lowest
WER value was achieved by the fine-tuned Whisper model across all test data
groups. The lowest WER score was recorded on the DFC (dictated-formal-clean)
test data, while the highest was noted on the SIC (spontaneous-informal-clean)
dataset. Furthermore, it was concluded that the characteristics most influencing the
predictive capacity of the model are variations in speaking style and speech context. |
format |
Final Project |
author |
Adila, Aulia |
spellingShingle |
Adila, Aulia INDONESIAN AUTOMATIC SPEECH RECOGNITION DEVELOPMENT USING TRANSFER LEARNING ON MASSIVELY MULTILINGUAL SPEECH (MMS) AND WHISPER |
author_facet |
Adila, Aulia |
author_sort |
Adila, Aulia |
title |
INDONESIAN AUTOMATIC SPEECH RECOGNITION DEVELOPMENT USING TRANSFER LEARNING ON MASSIVELY MULTILINGUAL SPEECH (MMS) AND WHISPER |
title_short |
INDONESIAN AUTOMATIC SPEECH RECOGNITION DEVELOPMENT USING TRANSFER LEARNING ON MASSIVELY MULTILINGUAL SPEECH (MMS) AND WHISPER |
title_full |
INDONESIAN AUTOMATIC SPEECH RECOGNITION DEVELOPMENT USING TRANSFER LEARNING ON MASSIVELY MULTILINGUAL SPEECH (MMS) AND WHISPER |
title_fullStr |
INDONESIAN AUTOMATIC SPEECH RECOGNITION DEVELOPMENT USING TRANSFER LEARNING ON MASSIVELY MULTILINGUAL SPEECH (MMS) AND WHISPER |
title_full_unstemmed |
INDONESIAN AUTOMATIC SPEECH RECOGNITION DEVELOPMENT USING TRANSFER LEARNING ON MASSIVELY MULTILINGUAL SPEECH (MMS) AND WHISPER |
title_sort |
indonesian automatic speech recognition development using transfer learning on massively multilingual speech (mms) and whisper |
url |
https://digilib.itb.ac.id/gdl/view/78316 |
_version_ |
1822995703132061696 |