INDONESIAN AUTOMATIC SPEECH RECOGNITION DEVELOPMENT USING TRANSFER LEARNING ON MASSIVELY MULTILINGUAL SPEECH (MMS) AND WHISPER
An ideal speech recognition model is capable of accurately transcribing speech across a variety of voice signal characteristics, such as speaking style (dictated and spontaneous), speech context (formal and informal), and background noise conditions (clean and moderate). Building a model from scr...
Saved in:
Main Author: | |
---|---|
Format: | Final Project |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/78316 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
Summary: | An ideal speech recognition model is capable of accurately transcribing speech
across a variety of voice signal characteristics, such as speaking style (dictated and
spontaneous), speech context (formal and informal), and background noise
conditions (clean and moderate). Building a model from scratch with large training
data is a possible approach. However, there is no substantial amount of Indonesian
speech training data available that represents the variability in characteristics;
therefore, an alternative approach is used to build the model effectively by utilizing
the knowledge already possessed by pretrained models through transfer learning.
In this final project, research was carried out on the development of an Indonesian
speech recognition model using the transfer learning method applied to state-of-
the-art Massively Multilingual Speech (MMS) and Whisper models, leveraging
48,570 recordings. The transfer learning output models (fine-tuned models) were
tested against speech data representing a range of characteristics, and then
compared with the testing of models without transfer learning (baseline models).
The experimental results indicate an enhanced predictive capability of the models
post transfer learning, marked by a decrease in WER (word error rate). The lowest
WER value was achieved by the fine-tuned Whisper model across all test data
groups. The lowest WER score was recorded on the DFC (dictated-formal-clean)
test data, while the highest was noted on the SIC (spontaneous-informal-clean)
dataset. Furthermore, it was concluded that the characteristics most influencing the
predictive capacity of the model are variations in speaking style and speech context. |
---|