Transfer Learning from News Domain to Lecture Domain in Automatic Speech Recognition

In teaching and learning activities, the knowledge conveyed by the instructors is not only from references or presentation slides but also other experiences or knowledge. On the other hand, the speech recognition system (ASR) is increasingly developing and is starting to be implemented a lot, inc...

Full description

Saved in:
Bibliographic Details
Main Author: Zakiah, Iftitakhul
Format: Final Project
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/39882
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
Description
Summary:In teaching and learning activities, the knowledge conveyed by the instructors is not only from references or presentation slides but also other experiences or knowledge. On the other hand, the speech recognition system (ASR) is increasingly developing and is starting to be implemented a lot, including in the lecture domain. ASR that was built from the start requires very large data, both in the form of voice recording data or text data. Therefore, it can use another approach, namely transfer learning, is an approach to building models by utilizing existing models as source models. This final project begins with the stage of data collection on the lecture domain of Informatics ITB. ASR experiments use spontaneous language models on the news domain as source models. In general, the final assignment is divided into three systems, namely systems that use the news domain (baseline), lecture domain (baseline), and both (transfer learning). In all three systems, the acoustic model used was triphone GMM-HMM and also MAP which was only on system C. The third language system model uses n-gram and LSTM with projection layer. Transfer learning implemented on language models with N-gram interpolation and transfer of the LSTMP model. The news domain system provides WER results of 78.30% (5-fold) and 85.18% (10sp), the lecture domain system is 58,232% (5-fold) and 62.18% (10sp), and the learning system transfers 52,734% (5-fold) and 67.0 (10sp). The smaller the WER value, the better the model is built, so the best ASR for the lecture is the transfer learning approach on ordinary language models and triphone on the acoustic model.