MGRUIP ACOUSTIC MODEL WITH TEMPORAL CONVOLUTION IN SPEECH RECOGNITION SYSTEM FOR KORAN RECITING EVALUATION

From previous research, the use of future context in acoustic models for speech recognition systems in reading the Koran seems to be able to improve system performance. The acoustic model with the future context is BLSTM. The use of BLSTM in the speech recognition system for reciting the Koran...

Full description

Saved in:
Bibliographic Details
Main Author: Kautsar, Isjhar
Format: Theses
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/54153
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
id id-itb.:54153
spelling id-itb.:541532021-03-15T13:02:04ZMGRUIP ACOUSTIC MODEL WITH TEMPORAL CONVOLUTION IN SPEECH RECOGNITION SYSTEM FOR KORAN RECITING EVALUATION Kautsar, Isjhar Indonesia Theses speech recognition system, acoustic model, future context, gated recurrent unit, reciting the Koran INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/54153 From previous research, the use of future context in acoustic models for speech recognition systems in reading the Koran seems to be able to improve system performance. The acoustic model with the future context is BLSTM. The use of BLSTM in the speech recognition system for reciting the Koran was able to reduce the WER value by an average of 4.63% compared to the GMM model. However, the predictive ability of the BLSTM model must be exchanged for computationally expensive due to its architectural complexity. This causes a large latency when decoding. This model will be difficult to apply to real applications because it can interfere with the user experience in using the application. In this research, the latency will be reduced using a simpler architecture with predictive capabilities equivalent to BLSTM, namely the mGRUIPTC acoustic model. mGRUIPTC is an architecture derived from the modification of the GRU architecture and with the addition of a projection layer. This layer serves to combine the current state input from the previous state output to reduce the number of parameters. The use of future context can be applied to this architecture by utilizing the projection layer, namely by adding input from several states afterward from the output of the previous layer or what is known as temporal convolution. From the test results on the reading of the Qur'an carried out on this model, the latency during the decoding process has decreased by up to 11 seconds compared to the BLSTM model with equivalent prediction results. From the experiments conducted, the mGRUIPTC model has 3 times longer training time than BLSTM on the data used. The data used in this research did not only contain a recitation of the Koran from expert speakers as was done in previous researches. Non-expert speakers were also included in the data. The data is taken from the memorization record of students from one of the Qur'anic tahfidz institutions. Apart from that, the addition of data related to the speech category was also carried out. In previous studies, the speech category used was only the male speech category. Meanwhile, in this study, two additional speech categories were included in the test data, namely the female and boy speech categories.iv The mGRUIPTC acoustic model is also tested to evaluate the recitation of the Koran. The speech recognition system is modified so that it can recognize sounds at the phoneme level because 5 of the 6 errors that occur in reciting the Koran are pronunciation errors, namely incorrect letters, lines, humming, thick, thin, and short length. Modifications were made to QScript, which in the previous study was tasked with mapping Arabic to Latin writing by recitation at the word level. These modifications take the form of adding new rules that have not been handled by QScript before. From the test results, the system works better in the male speech category. Besides, of the 5 errors reading the Koran, the system works better in detecting line errors and thick and thin errors. But overall, this system cannot be used to evaluate the reading of the Koran because the PER value of the acoustic model to predict major errors in reciting the Koran, namely errors in letters, lines, and short lengths reach 26.82%. An online speech recognition system prototype was also developed in this research. The prototype was built using the mGRUIPTC model with the best configuration obtained from the testing process. This system can record the reading of the Koran and provide feedback to the reading of the Koran directly. text
institution Institut Teknologi Bandung
building Institut Teknologi Bandung Library
continent Asia
country Indonesia
Indonesia
content_provider Institut Teknologi Bandung
collection Digital ITB
language Indonesia
description From previous research, the use of future context in acoustic models for speech recognition systems in reading the Koran seems to be able to improve system performance. The acoustic model with the future context is BLSTM. The use of BLSTM in the speech recognition system for reciting the Koran was able to reduce the WER value by an average of 4.63% compared to the GMM model. However, the predictive ability of the BLSTM model must be exchanged for computationally expensive due to its architectural complexity. This causes a large latency when decoding. This model will be difficult to apply to real applications because it can interfere with the user experience in using the application. In this research, the latency will be reduced using a simpler architecture with predictive capabilities equivalent to BLSTM, namely the mGRUIPTC acoustic model. mGRUIPTC is an architecture derived from the modification of the GRU architecture and with the addition of a projection layer. This layer serves to combine the current state input from the previous state output to reduce the number of parameters. The use of future context can be applied to this architecture by utilizing the projection layer, namely by adding input from several states afterward from the output of the previous layer or what is known as temporal convolution. From the test results on the reading of the Qur'an carried out on this model, the latency during the decoding process has decreased by up to 11 seconds compared to the BLSTM model with equivalent prediction results. From the experiments conducted, the mGRUIPTC model has 3 times longer training time than BLSTM on the data used. The data used in this research did not only contain a recitation of the Koran from expert speakers as was done in previous researches. Non-expert speakers were also included in the data. The data is taken from the memorization record of students from one of the Qur'anic tahfidz institutions. Apart from that, the addition of data related to the speech category was also carried out. In previous studies, the speech category used was only the male speech category. Meanwhile, in this study, two additional speech categories were included in the test data, namely the female and boy speech categories.iv The mGRUIPTC acoustic model is also tested to evaluate the recitation of the Koran. The speech recognition system is modified so that it can recognize sounds at the phoneme level because 5 of the 6 errors that occur in reciting the Koran are pronunciation errors, namely incorrect letters, lines, humming, thick, thin, and short length. Modifications were made to QScript, which in the previous study was tasked with mapping Arabic to Latin writing by recitation at the word level. These modifications take the form of adding new rules that have not been handled by QScript before. From the test results, the system works better in the male speech category. Besides, of the 5 errors reading the Koran, the system works better in detecting line errors and thick and thin errors. But overall, this system cannot be used to evaluate the reading of the Koran because the PER value of the acoustic model to predict major errors in reciting the Koran, namely errors in letters, lines, and short lengths reach 26.82%. An online speech recognition system prototype was also developed in this research. The prototype was built using the mGRUIPTC model with the best configuration obtained from the testing process. This system can record the reading of the Koran and provide feedback to the reading of the Koran directly.
format Theses
author Kautsar, Isjhar
spellingShingle Kautsar, Isjhar
MGRUIP ACOUSTIC MODEL WITH TEMPORAL CONVOLUTION IN SPEECH RECOGNITION SYSTEM FOR KORAN RECITING EVALUATION
author_facet Kautsar, Isjhar
author_sort Kautsar, Isjhar
title MGRUIP ACOUSTIC MODEL WITH TEMPORAL CONVOLUTION IN SPEECH RECOGNITION SYSTEM FOR KORAN RECITING EVALUATION
title_short MGRUIP ACOUSTIC MODEL WITH TEMPORAL CONVOLUTION IN SPEECH RECOGNITION SYSTEM FOR KORAN RECITING EVALUATION
title_full MGRUIP ACOUSTIC MODEL WITH TEMPORAL CONVOLUTION IN SPEECH RECOGNITION SYSTEM FOR KORAN RECITING EVALUATION
title_fullStr MGRUIP ACOUSTIC MODEL WITH TEMPORAL CONVOLUTION IN SPEECH RECOGNITION SYSTEM FOR KORAN RECITING EVALUATION
title_full_unstemmed MGRUIP ACOUSTIC MODEL WITH TEMPORAL CONVOLUTION IN SPEECH RECOGNITION SYSTEM FOR KORAN RECITING EVALUATION
title_sort mgruip acoustic model with temporal convolution in speech recognition system for koran reciting evaluation
url https://digilib.itb.ac.id/gdl/view/54153
_version_ 1822001707529273344