MGRUIP ACOUSTIC MODEL WITH TEMPORAL CONVOLUTION IN SPEECH RECOGNITION SYSTEM FOR KORAN RECITING EVALUATION
From previous research, the use of future context in acoustic models for speech recognition systems in reading the Koran seems to be able to improve system performance. The acoustic model with the future context is BLSTM. The use of BLSTM in the speech recognition system for reciting the Koran...
Saved in:
Main Author: | |
---|---|
Format: | Theses |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/54153 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
Summary: | From previous research, the use of future context in acoustic models for speech
recognition systems in reading the Koran seems to be able to improve system
performance. The acoustic model with the future context is BLSTM. The use of
BLSTM in the speech recognition system for reciting the Koran was able to reduce
the WER value by an average of 4.63% compared to the GMM model. However,
the predictive ability of the BLSTM model must be exchanged for computationally
expensive due to its architectural complexity. This causes a large latency when
decoding. This model will be difficult to apply to real applications because it can
interfere with the user experience in using the application.
In this research, the latency will be reduced using a simpler architecture with
predictive capabilities equivalent to BLSTM, namely the mGRUIPTC acoustic
model. mGRUIPTC is an architecture derived from the modification of the GRU
architecture and with the addition of a projection layer. This layer serves to
combine the current state input from the previous state output to reduce the number
of parameters. The use of future context can be applied to this architecture by
utilizing the projection layer, namely by adding input from several states afterward
from the output of the previous layer or what is known as temporal convolution.
From the test results on the reading of the Qur'an carried out on this model, the
latency during the decoding process has decreased by up to 11 seconds compared
to the BLSTM model with equivalent prediction results. From the experiments
conducted, the mGRUIPTC model has 3 times longer training time than BLSTM on
the data used.
The data used in this research did not only contain a recitation of the Koran from
expert speakers as was done in previous researches. Non-expert speakers were also
included in the data. The data is taken from the memorization record of students
from one of the Qur'anic tahfidz institutions. Apart from that, the addition of data
related to the speech category was also carried out. In previous studies, the speech
category used was only the male speech category. Meanwhile, in this study, two
additional speech categories were included in the test data, namely the female and
boy speech categories.iv
The mGRUIPTC acoustic model is also tested to evaluate the recitation of the
Koran. The speech recognition system is modified so that it can recognize sounds
at the phoneme level because 5 of the 6 errors that occur in reciting the Koran are
pronunciation errors, namely incorrect letters, lines, humming, thick, thin, and
short length. Modifications were made to QScript, which in the previous study was
tasked with mapping Arabic to Latin writing by recitation at the word level. These
modifications take the form of adding new rules that have not been handled by
QScript before. From the test results, the system works better in the male speech
category. Besides, of the 5 errors reading the Koran, the system works better in
detecting line errors and thick and thin errors. But overall, this system cannot be
used to evaluate the reading of the Koran because the PER value of the acoustic
model to predict major errors in reciting the Koran, namely errors in letters, lines,
and short lengths reach 26.82%.
An online speech recognition system prototype was also developed in this research.
The prototype was built using the mGRUIPTC model with the best configuration
obtained from the testing process. This system can record the reading of the Koran
and provide feedback to the reading of the Koran directly.
|
---|