Deep learning architectures for speech recognition
Choosing which deep learning architecture to perform speech recognition can be laborious. Additionally, improving the performance of a given architecture can require a lot of experimentation. The purpose of this project is to investigate different architectures used in speech recognition tasks an...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | English |
Published: |
2018
|
Subjects: | |
Online Access: | http://hdl.handle.net/10356/76176 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | Choosing which deep learning architecture to perform speech recognition can be
laborious. Additionally, improving the performance of a given architecture can
require a lot of experimentation. The purpose of this project is to investigate
different architectures used in speech recognition tasks and highlight the differences. In addition, the performance impacts of different deep learning techniques,
namely DeepSpeech and WaveNet, applied in a recurrent neural network is explored.
The baseline DeepSpeech model produced a word error rate (WER) of 0.304,
loss of 27.039, and mean edit distance of 0.178 from a dropout of 0.2367. By
increasing the dropout value to 0.5, the model produced a WER of 0.416, loss of
38.613, and mean edit distance of 0.259. By reducing the dropout value to 0, the
model produced a WER of 0.310, loss of 30.841, and mean edit distance of 0.175.
A DeepSpeech model with Batch Normalization applied on all layers resulted
in a WER of 0.275, loss of 26.485, and mean edit distance of 0.155. When
Batch Normalization is applied only in the feedforward layers, it resulted in a
WER of 0.2355, loss of 22.973, and mean edit distance of 0.133. When Batch
Normalization is applied only in the feedforwad layers without any dropout, it
resulted in a WER of 0.305, loss of 28.584, and mean edit distance of 0.172.
When running inference on audio files of 1.975s, 2.735s, and 2.590s, DeepSpeech
took 2.691s, 3.325s, and 2.788s respectively. For the same audio files, WaveNet
took 0.296s, 0.221s, and 0.261s respectively. However, DeepSpeech was found to
consistently outperform WaveNet in terms of transcription accuracy even with
its language model decoder deactivated.
Dropout was found to have a significant impact on the performance of a network and its value must be tuned carefully. Batch Normalization does introduce
performance improvements to a network, but only if applied alongside dropout.
When applying Batch Normalization, one should only apply it to the feedforward
layers, as well as alongside dropout. WaveNet consistently outperformed DeepSpeech in terms of transcription speeds, but produces less accurate transcriptions.
One should consider the importance of transcription speed and accuracy before
making a choice between DeepSpeech and WaveNet. |
---|