Deep learning architectures for speech recognition

Choosing which deep learning architecture to perform speech recognition can be laborious. Additionally, improving the performance of a given architecture can require a lot of experimentation. The purpose of this project is to investigate different architectures used in speech recognition tasks an...

Full description

Saved in:

Bibliographic Details
Main Author:	Yong, Jia Jie
Other Authors:	Jagath C. Rajapakse
Format:	Final Year Project
Language:	English
Published:	2018
Subjects:	DRNTU::Engineering::Computer science and engineering
Online Access:	http://hdl.handle.net/10356/76176
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-76176
record_format	dspace
spelling	sg-ntu-dr.10356-761762023-03-03T20:25:33Z Deep learning architectures for speech recognition Yong, Jia Jie Jagath C. Rajapakse School of Computer Science and Engineering DRNTU::Engineering::Computer science and engineering Choosing which deep learning architecture to perform speech recognition can be laborious. Additionally, improving the performance of a given architecture can require a lot of experimentation. The purpose of this project is to investigate different architectures used in speech recognition tasks and highlight the differences. In addition, the performance impacts of different deep learning techniques, namely DeepSpeech and WaveNet, applied in a recurrent neural network is explored. The baseline DeepSpeech model produced a word error rate (WER) of 0.304, loss of 27.039, and mean edit distance of 0.178 from a dropout of 0.2367. By increasing the dropout value to 0.5, the model produced a WER of 0.416, loss of 38.613, and mean edit distance of 0.259. By reducing the dropout value to 0, the model produced a WER of 0.310, loss of 30.841, and mean edit distance of 0.175. A DeepSpeech model with Batch Normalization applied on all layers resulted in a WER of 0.275, loss of 26.485, and mean edit distance of 0.155. When Batch Normalization is applied only in the feedforward layers, it resulted in a WER of 0.2355, loss of 22.973, and mean edit distance of 0.133. When Batch Normalization is applied only in the feedforwad layers without any dropout, it resulted in a WER of 0.305, loss of 28.584, and mean edit distance of 0.172. When running inference on audio files of 1.975s, 2.735s, and 2.590s, DeepSpeech took 2.691s, 3.325s, and 2.788s respectively. For the same audio files, WaveNet took 0.296s, 0.221s, and 0.261s respectively. However, DeepSpeech was found to consistently outperform WaveNet in terms of transcription accuracy even with its language model decoder deactivated. Dropout was found to have a significant impact on the performance of a network and its value must be tuned carefully. Batch Normalization does introduce performance improvements to a network, but only if applied alongside dropout. When applying Batch Normalization, one should only apply it to the feedforward layers, as well as alongside dropout. WaveNet consistently outperformed DeepSpeech in terms of transcription speeds, but produces less accurate transcriptions. One should consider the importance of transcription speed and accuracy before making a choice between DeepSpeech and WaveNet. Bachelor of Engineering (Computer Science) 2018-11-22T13:34:10Z 2018-11-22T13:34:10Z 2018 Final Year Project (FYP) http://hdl.handle.net/10356/76176 en Nanyang Technological University 45 p. application/pdf
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	DRNTU::Engineering::Computer science and engineering
spellingShingle	DRNTU::Engineering::Computer science and engineering Yong, Jia Jie Deep learning architectures for speech recognition
description	Choosing which deep learning architecture to perform speech recognition can be laborious. Additionally, improving the performance of a given architecture can require a lot of experimentation. The purpose of this project is to investigate different architectures used in speech recognition tasks and highlight the differences. In addition, the performance impacts of different deep learning techniques, namely DeepSpeech and WaveNet, applied in a recurrent neural network is explored. The baseline DeepSpeech model produced a word error rate (WER) of 0.304, loss of 27.039, and mean edit distance of 0.178 from a dropout of 0.2367. By increasing the dropout value to 0.5, the model produced a WER of 0.416, loss of 38.613, and mean edit distance of 0.259. By reducing the dropout value to 0, the model produced a WER of 0.310, loss of 30.841, and mean edit distance of 0.175. A DeepSpeech model with Batch Normalization applied on all layers resulted in a WER of 0.275, loss of 26.485, and mean edit distance of 0.155. When Batch Normalization is applied only in the feedforward layers, it resulted in a WER of 0.2355, loss of 22.973, and mean edit distance of 0.133. When Batch Normalization is applied only in the feedforwad layers without any dropout, it resulted in a WER of 0.305, loss of 28.584, and mean edit distance of 0.172. When running inference on audio files of 1.975s, 2.735s, and 2.590s, DeepSpeech took 2.691s, 3.325s, and 2.788s respectively. For the same audio files, WaveNet took 0.296s, 0.221s, and 0.261s respectively. However, DeepSpeech was found to consistently outperform WaveNet in terms of transcription accuracy even with its language model decoder deactivated. Dropout was found to have a significant impact on the performance of a network and its value must be tuned carefully. Batch Normalization does introduce performance improvements to a network, but only if applied alongside dropout. When applying Batch Normalization, one should only apply it to the feedforward layers, as well as alongside dropout. WaveNet consistently outperformed DeepSpeech in terms of transcription speeds, but produces less accurate transcriptions. One should consider the importance of transcription speed and accuracy before making a choice between DeepSpeech and WaveNet.
author2	Jagath C. Rajapakse
author_facet	Jagath C. Rajapakse Yong, Jia Jie
format	Final Year Project
author	Yong, Jia Jie
author_sort	Yong, Jia Jie
title	Deep learning architectures for speech recognition
title_short	Deep learning architectures for speech recognition
title_full	Deep learning architectures for speech recognition
title_fullStr	Deep learning architectures for speech recognition
title_full_unstemmed	Deep learning architectures for speech recognition
title_sort	deep learning architectures for speech recognition
publishDate	2018
url	http://hdl.handle.net/10356/76176
_version_	1759854610946719744

Deep learning architectures for speech recognition

Similar Items