Robust visual voice activity detection using Long Short-Term Memory recurrent neural network

© Springer International Publishing Switzerland 2016. Many traditional visual voice activity detection systems utilize features extracted from mouth region images which are sensitive to noisy observations of the visual domain. In addition, hyperparameters of the feature extraction process modulating...

Full description

Saved in:

Bibliographic Details
Main Authors:	Zaw Htet Aung, Panrasee Ritthipravat
Other Authors:	Mahidol University
Format:	Conference or Workshop Item
Published:	2018
Subjects:	Computer Science Mathematics
Online Access:	https://repository.li.mahidol.ac.th/handle/123456789/43477
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Mahidol University

id	th-mahidol.43477
record_format	dspace
spelling	th-mahidol.434772019-03-14T15:04:32Z Robust visual voice activity detection using Long Short-Term Memory recurrent neural network Zaw Htet Aung Panrasee Ritthipravat Mahidol University Computer Science Mathematics © Springer International Publishing Switzerland 2016. Many traditional visual voice activity detection systems utilize features extracted from mouth region images which are sensitive to noisy observations of the visual domain. In addition, hyperparameters of the feature extraction process modulating the desired compromise between robustness, efficiency, and accuracy of the algorithm are difficult to be determined. Therefore, a visual voice activity detection algorithm which only utilizes simple lip shape information as features and a Long Short-Term Memory recurrent neural network (LSTM-RNN) as a classifier is proposed. Face detection is performed by structural SVM based on histogram of oriented gradient (HOG) features. Detected face template is used to initialize a kernelized correlation filter tracker. Facial landmark coordinates are then extracted from the tracked face. Centroid distance function is applied to the geometrically normalized landmarks surrounding the outer and inner lip contours. Finally, discriminative (LSTM-RNN) and generative (Hidden Markov Model) methods are used to model the temporal lip shape sequences during speech and non-speech intervals and their classification performances are compared. Experimental results show that the proposed algorithm using LSTMRNN can achieve a classification rate of 98% in labeling speech and non-speech periods. It is robust and efficient for realtime applications. 2018-12-11T02:41:20Z 2019-03-14T08:04:32Z 2018-12-11T02:41:20Z 2019-03-14T08:04:32Z 2016-01-01 Conference Paper Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol.9431, (2016), 380-391 10.1007/978-3-319-29451-3_31 16113349 03029743 2-s2.0-84959019631 https://repository.li.mahidol.ac.th/handle/123456789/43477 Mahidol University SCOPUS https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=84959019631&origin=inward
institution	Mahidol University
building	Mahidol University Library
continent	Asia
country	Thailand Thailand
content_provider	Mahidol University Library
collection	Mahidol University Institutional Repository
topic	Computer Science Mathematics
spellingShingle	Computer Science Mathematics Zaw Htet Aung Panrasee Ritthipravat Robust visual voice activity detection using Long Short-Term Memory recurrent neural network
description	© Springer International Publishing Switzerland 2016. Many traditional visual voice activity detection systems utilize features extracted from mouth region images which are sensitive to noisy observations of the visual domain. In addition, hyperparameters of the feature extraction process modulating the desired compromise between robustness, efficiency, and accuracy of the algorithm are difficult to be determined. Therefore, a visual voice activity detection algorithm which only utilizes simple lip shape information as features and a Long Short-Term Memory recurrent neural network (LSTM-RNN) as a classifier is proposed. Face detection is performed by structural SVM based on histogram of oriented gradient (HOG) features. Detected face template is used to initialize a kernelized correlation filter tracker. Facial landmark coordinates are then extracted from the tracked face. Centroid distance function is applied to the geometrically normalized landmarks surrounding the outer and inner lip contours. Finally, discriminative (LSTM-RNN) and generative (Hidden Markov Model) methods are used to model the temporal lip shape sequences during speech and non-speech intervals and their classification performances are compared. Experimental results show that the proposed algorithm using LSTMRNN can achieve a classification rate of 98% in labeling speech and non-speech periods. It is robust and efficient for realtime applications.
author2	Mahidol University
author_facet	Mahidol University Zaw Htet Aung Panrasee Ritthipravat
format	Conference or Workshop Item
author	Zaw Htet Aung Panrasee Ritthipravat
author_sort	Zaw Htet Aung
title	Robust visual voice activity detection using Long Short-Term Memory recurrent neural network
title_short	Robust visual voice activity detection using Long Short-Term Memory recurrent neural network
title_full	Robust visual voice activity detection using Long Short-Term Memory recurrent neural network
title_fullStr	Robust visual voice activity detection using Long Short-Term Memory recurrent neural network
title_full_unstemmed	Robust visual voice activity detection using Long Short-Term Memory recurrent neural network
title_sort	robust visual voice activity detection using long short-term memory recurrent neural network
publishDate	2018
url	https://repository.li.mahidol.ac.th/handle/123456789/43477
_version_	1763496141646725120

Robust visual voice activity detection using Long Short-Term Memory recurrent neural network

Similar Items