Noise robust voice activity detection

Voice activity detection (VAD) is a fundamental task in various speech-related applications, such as speech coding, speaker diarization and speech recognition. It is often defined as the problem of distinguishing speech from silence/noise. A typical VAD system consists of two core parts: a fe...

Full description

Saved in:

Bibliographic Details
Main Author:	Pham, Chau Khoa.
Other Authors:	Chng Eng Siong
Format:	Theses and Dissertations
Language:	English
Published:	2013
Subjects:	DRNTU::Engineering::Computer science and engineering::Computing methodologies::Pattern recognition
Online Access:	http://hdl.handle.net/10356/52255
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-52255
record_format	dspace
spelling	sg-ntu-dr.10356-522552023-03-04T00:34:09Z Noise robust voice activity detection Pham, Chau Khoa. Chng Eng Siong School of Computer Engineering Parallel and Distributed Computing Centre DRNTU::Engineering::Computer science and engineering::Computing methodologies::Pattern recognition Voice activity detection (VAD) is a fundamental task in various speech-related applications, such as speech coding, speaker diarization and speech recognition. It is often defined as the problem of distinguishing speech from silence/noise. A typical VAD system consists of two core parts: a feature extraction and a speech/ non-speech decision mechanism. The first part extracts a set of parameters from the signal, which are used by the second part to make the final speech/non-speech decision, based on a set of decision rules. Most VAD features proposed in the literature exploit the discriminative characteristics of speech in different domains, which can be divided into five categories: energy-based features, spectral-domain features, cepstral-domain features, harmonicity-based features, and long-term features. Energy-based features are simple and can be easily implemented in hardware. Spectral-domain and cepstral-domain features are more noise robust at low SNRs, as they are beneficial from a wide class of filtering and speech analysis techniques in these domains. When SNR is around 0 dB, or when the background noise contains complex acoustical events, features relying on the harmonic structure of voiced speech, as well as ones that exploit the long-term variability of speech appear to be more robust. Next, the second part of VAD decides the speech or non-speech class for each signal segment. Existing decision making mechanisms can be divided into three categories: thresholding, statistical modelling and machine learning. The first one is the simplest, yet sufficient in many cases where the features employed possess a good discriminative power. The latter two can work well at high SNRs, but their performance decline quickly at lower SNRs. In order to derive a state-of-the-art VAD algorithm, a comparative study has been carried out in this thesis to evaluate different VAD techniques. Traditionally, VAD algorithms are evaluated as a holistic system, from which it is hard to analyse whether performance gain is achieved from a new feature or a new decision mechanism. In this report, the author examines the use of P_e, the probability of error of two given distributions, to measure performance of a VAD feature separately from other modules in the system. The metric represents the discriminative power of a feature when used for classifying speech and non-speech. The result is a fairer comparison and a more compact performance representation. This allows a deeper analysis of VAD features, which reveals interesting trends across different SNRs. Secondly, a new approach to VAD is proposed in this report, which tackles the cases where SNR can be lower than 0 dB and background might contain complex audible events. The proposed idea exploits the sub-regions of the speech noisy spectrum that still retain a sufficient harmonicity structure of the human voiced speech. This allows for a more robust feature, based on the local harmonicity of the spectral autocorrelation of the voiced speech, can be derived to reliably detect the heavily corrupted voiced speech segments. Experimental results showed a significant improvement over a recently proposed method in the same category. Master of Engineering (SCE) 2013-04-26T03:42:20Z 2013-04-26T03:42:20Z 2013 2013 Thesis http://hdl.handle.net/10356/52255 en 82 p. application/pdf
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	DRNTU::Engineering::Computer science and engineering::Computing methodologies::Pattern recognition
spellingShingle	DRNTU::Engineering::Computer science and engineering::Computing methodologies::Pattern recognition Pham, Chau Khoa. Noise robust voice activity detection
description	Voice activity detection (VAD) is a fundamental task in various speech-related applications, such as speech coding, speaker diarization and speech recognition. It is often defined as the problem of distinguishing speech from silence/noise. A typical VAD system consists of two core parts: a feature extraction and a speech/ non-speech decision mechanism. The first part extracts a set of parameters from the signal, which are used by the second part to make the final speech/non-speech decision, based on a set of decision rules. Most VAD features proposed in the literature exploit the discriminative characteristics of speech in different domains, which can be divided into five categories: energy-based features, spectral-domain features, cepstral-domain features, harmonicity-based features, and long-term features. Energy-based features are simple and can be easily implemented in hardware. Spectral-domain and cepstral-domain features are more noise robust at low SNRs, as they are beneficial from a wide class of filtering and speech analysis techniques in these domains. When SNR is around 0 dB, or when the background noise contains complex acoustical events, features relying on the harmonic structure of voiced speech, as well as ones that exploit the long-term variability of speech appear to be more robust. Next, the second part of VAD decides the speech or non-speech class for each signal segment. Existing decision making mechanisms can be divided into three categories: thresholding, statistical modelling and machine learning. The first one is the simplest, yet sufficient in many cases where the features employed possess a good discriminative power. The latter two can work well at high SNRs, but their performance decline quickly at lower SNRs. In order to derive a state-of-the-art VAD algorithm, a comparative study has been carried out in this thesis to evaluate different VAD techniques. Traditionally, VAD algorithms are evaluated as a holistic system, from which it is hard to analyse whether performance gain is achieved from a new feature or a new decision mechanism. In this report, the author examines the use of P_e, the probability of error of two given distributions, to measure performance of a VAD feature separately from other modules in the system. The metric represents the discriminative power of a feature when used for classifying speech and non-speech. The result is a fairer comparison and a more compact performance representation. This allows a deeper analysis of VAD features, which reveals interesting trends across different SNRs. Secondly, a new approach to VAD is proposed in this report, which tackles the cases where SNR can be lower than 0 dB and background might contain complex audible events. The proposed idea exploits the sub-regions of the speech noisy spectrum that still retain a sufficient harmonicity structure of the human voiced speech. This allows for a more robust feature, based on the local harmonicity of the spectral autocorrelation of the voiced speech, can be derived to reliably detect the heavily corrupted voiced speech segments. Experimental results showed a significant improvement over a recently proposed method in the same category.
author2	Chng Eng Siong
author_facet	Chng Eng Siong Pham, Chau Khoa.
format	Theses and Dissertations
author	Pham, Chau Khoa.
author_sort	Pham, Chau Khoa.
title	Noise robust voice activity detection
title_short	Noise robust voice activity detection
title_full	Noise robust voice activity detection
title_fullStr	Noise robust voice activity detection
title_full_unstemmed	Noise robust voice activity detection
title_sort	noise robust voice activity detection
publishDate	2013
url	http://hdl.handle.net/10356/52255
_version_	1759856356391649280

Noise robust voice activity detection

Similar Items