Robust voice activity detection using DNN approaches

Voice activity detection (VAD) is a pivotal component in various speech processing applications, playing a crucial role in tasks such as speech recognition, speaker diarization, and noise suppression. Recognizing its significance, this thesis delves into the exploration of advancements in single-cha...

Full description

Saved in:

Bibliographic Details
Main Author:	Parashar Kshitij
Other Authors:	Chng Eng Siong
Format:	Final Year Project
Language:	English
Published:	Nanyang Technological University 2024
Subjects:	Computer and Information Science Voice activity detection Single channel audio Machine learning Deep learning Artificial intelligence Pyannote Silero Marblenet Speech activity detection Hyperparameter tuning Open neural network exchange Kaldi toolkit
Online Access:	https://hdl.handle.net/10356/175226
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-175226
record_format	dspace
spelling	sg-ntu-dr.10356-1752262024-05-17T15:37:10Z Robust voice activity detection using DNN approaches Parashar Kshitij Chng Eng Siong School of Computer Science and Engineering ASESChng@ntu.edu.sg Computer and Information Science Voice activity detection Single channel audio Machine learning Deep learning Artificial intelligence Pyannote Silero Marblenet Speech activity detection Hyperparameter tuning Open neural network exchange Kaldi toolkit Voice activity detection (VAD) is a pivotal component in various speech processing applications, playing a crucial role in tasks such as speech recognition, speaker diarization, and noise suppression. Recognizing its significance, this thesis delves into the exploration of advancements in single-channel VAD systems, leveraging the power of deep learning techniques. Through meticulous experimentation and analysis, we undertake comprehensive evaluations of three prominent VAD models: Pyannote, Silero, and MarbleNet, across a spectrum of conditions and scenarios. Our investigations encompass a nuanced examination of varying parameters such as chunk sizes, strides, and prediction thresholds, aiming to discern their nuanced impacts on model performance. From our findings, we discern Pyannote as the standout performer exhibiting superior accuracy compared to Silero by approximately 16.87% and MarbleNet by approximately 25.97% on the DIHARD III dataset. Consequently, we pivot our focus towards enhancing Pyannote’s capabilities. In the process of enhancing, we looked into how different parameters affect the performance of Pyannote and trained models on varying chunk sizes and stride to deduce the same. With this, we were able to conclude that models trained on small chunk size and strides do not necessarily perform well during inference with small chunks and strides. Additionally, we delve into the realm of scalability and production readiness, exploring strategies facilitated by the Open Neural Network Exchange (ONNX) framework. These efforts provide important insights that can enhance the field of VAD, leading to the development of more robust and efficient voice activity detection systems capable of meeting the needs of modern speech processing applications Bachelor's degree 2024-04-21T23:23:15Z 2024-04-21T23:23:15Z 2024 Final Year Project (FYP) Parashar Kshitij (2024). Robust voice activity detection using DNN approaches. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/175226 https://hdl.handle.net/10356/175226 en SCSE23-0748 application/pdf Nanyang Technological University
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Computer and Information Science Voice activity detection Single channel audio Machine learning Deep learning Artificial intelligence Pyannote Silero Marblenet Speech activity detection Hyperparameter tuning Open neural network exchange Kaldi toolkit
spellingShingle	Computer and Information Science Voice activity detection Single channel audio Machine learning Deep learning Artificial intelligence Pyannote Silero Marblenet Speech activity detection Hyperparameter tuning Open neural network exchange Kaldi toolkit Parashar Kshitij Robust voice activity detection using DNN approaches
description	Voice activity detection (VAD) is a pivotal component in various speech processing applications, playing a crucial role in tasks such as speech recognition, speaker diarization, and noise suppression. Recognizing its significance, this thesis delves into the exploration of advancements in single-channel VAD systems, leveraging the power of deep learning techniques. Through meticulous experimentation and analysis, we undertake comprehensive evaluations of three prominent VAD models: Pyannote, Silero, and MarbleNet, across a spectrum of conditions and scenarios. Our investigations encompass a nuanced examination of varying parameters such as chunk sizes, strides, and prediction thresholds, aiming to discern their nuanced impacts on model performance. From our findings, we discern Pyannote as the standout performer exhibiting superior accuracy compared to Silero by approximately 16.87% and MarbleNet by approximately 25.97% on the DIHARD III dataset. Consequently, we pivot our focus towards enhancing Pyannote’s capabilities. In the process of enhancing, we looked into how different parameters affect the performance of Pyannote and trained models on varying chunk sizes and stride to deduce the same. With this, we were able to conclude that models trained on small chunk size and strides do not necessarily perform well during inference with small chunks and strides. Additionally, we delve into the realm of scalability and production readiness, exploring strategies facilitated by the Open Neural Network Exchange (ONNX) framework. These efforts provide important insights that can enhance the field of VAD, leading to the development of more robust and efficient voice activity detection systems capable of meeting the needs of modern speech processing applications
author2	Chng Eng Siong
author_facet	Chng Eng Siong Parashar Kshitij
format	Final Year Project
author	Parashar Kshitij
author_sort	Parashar Kshitij
title	Robust voice activity detection using DNN approaches
title_short	Robust voice activity detection using DNN approaches
title_full	Robust voice activity detection using DNN approaches
title_fullStr	Robust voice activity detection using DNN approaches
title_full_unstemmed	Robust voice activity detection using DNN approaches
title_sort	robust voice activity detection using dnn approaches
publisher	Nanyang Technological University
publishDate	2024
url	https://hdl.handle.net/10356/175226
_version_	1814047180833423360

Robust voice activity detection using DNN approaches

Similar Items