Sound event recognition in unstructured environments using spectrogram image processing

The objective of this research is to develop feature extraction and classification techniques for the task of sound event recognition (SER) in unstructured environments. Although this field is traditionally overshadowed by the popular field of automatic speech recognition (ASR), an SER system that c...

Full description

Saved in:

Bibliographic Details
Main Author:	Dennis, Jonathan William
Other Authors:	Chng Eng Siong
Format:	Theses and Dissertations
Language:	English
Published:	2014
Subjects:	DRNTU::Engineering::Computer science and engineering::Computing methodologies::Pattern recognition DRNTU::Science::Mathematics::Applied mathematics::Signal processing DRNTU::Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision
Online Access:	https://hdl.handle.net/10356/59272
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-59272
record_format	dspace
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	DRNTU::Engineering::Computer science and engineering::Computing methodologies::Pattern recognition DRNTU::Science::Mathematics::Applied mathematics::Signal processing DRNTU::Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision
spellingShingle	DRNTU::Engineering::Computer science and engineering::Computing methodologies::Pattern recognition DRNTU::Science::Mathematics::Applied mathematics::Signal processing DRNTU::Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision Dennis, Jonathan William Sound event recognition in unstructured environments using spectrogram image processing
description	The objective of this research is to develop feature extraction and classification techniques for the task of sound event recognition (SER) in unstructured environments. Although this field is traditionally overshadowed by the popular field of automatic speech recognition (ASR), an SER system that can achieve human-like sound recognition performance opens up a range of novel application areas. These include acoustic surveillance, bio-acoustical monitoring, environmental context detection, healthcare applications and more generally the rich transcription of acoustic environments. The challenge in such environments are the adverse effects such as noise, distortion and multiple sources, which are more likely to occur with distant microphones compared to the close-talking microphones that are more common in ASR. In addition, the characteristics of acoustic events are less well defined than those of speech, and there is no sub-word dictionary available like the phonemes in speech. Therefore, the performance of ASR systems typically degrades dramatically in these challenging unstructured environments, and it is important to develop new methods that can perform well for this challenging task. In this thesis, the approach taken is to interpret the sound event as a two-dimensional spectrogram image, with the two axes as the time and frequency dimensions. This enables novel methods for SER to be developed based on spectrogram image processing, which are inspired by techniques from the field of image processing. The motivation for such an approach is based on finding an automatic approach to ``spectrogram reading'', where it is possible for humans to visually recognise the different sound event signatures in the spectrogram. The advantages of such an approach are twofold. Firstly, the sound event image representation makes it possible to naturally capture the sound information in a two-dimensional feature. This has advantages over conventional one-dimensional frame-based features, which capture only a slice of spectral information within a short time window. Secondly, the problem of detecting sound events in mixtures containing noise or overlapping sounds can be formulated in a way that is similar to image classification and object detection in the field of image processing. This makes it possible to draw on previous works in the field, taking into account the fundamental differences between spectrograms and conventional images. With this new perspective, three novel solutions to the challenging task of robust SER are developed in this thesis. In the first study, a method for robust sound classification is developed called the Spectrogram Image Feature (SIF), which is based on a global image feature extracted directly from the time-frequency spectrogram of the sound. This in turn leads to the development of a novel sound event image representation called the Subband Power Distribution (SPD) image. This is derived as an image representation of the stochastic distribution of spectral power over the sound clip, and can overcome some of the issues of extracting image features directly from the spectrogram. In the final study, the challenging task of simultaneous recognition of overlapping sounds in noisy environments is considered. An approach is proposed based on inspiration from object recognition in image processing, where the task of finding an object in a cluttered scene has many parallels with detecting a sound event overlapped with other sources and noise. The proposed framework combines keypoint detection and local spectrogram feature extraction, with a model that captures the geometrical distribution of the keypoints over time, frequency and spectral power. For each of the proposed systems detailed experimental evaluation is carried out to compare the performance against a range of state-of-the-art systems.
author2	Chng Eng Siong
author_facet	Chng Eng Siong Dennis, Jonathan William
format	Theses and Dissertations
author	Dennis, Jonathan William
author_sort	Dennis, Jonathan William
title	Sound event recognition in unstructured environments using spectrogram image processing
title_short	Sound event recognition in unstructured environments using spectrogram image processing
title_full	Sound event recognition in unstructured environments using spectrogram image processing
title_fullStr	Sound event recognition in unstructured environments using spectrogram image processing
title_full_unstemmed	Sound event recognition in unstructured environments using spectrogram image processing
title_sort	sound event recognition in unstructured environments using spectrogram image processing
publishDate	2014
url	https://hdl.handle.net/10356/59272
_version_	1759857810753978368
spelling	sg-ntu-dr.10356-592722023-03-04T00:48:59Z Sound event recognition in unstructured environments using spectrogram image processing Dennis, Jonathan William Chng Eng Siong School of Computer Engineering A*STAR Institute for Infocomm Research Tran Huy Dat DRNTU::Engineering::Computer science and engineering::Computing methodologies::Pattern recognition DRNTU::Science::Mathematics::Applied mathematics::Signal processing DRNTU::Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision The objective of this research is to develop feature extraction and classification techniques for the task of sound event recognition (SER) in unstructured environments. Although this field is traditionally overshadowed by the popular field of automatic speech recognition (ASR), an SER system that can achieve human-like sound recognition performance opens up a range of novel application areas. These include acoustic surveillance, bio-acoustical monitoring, environmental context detection, healthcare applications and more generally the rich transcription of acoustic environments. The challenge in such environments are the adverse effects such as noise, distortion and multiple sources, which are more likely to occur with distant microphones compared to the close-talking microphones that are more common in ASR. In addition, the characteristics of acoustic events are less well defined than those of speech, and there is no sub-word dictionary available like the phonemes in speech. Therefore, the performance of ASR systems typically degrades dramatically in these challenging unstructured environments, and it is important to develop new methods that can perform well for this challenging task. In this thesis, the approach taken is to interpret the sound event as a two-dimensional spectrogram image, with the two axes as the time and frequency dimensions. This enables novel methods for SER to be developed based on spectrogram image processing, which are inspired by techniques from the field of image processing. The motivation for such an approach is based on finding an automatic approach to ``spectrogram reading'', where it is possible for humans to visually recognise the different sound event signatures in the spectrogram. The advantages of such an approach are twofold. Firstly, the sound event image representation makes it possible to naturally capture the sound information in a two-dimensional feature. This has advantages over conventional one-dimensional frame-based features, which capture only a slice of spectral information within a short time window. Secondly, the problem of detecting sound events in mixtures containing noise or overlapping sounds can be formulated in a way that is similar to image classification and object detection in the field of image processing. This makes it possible to draw on previous works in the field, taking into account the fundamental differences between spectrograms and conventional images. With this new perspective, three novel solutions to the challenging task of robust SER are developed in this thesis. In the first study, a method for robust sound classification is developed called the Spectrogram Image Feature (SIF), which is based on a global image feature extracted directly from the time-frequency spectrogram of the sound. This in turn leads to the development of a novel sound event image representation called the Subband Power Distribution (SPD) image. This is derived as an image representation of the stochastic distribution of spectral power over the sound clip, and can overcome some of the issues of extracting image features directly from the spectrogram. In the final study, the challenging task of simultaneous recognition of overlapping sounds in noisy environments is considered. An approach is proposed based on inspiration from object recognition in image processing, where the task of finding an object in a cluttered scene has many parallels with detecting a sound event overlapped with other sources and noise. The proposed framework combines keypoint detection and local spectrogram feature extraction, with a model that captures the geometrical distribution of the keypoints over time, frequency and spectral power. For each of the proposed systems detailed experimental evaluation is carried out to compare the performance against a range of state-of-the-art systems. DOCTOR OF PHILOSOPHY (SCE) 2014-04-29T01:57:59Z 2014-04-29T01:57:59Z 2014 2014 Thesis Dennis, J. W. (2014). Sound event recognition in unstructured environments using spectrogram image processing. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/59272 10.32657/10356/59272 en 208 p. application/pdf

Sound event recognition in unstructured environments using spectrogram image processing

Similar Items