Sound event recognition in home environments

The sounds that we hear in everyday environments contain a wide variety of acoustic information that assists us in accomplishing many daily tasks. Sound event recognition (SER) aims to automatically detect and classify these sounds to provide more information about the surroundings. This research fo...

Full description

Saved in:
Bibliographic Details
Main Author: Ng, Terence Wen Zheng
Other Authors: Tran Huy Dat
Format: Theses and Dissertations
Language:English
Published: 2014
Subjects:
Online Access:https://hdl.handle.net/10356/61720
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:The sounds that we hear in everyday environments contain a wide variety of acoustic information that assists us in accomplishing many daily tasks. Sound event recognition (SER) aims to automatically detect and classify these sounds to provide more information about the surroundings. This research focuses on sound events found in home environments, which has a wide range of potential applications. However the audio signal received in the home environment is completely unstructured and there exists a range of challenges faced by these applications. In this thesis, we focus on two of the commonly faced problems while performing SER in the home environments: (1) the presence of interference noise and (2) limited training data. The problem of interference noise refers to noise which is highly non-stationary and may be regarded as a signal itself. An example of an interference noise commonly occurring in the home environment is the audio signal produced by the television (TV). Conventional noise robustness methods that aim to improve the results of sound recognition under background noise typically assume that the noise is stationary or slowly changing. These assumptions do not hold for interference noise and will not be effective if used in practice. To reduce the interference noise, many existing methods assume the use of an additional reference microphone to receive the TV signal. With the knowledge of TV reference signal, the problem is simplified to estimating the room impulse response using adaptive filtering. Instead of adaptive filtering, the approach taken in this thesis is based on a regression mapping in the frequency domain. This is called the regressive noise cancellation (RNC), which finds a global minimum for the error function instead of iteratively minimising the error function. While this is shown to improve the cancellation compared to the previous techniques, some noise remains in the form of residual noise. To address the residual noise, an existing subband power distribution image feature (SPD-IF) classification framework is employed to localize the noise and signal into separate regions, followed by a missing feature classification performed on the reliable parts. An enhancement to the SPD-IF is proposed where the subband power distributions are estimated by utilising the temporal information across the subband. From the experimental results, the proposed RNC cancellation, together with the improved SPD-IF, outperforms several combinations of conventional cancellation and classification methods. The second problem faced in the home environments is that of limited training data. This often worsens the performance of any recognition systems significantly. Collecting large databases has always been a big challenge as labelling is time consuming and expensive. One common way to overcome this problem is to use Semi-Supervised Learning (SSL), which utilizes an initial model trained from a small initial training data set to classify the unlabelled data. The most reliable samples are first selected and subsequently used to improve the initial model. While most research that deals with limited training data focuses on improving the performance of SSL methods, there is less attention to directly improve the feature for the case of limited training data. To this end, the approach taken in this thesis is to make the features more discriminative and for this, a class-based compensation (CBC) method is proposed. The idea of CBC is to learn a set of filters for each of the binary classes of the SVM classifier to enhance the discriminative capability of the features for classification. To enable this, CBC employs Fisher Linear Discriminant (FLD) analysis on the power spectrum distribution between the class pairs to assign higher weights to the frequency components which best discriminate the class information. Experimental results show that the compensation is able to perform well with the constraint of limited training samples. Moreover, the compensation method further improves the accuracy when used with in conjunction with previous noise robustness techniques in noisy environments. Together, the approaches presented in this thesis can form the basis of a robust SER system that performs well in home environments.