Classification of sound using machine learning

The soundscape of urban parks and cities are composed of a variety of natural and man-made noises. The benefits brought by urban parks and the health ailments from sound pollution makes soundscape analysis a valuable study topic in urban cities, like Singapore. However, the need to collect sound dat...

Full description

Saved in:
Bibliographic Details
Main Author: Tan, Ki In
Other Authors: Lee Bu Sung, Francis
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2023
Subjects:
Online Access:https://hdl.handle.net/10356/165777
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:The soundscape of urban parks and cities are composed of a variety of natural and man-made noises. The benefits brought by urban parks and the health ailments from sound pollution makes soundscape analysis a valuable study topic in urban cities, like Singapore. However, the need to collect sound data for research and analysis is hampered by the difficulty of recruiting volunteers and manually annotating samples accurately. The use of machine learning in predicting and annotating environment sounds can help in data collection. In my previous work, the use of sound spectrum was to overcome the risk of audio recording storage, where the audio data may expose personal information of the data collection volunteer. Existing works successfully proved the usage of sound spectrum in training a deep learning model to predict with high accuracy. However, the model displayed signs of overfitting. This project aims to explore various ways to improve the existing sound spectrum data pipeline’s accuracy, while limiting the overfitting issue found in models used in the pipeline. In investigating the limitations in the data pipeline, the Convolutional Neural Network (CNN) model architecture used was found to limit model performance, due to its inability to capture global relationships of sound spectrum features. A literature review and evaluation of transformer models for sound classification found the Audio Spectrogram Transformer (AST) most suitable to replace CNN due to the stable performance despite having its batch size reduced when trained in a resource constrained environment. Training AST with sound spectrum yielded an accuracy of 62.30% when compared to CNN’s accuracy of 46.50%, but the performance improvement was limited. Class reduction and data augmentations techniques were experimented; both techniques improved the accuracy of AST, with class reduction improving the accuracy of AST up to 91.25% but at the cost of having limited and less suitable class labels for evaluation. Ultimately, the use of state-of-the-art models and other performance improving techniques in the sound spectrum data pipeline has successfully improved overall model accuracy and future works is likely to investigate on finetuning of the data pipeline for deployment.