Machine learning based audio event recognition

As an important information carrier, sound carries abundant information about the environment, which is often used to assist the environment perception and video surveillance. During the recognition of audio event, the feature values are extracted based on the analysis of environmental sound, classi...

Full description

Saved in:
Bibliographic Details
Main Author: Lu, Yujing
Other Authors: Jiang Xudong
Format: Thesis-Master by Coursework
Language:English
Published: Nanyang Technological University 2020
Subjects:
Online Access:https://hdl.handle.net/10356/140286
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-140286
record_format dspace
spelling sg-ntu-dr.10356-1402862023-07-04T16:49:06Z Machine learning based audio event recognition Lu, Yujing Jiang Xudong School of Electrical and Electronic Engineering EXDJiang@ntu.edu.sg Engineering::Electrical and electronic engineering As an important information carrier, sound carries abundant information about the environment, which is often used to assist the environment perception and video surveillance. During the recognition of audio event, the feature values are extracted based on the analysis of environmental sound, classified and attached with semantic labels, such as beach, library, forest etc. Audio scene recognition can be used in various fields, such as military reconnaissance, intelligent home, security monitoring, medical monitoring, etc. The deep learning method involves neural network with multiple layers for perceptron, which has achieved great success in image recognition, machine translation and other applications. Deep learning can also be used as a classifier in audio event recognition. Under supervision, deep learning can learn audio features automatically, which can overcome many disadvantages including long time consumption, heavy manual work and unstable manual selection of features. To address these problems, a variety of deep learning models are investigated in this project. Therefore, this project mainly studies the sound event recognition technology based on a variety of deep learning models. By using various deep neural networks with different structures, information extraction and representation learning of sound event samples are performed to improve the recognition accuracy of sound event recognition systems. In this project, a DNN-based audio scene recognition system is built, in which, MFCC is used to extract audio features, and the system consists 10 dense layers and a dropout layer. This model achieved the training data accuracy of 84.5%, but the accuracy of test data was under 40%. In this work, a CNN-based audio scene recognition system is also established. The reason for choosing CNN is that CNN is currently the most mainstream network structure in deep learning, which has good performance in the fields of image recognition and speech recognition. The systems consists of 4 convolutional layers and 4 pooling layers, 1 tiled layer, two fully-connected layers and also a dropout layer, which can prevent the network from overfitting in training. In this model, the accuracy of training data reached 80.5%, but the accuracy of test data was only around 77%. Finally, a CRNN-based audio scene recognition model was established, but the accuracy rate of this model was lower than that of the CNN model, and it also took longer to train. Master of Science (Signal Processing) 2020-05-27T12:53:31Z 2020-05-27T12:53:31Z 2020 Thesis-Master by Coursework https://hdl.handle.net/10356/140286 en application/pdf Nanyang Technological University
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Engineering::Electrical and electronic engineering
spellingShingle Engineering::Electrical and electronic engineering
Lu, Yujing
Machine learning based audio event recognition
description As an important information carrier, sound carries abundant information about the environment, which is often used to assist the environment perception and video surveillance. During the recognition of audio event, the feature values are extracted based on the analysis of environmental sound, classified and attached with semantic labels, such as beach, library, forest etc. Audio scene recognition can be used in various fields, such as military reconnaissance, intelligent home, security monitoring, medical monitoring, etc. The deep learning method involves neural network with multiple layers for perceptron, which has achieved great success in image recognition, machine translation and other applications. Deep learning can also be used as a classifier in audio event recognition. Under supervision, deep learning can learn audio features automatically, which can overcome many disadvantages including long time consumption, heavy manual work and unstable manual selection of features. To address these problems, a variety of deep learning models are investigated in this project. Therefore, this project mainly studies the sound event recognition technology based on a variety of deep learning models. By using various deep neural networks with different structures, information extraction and representation learning of sound event samples are performed to improve the recognition accuracy of sound event recognition systems. In this project, a DNN-based audio scene recognition system is built, in which, MFCC is used to extract audio features, and the system consists 10 dense layers and a dropout layer. This model achieved the training data accuracy of 84.5%, but the accuracy of test data was under 40%. In this work, a CNN-based audio scene recognition system is also established. The reason for choosing CNN is that CNN is currently the most mainstream network structure in deep learning, which has good performance in the fields of image recognition and speech recognition. The systems consists of 4 convolutional layers and 4 pooling layers, 1 tiled layer, two fully-connected layers and also a dropout layer, which can prevent the network from overfitting in training. In this model, the accuracy of training data reached 80.5%, but the accuracy of test data was only around 77%. Finally, a CRNN-based audio scene recognition model was established, but the accuracy rate of this model was lower than that of the CNN model, and it also took longer to train.
author2 Jiang Xudong
author_facet Jiang Xudong
Lu, Yujing
format Thesis-Master by Coursework
author Lu, Yujing
author_sort Lu, Yujing
title Machine learning based audio event recognition
title_short Machine learning based audio event recognition
title_full Machine learning based audio event recognition
title_fullStr Machine learning based audio event recognition
title_full_unstemmed Machine learning based audio event recognition
title_sort machine learning based audio event recognition
publisher Nanyang Technological University
publishDate 2020
url https://hdl.handle.net/10356/140286
_version_ 1772826132869021696