Machine learning based audio event recognition

As an important information carrier, sound carries abundant information about the environment, which is often used to assist the environment perception and video surveillance. During the recognition of audio event, the feature values are extracted based on the analysis of environmental sound, classi...

Full description

Saved in:

Bibliographic Details
Main Author:	Lu, Yujing
Other Authors:	Jiang Xudong
Format:	Thesis-Master by Coursework
Language:	English
Published:	Nanyang Technological University 2020
Subjects:	Engineering::Electrical and electronic engineering
Online Access:	https://hdl.handle.net/10356/140286
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-140286
record_format	dspace
spelling	sg-ntu-dr.10356-1402862023-07-04T16:49:06Z Machine learning based audio event recognition Lu, Yujing Jiang Xudong School of Electrical and Electronic Engineering EXDJiang@ntu.edu.sg Engineering::Electrical and electronic engineering As an important information carrier, sound carries abundant information about the environment, which is often used to assist the environment perception and video surveillance. During the recognition of audio event, the feature values are extracted based on the analysis of environmental sound, classified and attached with semantic labels, such as beach, library, forest etc. Audio scene recognition can be used in various fields, such as military reconnaissance, intelligent home, security monitoring, medical monitoring, etc. The deep learning method involves neural network with multiple layers for perceptron, which has achieved great success in image recognition, machine translation and other applications. Deep learning can also be used as a classifier in audio event recognition. Under supervision, deep learning can learn audio features automatically, which can overcome many disadvantages including long time consumption, heavy manual work and unstable manual selection of features. To address these problems, a variety of deep learning models are investigated in this project. Therefore, this project mainly studies the sound event recognition technology based on a variety of deep learning models. By using various deep neural networks with different structures, information extraction and representation learning of sound event samples are performed to improve the recognition accuracy of sound event recognition systems. In this project, a DNN-based audio scene recognition system is built, in which, MFCC is used to extract audio features, and the system consists 10 dense layers and a dropout layer. This model achieved the training data accuracy of 84.5%, but the accuracy of test data was under 40%. In this work, a CNN-based audio scene recognition system is also established. The reason for choosing CNN is that CNN is currently the most mainstream network structure in deep learning, which has good performance in the fields of image recognition and speech recognition. The systems consists of 4 convolutional layers and 4 pooling layers, 1 tiled layer, two fully-connected layers and also a dropout layer, which can prevent the network from overfitting in training. In this model, the accuracy of training data reached 80.5%, but the accuracy of test data was only around 77%. Finally, a CRNN-based audio scene recognition model was established, but the accuracy rate of this model was lower than that of the CNN model, and it also took longer to train. Master of Science (Signal Processing) 2020-05-27T12:53:31Z 2020-05-27T12:53:31Z 2020 Thesis-Master by Coursework https://hdl.handle.net/10356/140286 en application/pdf Nanyang Technological University
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Engineering::Electrical and electronic engineering
spellingShingle	Engineering::Electrical and electronic engineering Lu, Yujing Machine learning based audio event recognition
description	As an important information carrier, sound carries abundant information about the environment, which is often used to assist the environment perception and video surveillance. During the recognition of audio event, the feature values are extracted based on the analysis of environmental sound, classified and attached with semantic labels, such as beach, library, forest etc. Audio scene recognition can be used in various fields, such as military reconnaissance, intelligent home, security monitoring, medical monitoring, etc. The deep learning method involves neural network with multiple layers for perceptron, which has achieved great success in image recognition, machine translation and other applications. Deep learning can also be used as a classifier in audio event recognition. Under supervision, deep learning can learn audio features automatically, which can overcome many disadvantages including long time consumption, heavy manual work and unstable manual selection of features. To address these problems, a variety of deep learning models are investigated in this project. Therefore, this project mainly studies the sound event recognition technology based on a variety of deep learning models. By using various deep neural networks with different structures, information extraction and representation learning of sound event samples are performed to improve the recognition accuracy of sound event recognition systems. In this project, a DNN-based audio scene recognition system is built, in which, MFCC is used to extract audio features, and the system consists 10 dense layers and a dropout layer. This model achieved the training data accuracy of 84.5%, but the accuracy of test data was under 40%. In this work, a CNN-based audio scene recognition system is also established. The reason for choosing CNN is that CNN is currently the most mainstream network structure in deep learning, which has good performance in the fields of image recognition and speech recognition. The systems consists of 4 convolutional layers and 4 pooling layers, 1 tiled layer, two fully-connected layers and also a dropout layer, which can prevent the network from overfitting in training. In this model, the accuracy of training data reached 80.5%, but the accuracy of test data was only around 77%. Finally, a CRNN-based audio scene recognition model was established, but the accuracy rate of this model was lower than that of the CNN model, and it also took longer to train.
author2	Jiang Xudong
author_facet	Jiang Xudong Lu, Yujing
format	Thesis-Master by Coursework
author	Lu, Yujing
author_sort	Lu, Yujing
title	Machine learning based audio event recognition
title_short	Machine learning based audio event recognition
title_full	Machine learning based audio event recognition
title_fullStr	Machine learning based audio event recognition
title_full_unstemmed	Machine learning based audio event recognition
title_sort	machine learning based audio event recognition
publisher	Nanyang Technological University
publishDate	2020
url	https://hdl.handle.net/10356/140286
_version_	1772826132869021696

Machine learning based audio event recognition

Similar Items