Machine learning based audio event recognition
As an important information carrier, sound carries abundant information about the environment, which is often used to assist the environment perception and video surveillance. During the recognition of audio event, the feature values are extracted based on the analysis of environmental sound, classi...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Master by Coursework |
Language: | English |
Published: |
Nanyang Technological University
2020
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/140286 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-140286 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-1402862023-07-04T16:49:06Z Machine learning based audio event recognition Lu, Yujing Jiang Xudong School of Electrical and Electronic Engineering EXDJiang@ntu.edu.sg Engineering::Electrical and electronic engineering As an important information carrier, sound carries abundant information about the environment, which is often used to assist the environment perception and video surveillance. During the recognition of audio event, the feature values are extracted based on the analysis of environmental sound, classified and attached with semantic labels, such as beach, library, forest etc. Audio scene recognition can be used in various fields, such as military reconnaissance, intelligent home, security monitoring, medical monitoring, etc. The deep learning method involves neural network with multiple layers for perceptron, which has achieved great success in image recognition, machine translation and other applications. Deep learning can also be used as a classifier in audio event recognition. Under supervision, deep learning can learn audio features automatically, which can overcome many disadvantages including long time consumption, heavy manual work and unstable manual selection of features. To address these problems, a variety of deep learning models are investigated in this project. Therefore, this project mainly studies the sound event recognition technology based on a variety of deep learning models. By using various deep neural networks with different structures, information extraction and representation learning of sound event samples are performed to improve the recognition accuracy of sound event recognition systems. In this project, a DNN-based audio scene recognition system is built, in which, MFCC is used to extract audio features, and the system consists 10 dense layers and a dropout layer. This model achieved the training data accuracy of 84.5%, but the accuracy of test data was under 40%. In this work, a CNN-based audio scene recognition system is also established. The reason for choosing CNN is that CNN is currently the most mainstream network structure in deep learning, which has good performance in the fields of image recognition and speech recognition. The systems consists of 4 convolutional layers and 4 pooling layers, 1 tiled layer, two fully-connected layers and also a dropout layer, which can prevent the network from overfitting in training. In this model, the accuracy of training data reached 80.5%, but the accuracy of test data was only around 77%. Finally, a CRNN-based audio scene recognition model was established, but the accuracy rate of this model was lower than that of the CNN model, and it also took longer to train. Master of Science (Signal Processing) 2020-05-27T12:53:31Z 2020-05-27T12:53:31Z 2020 Thesis-Master by Coursework https://hdl.handle.net/10356/140286 en application/pdf Nanyang Technological University |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Engineering::Electrical and electronic engineering |
spellingShingle |
Engineering::Electrical and electronic engineering Lu, Yujing Machine learning based audio event recognition |
description |
As an important information carrier, sound carries abundant information about the environment, which is often used to assist the environment perception and video surveillance. During the recognition of audio event, the feature values are extracted based on the analysis of environmental sound, classified and attached with semantic labels, such as beach, library, forest etc. Audio scene recognition can be used in various fields, such as military reconnaissance, intelligent home, security monitoring, medical monitoring, etc. The deep learning method involves neural network with multiple layers for perceptron, which has achieved great success in image recognition, machine translation and other applications. Deep learning can also be used as a classifier in audio event recognition. Under supervision, deep learning can learn audio features automatically, which can overcome many disadvantages including long time consumption, heavy manual work and unstable manual selection of features. To address these problems, a variety of deep learning models are investigated in this project.
Therefore, this project mainly studies the sound event recognition technology based on a variety of deep learning models. By using various deep neural networks with different structures, information extraction and representation learning of sound event samples are performed to improve the recognition accuracy of sound event recognition systems.
In this project, a DNN-based audio scene recognition system is built, in which, MFCC is used to extract audio features, and the system consists 10 dense layers and a dropout layer. This model achieved the training data accuracy of 84.5%, but the accuracy of test data was under 40%.
In this work, a CNN-based audio scene recognition system is also established. The reason for choosing CNN is that CNN is currently the most mainstream network structure in deep learning, which has good performance in the fields of image recognition and speech recognition. The systems consists of 4 convolutional layers and 4 pooling layers, 1 tiled layer, two fully-connected layers and also a dropout layer, which can prevent the network from overfitting in training. In this model, the accuracy of training data reached 80.5%, but the accuracy of test data was only around 77%.
Finally, a CRNN-based audio scene recognition model was established, but the accuracy rate of this model was lower than that of the CNN model, and it also took longer to train. |
author2 |
Jiang Xudong |
author_facet |
Jiang Xudong Lu, Yujing |
format |
Thesis-Master by Coursework |
author |
Lu, Yujing |
author_sort |
Lu, Yujing |
title |
Machine learning based audio event recognition |
title_short |
Machine learning based audio event recognition |
title_full |
Machine learning based audio event recognition |
title_fullStr |
Machine learning based audio event recognition |
title_full_unstemmed |
Machine learning based audio event recognition |
title_sort |
machine learning based audio event recognition |
publisher |
Nanyang Technological University |
publishDate |
2020 |
url |
https://hdl.handle.net/10356/140286 |
_version_ |
1772826132869021696 |