Non-reference speech quality assessment based on deep learning

In the field of speech processing, voice quality evaluation is one of the important techniques, and it has been widely used in mobile communications, Internet, public safety, digital entertainment, consumer electronics, and other fields. In the early days, there was only subjective voice quality ass...

Full description

Saved in:
Bibliographic Details
Main Author: Fang, Xuhui
Other Authors: Tan Yap Peng
Format: Thesis-Master by Coursework
Language:English
Published: Nanyang Technological University 2023
Subjects:
Online Access:https://hdl.handle.net/10356/164956
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-164956
record_format dspace
spelling sg-ntu-dr.10356-1649562023-07-04T16:08:07Z Non-reference speech quality assessment based on deep learning Fang, Xuhui Tan Yap Peng School of Electrical and Electronic Engineering EYPTan@ntu.edu.sg Engineering::Electrical and electronic engineering In the field of speech processing, voice quality evaluation is one of the important techniques, and it has been widely used in mobile communications, Internet, public safety, digital entertainment, consumer electronics, and other fields. In the early days, there was only subjective voice quality assessment, but it required large human resources, annotated data and time. Hence, objective voice quality evaluation methods gradually became popular. Referenced speech quality assessment models require pure and raw speech signals, which are sometimes difficult to obtain in practice. As a result, the reference speech quality assessment method has received increased attention, especially in recent years. Many experts and researchers have integrated deep learning technology into reference speech quality assessment, which has made a major breakthrough in this field. However, the existing deep learning-based speech quality evaluation still has limitations such as insufficient accuracy and large number of parameters. In order to address these limitations, this dissertation studies the non-reference speech quality evaluation method based on deep learning, and the main research is summarized below: (1) Considering the problem that the accuracy of existing voice quality assessment is not high enough, this dissertation proposes an improvement method from multiple perspectives. This includes the use of BiLSTM(Bidirectional Long Short-Term Memory) to improve the time-dependent model, fully exploiting the ability of BiLSTM to effectively learn the speech context information. On this basis, the Squeeze-and-Excitation (SE) module is added to screen out the attention of the channels by learning the correlation between different channels in the feature map, so as to perform feature calibration on the feature map. In addition, a custom loss function based on the signal loss ratio is used to improve model fitting, which further improves the evaluation performance of the model. Experimental results show the effectiveness of this method. (2) For the problem that the existing speech quality evaluation model has a large number of parameters, we propose a low-complexity speech quality evaluation method based on depthwise residual convolution and Bidirectional Gate Recurrent Unit (BiGRU), the SE-DSResBGRU-NRSQA model\cite{CNN41}. The main goal of this model is to reduce the number of parameters, by using BiGRU and depthwise separable convolution, optimizing the convolution part with the main structure of residual network (ResNet), and using shallow feature information to improve the evaluation performance through direct mapping. On this basis, SE modules are added to learn the importance of different channels, so as to effectively exploit the input information and improve the evaluation performance of the system. From the experimental results, it can be seen that the proposed method can achieve good speech quality evaluation while the number of parameters is relatively small. Master of Science (Communications Engineering) 2023-03-03T02:07:06Z 2023-03-03T02:07:06Z 2023 Thesis-Master by Coursework Fang, X. (2023). Non-reference speech quality assessment based on deep learning. Master's thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/164956 https://hdl.handle.net/10356/164956 en application/pdf Nanyang Technological University
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Engineering::Electrical and electronic engineering
spellingShingle Engineering::Electrical and electronic engineering
Fang, Xuhui
Non-reference speech quality assessment based on deep learning
description In the field of speech processing, voice quality evaluation is one of the important techniques, and it has been widely used in mobile communications, Internet, public safety, digital entertainment, consumer electronics, and other fields. In the early days, there was only subjective voice quality assessment, but it required large human resources, annotated data and time. Hence, objective voice quality evaluation methods gradually became popular. Referenced speech quality assessment models require pure and raw speech signals, which are sometimes difficult to obtain in practice. As a result, the reference speech quality assessment method has received increased attention, especially in recent years. Many experts and researchers have integrated deep learning technology into reference speech quality assessment, which has made a major breakthrough in this field. However, the existing deep learning-based speech quality evaluation still has limitations such as insufficient accuracy and large number of parameters. In order to address these limitations, this dissertation studies the non-reference speech quality evaluation method based on deep learning, and the main research is summarized below: (1) Considering the problem that the accuracy of existing voice quality assessment is not high enough, this dissertation proposes an improvement method from multiple perspectives. This includes the use of BiLSTM(Bidirectional Long Short-Term Memory) to improve the time-dependent model, fully exploiting the ability of BiLSTM to effectively learn the speech context information. On this basis, the Squeeze-and-Excitation (SE) module is added to screen out the attention of the channels by learning the correlation between different channels in the feature map, so as to perform feature calibration on the feature map. In addition, a custom loss function based on the signal loss ratio is used to improve model fitting, which further improves the evaluation performance of the model. Experimental results show the effectiveness of this method. (2) For the problem that the existing speech quality evaluation model has a large number of parameters, we propose a low-complexity speech quality evaluation method based on depthwise residual convolution and Bidirectional Gate Recurrent Unit (BiGRU), the SE-DSResBGRU-NRSQA model\cite{CNN41}. The main goal of this model is to reduce the number of parameters, by using BiGRU and depthwise separable convolution, optimizing the convolution part with the main structure of residual network (ResNet), and using shallow feature information to improve the evaluation performance through direct mapping. On this basis, SE modules are added to learn the importance of different channels, so as to effectively exploit the input information and improve the evaluation performance of the system. From the experimental results, it can be seen that the proposed method can achieve good speech quality evaluation while the number of parameters is relatively small.
author2 Tan Yap Peng
author_facet Tan Yap Peng
Fang, Xuhui
format Thesis-Master by Coursework
author Fang, Xuhui
author_sort Fang, Xuhui
title Non-reference speech quality assessment based on deep learning
title_short Non-reference speech quality assessment based on deep learning
title_full Non-reference speech quality assessment based on deep learning
title_fullStr Non-reference speech quality assessment based on deep learning
title_full_unstemmed Non-reference speech quality assessment based on deep learning
title_sort non-reference speech quality assessment based on deep learning
publisher Nanyang Technological University
publishDate 2023
url https://hdl.handle.net/10356/164956
_version_ 1772827189571485696