Improving end-to-end transformer model architecture in ASR
As a result of advancement in deep learning and neural network technology, end-to-end models have been introduced into automatic speech recognition (ASR) successfully and achieved superior performance compared to conventional hybrid systems. End-to-end models simplify the traditional GMM-HMM models...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Doctor of Philosophy |
Language: | English |
Published: |
Nanyang Technological University
2023
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/166407 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | As a result of advancement in deep learning and neural network technology, end-to-end models have been introduced into automatic speech recognition (ASR) successfully and achieved superior performance compared to conventional hybrid systems. End-to-end models simplify the traditional GMM-HMM models by transcribing speech to text directly with fast computation speed and fast development time. Transformer model, the latest end-to-end model, has achieved a huge success not only in ASR but also in natural language processing and computer vision. In spite of its great performance, transformer model architecture can be further improved to better suit the characteristics of ASR.
To be more specific, ASR performance is greatly affected by the speaker mismatch between training and test data, and speaker adaptation is a long-standing problem in ASR. Additionally, ASR has the characteristic to have monotonic alignment between text output and speech input. Different phonemes along an utterance may require different level of computation due to varying complexity and noise level. How to better deploy the transformer model for ASR remains a challenge. In this thesis, three novel architectural changes to the transformer model are proposed for the above ASR characteristics. For each of the proposed methods, detailed experimental evaluation is carried out to compare the performance against the transformer baseline with analysis.
In the first study, to alleviate the performance drop in ASR due to speaker mismatch between training and test data, we present a unified framework for speaker adaptation, which consists of feature adaptation and model adaptation. A speaker-aware persistent memory model is presented to capture speaker knowledge through the persistent memory in order to generalize better to unseen test speakers, which belongs to the feature adaptation method. Furthermore, a model-based gradual pruning approach is deployed to free up partial model encoder parameters for target speaker adaptation without sacrificing the original model performance. Finally, instead of only adapting to general speakers or specific target speaker, we design a multi-speaker adaptation model capable of adapting to multiple speakers simultaneously, which is practical in ubiquitous environment with multiple users. Our proposed approach brings relative 2.74-6.52% word error rate (WER) reduction on general speaker adaptation. On target speaker adaptation, our method outperforms the baseline with up to 20.58% relative WER reduction, and surpasses the finetuning method by up to relative 2.54%.
In the second study, we then address the text output speech input misalignment problem in transformer model which greatly affects the ASR accuracy during training and inference. An effective cross attention biasing technique in transformer is proposed that takes monotonic alignment between text output and speech input into consideration by making use of cross attention weights. Specifically, a Gaussian mask is applied on cross attention weights to limit the input speech context range locally given alignment information. A regularizer is further introduced for alignment regularization. Experiments on LibriSpeech dataset find that our proposed model can obtain improved output-input alignment for ASR, and yields 14.5%-25.0% relative WER reductions.
In the final study, to resolve the problem of static number of layers in transformer model, we present universal speech transformer. It generalizes the speech transformer with dynamic numbers of encoder/decoder layers, which can relieve the burden of tuning depth related hyperparameters. Universal transformer adds the depth and positional embeddings repeatedly for each layer, which dilutes the acoustic information carried by hidden representation, and it also performs a partial update of hidden vectors between layers, which is less efficient especially on the very deep models. For better use of universal transformer, we modify its processing framework by removing the depth embedding and only adding the positional embedding once at transformer encoder frontend. Furthermore, to update the hidden vectors efficiently, especially on the very deep models, we adopt a full update. Experiments on LibriSpeech, Switchboard and AISHELL-1 datasets show that our model outperforms a baseline by 3.88%-13.7%, and surpasses other model with less computation cost. |
---|