Improving end-to-end transformer model architecture in ASR

As a result of advancement in deep learning and neural network technology, end-to-end models have been introduced into automatic speech recognition (ASR) successfully and achieved superior performance compared to conventional hybrid systems. End-to-end models simplify the traditional GMM-HMM models...

Full description

Saved in:
Bibliographic Details
Main Author: Zhao, Yingzhu
Other Authors: Chng Eng Siong
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2023
Subjects:
Online Access:https://hdl.handle.net/10356/166407
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:As a result of advancement in deep learning and neural network technology, end-to-end models have been introduced into automatic speech recognition (ASR) successfully and achieved superior performance compared to conventional hybrid systems. End-to-end models simplify the traditional GMM-HMM models by transcribing speech to text directly with fast computation speed and fast development time. Transformer model, the latest end-to-end model, has achieved a huge success not only in ASR but also in natural language processing and computer vision. In spite of its great performance, transformer model architecture can be further improved to better suit the characteristics of ASR. To be more specific, ASR performance is greatly affected by the speaker mismatch between training and test data, and speaker adaptation is a long-standing problem in ASR. Additionally, ASR has the characteristic to have monotonic alignment between text output and speech input. Different phonemes along an utterance may require different level of computation due to varying complexity and noise level. How to better deploy the transformer model for ASR remains a challenge. In this thesis, three novel architectural changes to the transformer model are proposed for the above ASR characteristics. For each of the proposed methods, detailed experimental evaluation is carried out to compare the performance against the transformer baseline with analysis. In the first study, to alleviate the performance drop in ASR due to speaker mismatch between training and test data, we present a unified framework for speaker adaptation, which consists of feature adaptation and model adaptation. A speaker-aware persistent memory model is presented to capture speaker knowledge through the persistent memory in order to generalize better to unseen test speakers, which belongs to the feature adaptation method. Furthermore, a model-based gradual pruning approach is deployed to free up partial model encoder parameters for target speaker adaptation without sacrificing the original model performance. Finally, instead of only adapting to general speakers or specific target speaker, we design a multi-speaker adaptation model capable of adapting to multiple speakers simultaneously, which is practical in ubiquitous environment with multiple users. Our proposed approach brings relative 2.74-6.52% word error rate (WER) reduction on general speaker adaptation. On target speaker adaptation, our method outperforms the baseline with up to 20.58% relative WER reduction, and surpasses the finetuning method by up to relative 2.54%. In the second study, we then address the text output speech input misalignment problem in transformer model which greatly affects the ASR accuracy during training and inference. An effective cross attention biasing technique in transformer is proposed that takes monotonic alignment between text output and speech input into consideration by making use of cross attention weights. Specifically, a Gaussian mask is applied on cross attention weights to limit the input speech context range locally given alignment information. A regularizer is further introduced for alignment regularization. Experiments on LibriSpeech dataset find that our proposed model can obtain improved output-input alignment for ASR, and yields 14.5%-25.0% relative WER reductions. In the final study, to resolve the problem of static number of layers in transformer model, we present universal speech transformer. It generalizes the speech transformer with dynamic numbers of encoder/decoder layers, which can relieve the burden of tuning depth related hyperparameters. Universal transformer adds the depth and positional embeddings repeatedly for each layer, which dilutes the acoustic information carried by hidden representation, and it also performs a partial update of hidden vectors between layers, which is less efficient especially on the very deep models. For better use of universal transformer, we modify its processing framework by removing the depth embedding and only adding the positional embedding once at transformer encoder frontend. Furthermore, to update the hidden vectors efficiently, especially on the very deep models, we adopt a full update. Experiments on LibriSpeech, Switchboard and AISHELL-1 datasets show that our model outperforms a baseline by 3.88%-13.7%, and surpasses other model with less computation cost.