Language model domain adaptation for automatic speech recognition systems

This research addresses the language model (LM) domain mismatch problem in automatic speech recognition (ASR) systems. The ASR systems rely on LMs to constrain its recognition output to linguistically correct hypotheses. While LMs significantly improve the linguistic competence of ASR, they are high...

Full description

Saved in:
Bibliographic Details
Main Author: Khassanov, Yerbolat
Other Authors: Chng Eng Siong
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2020
Subjects:
Online Access:https://hdl.handle.net/10356/141323
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:This research addresses the language model (LM) domain mismatch problem in automatic speech recognition (ASR) systems. The ASR systems rely on LMs to constrain its recognition output to linguistically correct hypotheses. While LMs significantly improve the linguistic competence of ASR, they are highly sensitive to the domain mismatch between training (source) and test (target) data. Even a slight difference between source and target domains might severely degrade LM’s effectiveness, and consequently harm the recognition performance of ASR. Although LM domain mismatch is caused by the combination of various factors, in this work, we focus only on the following three factors: topic, vocabulary coverage and code-switching practice. In particular, we will first thoroughly describe each of these factors and their impact on the performance of LM and ASR, and then will propose three novel LM adaptation methods addressing each of them. The first proposed method addresses the topic domain mismatch in count-based N-gram LMs employed at the decoding stage of the deep neural network (DNN) hidden Markov model (HMM) based hybrid ASR. The proposed method is based on the two-pass adaptation approach with the data selection technique. Different from the conventional two-pass adaptation methods that directly modify the parameters of LM using the recognition output from the first pass, our method instead modifies the training data by filtering out the irrelevant text segments. Consequently, our method avoids the error propagation caused by the re-usage of incorrect recognition output hypotheses and prevents the distortion of the captured linguistic knowledge. The second proposed method addresses the vocabulary coverage mismatch in word level recurrent neural network (RNN) LMs frequently used at the rescoring stage of both DNN-HMM and end-to-end (E2E) ASR systems. Despite the superior generalization capability of RNN LMs, its vocabulary coverage will be always limited to the words present in the training data. Whereas the remaining words, which are usually rare domain-specific words, are either discarded or mapped to a special token such as <unk>. Consequently, the important keywords such as rare person and location names will be underrepresented, and thus, removed from the final recognition output. To overcome this problem, we propose an efficient vocabulary adaptation method based on the word embedding matrix augmentation. Our method employs the similar words to expand the vocabulary coverage or to enrich the representations of rare words in pre-trained RNN LM without requiring additional in-domain training data and expensive post-processing. The third proposed method addresses the code-switching (CS) practice mismatch between training and test data in multilingual E2E ASR systems. Specifically, the training data consists of intersentential CS (inter-CS) type utterances, i.e. a practice of mixing languages between utterances. On the other hand, the test data consists of intrasentential CS (intra-CS) type utterances, i.e. a practice of mixing languages within utterances, which are considered more challenging. While the inter-CS data can be obtained by simply combining several monolingual corpora from different languages, the labeled intra-CS data is extremely difficult to obtain. Therefore, we propose an effective adaptation method to improve the intra-CS speech recognition capability of E2E ASR built using only abundant inter-CS data. Particularly, our method constrains the output token embeddings of monolingual languages, residing within internal LM of E2E ASR, to be close to each other, and hence promotes the E2E ASR to easily switch languages. We evaluated all proposed methods on standard datasets using the state-of-the-art tools. We also compared the proposed methods against strong baseline systems where significant improvements were achieved.