Language model domain adaptation for automatic speech recognition systems

This research addresses the language model (LM) domain mismatch problem in automatic speech recognition (ASR) systems. The ASR systems rely on LMs to constrain its recognition output to linguistically correct hypotheses. While LMs significantly improve the linguistic competence of ASR, they are high...

Full description

Saved in:

Bibliographic Details
Main Author:	Khassanov, Yerbolat
Other Authors:	Chng Eng Siong
Format:	Thesis-Doctor of Philosophy
Language:	English
Published:	Nanyang Technological University 2020
Subjects:	Engineering::Computer science and engineering::Computer applications
Online Access:	https://hdl.handle.net/10356/141323
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-141323
record_format	dspace
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Engineering::Computer science and engineering::Computer applications
spellingShingle	Engineering::Computer science and engineering::Computer applications Khassanov, Yerbolat Language model domain adaptation for automatic speech recognition systems
description	This research addresses the language model (LM) domain mismatch problem in automatic speech recognition (ASR) systems. The ASR systems rely on LMs to constrain its recognition output to linguistically correct hypotheses. While LMs significantly improve the linguistic competence of ASR, they are highly sensitive to the domain mismatch between training (source) and test (target) data. Even a slight difference between source and target domains might severely degrade LM’s effectiveness, and consequently harm the recognition performance of ASR. Although LM domain mismatch is caused by the combination of various factors, in this work, we focus only on the following three factors: topic, vocabulary coverage and code-switching practice. In particular, we will first thoroughly describe each of these factors and their impact on the performance of LM and ASR, and then will propose three novel LM adaptation methods addressing each of them. The first proposed method addresses the topic domain mismatch in count-based N-gram LMs employed at the decoding stage of the deep neural network (DNN) hidden Markov model (HMM) based hybrid ASR. The proposed method is based on the two-pass adaptation approach with the data selection technique. Different from the conventional two-pass adaptation methods that directly modify the parameters of LM using the recognition output from the first pass, our method instead modifies the training data by filtering out the irrelevant text segments. Consequently, our method avoids the error propagation caused by the re-usage of incorrect recognition output hypotheses and prevents the distortion of the captured linguistic knowledge. The second proposed method addresses the vocabulary coverage mismatch in word level recurrent neural network (RNN) LMs frequently used at the rescoring stage of both DNN-HMM and end-to-end (E2E) ASR systems. Despite the superior generalization capability of RNN LMs, its vocabulary coverage will be always limited to the words present in the training data. Whereas the remaining words, which are usually rare domain-specific words, are either discarded or mapped to a special token such as <unk>. Consequently, the important keywords such as rare person and location names will be underrepresented, and thus, removed from the final recognition output. To overcome this problem, we propose an efficient vocabulary adaptation method based on the word embedding matrix augmentation. Our method employs the similar words to expand the vocabulary coverage or to enrich the representations of rare words in pre-trained RNN LM without requiring additional in-domain training data and expensive post-processing. The third proposed method addresses the code-switching (CS) practice mismatch between training and test data in multilingual E2E ASR systems. Specifically, the training data consists of intersentential CS (inter-CS) type utterances, i.e. a practice of mixing languages between utterances. On the other hand, the test data consists of intrasentential CS (intra-CS) type utterances, i.e. a practice of mixing languages within utterances, which are considered more challenging. While the inter-CS data can be obtained by simply combining several monolingual corpora from different languages, the labeled intra-CS data is extremely difficult to obtain. Therefore, we propose an effective adaptation method to improve the intra-CS speech recognition capability of E2E ASR built using only abundant inter-CS data. Particularly, our method constrains the output token embeddings of monolingual languages, residing within internal LM of E2E ASR, to be close to each other, and hence promotes the E2E ASR to easily switch languages. We evaluated all proposed methods on standard datasets using the state-of-the-art tools. We also compared the proposed methods against strong baseline systems where significant improvements were achieved.
author2	Chng Eng Siong
author_facet	Chng Eng Siong Khassanov, Yerbolat
format	Thesis-Doctor of Philosophy
author	Khassanov, Yerbolat
author_sort	Khassanov, Yerbolat
title	Language model domain adaptation for automatic speech recognition systems
title_short	Language model domain adaptation for automatic speech recognition systems
title_full	Language model domain adaptation for automatic speech recognition systems
title_fullStr	Language model domain adaptation for automatic speech recognition systems
title_full_unstemmed	Language model domain adaptation for automatic speech recognition systems
title_sort	language model domain adaptation for automatic speech recognition systems
publisher	Nanyang Technological University
publishDate	2020
url	https://hdl.handle.net/10356/141323
_version_	1683493128405778432
spelling	sg-ntu-dr.10356-1413232020-10-28T08:40:42Z Language model domain adaptation for automatic speech recognition systems Khassanov, Yerbolat Chng Eng Siong School of Computer Science and Engineering ASESChng@ntu.edu.sg Engineering::Computer science and engineering::Computer applications This research addresses the language model (LM) domain mismatch problem in automatic speech recognition (ASR) systems. The ASR systems rely on LMs to constrain its recognition output to linguistically correct hypotheses. While LMs significantly improve the linguistic competence of ASR, they are highly sensitive to the domain mismatch between training (source) and test (target) data. Even a slight difference between source and target domains might severely degrade LM’s effectiveness, and consequently harm the recognition performance of ASR. Although LM domain mismatch is caused by the combination of various factors, in this work, we focus only on the following three factors: topic, vocabulary coverage and code-switching practice. In particular, we will first thoroughly describe each of these factors and their impact on the performance of LM and ASR, and then will propose three novel LM adaptation methods addressing each of them. The first proposed method addresses the topic domain mismatch in count-based N-gram LMs employed at the decoding stage of the deep neural network (DNN) hidden Markov model (HMM) based hybrid ASR. The proposed method is based on the two-pass adaptation approach with the data selection technique. Different from the conventional two-pass adaptation methods that directly modify the parameters of LM using the recognition output from the first pass, our method instead modifies the training data by filtering out the irrelevant text segments. Consequently, our method avoids the error propagation caused by the re-usage of incorrect recognition output hypotheses and prevents the distortion of the captured linguistic knowledge. The second proposed method addresses the vocabulary coverage mismatch in word level recurrent neural network (RNN) LMs frequently used at the rescoring stage of both DNN-HMM and end-to-end (E2E) ASR systems. Despite the superior generalization capability of RNN LMs, its vocabulary coverage will be always limited to the words present in the training data. Whereas the remaining words, which are usually rare domain-specific words, are either discarded or mapped to a special token such as <unk>. Consequently, the important keywords such as rare person and location names will be underrepresented, and thus, removed from the final recognition output. To overcome this problem, we propose an efficient vocabulary adaptation method based on the word embedding matrix augmentation. Our method employs the similar words to expand the vocabulary coverage or to enrich the representations of rare words in pre-trained RNN LM without requiring additional in-domain training data and expensive post-processing. The third proposed method addresses the code-switching (CS) practice mismatch between training and test data in multilingual E2E ASR systems. Specifically, the training data consists of intersentential CS (inter-CS) type utterances, i.e. a practice of mixing languages between utterances. On the other hand, the test data consists of intrasentential CS (intra-CS) type utterances, i.e. a practice of mixing languages within utterances, which are considered more challenging. While the inter-CS data can be obtained by simply combining several monolingual corpora from different languages, the labeled intra-CS data is extremely difficult to obtain. Therefore, we propose an effective adaptation method to improve the intra-CS speech recognition capability of E2E ASR built using only abundant inter-CS data. Particularly, our method constrains the output token embeddings of monolingual languages, residing within internal LM of E2E ASR, to be close to each other, and hence promotes the E2E ASR to easily switch languages. We evaluated all proposed methods on standard datasets using the state-of-the-art tools. We also compared the proposed methods against strong baseline systems where significant improvements were achieved. Doctor of Philosophy 2020-06-07T14:15:20Z 2020-06-07T14:15:20Z 2020 Thesis-Doctor of Philosophy Khassanov, Y. (2020). Language model domain adaptation for automatic speech recognition systems. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/141323 10.32657/10356/141323 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University

Language model domain adaptation for automatic speech recognition systems

Similar Items