Enhancing spoken language identification and diarization for multilingual speech

Spoken language identification (LID) refers to the automatic process of determining the identity of the language spoken in a speech signal. It has been widely employed as preprocessing in multilingual speech signal processing systems. While existing approaches have shown high performance for general...

وصف كامل

محفوظ في:

التفاصيل البيبلوغرافية
المؤلف الرئيسي:	Liu, Hexin
مؤلفون آخرون:	Andy Khong W H
التنسيق:	Thesis-Doctor of Philosophy
اللغة:	English
منشور في:	Nanyang Technological University 2023
الموضوعات:	Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Engineering::Electrical and electronic engineering
الوصول للمادة أونلاين:	https://hdl.handle.net/10356/168498
الوسوم:	إضافة وسم لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!
المؤسسة:	Nanyang Technological University
اللغة:	English

id	sg-ntu-dr.10356-168498
record_format	dspace
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Engineering::Electrical and electronic engineering
spellingShingle	Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Engineering::Electrical and electronic engineering Liu, Hexin Enhancing spoken language identification and diarization for multilingual speech
description	Spoken language identification (LID) refers to the automatic process of determining the identity of the language spoken in a speech signal. It has been widely employed as preprocessing in multilingual speech signal processing systems. While existing approaches have shown high performance for general LID, it is still challenging to perform well on speech of various durations. In addition, general LID methods only employ a single type of language cue. Since language cues depict language information from different perspectives, the incorporation of language cues is expected to exhibit higher performance compared to using a single language cue. Therefore, in this thesis, an x-vector self-attention LID (XSA-LID) model is proposed to achieve robustness to speech duration. Two approaches are next introduced to improve and incorporate the language cues, respectively. Finally, LID is performed in a more complex scenario—language diarization via an end-to-end LD model. To achieve robustness against performance degradation due to varying duration, a dual-mode framework on the XSA-LID model with knowledge distillation (KD) is proposed. The dual-mode XSA-LID model is trained by jointly optimizing both the full and short modes with their respective inputs being the full-length speech and its short clip extracted by a specific Boolean mask before KD is applied to further boost the performance on short utterances. In addition, the impact of clip-wise linguistic variability and lexical integrity for LID is investigated by analyzing the variation of LID performance in terms of the lengths and positions of the mimicked speech clips. To enhance LID from the perspective of language cues, two methods are next introduced through which language cues can be utilized efficiently and effectively. The first method investigated efficient methods to compute reliable representations and discard redundant information for LID using a pre-trained multilingual wav2vec 2.0 model. To determine an optimal basic system, the performance of the wav2vec features extracted from the different inner layers of the context network are compared. For this approach, the XSA-LID model forms the backbone used to discriminate between distinct languages. Two mechanisms are then employed to reduce the irrelevant information of the representations in LID—the first being the attentive squeeze-and-excitation (SE) block for dimension-wise scaling and the second being the linear bottleneck (LBN) block that reduces the irrelevant information by nonlinear dimension reduction. These two methods are incorporated within the XSA-LID model, named AttSE-XSA and LBN-XSA respectively. In the second approach, a novel LID model is proposed to hierarchically incorporate phoneme and phonotactic information without requiring phoneme annotations for training. In this model, named PHO-LID, a self-supervised phoneme segmentation task and a LID task share a convolutional neural network (CNN) module, which encodes both language identity and sequential phonemic information in the input speech to generate an intermediate sequence of “phonotactic” embeddings. These embeddings are then fed into transformer encoder layers for utterance-level LID. This architecture is called CNN-Trans. Finally, LID is extended for a code-switching scenario language diarization. In this work, two end-to-end neural configurations are proposed for language diarization on bilingual code-switching speech. The first, a BLSTM-E2E architecture, includes a set of stacked bidirectional LSTMs to compute embeddings and incorporates the deep clustering loss to enforce grouping of languages belonging to the same class. The second, an XSA-E2E architecture, is based on an x-vector model followed by a self-attention encoder. The former encodes frame-level features into segment-level embeddings while the latter considers all those embeddings to generate a sequence of segment-level language labels. All proposed approaches are evaluated on standard datasets including NIST LRE 2017, OLR, SEAME, and WSTCSMC 2020. Compared with the baseline systems, the proposed approaches exhibit significant performance improvement on their corresponding language identification and diarization tasks.
author2	Andy Khong W H
author_facet	Andy Khong W H Liu, Hexin
format	Thesis-Doctor of Philosophy
author	Liu, Hexin
author_sort	Liu, Hexin
title	Enhancing spoken language identification and diarization for multilingual speech
title_short	Enhancing spoken language identification and diarization for multilingual speech
title_full	Enhancing spoken language identification and diarization for multilingual speech
title_fullStr	Enhancing spoken language identification and diarization for multilingual speech
title_full_unstemmed	Enhancing spoken language identification and diarization for multilingual speech
title_sort	enhancing spoken language identification and diarization for multilingual speech
publisher	Nanyang Technological University
publishDate	2023
url	https://hdl.handle.net/10356/168498
_version_	1772826478243741696
spelling	sg-ntu-dr.10356-1684982023-07-04T15:10:35Z Enhancing spoken language identification and diarization for multilingual speech Liu, Hexin Andy Khong W H School of Electrical and Electronic Engineering AndyKhong@ntu.edu.sg Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Engineering::Electrical and electronic engineering Spoken language identification (LID) refers to the automatic process of determining the identity of the language spoken in a speech signal. It has been widely employed as preprocessing in multilingual speech signal processing systems. While existing approaches have shown high performance for general LID, it is still challenging to perform well on speech of various durations. In addition, general LID methods only employ a single type of language cue. Since language cues depict language information from different perspectives, the incorporation of language cues is expected to exhibit higher performance compared to using a single language cue. Therefore, in this thesis, an x-vector self-attention LID (XSA-LID) model is proposed to achieve robustness to speech duration. Two approaches are next introduced to improve and incorporate the language cues, respectively. Finally, LID is performed in a more complex scenario—language diarization via an end-to-end LD model. To achieve robustness against performance degradation due to varying duration, a dual-mode framework on the XSA-LID model with knowledge distillation (KD) is proposed. The dual-mode XSA-LID model is trained by jointly optimizing both the full and short modes with their respective inputs being the full-length speech and its short clip extracted by a specific Boolean mask before KD is applied to further boost the performance on short utterances. In addition, the impact of clip-wise linguistic variability and lexical integrity for LID is investigated by analyzing the variation of LID performance in terms of the lengths and positions of the mimicked speech clips. To enhance LID from the perspective of language cues, two methods are next introduced through which language cues can be utilized efficiently and effectively. The first method investigated efficient methods to compute reliable representations and discard redundant information for LID using a pre-trained multilingual wav2vec 2.0 model. To determine an optimal basic system, the performance of the wav2vec features extracted from the different inner layers of the context network are compared. For this approach, the XSA-LID model forms the backbone used to discriminate between distinct languages. Two mechanisms are then employed to reduce the irrelevant information of the representations in LID—the first being the attentive squeeze-and-excitation (SE) block for dimension-wise scaling and the second being the linear bottleneck (LBN) block that reduces the irrelevant information by nonlinear dimension reduction. These two methods are incorporated within the XSA-LID model, named AttSE-XSA and LBN-XSA respectively. In the second approach, a novel LID model is proposed to hierarchically incorporate phoneme and phonotactic information without requiring phoneme annotations for training. In this model, named PHO-LID, a self-supervised phoneme segmentation task and a LID task share a convolutional neural network (CNN) module, which encodes both language identity and sequential phonemic information in the input speech to generate an intermediate sequence of “phonotactic” embeddings. These embeddings are then fed into transformer encoder layers for utterance-level LID. This architecture is called CNN-Trans. Finally, LID is extended for a code-switching scenario language diarization. In this work, two end-to-end neural configurations are proposed for language diarization on bilingual code-switching speech. The first, a BLSTM-E2E architecture, includes a set of stacked bidirectional LSTMs to compute embeddings and incorporates the deep clustering loss to enforce grouping of languages belonging to the same class. The second, an XSA-E2E architecture, is based on an x-vector model followed by a self-attention encoder. The former encodes frame-level features into segment-level embeddings while the latter considers all those embeddings to generate a sequence of segment-level language labels. All proposed approaches are evaluated on standard datasets including NIST LRE 2017, OLR, SEAME, and WSTCSMC 2020. Compared with the baseline systems, the proposed approaches exhibit significant performance improvement on their corresponding language identification and diarization tasks. Doctor of Philosophy 2023-06-05T03:43:29Z 2023-06-05T03:43:29Z 2023 Thesis-Doctor of Philosophy Liu, H. (2023). Enhancing spoken language identification and diarization for multilingual speech. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/168498 https://hdl.handle.net/10356/168498 10.32657/10356/168498 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University

Enhancing spoken language identification and diarization for multilingual speech

مواد مشابهة