Enhancing spoken language identification and diarization for multilingual speech
Spoken language identification (LID) refers to the automatic process of determining the identity of the language spoken in a speech signal. It has been widely employed as preprocessing in multilingual speech signal processing systems. While existing approaches have shown high performance for general...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Doctor of Philosophy |
Language: | English |
Published: |
Nanyang Technological University
2023
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/168498 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-168498 |
---|---|
record_format |
dspace |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Engineering::Electrical and electronic engineering |
spellingShingle |
Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Engineering::Electrical and electronic engineering Liu, Hexin Enhancing spoken language identification and diarization for multilingual speech |
description |
Spoken language identification (LID) refers to the automatic process of determining the identity of the language spoken in a speech signal. It has been widely employed as preprocessing in multilingual speech signal processing systems. While existing approaches have shown high performance for general LID, it is still challenging to perform well on speech of various durations. In addition, general LID methods only employ a single type of language cue. Since language cues depict language information from different perspectives, the incorporation of language cues is expected to exhibit higher performance compared to using a single language cue. Therefore, in this thesis, an x-vector self-attention LID (XSA-LID) model is proposed to achieve robustness to speech duration. Two approaches are next introduced to improve and incorporate the language cues, respectively. Finally, LID is performed in a more complex scenario—language diarization via an end-to-end LD model.
To achieve robustness against performance degradation due to varying duration, a dual-mode framework on the XSA-LID model with knowledge distillation (KD) is proposed. The dual-mode XSA-LID model is trained by jointly optimizing both the full and short modes with their respective inputs being the full-length speech and its short clip extracted by a specific Boolean mask before KD is applied to further boost the performance on short utterances. In addition, the impact of clip-wise linguistic variability and lexical integrity for LID is investigated by analyzing the variation of LID performance in terms of the lengths and positions of the mimicked speech clips.
To enhance LID from the perspective of language cues, two methods are next introduced through which language cues can be utilized efficiently and effectively. The first method investigated efficient methods to compute reliable representations and discard redundant information for LID using a pre-trained multilingual wav2vec 2.0 model. To determine an optimal basic system, the performance of the wav2vec features extracted from the different inner layers of the context network are compared. For this approach, the XSA-LID model forms the backbone used to discriminate between distinct languages. Two mechanisms are then employed to reduce the irrelevant information of the representations in LID—the first being the attentive squeeze-and-excitation (SE) block for dimension-wise scaling and the second being the linear bottleneck (LBN) block that reduces the irrelevant information by nonlinear dimension reduction. These two methods are incorporated within the XSA-LID model, named AttSE-XSA and LBN-XSA respectively.
In the second approach, a novel LID model is proposed to hierarchically incorporate phoneme and phonotactic information without requiring phoneme annotations for training. In this model, named PHO-LID, a self-supervised phoneme segmentation task and a LID task share a convolutional neural network (CNN) module, which encodes both language identity and sequential phonemic information in the input speech to generate an intermediate sequence of “phonotactic” embeddings. These embeddings are then fed into transformer encoder layers for utterance-level LID. This architecture is called CNN-Trans.
Finally, LID is extended for a code-switching scenario language diarization. In this work, two end-to-end neural configurations are proposed for language diarization on bilingual code-switching speech. The first, a BLSTM-E2E architecture, includes a set of stacked bidirectional LSTMs to compute embeddings and incorporates the deep clustering loss to enforce grouping of languages belonging to the same class. The second, an XSA-E2E architecture, is based on an x-vector model followed by a self-attention encoder. The former encodes frame-level features into segment-level embeddings while the latter considers all those embeddings to generate a sequence of segment-level language labels.
All proposed approaches are evaluated on standard datasets including NIST LRE 2017, OLR, SEAME, and WSTCSMC 2020. Compared with the baseline systems, the proposed approaches exhibit significant performance improvement on their corresponding language identification and diarization tasks. |
author2 |
Andy Khong W H |
author_facet |
Andy Khong W H Liu, Hexin |
format |
Thesis-Doctor of Philosophy |
author |
Liu, Hexin |
author_sort |
Liu, Hexin |
title |
Enhancing spoken language identification and diarization for multilingual speech |
title_short |
Enhancing spoken language identification and diarization for multilingual speech |
title_full |
Enhancing spoken language identification and diarization for multilingual speech |
title_fullStr |
Enhancing spoken language identification and diarization for multilingual speech |
title_full_unstemmed |
Enhancing spoken language identification and diarization for multilingual speech |
title_sort |
enhancing spoken language identification and diarization for multilingual speech |
publisher |
Nanyang Technological University |
publishDate |
2023 |
url |
https://hdl.handle.net/10356/168498 |
_version_ |
1772826478243741696 |
spelling |
sg-ntu-dr.10356-1684982023-07-04T15:10:35Z Enhancing spoken language identification and diarization for multilingual speech Liu, Hexin Andy Khong W H School of Electrical and Electronic Engineering AndyKhong@ntu.edu.sg Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Engineering::Electrical and electronic engineering Spoken language identification (LID) refers to the automatic process of determining the identity of the language spoken in a speech signal. It has been widely employed as preprocessing in multilingual speech signal processing systems. While existing approaches have shown high performance for general LID, it is still challenging to perform well on speech of various durations. In addition, general LID methods only employ a single type of language cue. Since language cues depict language information from different perspectives, the incorporation of language cues is expected to exhibit higher performance compared to using a single language cue. Therefore, in this thesis, an x-vector self-attention LID (XSA-LID) model is proposed to achieve robustness to speech duration. Two approaches are next introduced to improve and incorporate the language cues, respectively. Finally, LID is performed in a more complex scenario—language diarization via an end-to-end LD model. To achieve robustness against performance degradation due to varying duration, a dual-mode framework on the XSA-LID model with knowledge distillation (KD) is proposed. The dual-mode XSA-LID model is trained by jointly optimizing both the full and short modes with their respective inputs being the full-length speech and its short clip extracted by a specific Boolean mask before KD is applied to further boost the performance on short utterances. In addition, the impact of clip-wise linguistic variability and lexical integrity for LID is investigated by analyzing the variation of LID performance in terms of the lengths and positions of the mimicked speech clips. To enhance LID from the perspective of language cues, two methods are next introduced through which language cues can be utilized efficiently and effectively. The first method investigated efficient methods to compute reliable representations and discard redundant information for LID using a pre-trained multilingual wav2vec 2.0 model. To determine an optimal basic system, the performance of the wav2vec features extracted from the different inner layers of the context network are compared. For this approach, the XSA-LID model forms the backbone used to discriminate between distinct languages. Two mechanisms are then employed to reduce the irrelevant information of the representations in LID—the first being the attentive squeeze-and-excitation (SE) block for dimension-wise scaling and the second being the linear bottleneck (LBN) block that reduces the irrelevant information by nonlinear dimension reduction. These two methods are incorporated within the XSA-LID model, named AttSE-XSA and LBN-XSA respectively. In the second approach, a novel LID model is proposed to hierarchically incorporate phoneme and phonotactic information without requiring phoneme annotations for training. In this model, named PHO-LID, a self-supervised phoneme segmentation task and a LID task share a convolutional neural network (CNN) module, which encodes both language identity and sequential phonemic information in the input speech to generate an intermediate sequence of “phonotactic” embeddings. These embeddings are then fed into transformer encoder layers for utterance-level LID. This architecture is called CNN-Trans. Finally, LID is extended for a code-switching scenario language diarization. In this work, two end-to-end neural configurations are proposed for language diarization on bilingual code-switching speech. The first, a BLSTM-E2E architecture, includes a set of stacked bidirectional LSTMs to compute embeddings and incorporates the deep clustering loss to enforce grouping of languages belonging to the same class. The second, an XSA-E2E architecture, is based on an x-vector model followed by a self-attention encoder. The former encodes frame-level features into segment-level embeddings while the latter considers all those embeddings to generate a sequence of segment-level language labels. All proposed approaches are evaluated on standard datasets including NIST LRE 2017, OLR, SEAME, and WSTCSMC 2020. Compared with the baseline systems, the proposed approaches exhibit significant performance improvement on their corresponding language identification and diarization tasks. Doctor of Philosophy 2023-06-05T03:43:29Z 2023-06-05T03:43:29Z 2023 Thesis-Doctor of Philosophy Liu, H. (2023). Enhancing spoken language identification and diarization for multilingual speech. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/168498 https://hdl.handle.net/10356/168498 10.32657/10356/168498 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University |