Towards high performance phonotactic feature for spoken language recognition

With the demands of globalization, multilingual speech is increasingly common in conversational telephone speech, broadcast news and internet podcasts. Therefore, automatic spoken language recognition has become an important technology in multilingual speech related applications. For example, auto...

Full description

Saved in:
Bibliographic Details
Main Author: Tong, Rong
Other Authors: Li Haizhou
Format: Theses and Dissertations
Language:English
Published: 2012
Subjects:
Online Access:https://hdl.handle.net/10356/50585
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-50585
record_format dspace
spelling sg-ntu-dr.10356-505852023-03-04T00:48:25Z Towards high performance phonotactic feature for spoken language recognition Tong, Rong Li Haizhou Chng Eng Siong School of Computer Engineering Temasek Laboratories DRNTU::Engineering::Computer science and engineering::Computer systems organization::Performance of systems With the demands of globalization, multilingual speech is increasingly common in conversational telephone speech, broadcast news and internet podcasts. Therefore, automatic spoken language recognition has become an important technology in multilingual speech related applications. For example, automatic spoken language recognition has been used as a preprocessing component for spoken language translation, multilingual speech recognition and spoken document retrieval. Both humans and machines rely on certain informative cues to differentiate one language from another. Inspired by the findings in the discriminative cues for human language recognition, most of the automatic language recognition systems rely on the following three features: acoustic, prosodic and phonotactic. Acoustic features capture spectral characteristics and can be obtained from short-term speech signals. Prosodic features such as tone, intonation, prominence and rhythm can be derived from energy measurements, pitch contour, rate of change. Phonotactic features capture the statistics of lexical constraints and phonotactic patterns. Phonotactic features can be generated from a tokenization front end which converts speech signals into sequences of sound patterns. This thesis focuses on the study of effective phonotactic feature extraction methods for high performance automatic language recognition. Specifically, the main contributions of this thesis are: A novel target-oriented method is proposed to construct parallel phone recognizers for robust phonotactic feature extraction. A subset of the most discriminative phones from an existing phone recognizer is selected to form a target-oriented phone tokenizer (TOPT). The TOPT phone tokenizers, one for each of the target languages, are constructed from an existing phone recognizer without requiring additional transcribed training data. A target-aware language models (TALM) method is proposed to generate phone tokenizers by constructing a set of phone language models, each dedicated to a target language. In the front-end decoding process with TALM, all the phone models of the original phone recognizer are used, and they are constrained by target-aware language models. Each target-aware language model emphasize on the discriminative ability of phones for a specific target language. An automatic relevance feedback technique is proposed to incorporate more language information in language recognition with short utterances. The idea is to augment the short input utterance with relevant utterances from the reference corpus. In this way, the short utterances are augmented with richer information and better language recognition accuracy can be achieved. A feature selection method is proposed to reduce redundant phonotactic information to make the language recognition system more efficient. The dimensional reduction is achieved by measuring the importance of features using two different criteria: contribution to SVM separation margin and Chi-squared value. DOCTOR OF PHILOSOPHY (SCE) 2012-07-11T06:12:06Z 2012-07-11T06:12:06Z 2012 2012 Thesis Tong, R. (2012). Towards high performance phonotactic feature for spoken language recognition. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/50585 10.32657/10356/50585 en 145 p. application/pdf
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic DRNTU::Engineering::Computer science and engineering::Computer systems organization::Performance of systems
spellingShingle DRNTU::Engineering::Computer science and engineering::Computer systems organization::Performance of systems
Tong, Rong
Towards high performance phonotactic feature for spoken language recognition
description With the demands of globalization, multilingual speech is increasingly common in conversational telephone speech, broadcast news and internet podcasts. Therefore, automatic spoken language recognition has become an important technology in multilingual speech related applications. For example, automatic spoken language recognition has been used as a preprocessing component for spoken language translation, multilingual speech recognition and spoken document retrieval. Both humans and machines rely on certain informative cues to differentiate one language from another. Inspired by the findings in the discriminative cues for human language recognition, most of the automatic language recognition systems rely on the following three features: acoustic, prosodic and phonotactic. Acoustic features capture spectral characteristics and can be obtained from short-term speech signals. Prosodic features such as tone, intonation, prominence and rhythm can be derived from energy measurements, pitch contour, rate of change. Phonotactic features capture the statistics of lexical constraints and phonotactic patterns. Phonotactic features can be generated from a tokenization front end which converts speech signals into sequences of sound patterns. This thesis focuses on the study of effective phonotactic feature extraction methods for high performance automatic language recognition. Specifically, the main contributions of this thesis are: A novel target-oriented method is proposed to construct parallel phone recognizers for robust phonotactic feature extraction. A subset of the most discriminative phones from an existing phone recognizer is selected to form a target-oriented phone tokenizer (TOPT). The TOPT phone tokenizers, one for each of the target languages, are constructed from an existing phone recognizer without requiring additional transcribed training data. A target-aware language models (TALM) method is proposed to generate phone tokenizers by constructing a set of phone language models, each dedicated to a target language. In the front-end decoding process with TALM, all the phone models of the original phone recognizer are used, and they are constrained by target-aware language models. Each target-aware language model emphasize on the discriminative ability of phones for a specific target language. An automatic relevance feedback technique is proposed to incorporate more language information in language recognition with short utterances. The idea is to augment the short input utterance with relevant utterances from the reference corpus. In this way, the short utterances are augmented with richer information and better language recognition accuracy can be achieved. A feature selection method is proposed to reduce redundant phonotactic information to make the language recognition system more efficient. The dimensional reduction is achieved by measuring the importance of features using two different criteria: contribution to SVM separation margin and Chi-squared value.
author2 Li Haizhou
author_facet Li Haizhou
Tong, Rong
format Theses and Dissertations
author Tong, Rong
author_sort Tong, Rong
title Towards high performance phonotactic feature for spoken language recognition
title_short Towards high performance phonotactic feature for spoken language recognition
title_full Towards high performance phonotactic feature for spoken language recognition
title_fullStr Towards high performance phonotactic feature for spoken language recognition
title_full_unstemmed Towards high performance phonotactic feature for spoken language recognition
title_sort towards high performance phonotactic feature for spoken language recognition
publishDate 2012
url https://hdl.handle.net/10356/50585
_version_ 1759854407270268928