Towards high performance phonotactic feature for spoken language recognition

With the demands of globalization, multilingual speech is increasingly common in conversational telephone speech, broadcast news and internet podcasts. Therefore, automatic spoken language recognition has become an important technology in multilingual speech related applications. For example, auto...

Full description

Saved in:

Bibliographic Details
Main Author:	Tong, Rong
Other Authors:	Li Haizhou
Format:	Theses and Dissertations
Language:	English
Published:	2012
Subjects:	DRNTU::Engineering::Computer science and engineering::Computer systems organization::Performance of systems
Online Access:	https://hdl.handle.net/10356/50585
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-50585
record_format	dspace
spelling	sg-ntu-dr.10356-505852023-03-04T00:48:25Z Towards high performance phonotactic feature for spoken language recognition Tong, Rong Li Haizhou Chng Eng Siong School of Computer Engineering Temasek Laboratories DRNTU::Engineering::Computer science and engineering::Computer systems organization::Performance of systems With the demands of globalization, multilingual speech is increasingly common in conversational telephone speech, broadcast news and internet podcasts. Therefore, automatic spoken language recognition has become an important technology in multilingual speech related applications. For example, automatic spoken language recognition has been used as a preprocessing component for spoken language translation, multilingual speech recognition and spoken document retrieval. Both humans and machines rely on certain informative cues to differentiate one language from another. Inspired by the findings in the discriminative cues for human language recognition, most of the automatic language recognition systems rely on the following three features: acoustic, prosodic and phonotactic. Acoustic features capture spectral characteristics and can be obtained from short-term speech signals. Prosodic features such as tone, intonation, prominence and rhythm can be derived from energy measurements, pitch contour, rate of change. Phonotactic features capture the statistics of lexical constraints and phonotactic patterns. Phonotactic features can be generated from a tokenization front end which converts speech signals into sequences of sound patterns. This thesis focuses on the study of effective phonotactic feature extraction methods for high performance automatic language recognition. Specifically, the main contributions of this thesis are: A novel target-oriented method is proposed to construct parallel phone recognizers for robust phonotactic feature extraction. A subset of the most discriminative phones from an existing phone recognizer is selected to form a target-oriented phone tokenizer (TOPT). The TOPT phone tokenizers, one for each of the target languages, are constructed from an existing phone recognizer without requiring additional transcribed training data. A target-aware language models (TALM) method is proposed to generate phone tokenizers by constructing a set of phone language models, each dedicated to a target language. In the front-end decoding process with TALM, all the phone models of the original phone recognizer are used, and they are constrained by target-aware language models. Each target-aware language model emphasize on the discriminative ability of phones for a specific target language. An automatic relevance feedback technique is proposed to incorporate more language information in language recognition with short utterances. The idea is to augment the short input utterance with relevant utterances from the reference corpus. In this way, the short utterances are augmented with richer information and better language recognition accuracy can be achieved. A feature selection method is proposed to reduce redundant phonotactic information to make the language recognition system more efficient. The dimensional reduction is achieved by measuring the importance of features using two different criteria: contribution to SVM separation margin and Chi-squared value. DOCTOR OF PHILOSOPHY (SCE) 2012-07-11T06:12:06Z 2012-07-11T06:12:06Z 2012 2012 Thesis Tong, R. (2012). Towards high performance phonotactic feature for spoken language recognition. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/50585 10.32657/10356/50585 en 145 p. application/pdf
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	DRNTU::Engineering::Computer science and engineering::Computer systems organization::Performance of systems
spellingShingle	DRNTU::Engineering::Computer science and engineering::Computer systems organization::Performance of systems Tong, Rong Towards high performance phonotactic feature for spoken language recognition
description	With the demands of globalization, multilingual speech is increasingly common in conversational telephone speech, broadcast news and internet podcasts. Therefore, automatic spoken language recognition has become an important technology in multilingual speech related applications. For example, automatic spoken language recognition has been used as a preprocessing component for spoken language translation, multilingual speech recognition and spoken document retrieval. Both humans and machines rely on certain informative cues to differentiate one language from another. Inspired by the findings in the discriminative cues for human language recognition, most of the automatic language recognition systems rely on the following three features: acoustic, prosodic and phonotactic. Acoustic features capture spectral characteristics and can be obtained from short-term speech signals. Prosodic features such as tone, intonation, prominence and rhythm can be derived from energy measurements, pitch contour, rate of change. Phonotactic features capture the statistics of lexical constraints and phonotactic patterns. Phonotactic features can be generated from a tokenization front end which converts speech signals into sequences of sound patterns. This thesis focuses on the study of effective phonotactic feature extraction methods for high performance automatic language recognition. Specifically, the main contributions of this thesis are: A novel target-oriented method is proposed to construct parallel phone recognizers for robust phonotactic feature extraction. A subset of the most discriminative phones from an existing phone recognizer is selected to form a target-oriented phone tokenizer (TOPT). The TOPT phone tokenizers, one for each of the target languages, are constructed from an existing phone recognizer without requiring additional transcribed training data. A target-aware language models (TALM) method is proposed to generate phone tokenizers by constructing a set of phone language models, each dedicated to a target language. In the front-end decoding process with TALM, all the phone models of the original phone recognizer are used, and they are constrained by target-aware language models. Each target-aware language model emphasize on the discriminative ability of phones for a specific target language. An automatic relevance feedback technique is proposed to incorporate more language information in language recognition with short utterances. The idea is to augment the short input utterance with relevant utterances from the reference corpus. In this way, the short utterances are augmented with richer information and better language recognition accuracy can be achieved. A feature selection method is proposed to reduce redundant phonotactic information to make the language recognition system more efficient. The dimensional reduction is achieved by measuring the importance of features using two different criteria: contribution to SVM separation margin and Chi-squared value.
author2	Li Haizhou
author_facet	Li Haizhou Tong, Rong
format	Theses and Dissertations
author	Tong, Rong
author_sort	Tong, Rong
title	Towards high performance phonotactic feature for spoken language recognition
title_short	Towards high performance phonotactic feature for spoken language recognition
title_full	Towards high performance phonotactic feature for spoken language recognition
title_fullStr	Towards high performance phonotactic feature for spoken language recognition
title_full_unstemmed	Towards high performance phonotactic feature for spoken language recognition
title_sort	towards high performance phonotactic feature for spoken language recognition
publishDate	2012
url	https://hdl.handle.net/10356/50585
_version_	1759854407270268928

Towards high performance phonotactic feature for spoken language recognition

Similar Items