Processing of speech utterances for computer aided training of speaking skills

Computer-aided language learning (CALL) involves the studies and applications of speech and language processing technologies to improve the process of language acquisition. Ideally, an effective computer-aided language learning system should be able to accurately assess the performance of a language...

Full description

Saved in:

Bibliographic Details
Main Author:	Zhao, Sixuan
Other Authors:	Koh Soo Ngee
Format:	Theses and Dissertations
Language:	English
Published:	2014
Subjects:	DRNTU::Engineering::Computer science and engineering::Information systems::Information systems applications
Online Access:	https://hdl.handle.net/10356/61736
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-61736
record_format	dspace
spelling	sg-ntu-dr.10356-617362023-07-04T16:21:21Z Processing of speech utterances for computer aided training of speaking skills Zhao, Sixuan Koh Soo Ngee Luke Kang Kwong, Kapathy Soon Ing Yann School of Electrical and Electronic Engineering Institute for Media Innovation DRNTU::Engineering::Computer science and engineering::Information systems::Information systems applications Computer-aided language learning (CALL) involves the studies and applications of speech and language processing technologies to improve the process of language acquisition. Ideally, an effective computer-aided language learning system should be able to accurately assess the performance of a language learner and generate meaningful feedback during his learning process. This thesis addresses several issues which are relevant to computer-aided language learning systems, particularly for learning of English as a second language (L2). The first issue is about the evaluation of prosody of the learner’s speech utterances. Prosody evaluation plays an important role in automatic assessment of English proficiency of L2 learners. It requires segmentation of a speech utterance into appropriate units to achieve effective modeling of prosodic features. A segmentation scheme is proposed to improve the prosody evaluation results by taking into account prosodic units. Unlike lexical units such as word or phoneme, prosodic units correspond to the phrasing and rhythm information and are more appropriate for the purpose of prosody evaluation. An algorithm is designed to segment the speech signal into prosodic units automatically, and it is shown that the algorithm can detect the proposed prosodic unit with reasonable accuracy. The production of audio feedback which is an important component of CALL is studied in this thesis. The learner’s vocal features and the teacher’s linguistic gestures are combined to produce effective feedback utterances which can facilitate the acquisition of English speaking skills. An accent reduction scheme which reduces the perceived accents in the learner’s utterances is studied. A multi-corpora experiment designed to examine effects of external factors on the accent reduction results resolves some ambiguities in the literature. In addition, different speech synthesis methods are described and implemented to perform accent reduction. Voice conversion is also applied as a new method to generate feedback utterances which possess the learner’s vocal features and the teacher’s linguistic gestures. The feedback utterances generated by various accent reduction methods are compared with that produced by voice conversion in order to identify an optimal way to produce feedback utterances with high nativeness and acoustic quality. Consequently, a multi-stage feedback scheme is proposed. Finally, the phonetic segmentation process is studied and its performance is improved to produce more accurate phone boundary information. Such kind of information can contribute to the development of speech technology areas which can be applied to the design of computer-aided language learning systems. Three different refinement methods, i.e., statistical correction, multi-resolution fusion, and predictive model based refinement, are presented. These methods are combined appropriately to improve the accuracy of the baseline phonetic segmentation system using forced alignment. The proposed refinement scheme is also extended to a cross-corpora scenario, which enables the analysis of a new corpus with limited labeled data and thus facilitates the application of the new corpus for various purposes such as speech recognition and linguistic research. DOCTOR OF PHILOSOPHY (EEE) 2014-09-12T01:48:16Z 2014-09-12T01:48:16Z 2014 2014 Thesis Zhao, S. (2014). Processing of speech utterances for computer aided training of speaking skills. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/61736 10.32657/10356/61736 en 191 p. application/pdf
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	DRNTU::Engineering::Computer science and engineering::Information systems::Information systems applications
spellingShingle	DRNTU::Engineering::Computer science and engineering::Information systems::Information systems applications Zhao, Sixuan Processing of speech utterances for computer aided training of speaking skills
description	Computer-aided language learning (CALL) involves the studies and applications of speech and language processing technologies to improve the process of language acquisition. Ideally, an effective computer-aided language learning system should be able to accurately assess the performance of a language learner and generate meaningful feedback during his learning process. This thesis addresses several issues which are relevant to computer-aided language learning systems, particularly for learning of English as a second language (L2). The first issue is about the evaluation of prosody of the learner’s speech utterances. Prosody evaluation plays an important role in automatic assessment of English proficiency of L2 learners. It requires segmentation of a speech utterance into appropriate units to achieve effective modeling of prosodic features. A segmentation scheme is proposed to improve the prosody evaluation results by taking into account prosodic units. Unlike lexical units such as word or phoneme, prosodic units correspond to the phrasing and rhythm information and are more appropriate for the purpose of prosody evaluation. An algorithm is designed to segment the speech signal into prosodic units automatically, and it is shown that the algorithm can detect the proposed prosodic unit with reasonable accuracy. The production of audio feedback which is an important component of CALL is studied in this thesis. The learner’s vocal features and the teacher’s linguistic gestures are combined to produce effective feedback utterances which can facilitate the acquisition of English speaking skills. An accent reduction scheme which reduces the perceived accents in the learner’s utterances is studied. A multi-corpora experiment designed to examine effects of external factors on the accent reduction results resolves some ambiguities in the literature. In addition, different speech synthesis methods are described and implemented to perform accent reduction. Voice conversion is also applied as a new method to generate feedback utterances which possess the learner’s vocal features and the teacher’s linguistic gestures. The feedback utterances generated by various accent reduction methods are compared with that produced by voice conversion in order to identify an optimal way to produce feedback utterances with high nativeness and acoustic quality. Consequently, a multi-stage feedback scheme is proposed. Finally, the phonetic segmentation process is studied and its performance is improved to produce more accurate phone boundary information. Such kind of information can contribute to the development of speech technology areas which can be applied to the design of computer-aided language learning systems. Three different refinement methods, i.e., statistical correction, multi-resolution fusion, and predictive model based refinement, are presented. These methods are combined appropriately to improve the accuracy of the baseline phonetic segmentation system using forced alignment. The proposed refinement scheme is also extended to a cross-corpora scenario, which enables the analysis of a new corpus with limited labeled data and thus facilitates the application of the new corpus for various purposes such as speech recognition and linguistic research.
author2	Koh Soo Ngee
author_facet	Koh Soo Ngee Zhao, Sixuan
format	Theses and Dissertations
author	Zhao, Sixuan
author_sort	Zhao, Sixuan
title	Processing of speech utterances for computer aided training of speaking skills
title_short	Processing of speech utterances for computer aided training of speaking skills
title_full	Processing of speech utterances for computer aided training of speaking skills
title_fullStr	Processing of speech utterances for computer aided training of speaking skills
title_full_unstemmed	Processing of speech utterances for computer aided training of speaking skills
title_sort	processing of speech utterances for computer aided training of speaking skills
publishDate	2014
url	https://hdl.handle.net/10356/61736
_version_	1772825681808326656

Processing of speech utterances for computer aided training of speaking skills

Similar Items