Processing of speech utterances for computer aided training of speaking skills

Computer-aided language learning (CALL) involves the studies and applications of speech and language processing technologies to improve the process of language acquisition. Ideally, an effective computer-aided language learning system should be able to accurately assess the performance of a language...

Full description

Saved in:
Bibliographic Details
Main Author: Zhao, Sixuan
Other Authors: Koh Soo Ngee
Format: Theses and Dissertations
Language:English
Published: 2014
Subjects:
Online Access:https://hdl.handle.net/10356/61736
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-61736
record_format dspace
spelling sg-ntu-dr.10356-617362023-07-04T16:21:21Z Processing of speech utterances for computer aided training of speaking skills Zhao, Sixuan Koh Soo Ngee Luke Kang Kwong, Kapathy Soon Ing Yann School of Electrical and Electronic Engineering Institute for Media Innovation DRNTU::Engineering::Computer science and engineering::Information systems::Information systems applications Computer-aided language learning (CALL) involves the studies and applications of speech and language processing technologies to improve the process of language acquisition. Ideally, an effective computer-aided language learning system should be able to accurately assess the performance of a language learner and generate meaningful feedback during his learning process. This thesis addresses several issues which are relevant to computer-aided language learning systems, particularly for learning of English as a second language (L2). The first issue is about the evaluation of prosody of the learner’s speech utterances. Prosody evaluation plays an important role in automatic assessment of English proficiency of L2 learners. It requires segmentation of a speech utterance into appropriate units to achieve effective modeling of prosodic features. A segmentation scheme is proposed to improve the prosody evaluation results by taking into account prosodic units. Unlike lexical units such as word or phoneme, prosodic units correspond to the phrasing and rhythm information and are more appropriate for the purpose of prosody evaluation. An algorithm is designed to segment the speech signal into prosodic units automatically, and it is shown that the algorithm can detect the proposed prosodic unit with reasonable accuracy. The production of audio feedback which is an important component of CALL is studied in this thesis. The learner’s vocal features and the teacher’s linguistic gestures are combined to produce effective feedback utterances which can facilitate the acquisition of English speaking skills. An accent reduction scheme which reduces the perceived accents in the learner’s utterances is studied. A multi-corpora experiment designed to examine effects of external factors on the accent reduction results resolves some ambiguities in the literature. In addition, different speech synthesis methods are described and implemented to perform accent reduction. Voice conversion is also applied as a new method to generate feedback utterances which possess the learner’s vocal features and the teacher’s linguistic gestures. The feedback utterances generated by various accent reduction methods are compared with that produced by voice conversion in order to identify an optimal way to produce feedback utterances with high nativeness and acoustic quality. Consequently, a multi-stage feedback scheme is proposed. Finally, the phonetic segmentation process is studied and its performance is improved to produce more accurate phone boundary information. Such kind of information can contribute to the development of speech technology areas which can be applied to the design of computer-aided language learning systems. Three different refinement methods, i.e., statistical correction, multi-resolution fusion, and predictive model based refinement, are presented. These methods are combined appropriately to improve the accuracy of the baseline phonetic segmentation system using forced alignment. The proposed refinement scheme is also extended to a cross-corpora scenario, which enables the analysis of a new corpus with limited labeled data and thus facilitates the application of the new corpus for various purposes such as speech recognition and linguistic research. DOCTOR OF PHILOSOPHY (EEE) 2014-09-12T01:48:16Z 2014-09-12T01:48:16Z 2014 2014 Thesis Zhao, S. (2014). Processing of speech utterances for computer aided training of speaking skills. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/61736 10.32657/10356/61736 en 191 p. application/pdf
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic DRNTU::Engineering::Computer science and engineering::Information systems::Information systems applications
spellingShingle DRNTU::Engineering::Computer science and engineering::Information systems::Information systems applications
Zhao, Sixuan
Processing of speech utterances for computer aided training of speaking skills
description Computer-aided language learning (CALL) involves the studies and applications of speech and language processing technologies to improve the process of language acquisition. Ideally, an effective computer-aided language learning system should be able to accurately assess the performance of a language learner and generate meaningful feedback during his learning process. This thesis addresses several issues which are relevant to computer-aided language learning systems, particularly for learning of English as a second language (L2). The first issue is about the evaluation of prosody of the learner’s speech utterances. Prosody evaluation plays an important role in automatic assessment of English proficiency of L2 learners. It requires segmentation of a speech utterance into appropriate units to achieve effective modeling of prosodic features. A segmentation scheme is proposed to improve the prosody evaluation results by taking into account prosodic units. Unlike lexical units such as word or phoneme, prosodic units correspond to the phrasing and rhythm information and are more appropriate for the purpose of prosody evaluation. An algorithm is designed to segment the speech signal into prosodic units automatically, and it is shown that the algorithm can detect the proposed prosodic unit with reasonable accuracy. The production of audio feedback which is an important component of CALL is studied in this thesis. The learner’s vocal features and the teacher’s linguistic gestures are combined to produce effective feedback utterances which can facilitate the acquisition of English speaking skills. An accent reduction scheme which reduces the perceived accents in the learner’s utterances is studied. A multi-corpora experiment designed to examine effects of external factors on the accent reduction results resolves some ambiguities in the literature. In addition, different speech synthesis methods are described and implemented to perform accent reduction. Voice conversion is also applied as a new method to generate feedback utterances which possess the learner’s vocal features and the teacher’s linguistic gestures. The feedback utterances generated by various accent reduction methods are compared with that produced by voice conversion in order to identify an optimal way to produce feedback utterances with high nativeness and acoustic quality. Consequently, a multi-stage feedback scheme is proposed. Finally, the phonetic segmentation process is studied and its performance is improved to produce more accurate phone boundary information. Such kind of information can contribute to the development of speech technology areas which can be applied to the design of computer-aided language learning systems. Three different refinement methods, i.e., statistical correction, multi-resolution fusion, and predictive model based refinement, are presented. These methods are combined appropriately to improve the accuracy of the baseline phonetic segmentation system using forced alignment. The proposed refinement scheme is also extended to a cross-corpora scenario, which enables the analysis of a new corpus with limited labeled data and thus facilitates the application of the new corpus for various purposes such as speech recognition and linguistic research.
author2 Koh Soo Ngee
author_facet Koh Soo Ngee
Zhao, Sixuan
format Theses and Dissertations
author Zhao, Sixuan
author_sort Zhao, Sixuan
title Processing of speech utterances for computer aided training of speaking skills
title_short Processing of speech utterances for computer aided training of speaking skills
title_full Processing of speech utterances for computer aided training of speaking skills
title_fullStr Processing of speech utterances for computer aided training of speaking skills
title_full_unstemmed Processing of speech utterances for computer aided training of speaking skills
title_sort processing of speech utterances for computer aided training of speaking skills
publishDate 2014
url https://hdl.handle.net/10356/61736
_version_ 1772825681808326656