Phoneme based speech to text translation system for Malaysian English pronunciation

Speech is the most common and vocalized form of human communication. Communication through speech helps to convey the linguistic information and also helps to express information about the person’s social and regional origin, health and emotional state. Recent improvement in phoneme based speech...

Full description

Saved in:
Bibliographic Details
Main Author: Sathees Kumar, Nataraj
Format: Thesis
Language:English
Published: Universiti Malaysia Perlis (UniMAP) 2014
Subjects:
Online Access:http://dspace.unimap.edu.my:80/dspace/handle/123456789/31909
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Universiti Malaysia Perlis
Language: English
id my.unimap-31909
record_format dspace
institution Universiti Malaysia Perlis
building UniMAP Library
collection Institutional Repository
continent Asia
country Malaysia
content_provider Universiti Malaysia Perlis
content_source UniMAP Library Digital Repository
url_provider http://dspace.unimap.edu.my/
language English
topic Phoneme
Speech signal processing
English language
Speech to text translation
Speech recognition systems
spellingShingle Phoneme
Speech signal processing
English language
Speech to text translation
Speech recognition systems
Sathees Kumar, Nataraj
Phoneme based speech to text translation system for Malaysian English pronunciation
description Speech is the most common and vocalized form of human communication. Communication through speech helps to convey the linguistic information and also helps to express information about the person’s social and regional origin, health and emotional state. Recent improvement in phoneme based speech to text translation system has become one of the most exciting areas of the speech signal processing; because of the major advances in statistical modeling of speech, automatic speech recognition systems have find widespread of applications in tasks that require human machine interface. The advancement and development of speech to text translation system can be used in many applications such as Medical Transcriptions (digital speech to text) Automated transcription, Telematics and Air traffic control. In this research work, two sets of isolated word speech signal database has been built namely Vowels Class Word Database (VCWD) and Phonemes Class Word Database (PCWD). The VCWD was initially built to classify the isolated words based on the eleven classes of vowels. The database has been analyzed using four different spectral analysis techniques such as Mel-Frequency Cepstral Co-efficient (MFCC), Linear Predictive Coefficient (LPC), Perceptual Linear Predictive Analysis (PLP) and Relative Spectra- Perceptual Linear Predictive Analysis (RASTA-PLP)) to determine the best discriminative features and to identify the network parameters. The PCWD has been built to develop the phoneme based speech to text translation system using Linear Predictive Coefficients (LPC) and Multilayer Neural Network models (MLNN) using fusion concept for the classification of isolated words and phoneme. The isolated word speech signals are recorded using a speech acquisition algorithm developed using a MATLAB Graphical user interface (GUI). The speech signals are recorded for 15 seconds at 16 kHz sampling frequency. The recorded speech signals are pre-processed and used to segment the voiced/unvoiced parts of the speech signal. A simple fuzzy voice classifier has been proposed to extract the voiced portion using frame energy and change in energy features. The extracted voiced portions are pre-processed and divided into a number of frames. For each frame signal, the spectral features are extracted and used as a feature set for the classification. The classification tasks of the isolated words and phonemes are associated with the extracted features to establish input output mapping. The data are then normalized and randomized to rearrange the values into definite range. The Multilayer Neural Network (MLNN) model has been developed with four combinations of input and hidden activation functions. To improve the performance rate and reduce the training time a simple systole activation function has been proposed. The neural network models are trained with 60%, 70% and 80% of the total data samples. The trained neural network is validated with the remaining 40%, 30% and 20% of data samples by simulating the network. The performance of the network is calculated by measuring the true positives, false negatives and classification accuracy and the results are compared. It is observed that the fuzzy voice classifier is developed with less complexity and yields better accuracy when compared with the other voiced/unvoiced classification methods available in the literature. The LPC features show better discrimination and the MLNN neural network models trained using the LPC spectral band features gives better classification accuracy when compared with other feature extraction algorithms. Also, the proposed systole activation function produces reduced training time and epoch rate when compared with the other network models.
format Thesis
author Sathees Kumar, Nataraj
author_facet Sathees Kumar, Nataraj
author_sort Sathees Kumar, Nataraj
title Phoneme based speech to text translation system for Malaysian English pronunciation
title_short Phoneme based speech to text translation system for Malaysian English pronunciation
title_full Phoneme based speech to text translation system for Malaysian English pronunciation
title_fullStr Phoneme based speech to text translation system for Malaysian English pronunciation
title_full_unstemmed Phoneme based speech to text translation system for Malaysian English pronunciation
title_sort phoneme based speech to text translation system for malaysian english pronunciation
publisher Universiti Malaysia Perlis (UniMAP)
publishDate 2014
url http://dspace.unimap.edu.my:80/dspace/handle/123456789/31909
_version_ 1643796705708081152
spelling my.unimap-319092014-02-13T10:50:14Z Phoneme based speech to text translation system for Malaysian English pronunciation Sathees Kumar, Nataraj Phoneme Speech signal processing English language Speech to text translation Speech recognition systems Speech is the most common and vocalized form of human communication. Communication through speech helps to convey the linguistic information and also helps to express information about the person’s social and regional origin, health and emotional state. Recent improvement in phoneme based speech to text translation system has become one of the most exciting areas of the speech signal processing; because of the major advances in statistical modeling of speech, automatic speech recognition systems have find widespread of applications in tasks that require human machine interface. The advancement and development of speech to text translation system can be used in many applications such as Medical Transcriptions (digital speech to text) Automated transcription, Telematics and Air traffic control. In this research work, two sets of isolated word speech signal database has been built namely Vowels Class Word Database (VCWD) and Phonemes Class Word Database (PCWD). The VCWD was initially built to classify the isolated words based on the eleven classes of vowels. The database has been analyzed using four different spectral analysis techniques such as Mel-Frequency Cepstral Co-efficient (MFCC), Linear Predictive Coefficient (LPC), Perceptual Linear Predictive Analysis (PLP) and Relative Spectra- Perceptual Linear Predictive Analysis (RASTA-PLP)) to determine the best discriminative features and to identify the network parameters. The PCWD has been built to develop the phoneme based speech to text translation system using Linear Predictive Coefficients (LPC) and Multilayer Neural Network models (MLNN) using fusion concept for the classification of isolated words and phoneme. The isolated word speech signals are recorded using a speech acquisition algorithm developed using a MATLAB Graphical user interface (GUI). The speech signals are recorded for 15 seconds at 16 kHz sampling frequency. The recorded speech signals are pre-processed and used to segment the voiced/unvoiced parts of the speech signal. A simple fuzzy voice classifier has been proposed to extract the voiced portion using frame energy and change in energy features. The extracted voiced portions are pre-processed and divided into a number of frames. For each frame signal, the spectral features are extracted and used as a feature set for the classification. The classification tasks of the isolated words and phonemes are associated with the extracted features to establish input output mapping. The data are then normalized and randomized to rearrange the values into definite range. The Multilayer Neural Network (MLNN) model has been developed with four combinations of input and hidden activation functions. To improve the performance rate and reduce the training time a simple systole activation function has been proposed. The neural network models are trained with 60%, 70% and 80% of the total data samples. The trained neural network is validated with the remaining 40%, 30% and 20% of data samples by simulating the network. The performance of the network is calculated by measuring the true positives, false negatives and classification accuracy and the results are compared. It is observed that the fuzzy voice classifier is developed with less complexity and yields better accuracy when compared with the other voiced/unvoiced classification methods available in the literature. The LPC features show better discrimination and the MLNN neural network models trained using the LPC spectral band features gives better classification accuracy when compared with other feature extraction algorithms. Also, the proposed systole activation function produces reduced training time and epoch rate when compared with the other network models. 2014-02-13T10:50:14Z 2014-02-13T10:50:14Z 2012 Thesis http://dspace.unimap.edu.my:80/dspace/handle/123456789/31909 en Universiti Malaysia Perlis (UniMAP) School of Mechatronic Engineering