Phoneme based speech to text translation system for Malaysian English pronunciation

Speech is the most common and vocalized form of human communication. Communication through speech helps to convey the linguistic information and also helps to express information about the person’s social and regional origin, health and emotional state. Recent improvement in phoneme based speech...

Full description

Saved in:
Bibliographic Details
Main Author: Sathees Kumar, Nataraj
Format: Thesis
Language:English
Published: Universiti Malaysia Perlis (UniMAP) 2014
Subjects:
Online Access:http://dspace.unimap.edu.my:80/dspace/handle/123456789/31909
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Universiti Malaysia Perlis
Language: English
Description
Summary:Speech is the most common and vocalized form of human communication. Communication through speech helps to convey the linguistic information and also helps to express information about the person’s social and regional origin, health and emotional state. Recent improvement in phoneme based speech to text translation system has become one of the most exciting areas of the speech signal processing; because of the major advances in statistical modeling of speech, automatic speech recognition systems have find widespread of applications in tasks that require human machine interface. The advancement and development of speech to text translation system can be used in many applications such as Medical Transcriptions (digital speech to text) Automated transcription, Telematics and Air traffic control. In this research work, two sets of isolated word speech signal database has been built namely Vowels Class Word Database (VCWD) and Phonemes Class Word Database (PCWD). The VCWD was initially built to classify the isolated words based on the eleven classes of vowels. The database has been analyzed using four different spectral analysis techniques such as Mel-Frequency Cepstral Co-efficient (MFCC), Linear Predictive Coefficient (LPC), Perceptual Linear Predictive Analysis (PLP) and Relative Spectra- Perceptual Linear Predictive Analysis (RASTA-PLP)) to determine the best discriminative features and to identify the network parameters. The PCWD has been built to develop the phoneme based speech to text translation system using Linear Predictive Coefficients (LPC) and Multilayer Neural Network models (MLNN) using fusion concept for the classification of isolated words and phoneme. The isolated word speech signals are recorded using a speech acquisition algorithm developed using a MATLAB Graphical user interface (GUI). The speech signals are recorded for 15 seconds at 16 kHz sampling frequency. The recorded speech signals are pre-processed and used to segment the voiced/unvoiced parts of the speech signal. A simple fuzzy voice classifier has been proposed to extract the voiced portion using frame energy and change in energy features. The extracted voiced portions are pre-processed and divided into a number of frames. For each frame signal, the spectral features are extracted and used as a feature set for the classification. The classification tasks of the isolated words and phonemes are associated with the extracted features to establish input output mapping. The data are then normalized and randomized to rearrange the values into definite range. The Multilayer Neural Network (MLNN) model has been developed with four combinations of input and hidden activation functions. To improve the performance rate and reduce the training time a simple systole activation function has been proposed. The neural network models are trained with 60%, 70% and 80% of the total data samples. The trained neural network is validated with the remaining 40%, 30% and 20% of data samples by simulating the network. The performance of the network is calculated by measuring the true positives, false negatives and classification accuracy and the results are compared. It is observed that the fuzzy voice classifier is developed with less complexity and yields better accuracy when compared with the other voiced/unvoiced classification methods available in the literature. The LPC features show better discrimination and the MLNN neural network models trained using the LPC spectral band features gives better classification accuracy when compared with other feature extraction algorithms. Also, the proposed systole activation function produces reduced training time and epoch rate when compared with the other network models.