Dataset size and dimensionality reduction approaches for handwritten farsi digits and characters recognition / Mohammad Amin Shayegan

In all pattern recognition systems, increasing the recognition speed and improvement of the recognition accuracy are two important goals. However, these items usually perform against each other, when the former is improved, the latter is decreased, and vice versa. In this thesis, the focus is on bot...

Full description

Saved in:

Bibliographic Details
Main Author:	Shayegan, Mohammad Amin
Format:	Thesis
Published:	2015
Subjects:	QA75 Electronic computers. Computer science
Online Access:	http://studentsrepo.um.edu.my/4948/1/Final_Version_of_the_Thesis.pdf http://studentsrepo.um.edu.my/4948/
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Universiti Malaya

id	my.um.stud.4948
record_format	eprints
spelling	my.um.stud.49482015-03-11T02:02:54Z Dataset size and dimensionality reduction approaches for handwritten farsi digits and characters recognition / Mohammad Amin Shayegan Shayegan, Mohammad Amin QA75 Electronic computers. Computer science In all pattern recognition systems, increasing the recognition speed and improvement of the recognition accuracy are two important goals. However, these items usually perform against each other, when the former is improved, the latter is decreased, and vice versa. In this thesis, the focus is on both items; decreasing the overall processing time and increasing the system accuracy. To such an aim, the number of training samples is decreased by proposing a technique for dataset size reduction that leads to decrease of the training/testing time. Also, the number of features is decreased by proposing a new technique for dimensionality reduction. It decreases the training and testing time, and by deleting less important features, it increases the system accuracy, too. The existing dataset size reduction algorithms, usually remove samples near to the centers of classes, or support vector samples between different classes. However, the former samples include valuable information about the class characteristics, and are important to make system model. The latter samples are important for evaluating system efficiency and adjustment of system parameters. The proposed dataset size reduction method employs Modified Frequency Diagram technique to create a template for each class. Then, a similarity value is calculated for each pattern. Thereafter, the samples in each class are rearranged based on their similarity values. Consequently, the number of training samples is reduced by Sieving technique. As a result, the training/testing time is decreased. In other part of this study, the number of extracted features is decreased by proposing a new method, which is, analyzing the one-dimensional and two-dimensional spectrum diagrams of standard deviation and minimum to maximum distributions for initial feature vector elements. In recent years, the attractive nature of Optical Character Recognition (OCR) has caused the researchers to develop various algorithms for recognizing different alphabets. Target performance for an OCR system is to recognize at least five characters per second with 99.9% accuracy. However, the performance of available handwritten Farsi OCR systems is still lacking, both in terms of accuracy and speed. The proposed techniques in this thesis have been validated in handwritten OCR domain via the use of two big standard benchmark datasets; the Hoda for Farsi digits and letters and the MNIST for Latin digits. The proposed dataset size reduction technique has been successful in decreasing the training time to less than half, while the accuracy has only decreased by 0.68%. Both datasets (Hoda and MNIST) were also used for dimensionality reduction purpose. Here, the dimension of feature vector was reduced to 59.40% for the MNIST dataset, 43.61% for digits part of the Hoda dataset, and 69.92% for the characters part of the Hoda dataset. Meanwhile the accuracies are enhanced 2.95%, 4.71%, and 1.92%, respectively. The achieved results showed the superiority of the proposed method compared to the rival dimension reduction methods. The proposed size reduction technique can be used for other pictorial datasets. Also, the proposed dimensionality reduction technique can be employed in any other pattern recognition systems with numerical feature vectors. 2015 Thesis NonPeerReviewed application/pdf http://studentsrepo.um.edu.my/4948/1/Final_Version_of_the_Thesis.pdf Shayegan, Mohammad Amin (2015) Dataset size and dimensionality reduction approaches for handwritten farsi digits and characters recognition / Mohammad Amin Shayegan. PhD thesis, University of Malaya. http://studentsrepo.um.edu.my/4948/
institution	Universiti Malaya
building	UM Library
collection	Institutional Repository
continent	Asia
country	Malaysia
content_provider	Universiti Malaya
content_source	UM Student Repository
url_provider	http://studentsrepo.um.edu.my/
topic	QA75 Electronic computers. Computer science
spellingShingle	QA75 Electronic computers. Computer science Shayegan, Mohammad Amin Dataset size and dimensionality reduction approaches for handwritten farsi digits and characters recognition / Mohammad Amin Shayegan
description	In all pattern recognition systems, increasing the recognition speed and improvement of the recognition accuracy are two important goals. However, these items usually perform against each other, when the former is improved, the latter is decreased, and vice versa. In this thesis, the focus is on both items; decreasing the overall processing time and increasing the system accuracy. To such an aim, the number of training samples is decreased by proposing a technique for dataset size reduction that leads to decrease of the training/testing time. Also, the number of features is decreased by proposing a new technique for dimensionality reduction. It decreases the training and testing time, and by deleting less important features, it increases the system accuracy, too. The existing dataset size reduction algorithms, usually remove samples near to the centers of classes, or support vector samples between different classes. However, the former samples include valuable information about the class characteristics, and are important to make system model. The latter samples are important for evaluating system efficiency and adjustment of system parameters. The proposed dataset size reduction method employs Modified Frequency Diagram technique to create a template for each class. Then, a similarity value is calculated for each pattern. Thereafter, the samples in each class are rearranged based on their similarity values. Consequently, the number of training samples is reduced by Sieving technique. As a result, the training/testing time is decreased. In other part of this study, the number of extracted features is decreased by proposing a new method, which is, analyzing the one-dimensional and two-dimensional spectrum diagrams of standard deviation and minimum to maximum distributions for initial feature vector elements. In recent years, the attractive nature of Optical Character Recognition (OCR) has caused the researchers to develop various algorithms for recognizing different alphabets. Target performance for an OCR system is to recognize at least five characters per second with 99.9% accuracy. However, the performance of available handwritten Farsi OCR systems is still lacking, both in terms of accuracy and speed. The proposed techniques in this thesis have been validated in handwritten OCR domain via the use of two big standard benchmark datasets; the Hoda for Farsi digits and letters and the MNIST for Latin digits. The proposed dataset size reduction technique has been successful in decreasing the training time to less than half, while the accuracy has only decreased by 0.68%. Both datasets (Hoda and MNIST) were also used for dimensionality reduction purpose. Here, the dimension of feature vector was reduced to 59.40% for the MNIST dataset, 43.61% for digits part of the Hoda dataset, and 69.92% for the characters part of the Hoda dataset. Meanwhile the accuracies are enhanced 2.95%, 4.71%, and 1.92%, respectively. The achieved results showed the superiority of the proposed method compared to the rival dimension reduction methods. The proposed size reduction technique can be used for other pictorial datasets. Also, the proposed dimensionality reduction technique can be employed in any other pattern recognition systems with numerical feature vectors.
format	Thesis
author	Shayegan, Mohammad Amin
author_facet	Shayegan, Mohammad Amin
author_sort	Shayegan, Mohammad Amin
title	Dataset size and dimensionality reduction approaches for handwritten farsi digits and characters recognition / Mohammad Amin Shayegan
title_short	Dataset size and dimensionality reduction approaches for handwritten farsi digits and characters recognition / Mohammad Amin Shayegan
title_full	Dataset size and dimensionality reduction approaches for handwritten farsi digits and characters recognition / Mohammad Amin Shayegan
title_fullStr	Dataset size and dimensionality reduction approaches for handwritten farsi digits and characters recognition / Mohammad Amin Shayegan
title_full_unstemmed	Dataset size and dimensionality reduction approaches for handwritten farsi digits and characters recognition / Mohammad Amin Shayegan
title_sort	dataset size and dimensionality reduction approaches for handwritten farsi digits and characters recognition / mohammad amin shayegan
publishDate	2015
url	http://studentsrepo.um.edu.my/4948/1/Final_Version_of_the_Thesis.pdf http://studentsrepo.um.edu.my/4948/
_version_	1738505731325296640

Dataset size and dimensionality reduction approaches for handwritten farsi digits and characters recognition / Mohammad Amin Shayegan

Similar Items