Dataset size and dimensionality reduction approaches for handwritten farsi digits and characters recognition / Mohammad Amin Shayegan

In all pattern recognition systems, increasing the recognition speed and improvement of the recognition accuracy are two important goals. However, these items usually perform against each other, when the former is improved, the latter is decreased, and vice versa. In this thesis, the focus is on bot...

Full description

Saved in:
Bibliographic Details
Main Author: Shayegan, Mohammad Amin
Format: Thesis
Published: 2015
Subjects:
Online Access:http://studentsrepo.um.edu.my/4948/1/Final_Version_of_the_Thesis.pdf
http://studentsrepo.um.edu.my/4948/
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Universiti Malaya
id my.um.stud.4948
record_format eprints
spelling my.um.stud.49482015-03-11T02:02:54Z Dataset size and dimensionality reduction approaches for handwritten farsi digits and characters recognition / Mohammad Amin Shayegan Shayegan, Mohammad Amin QA75 Electronic computers. Computer science In all pattern recognition systems, increasing the recognition speed and improvement of the recognition accuracy are two important goals. However, these items usually perform against each other, when the former is improved, the latter is decreased, and vice versa. In this thesis, the focus is on both items; decreasing the overall processing time and increasing the system accuracy. To such an aim, the number of training samples is decreased by proposing a technique for dataset size reduction that leads to decrease of the training/testing time. Also, the number of features is decreased by proposing a new technique for dimensionality reduction. It decreases the training and testing time, and by deleting less important features, it increases the system accuracy, too. The existing dataset size reduction algorithms, usually remove samples near to the centers of classes, or support vector samples between different classes. However, the former samples include valuable information about the class characteristics, and are important to make system model. The latter samples are important for evaluating system efficiency and adjustment of system parameters. The proposed dataset size reduction method employs Modified Frequency Diagram technique to create a template for each class. Then, a similarity value is calculated for each pattern. Thereafter, the samples in each class are rearranged based on their similarity values. Consequently, the number of training samples is reduced by Sieving technique. As a result, the training/testing time is decreased. In other part of this study, the number of extracted features is decreased by proposing a new method, which is, analyzing the one-dimensional and two-dimensional spectrum diagrams of standard deviation and minimum to maximum distributions for initial feature vector elements. In recent years, the attractive nature of Optical Character Recognition (OCR) has caused the researchers to develop various algorithms for recognizing different alphabets. Target performance for an OCR system is to recognize at least five characters per second with 99.9% accuracy. However, the performance of available handwritten Farsi OCR systems is still lacking, both in terms of accuracy and speed. The proposed techniques in this thesis have been validated in handwritten OCR domain via the use of two big standard benchmark datasets; the Hoda for Farsi digits and letters and the MNIST for Latin digits. The proposed dataset size reduction technique has been successful in decreasing the training time to less than half, while the accuracy has only decreased by 0.68%. Both datasets (Hoda and MNIST) were also used for dimensionality reduction purpose. Here, the dimension of feature vector was reduced to 59.40% for the MNIST dataset, 43.61% for digits part of the Hoda dataset, and 69.92% for the characters part of the Hoda dataset. Meanwhile the accuracies are enhanced 2.95%, 4.71%, and 1.92%, respectively. The achieved results showed the superiority of the proposed method compared to the rival dimension reduction methods. The proposed size reduction technique can be used for other pictorial datasets. Also, the proposed dimensionality reduction technique can be employed in any other pattern recognition systems with numerical feature vectors. 2015 Thesis NonPeerReviewed application/pdf http://studentsrepo.um.edu.my/4948/1/Final_Version_of_the_Thesis.pdf Shayegan, Mohammad Amin (2015) Dataset size and dimensionality reduction approaches for handwritten farsi digits and characters recognition / Mohammad Amin Shayegan. PhD thesis, University of Malaya. http://studentsrepo.um.edu.my/4948/
institution Universiti Malaya
building UM Library
collection Institutional Repository
continent Asia
country Malaysia
content_provider Universiti Malaya
content_source UM Student Repository
url_provider http://studentsrepo.um.edu.my/
topic QA75 Electronic computers. Computer science
spellingShingle QA75 Electronic computers. Computer science
Shayegan, Mohammad Amin
Dataset size and dimensionality reduction approaches for handwritten farsi digits and characters recognition / Mohammad Amin Shayegan
description In all pattern recognition systems, increasing the recognition speed and improvement of the recognition accuracy are two important goals. However, these items usually perform against each other, when the former is improved, the latter is decreased, and vice versa. In this thesis, the focus is on both items; decreasing the overall processing time and increasing the system accuracy. To such an aim, the number of training samples is decreased by proposing a technique for dataset size reduction that leads to decrease of the training/testing time. Also, the number of features is decreased by proposing a new technique for dimensionality reduction. It decreases the training and testing time, and by deleting less important features, it increases the system accuracy, too. The existing dataset size reduction algorithms, usually remove samples near to the centers of classes, or support vector samples between different classes. However, the former samples include valuable information about the class characteristics, and are important to make system model. The latter samples are important for evaluating system efficiency and adjustment of system parameters. The proposed dataset size reduction method employs Modified Frequency Diagram technique to create a template for each class. Then, a similarity value is calculated for each pattern. Thereafter, the samples in each class are rearranged based on their similarity values. Consequently, the number of training samples is reduced by Sieving technique. As a result, the training/testing time is decreased. In other part of this study, the number of extracted features is decreased by proposing a new method, which is, analyzing the one-dimensional and two-dimensional spectrum diagrams of standard deviation and minimum to maximum distributions for initial feature vector elements. In recent years, the attractive nature of Optical Character Recognition (OCR) has caused the researchers to develop various algorithms for recognizing different alphabets. Target performance for an OCR system is to recognize at least five characters per second with 99.9% accuracy. However, the performance of available handwritten Farsi OCR systems is still lacking, both in terms of accuracy and speed. The proposed techniques in this thesis have been validated in handwritten OCR domain via the use of two big standard benchmark datasets; the Hoda for Farsi digits and letters and the MNIST for Latin digits. The proposed dataset size reduction technique has been successful in decreasing the training time to less than half, while the accuracy has only decreased by 0.68%. Both datasets (Hoda and MNIST) were also used for dimensionality reduction purpose. Here, the dimension of feature vector was reduced to 59.40% for the MNIST dataset, 43.61% for digits part of the Hoda dataset, and 69.92% for the characters part of the Hoda dataset. Meanwhile the accuracies are enhanced 2.95%, 4.71%, and 1.92%, respectively. The achieved results showed the superiority of the proposed method compared to the rival dimension reduction methods. The proposed size reduction technique can be used for other pictorial datasets. Also, the proposed dimensionality reduction technique can be employed in any other pattern recognition systems with numerical feature vectors.
format Thesis
author Shayegan, Mohammad Amin
author_facet Shayegan, Mohammad Amin
author_sort Shayegan, Mohammad Amin
title Dataset size and dimensionality reduction approaches for handwritten farsi digits and characters recognition / Mohammad Amin Shayegan
title_short Dataset size and dimensionality reduction approaches for handwritten farsi digits and characters recognition / Mohammad Amin Shayegan
title_full Dataset size and dimensionality reduction approaches for handwritten farsi digits and characters recognition / Mohammad Amin Shayegan
title_fullStr Dataset size and dimensionality reduction approaches for handwritten farsi digits and characters recognition / Mohammad Amin Shayegan
title_full_unstemmed Dataset size and dimensionality reduction approaches for handwritten farsi digits and characters recognition / Mohammad Amin Shayegan
title_sort dataset size and dimensionality reduction approaches for handwritten farsi digits and characters recognition / mohammad amin shayegan
publishDate 2015
url http://studentsrepo.um.edu.my/4948/1/Final_Version_of_the_Thesis.pdf
http://studentsrepo.um.edu.my/4948/
_version_ 1738505731325296640