Improved random forest for feature selection in writer identification

Writer Identification (WI) is a process to determine the writer of a given handwriting sample. A handwriting sample consists of various types of features. These features are unique due to the writer’s characteristics and individuality, which challenges the identification process. Some features do no...

Full description

Saved in:
Bibliographic Details
Main Author: Sukor, Nooraziera Akmal
Format: Thesis
Language:English
English
Published: 2015
Subjects:
Online Access:http://eprints.utem.edu.my/id/eprint/16842/1/Improved%20Random%20Forest%20For%20Feature%20Selection%20In%20Writer%20Identification.pdf
http://eprints.utem.edu.my/id/eprint/16842/2/Improved%20random%20forest%20for%20feature%20selection%20in%20writer%20identification.pdf
http://eprints.utem.edu.my/id/eprint/16842/
https://plh.utem.edu.my/cgi-bin/koha/opac-detail.pl?biblionumber=96166
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Universiti Teknikal Malaysia Melaka
Language: English
English
id my.utem.eprints.16842
record_format eprints
spelling my.utem.eprints.168422022-06-07T13:30:20Z http://eprints.utem.edu.my/id/eprint/16842/ Improved random forest for feature selection in writer identification Sukor, Nooraziera Akmal T Technology (General) TA Engineering (General). Civil engineering (General) Writer Identification (WI) is a process to determine the writer of a given handwriting sample. A handwriting sample consists of various types of features. These features are unique due to the writer’s characteristics and individuality, which challenges the identification process. Some features do not provide useful information and may cause to decrease the performance of a classifier. Thus, feature selection process is implemented in WI process. Feature selection is a process to identify and select the most significant features from presented features in handwriting documents and to eliminate the irrelevant features. Due to the WI framework, discretization process is applied before the feature selection process. Discretization process was proven to increase the classification performances and improved the identification performance in WI. An algorithm and framework of Improved Random Forest (IRF) tree was applied for feature selection process. RF tree is a collection of tree predictors used to ensemble decision tree models with a randomized selection of features at each split. It involved Classification and Regression Tree (CART) during the development of tree. Important features are measured by using Variable Importance (VI). While Mean Absolute Error (MAE) values use to identify the variance between writers, VI value was used for splitting process in tree and MAE value is to ensure the intra-class (same writer) invariance is lower than inter-class (different writer) invariance because lower intra-class invariance indicates accuracy to the real author. Number of selected features and the classification accuracy is used to indicate the performances of feature selection method. Experimental results have shown that the performances of IRF tree in discretized dataset produced third feature (f3) as the most important feature with average classification accuracy 99.19%. For un- discretized dataset, first feature (f1) and third feature (f3) are the most important features with average classification accuracy 40.79%. 2015 Thesis NonPeerReviewed text en http://eprints.utem.edu.my/id/eprint/16842/1/Improved%20Random%20Forest%20For%20Feature%20Selection%20In%20Writer%20Identification.pdf text en http://eprints.utem.edu.my/id/eprint/16842/2/Improved%20random%20forest%20for%20feature%20selection%20in%20writer%20identification.pdf Sukor, Nooraziera Akmal (2015) Improved random forest for feature selection in writer identification. Masters thesis, Universiti Teknikal Malaysia Melaka. https://plh.utem.edu.my/cgi-bin/koha/opac-detail.pl?biblionumber=96166
institution Universiti Teknikal Malaysia Melaka
building UTEM Library
collection Institutional Repository
continent Asia
country Malaysia
content_provider Universiti Teknikal Malaysia Melaka
content_source UTEM Institutional Repository
url_provider http://eprints.utem.edu.my/
language English
English
topic T Technology (General)
TA Engineering (General). Civil engineering (General)
spellingShingle T Technology (General)
TA Engineering (General). Civil engineering (General)
Sukor, Nooraziera Akmal
Improved random forest for feature selection in writer identification
description Writer Identification (WI) is a process to determine the writer of a given handwriting sample. A handwriting sample consists of various types of features. These features are unique due to the writer’s characteristics and individuality, which challenges the identification process. Some features do not provide useful information and may cause to decrease the performance of a classifier. Thus, feature selection process is implemented in WI process. Feature selection is a process to identify and select the most significant features from presented features in handwriting documents and to eliminate the irrelevant features. Due to the WI framework, discretization process is applied before the feature selection process. Discretization process was proven to increase the classification performances and improved the identification performance in WI. An algorithm and framework of Improved Random Forest (IRF) tree was applied for feature selection process. RF tree is a collection of tree predictors used to ensemble decision tree models with a randomized selection of features at each split. It involved Classification and Regression Tree (CART) during the development of tree. Important features are measured by using Variable Importance (VI). While Mean Absolute Error (MAE) values use to identify the variance between writers, VI value was used for splitting process in tree and MAE value is to ensure the intra-class (same writer) invariance is lower than inter-class (different writer) invariance because lower intra-class invariance indicates accuracy to the real author. Number of selected features and the classification accuracy is used to indicate the performances of feature selection method. Experimental results have shown that the performances of IRF tree in discretized dataset produced third feature (f3) as the most important feature with average classification accuracy 99.19%. For un- discretized dataset, first feature (f1) and third feature (f3) are the most important features with average classification accuracy 40.79%.
format Thesis
author Sukor, Nooraziera Akmal
author_facet Sukor, Nooraziera Akmal
author_sort Sukor, Nooraziera Akmal
title Improved random forest for feature selection in writer identification
title_short Improved random forest for feature selection in writer identification
title_full Improved random forest for feature selection in writer identification
title_fullStr Improved random forest for feature selection in writer identification
title_full_unstemmed Improved random forest for feature selection in writer identification
title_sort improved random forest for feature selection in writer identification
publishDate 2015
url http://eprints.utem.edu.my/id/eprint/16842/1/Improved%20Random%20Forest%20For%20Feature%20Selection%20In%20Writer%20Identification.pdf
http://eprints.utem.edu.my/id/eprint/16842/2/Improved%20random%20forest%20for%20feature%20selection%20in%20writer%20identification.pdf
http://eprints.utem.edu.my/id/eprint/16842/
https://plh.utem.edu.my/cgi-bin/koha/opac-detail.pl?biblionumber=96166
_version_ 1735390155451138048