Sequence comparison latent semantic analysis and support vector machine to detect remote protein homology

Remote protein homology detection refers to the detection of structural homology in weak proteins. Remote protein homology is important to identify function for new proteins which could assist in curing genetic diseases, performing drug design, and identifying novel enzymes. To detect remote protein...

Full description

Saved in:
Bibliographic Details
Main Author: Ismail, Surayati
Format: Thesis
Language:English
Published: 2010
Subjects:
Online Access:http://eprints.utm.my/id/eprint/16677/7/SurayatiIsmailMFSKSM2010.pdf
http://eprints.utm.my/id/eprint/16677/
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Universiti Teknologi Malaysia
Language: English
id my.utm.16677
record_format eprints
spelling my.utm.166772017-09-17T08:13:19Z http://eprints.utm.my/id/eprint/16677/ Sequence comparison latent semantic analysis and support vector machine to detect remote protein homology Ismail, Surayati QA75 Electronic computers. Computer science Remote protein homology detection refers to the detection of structural homology in weak proteins. Remote protein homology is important to identify function for new proteins which could assist in curing genetic diseases, performing drug design, and identifying novel enzymes. To detect remote protein homology, several problems have been identified by researchers which are hard-to-align proteins homology detection and high dimensional feature vectors of proteins caused by redundant and noisy data. To address these problems, a new remote protein homology detection computational framework has been developed. The computational framework begins by extracting structural similarity of protein using highly sensitive structural similarity algorithm which consist of four steps: split protein sequences into substring, calculate similarity using pairwise protein substring alignment, build guide tree, and extract the high structural similarity using multiple protein sequence alignment. Then, Latent Semantic Analysis algorithm (LSA) is used to produce feature vectors. The LSA consist of three steps: generate protein pattern blocks using TEIRESIAS algorithm, remove redundant data using chi-square algorithm, and noisy data using Singular Value Decomposition (SVD) algorithm. Lastly, this computational framework uses SVM to classify all the proteins into homologue or non-homologue members. The proposed computational framework is analyzed using dataset from SCOP database version 1.53 and the performance has been compared with other methods such as PSI-BLAST and SVM-Pairwise sequence comparison models, SAM and HMMER generative models, and SVM-Fisher and SVM-I-Sites discriminative classifier models in terms of Receiver Operating Characteristic (ROC), Median Rate of False Positives (MRFP), and family by family comparison of ROC. The results show that the proposed computational framework successfully outperforms other remote protein homology detection methods. 2010 Thesis NonPeerReviewed application/pdf en http://eprints.utm.my/id/eprint/16677/7/SurayatiIsmailMFSKSM2010.pdf Ismail, Surayati (2010) Sequence comparison latent semantic analysis and support vector machine to detect remote protein homology. Masters thesis, Universiti Teknologi Malaysia, Faculty of Computer Science and Information System.
institution Universiti Teknologi Malaysia
building UTM Library
collection Institutional Repository
continent Asia
country Malaysia
content_provider Universiti Teknologi Malaysia
content_source UTM Institutional Repository
url_provider http://eprints.utm.my/
language English
topic QA75 Electronic computers. Computer science
spellingShingle QA75 Electronic computers. Computer science
Ismail, Surayati
Sequence comparison latent semantic analysis and support vector machine to detect remote protein homology
description Remote protein homology detection refers to the detection of structural homology in weak proteins. Remote protein homology is important to identify function for new proteins which could assist in curing genetic diseases, performing drug design, and identifying novel enzymes. To detect remote protein homology, several problems have been identified by researchers which are hard-to-align proteins homology detection and high dimensional feature vectors of proteins caused by redundant and noisy data. To address these problems, a new remote protein homology detection computational framework has been developed. The computational framework begins by extracting structural similarity of protein using highly sensitive structural similarity algorithm which consist of four steps: split protein sequences into substring, calculate similarity using pairwise protein substring alignment, build guide tree, and extract the high structural similarity using multiple protein sequence alignment. Then, Latent Semantic Analysis algorithm (LSA) is used to produce feature vectors. The LSA consist of three steps: generate protein pattern blocks using TEIRESIAS algorithm, remove redundant data using chi-square algorithm, and noisy data using Singular Value Decomposition (SVD) algorithm. Lastly, this computational framework uses SVM to classify all the proteins into homologue or non-homologue members. The proposed computational framework is analyzed using dataset from SCOP database version 1.53 and the performance has been compared with other methods such as PSI-BLAST and SVM-Pairwise sequence comparison models, SAM and HMMER generative models, and SVM-Fisher and SVM-I-Sites discriminative classifier models in terms of Receiver Operating Characteristic (ROC), Median Rate of False Positives (MRFP), and family by family comparison of ROC. The results show that the proposed computational framework successfully outperforms other remote protein homology detection methods.
format Thesis
author Ismail, Surayati
author_facet Ismail, Surayati
author_sort Ismail, Surayati
title Sequence comparison latent semantic analysis and support vector machine to detect remote protein homology
title_short Sequence comparison latent semantic analysis and support vector machine to detect remote protein homology
title_full Sequence comparison latent semantic analysis and support vector machine to detect remote protein homology
title_fullStr Sequence comparison latent semantic analysis and support vector machine to detect remote protein homology
title_full_unstemmed Sequence comparison latent semantic analysis and support vector machine to detect remote protein homology
title_sort sequence comparison latent semantic analysis and support vector machine to detect remote protein homology
publishDate 2010
url http://eprints.utm.my/id/eprint/16677/7/SurayatiIsmailMFSKSM2010.pdf
http://eprints.utm.my/id/eprint/16677/
_version_ 1643646629354405888