Splice junction classification problems for DNA sequences: Representation issues

Splice junction classification in a Eukaryotic cell is an important problem because the splice junction indicates which part of the DNA sequence carries protein-coding information. The major issue in building a classifier for this classification task is how to represent the DNA sequence on computers...

Full description

Saved in:

Bibliographic Details
Main Authors:	Sarkar, M., Tze-Yun LEONG
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2001
Subjects:	Classification DNA Exon Gene Intron Random walk and Hurst coefficient Representation Splice boundary Health Information Technology Numerical Analysis and Scientific Computing
Online Access:	https://ink.library.smu.edu.sg/sis_research/3041
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-4041
record_format	dspace
spelling	sg-smu-ink.sis_research-40412016-03-10T02:40:11Z Splice junction classification problems for DNA sequences: Representation issues Sarkar, M. Tze-Yun LEONG, Splice junction classification in a Eukaryotic cell is an important problem because the splice junction indicates which part of the DNA sequence carries protein-coding information. The major issue in building a classifier for this classification task is how to represent the DNA sequence on computers since the accuracy of any classification technique critically hinges on the adopted representation. This paper presents the experimental results on seven representation schemes. The first three representations interpret each DNA sequence as a series of symbols. The fourth and fifth representations consider the sequence as a series of real numbers. Moreover, the first, second and fourth representations do not consider the influence of the neighbors on the occurrence of a nucleotide, whereas the third and fifth representations take the influence of the neighbors into considerations. To capture certain regularity in the apparent randomness in the DNA sequence, the sixth representation treats the sequence as a variant of random walk. The seventh representation uses Hurst coefficient, which quantifies the roughness of the DNA sequences. The experimental results suggest that the fourth representation scheme makes sequences from the same class close and the sequences from the different classes far, and thus finds a structure in the input space to provide the best classification results. 2001-12-01T08:00:00Z text https://ink.library.smu.edu.sg/sis_research/3041 Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Classification DNA Exon Gene Intron Random walk and Hurst coefficient Representation Splice boundary Health Information Technology Numerical Analysis and Scientific Computing
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Classification DNA Exon Gene Intron Random walk and Hurst coefficient Representation Splice boundary Health Information Technology Numerical Analysis and Scientific Computing
spellingShingle	Classification DNA Exon Gene Intron Random walk and Hurst coefficient Representation Splice boundary Health Information Technology Numerical Analysis and Scientific Computing Sarkar, M. Tze-Yun LEONG, Splice junction classification problems for DNA sequences: Representation issues
description	Splice junction classification in a Eukaryotic cell is an important problem because the splice junction indicates which part of the DNA sequence carries protein-coding information. The major issue in building a classifier for this classification task is how to represent the DNA sequence on computers since the accuracy of any classification technique critically hinges on the adopted representation. This paper presents the experimental results on seven representation schemes. The first three representations interpret each DNA sequence as a series of symbols. The fourth and fifth representations consider the sequence as a series of real numbers. Moreover, the first, second and fourth representations do not consider the influence of the neighbors on the occurrence of a nucleotide, whereas the third and fifth representations take the influence of the neighbors into considerations. To capture certain regularity in the apparent randomness in the DNA sequence, the sixth representation treats the sequence as a variant of random walk. The seventh representation uses Hurst coefficient, which quantifies the roughness of the DNA sequences. The experimental results suggest that the fourth representation scheme makes sequences from the same class close and the sequences from the different classes far, and thus finds a structure in the input space to provide the best classification results.
format	text
author	Sarkar, M. Tze-Yun LEONG,
author_facet	Sarkar, M. Tze-Yun LEONG,
author_sort	Sarkar, M.
title	Splice junction classification problems for DNA sequences: Representation issues
title_short	Splice junction classification problems for DNA sequences: Representation issues
title_full	Splice junction classification problems for DNA sequences: Representation issues
title_fullStr	Splice junction classification problems for DNA sequences: Representation issues
title_full_unstemmed	Splice junction classification problems for DNA sequences: Representation issues
title_sort	splice junction classification problems for dna sequences: representation issues
publisher	Institutional Knowledge at Singapore Management University
publishDate	2001
url	https://ink.library.smu.edu.sg/sis_research/3041
_version_	1770572788554268672

Splice junction classification problems for DNA sequences: Representation issues

Similar Items