An encoding scheme capturing generic priors and properties of amino acids improves protein classification

Feature engineering aims at representing non-numeric data with numeric features that keep the essential information of the underlying problem, and it is a non-trivial process in building a predictive model. In bioinformatics, there is a profound scale of DNA and protein sequences available, but far...

Full description

Saved in:

Bibliographic Details
Main Authors:	Zhou, Xinrui, Yin, Rui, Zheng, Jie, Kwoh, Chee-Keong
Other Authors:	School of Computer Science and Engineering
Format:	Article
Language:	English
Published:	2019
Subjects:	DRNTU::Engineering::Computer science and engineering Encoding Scheme Feature Engineering
Online Access:	https://hdl.handle.net/10356/105937 http://hdl.handle.net/10220/48837 https://doi.org/10.21979/N9/4YDZED
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-105937
record_format	dspace
spelling	sg-ntu-dr.10356-1059372021-01-18T04:50:19Z An encoding scheme capturing generic priors and properties of amino acids improves protein classification Zhou, Xinrui Yin, Rui Zheng, Jie Kwoh, Chee-Keong School of Computer Science and Engineering DRNTU::Engineering::Computer science and engineering Encoding Scheme Feature Engineering Feature engineering aims at representing non-numeric data with numeric features that keep the essential information of the underlying problem, and it is a non-trivial process in building a predictive model. In bioinformatics, there is a profound scale of DNA and protein sequences available, but far from being fully utilized. Computational models can facilitate the analyses of large-scale data. However, most computational models require a numeric representation as input. Expert knowledge can help design features to cast the raw symbolic data effectively. But generally, the features vary from case to case and have to be redesigned for a problem. Automated feature engineering, i.e., an encoding scheme automating the construction of features, saves the redesigning process and allows the researchers to try different representations with minimal effort. This is more in line with the explosion of data and the goal of building an intelligent system. In this paper, we introduce an encoding scheme for protein sequences, which encodes the representative sequence dataset into a numeric matrix that can be fed into a downstream learning model. The method, Context-Free EncodingScheme (CFreeEnS), was proposed for a dataset with labels for pairwise sequences. Here, we improve the method by making it applicable to a batch of protein sequences, requiring no sequence alignment beforehand. The improved method is applied to protein classification at the functional level, including identifying antimicrobial peptides, screening tumor homing peptides, and detecting hemolytic peptides and phage virion proteins. Compared with the traditional methods using task-specific designed features, CFreeEnS improves the predicting accuracy, with an increase ranging from 5.54% to 14.14%. The results indicate that the improved CFreeEnS, free from dependence on carefully designed features, is promising in capturing generic priors and essential properties of amino acids, thereby serving as an automated feature engineering method for protein sequences. MOE (Min. of Education, S’pore) Published version 2019-06-19T06:46:18Z 2019-12-06T22:01:06Z 2019-06-19T06:46:18Z 2019-12-06T22:01:06Z 2018 Journal Article Zhou, X., Yin, R., Zheng, J., & Kwoh, C.-K. (2019). An encoding scheme capturing generic priors and properties of amino acids improves protein classification. IEEE Access, 7, 7348-7356. doi:10.1109/ACCESS.2018.2890096 https://hdl.handle.net/10356/105937 http://hdl.handle.net/10220/48837 10.1109/ACCESS.2018.2890096 en IEEE Access https://doi.org/10.21979/N9/4YDZED © 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. 9 p. application/pdf
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	DRNTU::Engineering::Computer science and engineering Encoding Scheme Feature Engineering
spellingShingle	DRNTU::Engineering::Computer science and engineering Encoding Scheme Feature Engineering Zhou, Xinrui Yin, Rui Zheng, Jie Kwoh, Chee-Keong An encoding scheme capturing generic priors and properties of amino acids improves protein classification
description	Feature engineering aims at representing non-numeric data with numeric features that keep the essential information of the underlying problem, and it is a non-trivial process in building a predictive model. In bioinformatics, there is a profound scale of DNA and protein sequences available, but far from being fully utilized. Computational models can facilitate the analyses of large-scale data. However, most computational models require a numeric representation as input. Expert knowledge can help design features to cast the raw symbolic data effectively. But generally, the features vary from case to case and have to be redesigned for a problem. Automated feature engineering, i.e., an encoding scheme automating the construction of features, saves the redesigning process and allows the researchers to try different representations with minimal effort. This is more in line with the explosion of data and the goal of building an intelligent system. In this paper, we introduce an encoding scheme for protein sequences, which encodes the representative sequence dataset into a numeric matrix that can be fed into a downstream learning model. The method, Context-Free EncodingScheme (CFreeEnS), was proposed for a dataset with labels for pairwise sequences. Here, we improve the method by making it applicable to a batch of protein sequences, requiring no sequence alignment beforehand. The improved method is applied to protein classification at the functional level, including identifying antimicrobial peptides, screening tumor homing peptides, and detecting hemolytic peptides and phage virion proteins. Compared with the traditional methods using task-specific designed features, CFreeEnS improves the predicting accuracy, with an increase ranging from 5.54% to 14.14%. The results indicate that the improved CFreeEnS, free from dependence on carefully designed features, is promising in capturing generic priors and essential properties of amino acids, thereby serving as an automated feature engineering method for protein sequences.
author2	School of Computer Science and Engineering
author_facet	School of Computer Science and Engineering Zhou, Xinrui Yin, Rui Zheng, Jie Kwoh, Chee-Keong
format	Article
author	Zhou, Xinrui Yin, Rui Zheng, Jie Kwoh, Chee-Keong
author_sort	Zhou, Xinrui
title	An encoding scheme capturing generic priors and properties of amino acids improves protein classification
title_short	An encoding scheme capturing generic priors and properties of amino acids improves protein classification
title_full	An encoding scheme capturing generic priors and properties of amino acids improves protein classification
title_fullStr	An encoding scheme capturing generic priors and properties of amino acids improves protein classification
title_full_unstemmed	An encoding scheme capturing generic priors and properties of amino acids improves protein classification
title_sort	encoding scheme capturing generic priors and properties of amino acids improves protein classification
publishDate	2019
url	https://hdl.handle.net/10356/105937 http://hdl.handle.net/10220/48837 https://doi.org/10.21979/N9/4YDZED
_version_	1690658494179442688

An encoding scheme capturing generic priors and properties of amino acids improves protein classification

Similar Items