Using word embedding technique to efficiently represent protein sequences for identifying substrate specificities of transporters

Membrane transport proteins and their substrate specificities play crucial roles in various cellular functions. Identifying the substrate specificities of membrane transport proteins is closely related to protein-target interaction prediction, drug design, membrane recruitment, and dysregulation ana...

Full description

Saved in:
Bibliographic Details
Main Authors: Nguyen, Trinh-Trung-Duong, Le, Nguyen Quoc Khanh, Ho, Quang-Thai, Phan, Dinh-Van, Ou, Yu-Yen
Other Authors: School of Humanities
Format: Article
Language:English
Published: 2021
Subjects:
Online Access:https://hdl.handle.net/10356/150972
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-150972
record_format dspace
spelling sg-ntu-dr.10356-1509722021-05-31T08:34:45Z Using word embedding technique to efficiently represent protein sequences for identifying substrate specificities of transporters Nguyen, Trinh-Trung-Duong Le, Nguyen Quoc Khanh Ho, Quang-Thai Phan, Dinh-Van Ou, Yu-Yen School of Humanities Science::Biological sciences Word Embeddings Feature Extraction Membrane transport proteins and their substrate specificities play crucial roles in various cellular functions. Identifying the substrate specificities of membrane transport proteins is closely related to protein-target interaction prediction, drug design, membrane recruitment, and dysregulation analysis, thus being an important problem for bioinformatics researchers. In this study, we applied word embedding approach, the main cause for natural language processing breakout in recent years, to protein sequences of transporters. We defined each protein sequence based on the word embeddings and frequencies of its biological words. The protein features were then fed into machine learning models for prediction. We also varied the lengths of protein sequence's constituent biological words to find the optimal length which generated the most discriminative feature set. Compared to four other feature types created from protein sequences, our proposed features can help prediction models yield superior performance. Our best models reach an average area under the curve of 0.96 and 0.99, respectively on the 5-fold cross validation and the independent test. With this result, our study can help biologists identify transporters based on substrate specificities as well as provides a basis for further research that enriches a field of applying natural language processing techniques in bioinformatics. The authors acknowledge support from the Ministry of Science and Technology, Taiwan, R.O.C. under Grant no. MOST 106-2221-E-155-068. 2021-05-31T08:34:44Z 2021-05-31T08:34:44Z 2019 Journal Article Nguyen, T., Le, N. Q. K., Ho, Q., Phan, D. & Ou, Y. (2019). Using word embedding technique to efficiently represent protein sequences for identifying substrate specificities of transporters. Analytical Biochemistry, 577, 73-81. https://dx.doi.org/10.1016/j.ab.2019.04.011 0003-2697 https://hdl.handle.net/10356/150972 10.1016/j.ab.2019.04.011 31022378 2-s2.0-85064809652 577 73 81 en Analytical Biochemistry © 2019 Elsevier Inc. All rights reserved.
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Science::Biological sciences
Word Embeddings
Feature Extraction
spellingShingle Science::Biological sciences
Word Embeddings
Feature Extraction
Nguyen, Trinh-Trung-Duong
Le, Nguyen Quoc Khanh
Ho, Quang-Thai
Phan, Dinh-Van
Ou, Yu-Yen
Using word embedding technique to efficiently represent protein sequences for identifying substrate specificities of transporters
description Membrane transport proteins and their substrate specificities play crucial roles in various cellular functions. Identifying the substrate specificities of membrane transport proteins is closely related to protein-target interaction prediction, drug design, membrane recruitment, and dysregulation analysis, thus being an important problem for bioinformatics researchers. In this study, we applied word embedding approach, the main cause for natural language processing breakout in recent years, to protein sequences of transporters. We defined each protein sequence based on the word embeddings and frequencies of its biological words. The protein features were then fed into machine learning models for prediction. We also varied the lengths of protein sequence's constituent biological words to find the optimal length which generated the most discriminative feature set. Compared to four other feature types created from protein sequences, our proposed features can help prediction models yield superior performance. Our best models reach an average area under the curve of 0.96 and 0.99, respectively on the 5-fold cross validation and the independent test. With this result, our study can help biologists identify transporters based on substrate specificities as well as provides a basis for further research that enriches a field of applying natural language processing techniques in bioinformatics.
author2 School of Humanities
author_facet School of Humanities
Nguyen, Trinh-Trung-Duong
Le, Nguyen Quoc Khanh
Ho, Quang-Thai
Phan, Dinh-Van
Ou, Yu-Yen
format Article
author Nguyen, Trinh-Trung-Duong
Le, Nguyen Quoc Khanh
Ho, Quang-Thai
Phan, Dinh-Van
Ou, Yu-Yen
author_sort Nguyen, Trinh-Trung-Duong
title Using word embedding technique to efficiently represent protein sequences for identifying substrate specificities of transporters
title_short Using word embedding technique to efficiently represent protein sequences for identifying substrate specificities of transporters
title_full Using word embedding technique to efficiently represent protein sequences for identifying substrate specificities of transporters
title_fullStr Using word embedding technique to efficiently represent protein sequences for identifying substrate specificities of transporters
title_full_unstemmed Using word embedding technique to efficiently represent protein sequences for identifying substrate specificities of transporters
title_sort using word embedding technique to efficiently represent protein sequences for identifying substrate specificities of transporters
publishDate 2021
url https://hdl.handle.net/10356/150972
_version_ 1702418253853229056