Hash encoding on nucleotide acids for classification
Extraction of meaningful information from the DNA is a key element in bioinformatics research and DNA sequence classification has a wide range of presentations. In recent years, Machine Learning and Deep Learning techniques are popular, especially Convolutional Neural Networks (CNN) have been widely...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | English |
Published: |
Nanyang Technological University
2021
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/147987 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-147987 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-1479872021-04-16T05:01:45Z Hash encoding on nucleotide acids for classification Ni, Wei Kwoh Chee Keong School of Computer Science and Engineering ASCKKWOH@ntu.edu.sg Engineering::Computer science and engineering Extraction of meaningful information from the DNA is a key element in bioinformatics research and DNA sequence classification has a wide range of presentations. In recent years, Machine Learning and Deep Learning techniques are popular, especially Convolutional Neural Networks (CNN) have been widely used because of the high accuracy. To employ CNN or other Machine Learning/Deep Learning techniques for DNA/RNA classification or other discovery tasks, the input sequences are required to be numeric. Therefore, encoding is compulsory to covert the sequences into a vector or multi-dimensional matrix. The objective of this project was to find a more suitable way to use in encoding DNA/RNA sequences for classification. In this project, different encoding methods – hash encoding, one-hot encoding, and ordinal encoding were used on the two datasets, and the encoded data were used to the different Deep Learning models, including FNN, CNN, and Machine Learning models to do classification. The performance of each encoding method was evaluated in this study. This study suggests that hash encoding is an efficient way of encoding for both binary classification and multi-class classification problems. One-hot encoding and ordinal encoding are only suitable for the smaller dataset with a uniform length of data. For the same dataset, one-hot encoding performs better than the ordinal encoding in this study. Bachelor of Engineering (Computer Science) 2021-04-16T05:01:45Z 2021-04-16T05:01:45Z 2021 Final Year Project (FYP) Ni, W. (2021). Hash encoding on nucleotide acids for classification. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/147987 https://hdl.handle.net/10356/147987 en PSCSE19-0040 application/pdf Nanyang Technological University |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Engineering::Computer science and engineering |
spellingShingle |
Engineering::Computer science and engineering Ni, Wei Hash encoding on nucleotide acids for classification |
description |
Extraction of meaningful information from the DNA is a key element in bioinformatics research and DNA sequence classification has a wide range of presentations. In recent years, Machine Learning and Deep Learning techniques are popular, especially Convolutional Neural Networks (CNN) have been widely used because of the high accuracy. To employ CNN or other Machine Learning/Deep Learning techniques for DNA/RNA classification or other discovery tasks, the input sequences are required to be numeric. Therefore, encoding is compulsory to covert the sequences into a vector or multi-dimensional matrix.
The objective of this project was to find a more suitable way to use in encoding DNA/RNA sequences for classification.
In this project, different encoding methods – hash encoding, one-hot encoding, and ordinal encoding were used on the two datasets, and the encoded data were used to the different Deep Learning models, including FNN, CNN, and Machine Learning models to do classification. The performance of each encoding method was evaluated in this study.
This study suggests that hash encoding is an efficient way of encoding for both binary classification and multi-class classification problems. One-hot encoding and ordinal encoding are only suitable for the smaller dataset with a uniform length of data. For the same dataset, one-hot encoding performs better than the ordinal encoding in this study. |
author2 |
Kwoh Chee Keong |
author_facet |
Kwoh Chee Keong Ni, Wei |
format |
Final Year Project |
author |
Ni, Wei |
author_sort |
Ni, Wei |
title |
Hash encoding on nucleotide acids for classification |
title_short |
Hash encoding on nucleotide acids for classification |
title_full |
Hash encoding on nucleotide acids for classification |
title_fullStr |
Hash encoding on nucleotide acids for classification |
title_full_unstemmed |
Hash encoding on nucleotide acids for classification |
title_sort |
hash encoding on nucleotide acids for classification |
publisher |
Nanyang Technological University |
publishDate |
2021 |
url |
https://hdl.handle.net/10356/147987 |
_version_ |
1698713688018518016 |