Hash encoding on nucleotide acids for classification
Extraction of meaningful information from the DNA is a key element in bioinformatics research and DNA sequence classification has a wide range of presentations. In recent years, Machine Learning and Deep Learning techniques are popular, especially Convolutional Neural Networks (CNN) have been widely...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | English |
Published: |
Nanyang Technological University
2021
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/147987 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | Extraction of meaningful information from the DNA is a key element in bioinformatics research and DNA sequence classification has a wide range of presentations. In recent years, Machine Learning and Deep Learning techniques are popular, especially Convolutional Neural Networks (CNN) have been widely used because of the high accuracy. To employ CNN or other Machine Learning/Deep Learning techniques for DNA/RNA classification or other discovery tasks, the input sequences are required to be numeric. Therefore, encoding is compulsory to covert the sequences into a vector or multi-dimensional matrix.
The objective of this project was to find a more suitable way to use in encoding DNA/RNA sequences for classification.
In this project, different encoding methods – hash encoding, one-hot encoding, and ordinal encoding were used on the two datasets, and the encoded data were used to the different Deep Learning models, including FNN, CNN, and Machine Learning models to do classification. The performance of each encoding method was evaluated in this study.
This study suggests that hash encoding is an efficient way of encoding for both binary classification and multi-class classification problems. One-hot encoding and ordinal encoding are only suitable for the smaller dataset with a uniform length of data. For the same dataset, one-hot encoding performs better than the ordinal encoding in this study. |
---|