Hash encoding on nucleotide acids for classification

Extraction of meaningful information from the DNA is a key element in bioinformatics research and DNA sequence classification has a wide range of presentations. In recent years, Machine Learning and Deep Learning techniques are popular, especially Convolutional Neural Networks (CNN) have been widely...

Full description

Saved in:
Bibliographic Details
Main Author: Ni, Wei
Other Authors: Kwoh Chee Keong
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2021
Subjects:
Online Access:https://hdl.handle.net/10356/147987
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Extraction of meaningful information from the DNA is a key element in bioinformatics research and DNA sequence classification has a wide range of presentations. In recent years, Machine Learning and Deep Learning techniques are popular, especially Convolutional Neural Networks (CNN) have been widely used because of the high accuracy. To employ CNN or other Machine Learning/Deep Learning techniques for DNA/RNA classification or other discovery tasks, the input sequences are required to be numeric. Therefore, encoding is compulsory to covert the sequences into a vector or multi-dimensional matrix. The objective of this project was to find a more suitable way to use in encoding DNA/RNA sequences for classification. In this project, different encoding methods – hash encoding, one-hot encoding, and ordinal encoding were used on the two datasets, and the encoded data were used to the different Deep Learning models, including FNN, CNN, and Machine Learning models to do classification. The performance of each encoding method was evaluated in this study. This study suggests that hash encoding is an efficient way of encoding for both binary classification and multi-class classification problems. One-hot encoding and ordinal encoding are only suitable for the smaller dataset with a uniform length of data. For the same dataset, one-hot encoding performs better than the ordinal encoding in this study.