Hash encoding on nucleotide acids for classification

Extraction of meaningful information from the DNA is a key element in bioinformatics research and DNA sequence classification has a wide range of presentations. In recent years, Machine Learning and Deep Learning techniques are popular, especially Convolutional Neural Networks (CNN) have been widely...

Full description

Saved in:
Bibliographic Details
Main Author: Ni, Wei
Other Authors: Kwoh Chee Keong
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2021
Subjects:
Online Access:https://hdl.handle.net/10356/147987
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-147987
record_format dspace
spelling sg-ntu-dr.10356-1479872021-04-16T05:01:45Z Hash encoding on nucleotide acids for classification Ni, Wei Kwoh Chee Keong School of Computer Science and Engineering ASCKKWOH@ntu.edu.sg Engineering::Computer science and engineering Extraction of meaningful information from the DNA is a key element in bioinformatics research and DNA sequence classification has a wide range of presentations. In recent years, Machine Learning and Deep Learning techniques are popular, especially Convolutional Neural Networks (CNN) have been widely used because of the high accuracy. To employ CNN or other Machine Learning/Deep Learning techniques for DNA/RNA classification or other discovery tasks, the input sequences are required to be numeric. Therefore, encoding is compulsory to covert the sequences into a vector or multi-dimensional matrix. The objective of this project was to find a more suitable way to use in encoding DNA/RNA sequences for classification. In this project, different encoding methods – hash encoding, one-hot encoding, and ordinal encoding were used on the two datasets, and the encoded data were used to the different Deep Learning models, including FNN, CNN, and Machine Learning models to do classification. The performance of each encoding method was evaluated in this study. This study suggests that hash encoding is an efficient way of encoding for both binary classification and multi-class classification problems. One-hot encoding and ordinal encoding are only suitable for the smaller dataset with a uniform length of data. For the same dataset, one-hot encoding performs better than the ordinal encoding in this study. Bachelor of Engineering (Computer Science) 2021-04-16T05:01:45Z 2021-04-16T05:01:45Z 2021 Final Year Project (FYP) Ni, W. (2021). Hash encoding on nucleotide acids for classification. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/147987 https://hdl.handle.net/10356/147987 en PSCSE19-0040 application/pdf Nanyang Technological University
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Engineering::Computer science and engineering
spellingShingle Engineering::Computer science and engineering
Ni, Wei
Hash encoding on nucleotide acids for classification
description Extraction of meaningful information from the DNA is a key element in bioinformatics research and DNA sequence classification has a wide range of presentations. In recent years, Machine Learning and Deep Learning techniques are popular, especially Convolutional Neural Networks (CNN) have been widely used because of the high accuracy. To employ CNN or other Machine Learning/Deep Learning techniques for DNA/RNA classification or other discovery tasks, the input sequences are required to be numeric. Therefore, encoding is compulsory to covert the sequences into a vector or multi-dimensional matrix. The objective of this project was to find a more suitable way to use in encoding DNA/RNA sequences for classification. In this project, different encoding methods – hash encoding, one-hot encoding, and ordinal encoding were used on the two datasets, and the encoded data were used to the different Deep Learning models, including FNN, CNN, and Machine Learning models to do classification. The performance of each encoding method was evaluated in this study. This study suggests that hash encoding is an efficient way of encoding for both binary classification and multi-class classification problems. One-hot encoding and ordinal encoding are only suitable for the smaller dataset with a uniform length of data. For the same dataset, one-hot encoding performs better than the ordinal encoding in this study.
author2 Kwoh Chee Keong
author_facet Kwoh Chee Keong
Ni, Wei
format Final Year Project
author Ni, Wei
author_sort Ni, Wei
title Hash encoding on nucleotide acids for classification
title_short Hash encoding on nucleotide acids for classification
title_full Hash encoding on nucleotide acids for classification
title_fullStr Hash encoding on nucleotide acids for classification
title_full_unstemmed Hash encoding on nucleotide acids for classification
title_sort hash encoding on nucleotide acids for classification
publisher Nanyang Technological University
publishDate 2021
url https://hdl.handle.net/10356/147987
_version_ 1698713688018518016