On the design of capacity-approaching error-correction codes for multi constrained systems

Current common storage media has limited ability to store data with present data explosion trends, which serves as a dominant motivator for developing novel storage technologies. The technological advancement in biological sciences is not a new story, and DNA data storage is a beneficiary of breakth...

Full description

Saved in:
Bibliographic Details
Main Author: Zhang, Jiayu
Other Authors: Erry Gunawan
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2021
Subjects:
Online Access:https://hdl.handle.net/10356/151921
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Current common storage media has limited ability to store data with present data explosion trends, which serves as a dominant motivator for developing novel storage technologies. The technological advancement in biological sciences is not a new story, and DNA data storage is a beneficiary of breakthroughs in bioinformatics and in- novations by cross-disciplinary collaborations. Due to its potential to store data for centuries in a high-density manner, DNA is considered as a promising data storage solution to enormous data generation and storage requirement. DNA Sequencing is part of DNA data storage process, which is error prone. To analyse DNA nucleotide sequences, clustering plays a vital role to reduce redundancies and correct errors. Greedy approaches, which do not always produce the optimal results, are applied by most currently available software tools when clustering se- quences - they are very sensitive to single parameter which decides the similarities among DNA sequences within one cluster. In general, the specific similarity is not known, so sequence clusters generated by these greedy algorithms tend not to match the actual clusters if an imperfect parameter is used. As an unsupervised learning model, mean shift algorithm has been utilised many times in several fields like descriptive statistics, audio processing, and computer vision. A convergence to local optimum is guaranteed by the mean shift algorithm, which overcomes the limitations in greedy algorithms. MeShClust is an alignment-free clustering tool applying the mean shift approach and a machine learning algorithm to cluster DNA sequences. In this project, the MeShClust tool is implemented and the results are compared with the ones produced by the SlideSort algorithm based on the same DNA sequence dataset.