IMPLEMENTATION OF SEQUENTIAL LABELLING AND DNABERT FOR SPLICE SITES PREDICTION IN HOMO SAPIENS DNA
Genome sequencing technology has improved significantly in few last years and resulted in abundance genetic data. To make use of the data abundance, genome annotation must be carried out to identify gene functional dan structural features. Increasing amount of DNA produced by ever improving sequence...
Saved in:
Main Author: | |
---|---|
Format: | Theses |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/72098 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
Summary: | Genome sequencing technology has improved significantly in few last years and resulted in abundance genetic data. To make use of the data abundance, genome annotation must be carried out to identify gene functional dan structural features. Increasing amount of DNA produced by ever improving sequence machine requires more effective and efficient annotation method. Artificial intelligence has been employed to analyze genetic data in response to its sheer size and variability.
Splice site prediction is one of several processes involved in genome annotation and aims to identify introns and exons in gene. Sequence classification with deep learning such as CNN and LSTM has been widely utilized recently because of its capability to learn data feature representation automatically which eliminates the need for manual feature engineering. However, recent implementation is limited to sequence whose splice site is located at the center of sequence. To alleviate this limitation, sequential labelling approach, is proposed to identify splice sites regardless their position in given sequence.
Proposed sequential labelling model called DNABERT-SL is developed using pretrained DNABERT and NCBI RefSeq genome data. Sequential model based on bidirectional LSTM and bidirectional GRU are also developed as baseline. Experiment on DNABERT-SL involves both fine-tuning, feature-based approach, and hyperparameters recommended from both DNABERT and BERT while experiment on baseline model involves RNN variant and data representation.
Validation on DNABERT-SL shows that fine-tuning with learning rate = 5.10-5 and epsilon = 10-8 with AdamW produces the best result indicated by F1 score 0.998 and 0.996 for intron and exon labels and F1 score in range of 0.8 – 0.9 for splice site labels. On the other hand, using 3-mer data representation on baseline model results in better model compared to using single nucleotide token. Bidirectional GRU shows slightly better performance than bidirectional LSTM.
Test shows that both DNABERT-SL and baseline models score significantly lower than validation while sharing similar precision, recall, and F1 score. Both model
achieve 0.85 in F1 score for intron label while manages only 0.48 at average on other labels. The lowest F1 score is found on acceptor splice site labels (0.109). Error and test results analysis reveal that DNABERT-SL experience overfitting. Token and motif analysis also reveal that proposed model cannot distinguish between GT-AG as splice site motif and GT-AG as introns or exons because of its inability to recognize GT-AG contextual pattern correctly. |
---|