IM_SELATCN: DEEP LEARNING BASED SEQUENTIAL LABELING MODEL FOR TYPE AND INDEX MUTATION DETECTION IN BREAST AND LUNG CANCER DNA SEQUENCES
Early detection of cancer is indispensable, as the number of new cases and deaths caused by cancer increases every year. One of the factors that increases the death rate due to cancer is the lateness of the patient's self-examination, causing delayed diagnosis. The delay causes the cancer to...
Saved in:
Main Author: | |
---|---|
Format: | Dissertations |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/73361 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
id |
id-itb.:73361 |
---|---|
institution |
Institut Teknologi Bandung |
building |
Institut Teknologi Bandung Library |
continent |
Asia |
country |
Indonesia Indonesia |
content_provider |
Institut Teknologi Bandung |
collection |
Digital ITB |
language |
Indonesia |
description |
Early detection of cancer is indispensable, as the number of new cases and deaths
caused by cancer increases every year. One of the factors that increases the death
rate due to cancer is the lateness of the patient's self-examination, causing delayed
diagnosis. The delay causes the cancer to be in a higher stage, making the treatment
less effective. Early detection of cancer can be achieved by carrying out a DNA test
from a patient's blood sample. Meanwhile, a solid tumor biopsy is relatively difficult
to perform if the cancer does not form a tumor or if the location of the organ infected
with cancer is difficult to reach. Cancer is characterized by abnormalities in DNA
that can be caused by hereditary or cancer-related gene mutations. Mutations that
occur can take the form of point mutations, insertions, and deletions. Each type of
cancer causes mutations in certain genes..
In the field of bioinformatics, two approaches are generally used to detect
mutations, namely the alignment and machine learning approaches. Each approach
has its strengths and weaknesses. The alignment approach is superior in detection
accuracy, but it has a long test time because, to detect mutations from a new
sequence, the sequence must be compared with all available reference sequences.
On the other hand, the machine learning approach has a faster test time because
the new sequences to be tested are entered into the optimal detection model to
obtain the results without comparing them with reference sequences. However, in
the studies conducted, the machine learning approach has only classified
mutational or normal labels of a sequence and requires tools and other supporting
data.
Therefore, the proposed dissertation research aims to build a sequential labeling
model based on Deep Learning (IM_SelaTCN) to detect the type and index
mutations in DNA sequence data. The data used includes the COSMIC dataset for
breast and lung cancer, which was acquired from the Catalog of Somatic Mutations
in Cancer (COSMIC) database, as well as the RSCM dataset acquired from breast
iv
cancer patients at Cipto Mangunkusumo Hospital (RSCM), Jakarta, Indonesia. The
COSMIC breast cancer dataset consists of a combination of 21 genes associated
with breast cancer, with a total of 81,272 patient sequences. The COSMIC lung
cancer dataset consists of a combination of 10 genes associated with lung cancer,
with a total of 143,111 patient sequences. The RSCM dataset consists of 24 patients
with a total of 11,384,164 short sequences. The proposed research will start with
data acquisition, data preprocessing, and DNA mapping to convert DNA sequences
into numerical sequences, followed by the design and implementation of mutation
detection systems, system testing and analysis, and report or journal writing.
The Deep Learning models used include Temporal Convolutional Network (TCN),
Bidirectional Long Short-Term Memory (BiLSTM), and one-dimensional
Convolutional Neural Network (1D-CNN). The TCN model has the advantage of
processing information on sequential and time series data, being able to process
input sequences in parallel so that the required computation time is faster, having
a flexible receptive field size, being able to avoid exploding or vanishing gradients,
and having a shared filter that can be used at different layers, so it requires less
computational memory. BiLSTM also has the advantage of processing information
on sequential data, being able to handle varying input lengths, and the number of
parameters that need to be optimized does not increase as the length of the sequence
to be processed increases. Meanwhile, the 1D-CNN model has been proven to
extract features from DNA sequence data, but the research conducted still requires
results from other tools as supporting data.
Based on the training and testing process of the Deep Learning-based sequential
labeling model that was built, the performance of the detection model can be
improved by observing hyperparameters and selecting the appropriate Deep
Learning model. By observing the mapping technique on the COSMIC breast
cancer dataset, the 2-mers and 3-mers mapping techniques can increase the test
F1-score by 30-34% compared to the integer mapping technique. The proposed
TCN model is superior in detecting index mutation compared to the BiLSTM and
1D-CNN models in the COSMIC lung cancer dataset and RSCM dataset, and has
a detection time that is five times faster than the BiLSTM model. This proves that
the TCN model is more robust in detecting data that has a larger amount of data
with high heterogeneity. The highest F1-score achieved using the TCN model was
0.9443 for the COSMIC breast cancer dataset, 0.9591 for the COSMIC lung cancer
dataset, and 0.9629 for the RSCM dataset. The BiLSTM model achieved the highest
F1-score of 0.9634 for the COSMIC breast cancer dataset, 0.9457 for the COSMIC
lung cancer dataset, and 0.9576 for the RSCM dataset. |
format |
Dissertations |
author |
Novia Wisesty, Untari |
spellingShingle |
Novia Wisesty, Untari IM_SELATCN: DEEP LEARNING BASED SEQUENTIAL LABELING MODEL FOR TYPE AND INDEX MUTATION DETECTION IN BREAST AND LUNG CANCER DNA SEQUENCES |
author_facet |
Novia Wisesty, Untari |
author_sort |
Novia Wisesty, Untari |
title |
IM_SELATCN: DEEP LEARNING BASED SEQUENTIAL LABELING MODEL FOR TYPE AND INDEX MUTATION DETECTION IN BREAST AND LUNG CANCER DNA SEQUENCES |
title_short |
IM_SELATCN: DEEP LEARNING BASED SEQUENTIAL LABELING MODEL FOR TYPE AND INDEX MUTATION DETECTION IN BREAST AND LUNG CANCER DNA SEQUENCES |
title_full |
IM_SELATCN: DEEP LEARNING BASED SEQUENTIAL LABELING MODEL FOR TYPE AND INDEX MUTATION DETECTION IN BREAST AND LUNG CANCER DNA SEQUENCES |
title_fullStr |
IM_SELATCN: DEEP LEARNING BASED SEQUENTIAL LABELING MODEL FOR TYPE AND INDEX MUTATION DETECTION IN BREAST AND LUNG CANCER DNA SEQUENCES |
title_full_unstemmed |
IM_SELATCN: DEEP LEARNING BASED SEQUENTIAL LABELING MODEL FOR TYPE AND INDEX MUTATION DETECTION IN BREAST AND LUNG CANCER DNA SEQUENCES |
title_sort |
im_selatcn: deep learning based sequential labeling model for type and index mutation detection in breast and lung cancer dna sequences |
url |
https://digilib.itb.ac.id/gdl/view/73361 |
_version_ |
1822279570107138048 |
spelling |
id-itb.:733612023-06-20T08:12:40ZIM_SELATCN: DEEP LEARNING BASED SEQUENTIAL LABELING MODEL FOR TYPE AND INDEX MUTATION DETECTION IN BREAST AND LUNG CANCER DNA SEQUENCES Novia Wisesty, Untari Indonesia Dissertations Cancer Early Detection, Types and Index DNA Mutations, Sequential Labeling, Deep Learning, TCN, BiLSTM, 1D-CNN. INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/73361 Early detection of cancer is indispensable, as the number of new cases and deaths caused by cancer increases every year. One of the factors that increases the death rate due to cancer is the lateness of the patient's self-examination, causing delayed diagnosis. The delay causes the cancer to be in a higher stage, making the treatment less effective. Early detection of cancer can be achieved by carrying out a DNA test from a patient's blood sample. Meanwhile, a solid tumor biopsy is relatively difficult to perform if the cancer does not form a tumor or if the location of the organ infected with cancer is difficult to reach. Cancer is characterized by abnormalities in DNA that can be caused by hereditary or cancer-related gene mutations. Mutations that occur can take the form of point mutations, insertions, and deletions. Each type of cancer causes mutations in certain genes.. In the field of bioinformatics, two approaches are generally used to detect mutations, namely the alignment and machine learning approaches. Each approach has its strengths and weaknesses. The alignment approach is superior in detection accuracy, but it has a long test time because, to detect mutations from a new sequence, the sequence must be compared with all available reference sequences. On the other hand, the machine learning approach has a faster test time because the new sequences to be tested are entered into the optimal detection model to obtain the results without comparing them with reference sequences. However, in the studies conducted, the machine learning approach has only classified mutational or normal labels of a sequence and requires tools and other supporting data. Therefore, the proposed dissertation research aims to build a sequential labeling model based on Deep Learning (IM_SelaTCN) to detect the type and index mutations in DNA sequence data. The data used includes the COSMIC dataset for breast and lung cancer, which was acquired from the Catalog of Somatic Mutations in Cancer (COSMIC) database, as well as the RSCM dataset acquired from breast iv cancer patients at Cipto Mangunkusumo Hospital (RSCM), Jakarta, Indonesia. The COSMIC breast cancer dataset consists of a combination of 21 genes associated with breast cancer, with a total of 81,272 patient sequences. The COSMIC lung cancer dataset consists of a combination of 10 genes associated with lung cancer, with a total of 143,111 patient sequences. The RSCM dataset consists of 24 patients with a total of 11,384,164 short sequences. The proposed research will start with data acquisition, data preprocessing, and DNA mapping to convert DNA sequences into numerical sequences, followed by the design and implementation of mutation detection systems, system testing and analysis, and report or journal writing. The Deep Learning models used include Temporal Convolutional Network (TCN), Bidirectional Long Short-Term Memory (BiLSTM), and one-dimensional Convolutional Neural Network (1D-CNN). The TCN model has the advantage of processing information on sequential and time series data, being able to process input sequences in parallel so that the required computation time is faster, having a flexible receptive field size, being able to avoid exploding or vanishing gradients, and having a shared filter that can be used at different layers, so it requires less computational memory. BiLSTM also has the advantage of processing information on sequential data, being able to handle varying input lengths, and the number of parameters that need to be optimized does not increase as the length of the sequence to be processed increases. Meanwhile, the 1D-CNN model has been proven to extract features from DNA sequence data, but the research conducted still requires results from other tools as supporting data. Based on the training and testing process of the Deep Learning-based sequential labeling model that was built, the performance of the detection model can be improved by observing hyperparameters and selecting the appropriate Deep Learning model. By observing the mapping technique on the COSMIC breast cancer dataset, the 2-mers and 3-mers mapping techniques can increase the test F1-score by 30-34% compared to the integer mapping technique. The proposed TCN model is superior in detecting index mutation compared to the BiLSTM and 1D-CNN models in the COSMIC lung cancer dataset and RSCM dataset, and has a detection time that is five times faster than the BiLSTM model. This proves that the TCN model is more robust in detecting data that has a larger amount of data with high heterogeneity. The highest F1-score achieved using the TCN model was 0.9443 for the COSMIC breast cancer dataset, 0.9591 for the COSMIC lung cancer dataset, and 0.9629 for the RSCM dataset. The BiLSTM model achieved the highest F1-score of 0.9634 for the COSMIC breast cancer dataset, 0.9457 for the COSMIC lung cancer dataset, and 0.9576 for the RSCM dataset. text |