IM_SELATCN: DEEP LEARNING BASED SEQUENTIAL LABELING MODEL FOR TYPE AND INDEX MUTATION DETECTION IN BREAST AND LUNG CANCER DNA SEQUENCES

Early detection of cancer is indispensable, as the number of new cases and deaths caused by cancer increases every year. One of the factors that increases the death rate due to cancer is the lateness of the patient's self-examination, causing delayed diagnosis. The delay causes the cancer to...

Full description

Saved in:
Bibliographic Details
Main Author: Novia Wisesty, Untari
Format: Dissertations
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/73361
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
id id-itb.:73361
institution Institut Teknologi Bandung
building Institut Teknologi Bandung Library
continent Asia
country Indonesia
Indonesia
content_provider Institut Teknologi Bandung
collection Digital ITB
language Indonesia
description Early detection of cancer is indispensable, as the number of new cases and deaths caused by cancer increases every year. One of the factors that increases the death rate due to cancer is the lateness of the patient's self-examination, causing delayed diagnosis. The delay causes the cancer to be in a higher stage, making the treatment less effective. Early detection of cancer can be achieved by carrying out a DNA test from a patient's blood sample. Meanwhile, a solid tumor biopsy is relatively difficult to perform if the cancer does not form a tumor or if the location of the organ infected with cancer is difficult to reach. Cancer is characterized by abnormalities in DNA that can be caused by hereditary or cancer-related gene mutations. Mutations that occur can take the form of point mutations, insertions, and deletions. Each type of cancer causes mutations in certain genes.. In the field of bioinformatics, two approaches are generally used to detect mutations, namely the alignment and machine learning approaches. Each approach has its strengths and weaknesses. The alignment approach is superior in detection accuracy, but it has a long test time because, to detect mutations from a new sequence, the sequence must be compared with all available reference sequences. On the other hand, the machine learning approach has a faster test time because the new sequences to be tested are entered into the optimal detection model to obtain the results without comparing them with reference sequences. However, in the studies conducted, the machine learning approach has only classified mutational or normal labels of a sequence and requires tools and other supporting data. Therefore, the proposed dissertation research aims to build a sequential labeling model based on Deep Learning (IM_SelaTCN) to detect the type and index mutations in DNA sequence data. The data used includes the COSMIC dataset for breast and lung cancer, which was acquired from the Catalog of Somatic Mutations in Cancer (COSMIC) database, as well as the RSCM dataset acquired from breast iv cancer patients at Cipto Mangunkusumo Hospital (RSCM), Jakarta, Indonesia. The COSMIC breast cancer dataset consists of a combination of 21 genes associated with breast cancer, with a total of 81,272 patient sequences. The COSMIC lung cancer dataset consists of a combination of 10 genes associated with lung cancer, with a total of 143,111 patient sequences. The RSCM dataset consists of 24 patients with a total of 11,384,164 short sequences. The proposed research will start with data acquisition, data preprocessing, and DNA mapping to convert DNA sequences into numerical sequences, followed by the design and implementation of mutation detection systems, system testing and analysis, and report or journal writing. The Deep Learning models used include Temporal Convolutional Network (TCN), Bidirectional Long Short-Term Memory (BiLSTM), and one-dimensional Convolutional Neural Network (1D-CNN). The TCN model has the advantage of processing information on sequential and time series data, being able to process input sequences in parallel so that the required computation time is faster, having a flexible receptive field size, being able to avoid exploding or vanishing gradients, and having a shared filter that can be used at different layers, so it requires less computational memory. BiLSTM also has the advantage of processing information on sequential data, being able to handle varying input lengths, and the number of parameters that need to be optimized does not increase as the length of the sequence to be processed increases. Meanwhile, the 1D-CNN model has been proven to extract features from DNA sequence data, but the research conducted still requires results from other tools as supporting data. Based on the training and testing process of the Deep Learning-based sequential labeling model that was built, the performance of the detection model can be improved by observing hyperparameters and selecting the appropriate Deep Learning model. By observing the mapping technique on the COSMIC breast cancer dataset, the 2-mers and 3-mers mapping techniques can increase the test F1-score by 30-34% compared to the integer mapping technique. The proposed TCN model is superior in detecting index mutation compared to the BiLSTM and 1D-CNN models in the COSMIC lung cancer dataset and RSCM dataset, and has a detection time that is five times faster than the BiLSTM model. This proves that the TCN model is more robust in detecting data that has a larger amount of data with high heterogeneity. The highest F1-score achieved using the TCN model was 0.9443 for the COSMIC breast cancer dataset, 0.9591 for the COSMIC lung cancer dataset, and 0.9629 for the RSCM dataset. The BiLSTM model achieved the highest F1-score of 0.9634 for the COSMIC breast cancer dataset, 0.9457 for the COSMIC lung cancer dataset, and 0.9576 for the RSCM dataset.
format Dissertations
author Novia Wisesty, Untari
spellingShingle Novia Wisesty, Untari
IM_SELATCN: DEEP LEARNING BASED SEQUENTIAL LABELING MODEL FOR TYPE AND INDEX MUTATION DETECTION IN BREAST AND LUNG CANCER DNA SEQUENCES
author_facet Novia Wisesty, Untari
author_sort Novia Wisesty, Untari
title IM_SELATCN: DEEP LEARNING BASED SEQUENTIAL LABELING MODEL FOR TYPE AND INDEX MUTATION DETECTION IN BREAST AND LUNG CANCER DNA SEQUENCES
title_short IM_SELATCN: DEEP LEARNING BASED SEQUENTIAL LABELING MODEL FOR TYPE AND INDEX MUTATION DETECTION IN BREAST AND LUNG CANCER DNA SEQUENCES
title_full IM_SELATCN: DEEP LEARNING BASED SEQUENTIAL LABELING MODEL FOR TYPE AND INDEX MUTATION DETECTION IN BREAST AND LUNG CANCER DNA SEQUENCES
title_fullStr IM_SELATCN: DEEP LEARNING BASED SEQUENTIAL LABELING MODEL FOR TYPE AND INDEX MUTATION DETECTION IN BREAST AND LUNG CANCER DNA SEQUENCES
title_full_unstemmed IM_SELATCN: DEEP LEARNING BASED SEQUENTIAL LABELING MODEL FOR TYPE AND INDEX MUTATION DETECTION IN BREAST AND LUNG CANCER DNA SEQUENCES
title_sort im_selatcn: deep learning based sequential labeling model for type and index mutation detection in breast and lung cancer dna sequences
url https://digilib.itb.ac.id/gdl/view/73361
_version_ 1822279570107138048
spelling id-itb.:733612023-06-20T08:12:40ZIM_SELATCN: DEEP LEARNING BASED SEQUENTIAL LABELING MODEL FOR TYPE AND INDEX MUTATION DETECTION IN BREAST AND LUNG CANCER DNA SEQUENCES Novia Wisesty, Untari Indonesia Dissertations Cancer Early Detection, Types and Index DNA Mutations, Sequential Labeling, Deep Learning, TCN, BiLSTM, 1D-CNN. INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/73361 Early detection of cancer is indispensable, as the number of new cases and deaths caused by cancer increases every year. One of the factors that increases the death rate due to cancer is the lateness of the patient's self-examination, causing delayed diagnosis. The delay causes the cancer to be in a higher stage, making the treatment less effective. Early detection of cancer can be achieved by carrying out a DNA test from a patient's blood sample. Meanwhile, a solid tumor biopsy is relatively difficult to perform if the cancer does not form a tumor or if the location of the organ infected with cancer is difficult to reach. Cancer is characterized by abnormalities in DNA that can be caused by hereditary or cancer-related gene mutations. Mutations that occur can take the form of point mutations, insertions, and deletions. Each type of cancer causes mutations in certain genes.. In the field of bioinformatics, two approaches are generally used to detect mutations, namely the alignment and machine learning approaches. Each approach has its strengths and weaknesses. The alignment approach is superior in detection accuracy, but it has a long test time because, to detect mutations from a new sequence, the sequence must be compared with all available reference sequences. On the other hand, the machine learning approach has a faster test time because the new sequences to be tested are entered into the optimal detection model to obtain the results without comparing them with reference sequences. However, in the studies conducted, the machine learning approach has only classified mutational or normal labels of a sequence and requires tools and other supporting data. Therefore, the proposed dissertation research aims to build a sequential labeling model based on Deep Learning (IM_SelaTCN) to detect the type and index mutations in DNA sequence data. The data used includes the COSMIC dataset for breast and lung cancer, which was acquired from the Catalog of Somatic Mutations in Cancer (COSMIC) database, as well as the RSCM dataset acquired from breast iv cancer patients at Cipto Mangunkusumo Hospital (RSCM), Jakarta, Indonesia. The COSMIC breast cancer dataset consists of a combination of 21 genes associated with breast cancer, with a total of 81,272 patient sequences. The COSMIC lung cancer dataset consists of a combination of 10 genes associated with lung cancer, with a total of 143,111 patient sequences. The RSCM dataset consists of 24 patients with a total of 11,384,164 short sequences. The proposed research will start with data acquisition, data preprocessing, and DNA mapping to convert DNA sequences into numerical sequences, followed by the design and implementation of mutation detection systems, system testing and analysis, and report or journal writing. The Deep Learning models used include Temporal Convolutional Network (TCN), Bidirectional Long Short-Term Memory (BiLSTM), and one-dimensional Convolutional Neural Network (1D-CNN). The TCN model has the advantage of processing information on sequential and time series data, being able to process input sequences in parallel so that the required computation time is faster, having a flexible receptive field size, being able to avoid exploding or vanishing gradients, and having a shared filter that can be used at different layers, so it requires less computational memory. BiLSTM also has the advantage of processing information on sequential data, being able to handle varying input lengths, and the number of parameters that need to be optimized does not increase as the length of the sequence to be processed increases. Meanwhile, the 1D-CNN model has been proven to extract features from DNA sequence data, but the research conducted still requires results from other tools as supporting data. Based on the training and testing process of the Deep Learning-based sequential labeling model that was built, the performance of the detection model can be improved by observing hyperparameters and selecting the appropriate Deep Learning model. By observing the mapping technique on the COSMIC breast cancer dataset, the 2-mers and 3-mers mapping techniques can increase the test F1-score by 30-34% compared to the integer mapping technique. The proposed TCN model is superior in detecting index mutation compared to the BiLSTM and 1D-CNN models in the COSMIC lung cancer dataset and RSCM dataset, and has a detection time that is five times faster than the BiLSTM model. This proves that the TCN model is more robust in detecting data that has a larger amount of data with high heterogeneity. The highest F1-score achieved using the TCN model was 0.9443 for the COSMIC breast cancer dataset, 0.9591 for the COSMIC lung cancer dataset, and 0.9629 for the RSCM dataset. The BiLSTM model achieved the highest F1-score of 0.9634 for the COSMIC breast cancer dataset, 0.9457 for the COSMIC lung cancer dataset, and 0.9576 for the RSCM dataset. text