Imputation of missing values in breast cancer data

The critical role of complete and accurate data in breast cancer research and breast cancer diagnosis is the impetus behind this study, which rigorously examines and compares the efficacy of various imputation methods, focusing on the potential superiority of autoencoders over established techniques...

Full description

Saved in:
Bibliographic Details
Main Author: Rajagopal, Tejas R.
Other Authors: Fan Xiuyi
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2024
Subjects:
Online Access:https://hdl.handle.net/10356/176005
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-176005
record_format dspace
spelling sg-ntu-dr.10356-1760052024-05-17T15:38:13Z Imputation of missing values in breast cancer data Rajagopal, Tejas R. Fan Xiuyi School of Computer Science and Engineering xyfan@ntu.edu.sg Computer and Information Science Medicine, Health and Life Sciences Imputation The critical role of complete and accurate data in breast cancer research and breast cancer diagnosis is the impetus behind this study, which rigorously examines and compares the efficacy of various imputation methods, focusing on the potential superiority of autoencoders over established techniques. This comparative analysis initiates with the UCI Wisconsin Breast Cancer Dataset, where Multiple Imputation by Chained Equations (MICE) sets a commendable baseline for accuracy. Imputing missing data using autoencoders does not yield a performance as good as MICE within this dataset. The research then transitions to the SEER Breast Cancer Dataset, marked by a complex array of features, encompassing both categorical and numerical data. It is within this intricate dataset that autoencoders demonstrate remarkable proficiency, significantly outperforming the baseline MICE model. The dichotomy in results between the two datasets underscores the conditional nature of imputation method performance, heavily influenced by the dataset’s characteristics. Concluding the study is an exploration of the resilience of these imputation methods against datasets with incrementally introduced missing values. Even under heightened volumes of missing data, the autoencoder maintains a competitive edge on the SEER dataset, although the margin narrows.These findings suggest a nuanced approach to the imputation of missing breast cancer data, emphasizing the selection of the method contingent upon the dataset’s complexity and composition. Autoencoders emerge as a promising model, particularly adept at managing datasets of a sophisticated nature, potentially enabling better clinical decision-making and aiding in the conduct of breast cancer research. Although the study focuses on breast cancer data, the findings may be extended to other forms of medical data given the similarities within their data points. Overall, this study concluded that autoencoder’s imputation outperforms MICE on the SEER breast cancer dataset, whereas MICE outperforms autoencoder imputation on the Wisconsin breast cancer dataset. Bachelor's degree 2024-05-13T02:57:56Z 2024-05-13T02:57:56Z 2024 Final Year Project (FYP) Rajagopal, T. R. (2024). Imputation of missing values in breast cancer data. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/176005 https://hdl.handle.net/10356/176005 en SCSE23-0708 application/pdf Nanyang Technological University
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Computer and Information Science
Medicine, Health and Life Sciences
Imputation
spellingShingle Computer and Information Science
Medicine, Health and Life Sciences
Imputation
Rajagopal, Tejas R.
Imputation of missing values in breast cancer data
description The critical role of complete and accurate data in breast cancer research and breast cancer diagnosis is the impetus behind this study, which rigorously examines and compares the efficacy of various imputation methods, focusing on the potential superiority of autoencoders over established techniques. This comparative analysis initiates with the UCI Wisconsin Breast Cancer Dataset, where Multiple Imputation by Chained Equations (MICE) sets a commendable baseline for accuracy. Imputing missing data using autoencoders does not yield a performance as good as MICE within this dataset. The research then transitions to the SEER Breast Cancer Dataset, marked by a complex array of features, encompassing both categorical and numerical data. It is within this intricate dataset that autoencoders demonstrate remarkable proficiency, significantly outperforming the baseline MICE model. The dichotomy in results between the two datasets underscores the conditional nature of imputation method performance, heavily influenced by the dataset’s characteristics. Concluding the study is an exploration of the resilience of these imputation methods against datasets with incrementally introduced missing values. Even under heightened volumes of missing data, the autoencoder maintains a competitive edge on the SEER dataset, although the margin narrows.These findings suggest a nuanced approach to the imputation of missing breast cancer data, emphasizing the selection of the method contingent upon the dataset’s complexity and composition. Autoencoders emerge as a promising model, particularly adept at managing datasets of a sophisticated nature, potentially enabling better clinical decision-making and aiding in the conduct of breast cancer research. Although the study focuses on breast cancer data, the findings may be extended to other forms of medical data given the similarities within their data points. Overall, this study concluded that autoencoder’s imputation outperforms MICE on the SEER breast cancer dataset, whereas MICE outperforms autoencoder imputation on the Wisconsin breast cancer dataset.
author2 Fan Xiuyi
author_facet Fan Xiuyi
Rajagopal, Tejas R.
format Final Year Project
author Rajagopal, Tejas R.
author_sort Rajagopal, Tejas R.
title Imputation of missing values in breast cancer data
title_short Imputation of missing values in breast cancer data
title_full Imputation of missing values in breast cancer data
title_fullStr Imputation of missing values in breast cancer data
title_full_unstemmed Imputation of missing values in breast cancer data
title_sort imputation of missing values in breast cancer data
publisher Nanyang Technological University
publishDate 2024
url https://hdl.handle.net/10356/176005
_version_ 1800916247729143808