Classification of breast cancer disease using bagging fuzzy-id3 algorithm based on fuzzydbd
Classification is a data mining technique used to classify varied data types according to a specific criterion. One of the most powerful machine learning methods to handle classification problems is the decision tree. There are various decision tree algorithms, but the most commonly used are Iterati...
Saved in:
Main Author: | |
---|---|
Format: | Thesis |
Language: | English |
Published: |
2022
|
Subjects: | |
Online Access: | http://umpir.ump.edu.my/id/eprint/37640/1/ir.Classification%20of%20breast%20cancer%20disease%20using%20bagging%20fuzzy-id3%20algorithm%20based%20on%20fuzzydbd.pdf http://umpir.ump.edu.my/id/eprint/37640/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Universiti Malaysia Pahang |
Language: | English |
Summary: | Classification is a data mining technique used to classify varied data types according to a specific criterion. One of the most powerful machine learning methods to handle classification problems is the decision tree. There are various decision tree algorithms, but the most commonly used are Iterative Dichotomiser 3 (ID3), CART, and C4.5. ID3 has the most advantages among the three algorithms, especially in processing time, as it builds the fastest tree with short depth. However, despite the decision tree’s commonness in handling classification problems, it suffers problems like high variance and overfitting, leading to poor generalisation. The combination of fuzzy and ID3 algorithm manages the data more efficiently as it combines both the advantages of fuzzy and decision tree. For the proposed technique of the FID3-DBD algorithm, the continuous and discrete (integer) attributes would be defined in the linguistic values of the fuzzy sets, and the FUZZYDBD method is being used to set up the fuzzy sets’ parameters. Replacement with the linguistic labels of fuzzy sets with the highest compatibility of input values has also been done before the tree induction occurs. The proposed technique solves the limitation of the classic ID3 algorithm that cannot classify the continuous-valued attributes and, at the same time, increase the classification accuracy. The bagging method was then applied to the FID3-DBD algorithm to overcome overfitting problems and high variance in decision trees. Four breast cancer datasets were used to evaluate the classification accuracy: Wisconsin Breast Cancer (Original) dataset, WDBC (Diagnostic) dataset, Breast Cancer Coimbra dataset, and Mammographic Mass dataset. All those datasets were acquired from the UCI machine learning repository. This study aims to solve the limitation of the classic ID3 algorithm that is unable to classify continuous data well and overcome the high variance and overfitting issues. This research methodology consists of four fundamental steps: literature review, data collection, experiment implementation, and report writing. The FID3-DBD algorithm acquired the classification accuracy of 94.362% for the Wisconsin Breast Cancer (Original) dataset, 94.358% for the WDBC (Diagnostic) dataset, 81.119% for the Mammographic Mass dataset and 64.224% for the Coimbra dataset. The BFID3-DBD algorithm obtained the classification accuracy of 96.003% for the Wisconsin Breast Cancer (Original) dataset, 95.273% for the WDBC (Diagnostic) dataset, 81.590% for the Mammographic Mass dataset and 68.966% for the Coimbra dataset. The study verified that the FID3-DBD algorithm could classify the continuous data, and the BFID3-DBD algorithm overcame the overfitting issue, reduced high variance, and increased test data classification accuracy. |
---|