Classifying Heart Disease through Fusion of Multi-Source Datasets: Integration of Feature Selection and Explainable Machine Learning Techniques

Kasiful Aprianto; Mila Desi Anasanti

doi:10.22146/ijccs.92395

Classifying Heart Disease through Fusion of Multi-Source Datasets: Integration of Feature Selection and Explainable Machine Learning Techniques

https://doi.org/10.22146/ijccs.92395

Kasiful Aprianto⁽¹⁾, Mila Desi Anasanti^(2*)

(1) Nusa Mandiri University
(2) Nusa Mandiri University
(*) Corresponding Author

Abstract

This study delves into heart disease classification through integrated feature selection and machine learning methodologies, utilizing three datasets comprising 4,728 participants and 11 features, with 4.27% missing data. Employing machine learning, we used XGBoost to achieve 0.95 accuracy for one feature, while Random Forest (RF) demonstrated accuracies of 0.92 and 0.99 for the remaining two features. Comparing 11 classification models, RF and XGBoost classified heart disease with 0.97 and 0.99 accuracy, respectively, using all available features. Applying Feature Elimination with Simultaneous Perturbation Feature Selection and Ranking (SpFSR) revealed that RF attained 0.99 accuracy by selecting only four features (cholesterol level, age, resting electrocardiographic measurements, and maximum heart rate), while XGBoost dropped to 0.91. Constructing an RF model with four features enhanced interpretability without compromising accuracy. Explainable Machine Learning (XAI) techniques, including Permutation Importance and SHAP Summary Plot analyses, gauged feature impact on heart disease prediction. The resting electrocardiographic measurements feature held the highest value (0.40 ± 0.01), followed by maximum heart rate (0.32 ± 0.01), cholesterol level (0.28 ± 0.01), and age (0.26 ± 0.005). These results underscore the significance of each feature in diagnosing heart disease via machine learning.

Keywords

Heart disease classification; Dataset fusion; Imputation; Machine learning; Feature extraction; Explainable Machine Learning; XGBoost; Random Forest;

Full Text:

PDF

References

“Global health estimates: Leading causes of death.” Accessed: Oct. 19, 2023. [Online]. Available: https://www.who.int/data/gho/data/themes/mortality-and-global-health-estimates/ghe-leading-causes-of-death

C. J. L. Murray, “The Global Burden of Disease Study at 30 years,” Nat Med, vol. 28, no. 10, pp. 2019–2026, Oct. 2022, doi: 10.1038/s41591-022-01990-1.

V. Shorewala, “Early detection of coronary heart disease using ensemble techniques,” Informatics in Medicine Unlocked, vol. 26, p. 100655, 2021, doi: 10.1016/j.imu.2021.100655.

J. Li, A. Loerbroks, H. Bosma, and P. Angerer, “Work stress and cardiovascular disease: a life course perspective,” Journal of Occupational Health, vol. 58, no. 2, pp. 216–219, 2016, doi: 10.1539/joh.15-0326-OP.

Purushottam, K. Saxena, and R. Sharma, “Efficient Heart Disease Prediction System,” Procedia Computer Science, vol. 85, pp. 962–969, 2016, doi: 10.1016/j.procs.2016.05.288.

J. Maiga, G. G. Hungilo, and Pranowo, “Comparison of Machine Learning Models in Prediction of Cardiovascular Disease Using Health Record Data,” in 2019 International Conference on Informatics, Multimedia, Cyber and Information System (ICIMCIS), Oct. 2019, pp. 45–48. doi: 10.1109/ICIMCIS48181.2019.8985205.

R. Waigi, S. Choudhary, P. Fulzele, and G. Mishra, “Predicting the risk of heart disease using advanced machine learning approach,” European Journal of Molecular and Clinical Medicine, vol. 7, pp. 1638–1645, Sep. 2020.

M. Khan and M. R. Mondal, “Data-Driven Diagnosis of Heart Disease,” International Journal of Computer Applications, vol. 176, pp. 46–54, Jul. 2020, doi: 10.5120/ijca2020920549.

E. Maini, B. Venkateswarlu, and A. Gupta, “Applying Machine Learning Algorithms to Develop a Universal Cardiovascular Disease Prediction System,” in International Conference on Intelligent Data Communication Technologies and Internet of Things (ICICI) 2018, J. Hemanth, X. Fernando, P. Lafata, and Z. Baig, Eds., in Lecture Notes on Data Engineering and Communications Technologies. Cham: Springer International Publishing, 2019, pp. 627–632. doi: 10.1007/978-3-030-03146-6_69.

M. Kavitha, G. Gnaneswar, R. Dinesh, Y. R. Sai, and R. S. Suraj, “Heart Disease Prediction using Hybrid machine Learning Model,” in 2021 6th International Conference on Inventive Computation Technologies (ICICT), Jan. 2021, pp. 1329–1333. doi: 10.1109/ICICT50816.2021.9358597.

D. Shah, S. Patel, and S. K. Bharti, “Heart Disease Prediction using Machine Learning Techniques,” SN COMPUT. SCI., vol. 1, no. 6, p. 345, Oct. 2020, doi: 10.1007/s42979-020-00365-y.

R. Bharti, A. Khamparia, M. Shabaz, G. Dhiman, S. Pande, and P. Singh, “Prediction of Heart Disease Using a Combination of Machine Learning and Deep Learning,” Computational Intelligence and Neuroscience, vol. 2021, pp. 1–11, Jul. 2021, doi: 10.1155/2021/8387680.

W. S. Andras Janosi, “Heart Disease.” UCI Machine Learning Repository, 1989. doi: 10.24432/C52P4X.

M. Siddhartha, “Heart Disease Dataset (Comprehensive).” IEEE DataPort, Nov. 05, 2020. doi: 10.21227/DZ4T-CM36.

“Heart Disease Predication.” Accessed: Oct. 24, 2023. [Online]. Available: https://www.kaggle.com/datasets/durgesh2050/heart-disease-predication

F. H. Alfebi and M. D. Anasanti, “Improving Cardiovascular Disease Prediction by Integrating Imputation, Imbalance Resampling, and Feature Selection Techniques into Machine Learning Model,” Indonesian J. Comput. Cybern. Syst., vol. 17, no. 1, p. 55, Feb. 2023, doi: 10.22146/ijccs.80214.

A. Novianto and M. D. Anasanti, “Autism Spectrum Disorder (ASD) Identification Using Feature-Based Machine Learning Classification Model,” Indonesian J. Comput. Cybern. Syst., vol. 17, no. 3, p. 259, Jul. 2023, doi: 10.22146/ijccs.83585.

A. Yarahmadi et al., “Curcumin attenuates development of depressive-like behavior in male rats after spinal cord injury: involvement of NLRP3 inflammasome,” J. Contemp. Med. Sci., vol. 8, no. 3, Jun. 2022, doi: 10.22317/jcms.v8i3.1230.

P. Geurts, A. Irrthum, and L. Wehenkel, “Supervised learning with decision tree-based methods in computational and systems biology,” Mol. BioSyst., vol. 5, no. 12, p. 1593, 2009, doi: 10.1039/b907946g.

M. Schonlau and R. Y. Zou, “The random forest algorithm for statistical learning,” The Stata Journal, vol. 20, no. 1, pp. 3–29, Mar. 2020, doi: 10.1177/1536867X20909688.

Q. Wang, “Support Vector Machine Algorithm in Machine Learning,” in 2022 IEEE International Conference on Artificial Intelligence and Computer Applications (ICAICA), Dalian, China: IEEE, Jun. 2022, pp. 750–756. doi: 10.1109/ICAICA54878.2022.9844516.

K. M. Al-Aidaroos, A. A. Bakar, and Z. Othman, “Naïve bayes variants in classification learning,” in 2010 International Conference on Information Retrieval & Knowledge Management (CAMP), Shah Alam, Selangor: IEEE, Mar. 2010, pp. 276–281. doi: 10.1109/INFRKM.2010.5466902.

K. Siddique, Z. Akhtar, H. Lee, W. Kim, and Y. Kim, “Toward Bulk Synchronous Parallel-Based Machine Learning Techniques for Anomaly Detection in High-Speed Big Data Networks,” Symmetry, vol. 9, no. 9, p. 197, Sep. 2017, doi: 10.3390/sym9090197.

K. Taunk, S. De, S. Verma, and A. Swetapadma, “A Brief Review of Nearest Neighbor Algorithm for Learning and Classification,” in 2019 International Conference on Intelligent Computing and Control Systems (ICCS), Madurai, India: IEEE, May 2019, pp. 1255–1260. doi: 10.1109/ICCS45141.2019.9065747.

T. Chen and C. Guestrin, “XGBoost: A Scalable Tree Boosting System,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco California USA: ACM, Aug. 2016, pp. 785–794. doi: 10.1145/2939672.2939785.

J. T. Hancock and T. M. Khoshgoftaar, “CatBoost for big data: an interdisciplinary review,” J Big Data, vol. 7, no. 1, p. 94, Dec. 2020, doi: 10.1186/s40537-020-00369-8.

G. Ke et al., “LightGBM: A Highly Efficient Gradient Boosting Decision Tree,” in Advances in Neural Information Processing Systems, Curran Associates, Inc., 2017. Accessed: Nov. 04, 2023. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2017/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html

C. Rudin, “Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead,” Nat Mach Intell, vol. 1, no. 5, pp. 206–215, May 2019, doi: 10.1038/s42256-019-0048-x.

P. Linardatos, V. Papastefanopoulos, and S. Kotsiantis, “Explainable AI: A Review of Machine Learning Interpretability Methods,” Entropy, vol. 23, no. 1, p. 18, Dec. 2020, doi: 10.3390/e23010018.

U. Bhatt et al., “Explainable Machine Learning in Deployment,” 2019, doi: 10.48550/ARXIV.1909.06342.

A. Ejmalian et al., “Prediction of Acute Kidney Injury After Cardiac Surgery Using Interpretable Machine Learning,” Anesth Pain Med, vol. 12, no. 4, Sep. 2022, doi: 10.5812/aapm-127140.

K. Kobylińska, T. Orłowski, M. Adamek, and P. Biecek, “Explainable Machine Learning for Lung Cancer Screening Models,” Applied Sciences, vol. 12, no. 4, p. 1926, Feb. 2022, doi: 10.3390/app12041926.

J. Jiménez-Luna, F. Grisoni, and G. Schneider, “Drug discovery with explainable artificial intelligence,” Nat Mach Intell, vol. 2, no. 10, pp. 573–584, Oct. 2020, doi: 10.1038/s42256-020-00236-4.

F. Gabbay, S. Bar-Lev, O. Montano, and N. Hadad, “A LIME-Based Explainable Machine Learning Model for Predicting the Severity Level of COVID-19 Diagnosed Patients,” Applied Sciences, vol. 11, no. 21, p. 10417, Nov. 2021, doi: 10.3390/app112110417.

U. Bhatt et al., “Uncertainty as a Form of Transparency: Measuring, Communicating, and Using Uncertainty,” in Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, Virtual Event USA: ACM, Jul. 2021, pp. 401–413. doi: 10.1145/3461702.3462571.

C. Strobl, A.-L. Boulesteix, T. Kneib, T. Augustin, and A. Zeileis, “Conditional variable importance for random forests,” BMC Bioinformatics, vol. 9, no. 1, p. 307, Dec. 2008, doi: 10.1186/1471-2105-9-307.

R. Kitani and S. Iwata, “Verification of Interpretability of Phase-Resolved Partial Discharge Using a CNN With SHAP,” IEEE Access, vol. 11, pp. 4752–4762, 2023, doi: 10.1109/ACCESS.2023.3236315.

S. M. Lundberg, G. G. Erion, and S.-I. Lee, “Consistent Individualized Feature Attribution for Tree Ensembles,” 2018, doi: 10.48550/ARXIV.1802.03888.

Y. Arslan et al., “Towards Refined Classifications Driven by SHAP Explanations,” in Machine Learning and Knowledge Extraction, vol. 13480, A. Holzinger, P. Kieseberg, A. M. Tjoa, and E. Weippl, Eds., in Lecture Notes in Computer Science, vol. 13480. , Cham: Springer International Publishing, 2022, pp. 68–81. doi: 10.1007/978-3-031-14463-9_5.

E. G. Lakatta and D. Levy, “Arterial and Cardiac Aging: Major Shareholders in Cardiovascular Disease Enterprises: Part II: The Aging Heart in Health: Links to Heart Disease,” Circulation, vol. 107, no. 2, pp. 346–354, Jan. 2003, doi: 10.1161/01.CIR.0000048893.62841.F7.

N. A. M. Zaini and M. K. Awang, “Hybrid Feature Selection Algorithm and Ensemble Stacking for Heart Disease Prediction,” IJACSA, vol. 14, no. 2, 2023, doi: 10.14569/IJACSA.2023.0140220.

J. B. Kostis, A. E. Moreyra, M. T. Amendo, J. Di Pietro, N. Cosgrove, and P. T. Kuo, “The effect of age on heart rate in subjects free of heart disease. Studies by ambulatory electrocardiography and maximal exercise stress test.,” Circulation, vol. 65, no. 1, pp. 141–145, Jan. 1982, doi: 10.1161/01.CIR.65.1.141.

D. Jacobs et al., “Report of the Conference on Low Blood Cholesterol: Mortality Associations.,” Circulation, vol. 86, no. 3, pp. 1046–1060, Sep. 1992, doi: 10.1161/01.CIR.86.3.1046.

M. Hedayatnia et al., “Dyslipidemia and cardiovascular disease risk among the MASHAD study population,” Lipids Health Dis, vol. 19, no. 1, p. 42, Dec. 2020, doi: 10.1186/s12944-020-01204-y.

DOI: https://doi.org/10.22146/ijccs.92395

Article Metrics

Abstract views : 3682 |

Refbacks

There are currently no refbacks.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Copyright of :IJCCS (Indonesian Journal of Computing and Cybernetics Systems)ISSN 1978-1520 (print); ISSN 2460-7258 (online)is a scientific journal the results of Computingand Cybernetics Systems
A publication of IndoCEISS.Gedung S1 Ruang 416 FMIPA UGM, Sekip Utara, Yogyakarta 55281Fax: +62274 555133email:ijccs.mipa@ugm.ac.id | http://jurnal.ugm.ac.id/ijccs

View My Stats1View My Stats2

Username
Password
Remember me