Research Article Open Access

A Comparative Analysis of Smote and CSSF Techniques for Diabetes Classification Using Imbalanced Data

Bashar Hamad Aubaidan1, Rabiah Abdul Kadir 1 and Mohamad Taha Ijab1
  • 1 Institute of Visual Informatics, Universiti Kebangsaan Malaysia, Bangi Selangor, Malaysia

Abstract

Diabetes, a prevalent chronic metabolic disorder, poses a significant burden on healthcare systems worldwide. Accurate and timely diagnosis is crucial for effective management and complication prevention. Machine learning presents a promising solution but often faces challenges due to class imbalance within datasets, particularly the underrepresentation of diabetic cases. To address this issue, we introduce Cluster-based Synthetic Sample Filtering (CSSF), a method that enhances synthetic sample quality through advanced clustering and filtering techniques. Building upon the Synthetic Minority Over-sampling Technique (SMOTE), CSSF strategically generates synthetic samples within clusters while eliminating noisy instances, thereby improving classification accuracy and reliability. Comparative analysis demonstrates CSSF's effectiveness in mitigating class imbalance. Initial models achieved a 67% accuracy rate, which improved to 82% after smote preprocessing. CSSF further elevated accuracy to an impressive 90%. Notably, Support Vector Machines (SVM), neural networks (deep learning) and random forest achieved a remarkable 92% accuracy post-CSSF preprocessing. Decision tree and K-Nearest Neighbors (KNN) also demonstrated commendable accuracy after CSSF preprocessing. Crucially, CSSF consistently outperformed smote in precision, recall, and the F1-score, highlighting its superiority. Recognizing the importance of ethical AI practices, this study addresses ethical considerations and potential biases in machine learning within healthcare data analysis, promoting fairness, transparency and responsible AI utilization. This research underscores the necessity of ethical and effective approaches to address class imbalance in diabetes classification

Journal of Computer Science
Volume 20 No. 9, 2024, 1146-1165

DOI: https://doi.org/10.3844/jcssp.2024.1146.1165

Submitted On: 8 October 2023 Published On: 22 July 2024

How to Cite: Aubaidan, B. H., Kadir , R. A. & Ijab, M. T. (2024). A Comparative Analysis of Smote and CSSF Techniques for Diabetes Classification Using Imbalanced Data. Journal of Computer Science, 20(9), 1146-1165. https://doi.org/10.3844/jcssp.2024.1146.1165

  • 675 Views
  • 364 Downloads
  • 0 Citations

Download

Keywords

  • Imbalanced Datasets
  • SMOTE
  • CSSF
  • Synthetic Minority Over-Sampling Technique
  • Cluster-Based Synthetic Sample Filtering
  • Class Imbalance
  • Class Imbalance