A comparative study on SMOTE, CTGAN, and hybrid SMOTE-CTGAN for medical data augmentation

(1) * Ninda Khoirunnisa Mail (Informatics Department, Universitas Ahmad Dahlan, Indonesia)
(2) Miftahurrahma Rosyda Mail (Informatics Department, Universitas Ahmad Dahlan, Indonesia)
*corresponding author

Abstract


The imbalance of clinical datasets remains a challenge in medical data mining, often resulting in models biased toward majority outcomes and reduced sensitivity to rare but clinically critical cases. This study presents a comparative evaluation of three augmentation strategies—Synthetic Minority Oversampling Technique (SMOTE), Conditional Tabular GAN (CTGAN), and a hybrid SMOTE+CTGAN—on the Framingham Heart Study dataset for cardiovascular disease prediction. Augmented datasets were evaluated using Decision Tree, Random Forest, and XGBoost classifiers across multiple metrics, including accuracy, precision, recall, and F1-score. Results demonstrate that classifiers trained on imbalanced data achieved high accuracy but poor minority recall (<0.40), confirming model’s bias toward majority class. SMOTE yielded the strongest improvements in minority recall (up to 0.88 with XGBoost) and balanced F1 across classes, though at the cost of reduced majority recall. CTGAN and SMOTE+CTGAN delivered more moderate improvements in minority recall (0.66–0.77) while preserving higher majority recall (>0.86), providing a gentler trade-off. These findings indicate that while SMOTE remains a robust baseline for addressing imbalance, hybrid and GAN-based approaches offer practical alternatives for preserving majority performance. The results highlight that augmentation choice should be informed by clinical context.

Keywords


Medical tabular data; Imbalance data; SMOTE; CTGAN; Data augmentation

   

DOI

https://doi.org/10.31763/sitech.v6i1.2203
      

Article metrics

10.31763/sitech.v6i1.2203 Abstract views : 23 | PDF views : 8

   

Cite

   

Full Text

Download

References


[1] A. Sharma, P. K. Singh, and R. Chandra, “SMOTified-GAN for Class Imbalanced Pattern Classification Problems,” IEEE Access, vol. 10, pp. 30655–30665, 2022, doi: https://doi.org/10.1109/ACCESS.2022.3158977.

[2] Y. Zhang et al., “GAN-based one dimensional medical data augmentation,” Soft Comput., vol. 27, no. 15, pp. 10481–10491, Aug. 2023, doi: 10.1007/s00500-023-08345-z.

[3] E. Yuda, T. Ando, I. Kaneko, Y. Yoshida, and D. Hirahara, “Comprehensive Data Augmentation Approach Using WGAN-GP and UMAP for Enhancing Alzheimer’s Disease Diagnosis,” Electronics, vol. 13, no. 18, p. 3671, Sep. 2024, doi: 10.3390/electronics13183671.

[4] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic Minority Over-sampling Technique,” J. Artif. Intell. Res., vol. 16, pp. 321–357, Jun. 2002, doi: 10.1613/JAIR.953.

[5] H. Hairani, T. Widiyaningtyas, and D. Dwi Prasetya, “Addressing Class Imbalance of Health Data: A Systematic Literature Review on Modified Synthetic Minority Oversampling Technique (SMOTE) Strategies,” JOIV Int. J. Informatics Vis., vol. 8, no. 3, p. 1310, Sep. 2024, doi: 10.62527/joiv.8.3.2283.

[6] I. Goodfellow et al., “Generative adversarial networks,” Commun. ACM, vol. 63, no. 11, pp. 139–144, Oct. 2020, doi: 10.1145/3422622.

[7] M. Alqulaity and P. Yang, “Enhanced Conditional GAN for High-Quality Synthetic Tabular Data Generation in Mobile-Based Cardiovascular Healthcare,” Sensors, vol. 24, no. 23, p. 7673, Nov. 2024, doi: 10.3390/s24237673.

[8] H. A. Ahmed, J. A. Nepomuceno, B. Vega-Márquez, and I. A. Nepomuceno-Chamorro, “Synthetic Data Generation for Healthcare: Exploring Generative Adversarial Networks Variants for Medical Tabular Data,” Int. J. Data Sci. Anal., pp. 1–16, May 2025, doi: 10.1007/s41060-025-00816-w.

[9] T. Suresh, Z. Brijet, and T. D. Subha, “Imbalanced medical disease dataset classification using enhanced generative adversarial network,” Comput. Methods Biomech. Biomed. Engin., vol. 26, no. 14, pp. 1702–1718, Oct. 2023, doi: 10.1080/10255842.2022.2134729.

[10] M. Seibold, A. Hoch, M. Farshad, N. Navab, and P. Fürnstahl, “Conditional Generative Data Augmentation for Clinical Audio Datasets,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 13437 LNCS, Springer, Cham, 2022, pp. 345–354, doi: 10.1007/978-3-031-16449-1_33.

[11] Bakti Putra Pamungkas, Muhammad Jauhar Vikri, and Ita Aristia Sa’ida, “Application of SMOTE-ENN Method in Data Balancing for Classification of Diabetes Health Indicators with C4.5 Algorithm,” J. Sisfokom (Sistem Inf. dan Komputer), vol. 14, no. 2, pp. 183–188, May 2025, doi: 10.32736/sisfokom.v14i2.2350.

[12] M. Khairul Rezki, M. I. Mazdadi, F. Indriani, M. Muliadi, T. H. Saragih, and V. A. Athavale, “Application Of SMOTE To Address Class Imbalance In Diabetes Disease Classification Utilizing C5.0, Random Forest, And SVM,” J. Electron. Electromed. Eng. Med. Informatics, vol. 6, no. 4, pp. 343–354, Aug. 2024, doi: 10.35882/jeeemi.v6i4.434.

[13] M. Syofian and I. Maulana, “Enhancing Obesity Risk Classification: Tackling Data Imbalance with SMOTE and Deep Learning,” J. Ris. Inform., vol. 6, no. 4, pp. 231–236, Sep. 2024, doi: 10.34288/jri.v6i4.349.

[14] Selly Anastassia Amellia Kharis, Melisa Arisanty, and Arman Haqqi Anna Zili, “Application of SMOTE in Multiclass Body Mass Index Classification:,” Proceeding Int. Semin. Sci. Technol., vol. 4, pp. 37–48, Apr. 2025, doi: 10.33830/isst.v4i1.5229.

[15] H. Han, W.-Y. Wang, and B.-H. Mao, “Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning,” in Lecture Notes in Computer Science, vol. 3644, no. PART I, Springer, Berlin, Heidelberg, 2005, pp. 878–887, doi: 10.1007/11538059_91.

[16] S. Gholampour, “Impact of Nature of Medical Data on Machine and Deep Learning for Imbalanced Datasets: Clinical Validity of SMOTE Is Questionable,” Mach. Learn. Knowl. Extr., vol. 6, no. 2, pp. 827–841, Apr. 2024, doi: 10.3390/make6020039.

[17] L. Xu, M. Skoularidou, L. Cuesta-Infante, and K. Veeramachaneni, “Modeling tabular data using conditional GAN,” in Proceedings of the 33rd International Conference on Neural Information Processing Systems, 2019, pp. 7335–7345. [Online]. Available at: https://dl.acm.org/doi/10.5555/3454287.3454946.

[18] M. E. Sánchez-Gutiérrez and P. P. González-Pérez, “Addressing the class imbalance in tabular datasets from a generative adversarial network approach in supervised machine learning,” J. Algorithm. Comput. Technol., vol. 17, Jan. 2023, doi: 10.1177/17483026231215186.

[19] G. Eom and H. Byeon, “Searching for Optimal Oversampling to Process Imbalanced Data: Generative Adversarial Networks and Synthetic Minority Over-Sampling Technique,” Mathematics, vol. 11, no. 16, p. 3605, Aug. 2023, doi: 10.3390/math11163605.

[20] C.-S. Hung, C.-H. R. Lin, J.-S. Liu, S.-H. Chen, T.-C. Hung, and C.-M. Tsai, “Enhancing generalization in a Kawasaki Disease prediction model using data augmentation: Cross-validation of patients from two major hospitals in Taiwan,” PLoS One, vol. 19, no. 12, p. e0314995, Dec. 2024, doi: 10.1371/journal.pone.0314995.

[21] M. A. Friedl and C. E. Brodley, “Decision tree classification of land cover from remotely sensed data,” Remote Sens. Environ., vol. 61, no. 3, pp. 399–409, Sep. 1997, doi: 10.1016/S0034-4257(97)00049-7.

[22] J. R. Quinlan, “Induction of decision trees,” Mach. Learn., vol. 1, no. 1, pp. 81–106, Mar. 1986, doi: 10.1007/BF00116251.

[23] L. Breiman, “Random Forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32, Oct. 2001, doi: 10.1023/A:1010933404324.

[24] T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,” Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data Min., vol. 13-17-August-2016, pp. 785–794, Aug. 2016, doi: 10.1145/2939672.2939785.

[25] M. Sokolova and G. Lapalme, “A systematic analysis of performance measures for classification tasks,” Inf. Process. Manag., vol. 45, no. 4, pp. 427–437, Jul. 2009, doi: 10.1016/j.ipm.2009.03.002.

[26] Haibo He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowl. Data Eng., vol. 21, no. 9, pp. 1263–1284, Sep. 2009, doi: 10.1109/TKDE.2008.239.

[27] T. Saito and M. Rehmsmeier, “The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets,” PLoS One, vol. 10, no. 3, p. e0118432, Mar. 2015, doi: 10.1371/journal.pone.0118432.


Refbacks

  • There are currently no refbacks.


Copyright (c) 2025 Ninda Khoirunnisa, Miftahurrahma Rosyda

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

___________________________________________________________
Science in Information Technology Letters
ISSN 2722-4139
Published by Association for Scientific Computing Electrical and Engineering (ASCEE)
W : http://pubs2.ascee.org/index.php/sitech
E : sitech@ascee.org, andri@ascee.org, andri.pranolo.id@ieee.org

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0

View My Stats