Copyright ©The Author(s) 2023. Published by Baishideng Publishing Group Inc. All rights reserved.
World J Methodol. Dec 20, 2023; 13(5): 373-378
Published online Dec 20, 2023. doi: 10.5662/wjm.v13.i5.373
Challenges and limitations of synthetic minority oversampling techniques in machine learning
Ibraheem M Alkhawaldeh, Ibrahem Albalkhi, Abdulqadir Jeprel Naswhan
Ibraheem M Alkhawaldeh, Faculty of Medicine, Mutah University, Karak 61710, Jordan
Ibrahem Albalkhi, Department of Neuroradiology, Alfaisal University, Great Ormond Street Hospital NHS Foundation Trust, London WC1N 3JH, United Kingdom
Abdulqadir Jeprel Naswhan, Nursing for Education and Practice Development, Hamad Medical Corporation, Doha 3050, Qatar
Author contributions: Alkhawaldeh IM, Albalkhi I, and Naswhan AJ contributed to the writing and editing the manuscript, illustrations, and review of the literature of this paper; Alkhawaldeh IM and Naswhan AJ designed the overall concept and outline of the manuscript.
Conflict-of-interest statement: All the authors report no relevant conflicts of interest for this article.
Open-Access: This article is an open-access article that was selected by an in-house editor and fully peer-reviewed by external reviewers. It is distributed in accordance with the Creative Commons Attribution NonCommercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: https://creativecommons.org/Licenses/by-nc/4.0/
Corresponding author: Abdulqadir Jeprel Naswhan, MSc, RN, Director, Research Scientist, Senior Lecturer, Senior Researcher, Nursing for Education and Practice Development, Hamad Medical Corporation, Rayyan Road, Doha 3050, Qatar. anashwan@hamad.qa
Received: September 21, 2023
Peer-review started: September 21, 2023
First decision: September 29, 2023
Revised: September 30, 2023
Accepted: November 3, 2023
Article in press: November 3, 2023
Published online: December 20, 2023

Oversampling is the most utilized approach to deal with class-imbalanced datasets, as seen by the plethora of oversampling methods developed in the last two decades. We argue in the following editorial the issues with oversampling that stem from the possibility of overfitting and the generation of synthetic cases that might not accurately represent the minority class. These limitations should be considered when using oversampling techniques. We also propose several alternate strategies for dealing with imbalanced data, as well as a future work perspective.

Keywords: Machine learning, Class imbalance, Overfitting, Misdiagnosis

Core Tip: Addressing class imbalance in medical datasets, particularly in the context of machine learning applications, requires a cautious approach. While oversampling methods like synthetic minority oversampling technique are commonly used, it is crucial to recognize their limitations. They may introduce synthetic instances that do not accurately represent the minority class, potentially leading to overfitting and unreliable results in real-world medical scenarios. Instead, we can consider exploring alternative approaches such as Ensemble Learning-Based Methods like XGBoost and Easy Ensemble which have shown promise in mitigating bias and providing more robust performance. Collaborating with data science specialists and medical professionals to design and validate these techniques is essential to ensure their reliability and effectiveness in medical applications.