Challenges and limitations of synthetic minority oversampling techniques in machine learning

doi:10.5662/wjm.v13.i5.373

Advanced Search

BPG is committed to discovery and dissemination of knowledge

Home / Archive / Volume 13, Issue 5

This Article

Academic Content and Language Evaluation of This Article

CrossCheck and Google Search of This Article

Academic Rules and Norms of This Article

Citation of this article

Corresponding Author of This Article

Research Domain of This Article

Article-Type of This Article

Open-Access Policy of This Article

Times Cited Counts in Google of This Article

Number of Hits and Downloads for This Article

Total Article Views (5220)

All Articles published online

The chart showing PDF series, WORD series, HTML series, Figures (1-4) series.

Item

Count

PDF

149

WORD

HTML

3505

Figures (1-4)

561

Sum=4232

Publishing Process of This Article

The chart showing Browse series, Download series.

Item

Count

Browse

225

Download

580

Sum=805

Dec 20, 2023 (publication date) through Aug 9, 2025

Times Cited of This Article

Times Cited (48)

Journal Information of This Article

Publication Name

World Journal of Methodology

ISSN

2222-0682

Publisher of This Article

Baishideng Publishing Group Inc, 7041 Koll Center Parkway, Suite 160, Pleasanton, CA 94566, USA

Editorial

World J Methodol. Dec 20, 2023; 13(5): 373-378
Published online Dec 20, 2023. doi: 10.5662/wjm.v13.i5.373

Challenges and limitations of synthetic minority oversampling techniques in machine learning

Ibraheem M Alkhawaldeh, Ibrahem Albalkhi, Abdulqadir Jeprel Naswhan

Ibraheem M Alkhawaldeh, Faculty of Medicine, Mutah University, Karak 61710, Jordan

Ibrahem Albalkhi, Department of Neuroradiology, Alfaisal University, Great Ormond Street Hospital NHS Foundation Trust, London WC1N 3JH, United Kingdom

Abdulqadir Jeprel Naswhan, Nursing for Education and Practice Development, Hamad Medical Corporation, Doha 3050, Qatar

Author contributions: Alkhawaldeh IM, Albalkhi I, and Naswhan AJ contributed to the writing and editing the manuscript, illustrations, and review of the literature of this paper; Alkhawaldeh IM and Naswhan AJ designed the overall concept and outline of the manuscript.

Conflict-of-interest statement: All the authors report no relevant conflicts of interest for this article.

Open-Access: This article is an open-access article that was selected by an in-house editor and fully peer-reviewed by external reviewers. It is distributed in accordance with the Creative Commons Attribution NonCommercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: https://creativecommons.org/Licenses/by-nc/4.0/

Corresponding author: Abdulqadir Jeprel Naswhan, MSc, RN, Director, Research Scientist, Senior Lecturer, Senior Researcher, Nursing for Education and Practice Development, Hamad Medical Corporation, Rayyan Road, Doha 3050, Qatar. anashwan@hamad.qa

Received: September 21, 2023
Peer-review started: September 21, 2023
First decision: September 29, 2023
Revised: September 30, 2023
Accepted: November 3, 2023
Article in press: November 3, 2023
Published online: December 20, 2023
Processing time: 89 Days and 20.3 Hours

Abstract

Oversampling is the most utilized approach to deal with class-imbalanced datasets, as seen by the plethora of oversampling methods developed in the last two decades. We argue in the following editorial the issues with oversampling that stem from the possibility of overfitting and the generation of synthetic cases that might not accurately represent the minority class. These limitations should be considered when using oversampling techniques. We also propose several alternate strategies for dealing with imbalanced data, as well as a future work perspective.

Keywords: Machine learning; Class imbalance; Overfitting; Misdiagnosis

Core Tip: Addressing class imbalance in medical datasets, particularly in the context of machine learning applications, requires a cautious approach. While oversampling methods like synthetic minority oversampling technique are commonly used, it is crucial to recognize their limitations. They may introduce synthetic instances that do not accurately represent the minority class, potentially leading to overfitting and unreliable results in real-world medical scenarios. Instead, we can consider exploring alternative approaches such as Ensemble Learning-Based Methods like XGBoost and Easy Ensemble which have shown promise in mitigating bias and providing more robust performance. Collaborating with data science specialists and medical professionals to design and validate these techniques is essential to ensure their reliability and effectiveness in medical applications.